Today we will learn more about the tasks of social network analysis (**SNA**), and review the **Apache Spark** library designed to analyze Big Data. We will also consider one of the components of Apache Spark, designed for the analysis of graphs — **GraphX**. We’ll try to understand how this library has implemented the distributed graph storage and computations on them. The provided examples will show how we can use this library in practice. For instance, to search for the spam, rank search results, determine communities in social networks, or search for opinion leaders, and it’s not a complete list of applying methods for analyzing graphs.

Let’s start with recalling the main objects we’ll work with in the given article.

## Graph Theory, Random Graphs and Webgraphs

A graph is an ordered pair **G = (V,E)**, in which **V** is a set of **vertices** (say, websites on the Internet), and **E** is a set of **edges** connecting the vertices (links between the websites, respectively). It is quite a clear structure that has been analyzed for many years. However, in practice, when we analyze real networks, we wonder how a graph is built. For instance, there is a website on the Internet. What will it link to in the first place? What is the average number of new links? How well will this site rank in the search results?

People deal with this problem (the structure of a **web graph**) almost since the times the Internet appeared. Quite a lot of models have been invented during this time. Not so long ago, Yandex has introduced a generalized model, and also studied its features.

If the graph is already given, its features, as well as further calculations on it, are quite defined. For example, we can explore the degree of a particular vertex (the number of friends of a person in a social network), or measure the distance between two specific vertices (how many steps away, by way of introduction, two people are connected to each other), compute connected components (a group of people, where any two people know each other) and a lot more.

**Here are Some Classical Algorithms:**

PageRank is a well-known algorithm for computing the “authority” of a vertex in a graph. It was offered by Google in 1998. For a long time, it has been used to rank the search results.

The search for the strongly connected components is the algorithm to search for subsets of graph vertices. There is a path between any two vertices of a specific subset. At the same time, there is no path between the vertices of different subsets.

The shortest path problem is the problem of finding a path between two vertices in a graph such that the sum of the weights of its constituent edges is minimized.

As well as counting the number of triangles, clustering, distributing degrees of vertices, searching for clicks in a graph, and a lot more. It should be noted that most of the algorithms are iterative, and therefore, the **GraphX** library performs well in this context, as it caches data in memory. We will consider the capabilities provided by this library.

## Spark GraphX

We should note that **GraphX** is not the first and the only tool for the analysis of graphs (popular tools are, for example, **GraphLab**, the nowadays Dato.com, or the **Pregel** computational model, the API of which is partially used in **GraphX**). It should be also noted that at the time this article was written, the given library was still in the development stage and its capabilities were not so great. Nevertheless, **GraphX** proves its application in practice.

As for now, **GraphX** has no support for Python. Therefore, we will write the code in Scala, assuming **SparkContext** has already been created (the **sc** variable in the code below). Most of the code below is taken from the documentation and public materials. So, to begin with, let’s import all the necessary libraries:

```
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
```

In Spark, the concept of graph is implemented in the so-called **Property Graph**. It is a multigraph with a label (additional information) on vertices and edges. A **multigraph** is a **directed** graph (edges are directed), which is permitted to have multiple edges (can have several edges between two vertices) and loops (an edge that connects a vertex to itself). We should also say that in the case of directed graphs there are such notions as **in-degree** (the number of incoming edges) and **out-degree** (the number of outgoing edges). Let’s see how to build a particular graph.

## Building a Graph

We can build a graph with the help of the **Graph** constructor by passing vertices and edges into it (do not forget to make an **RDD** of them using the **.parallelize()** function):

```
val vertexArray = Array(
(1L, ("Alice", 28)),
(2L, ("Bob", 27)),
(3L, ("Charlie", 65)),
(4L, ("David", 42)),
(5L, ("Ed", 55)),
(6L, ("Fran", 50))
)
val edgeArray = Array(
Edge(2L, 1L, 7),
Edge(2L, 4L, 2),
Edge(3L, 2L, 4),
Edge(3L, 6L, 3),
Edge(4L, 1L, 1),
Edge(5L, 2L, 2),
Edge(5L, 3L, 8),
Edge(5L, 6L, 3)
)
val vertexRDD = sc.parallelize(vertexArray)
val edgeRDD = sc.parallelize(edgeArray)
val graph = Graph(vertexRDD, edgeRDD)
```

Or, if it is required to build vertices and edges on the basis of some data, located, say, in **HDFS**, it is necessary to process the source data at first (as usual, using the **.map()** transformation). For instance, if we have Wikipedia articles stored in the form of **(id, title)**, and links between the articles stored in the form of pairs, the graph is built easily. We should separate **id** from **title** in the first case, and construct the edges (use the **Edge** constructor) in the second case. At the output, we’ll get a list of vertices and edges that can be passed to the **Graph** constructor:

```
val articles = sc.textFile("articles.txt")
val links = sc.textFile("links.txt")
val vertices = articles.map { line =>
val fields = line.split('\t')
(fields(0).toLong, fields(1))
}
val edges = links.map { line =>
val fields = line.split('\t')
Edge(fields(0).toLong, fields(1).toLong, 0)
}
val graph = Graph(vertices, edges, "").cache()
```

After building the graph, we can compute some characteristics, and also run algorithms on it, including the ones mentioned above. It should also be noted that in addition to the concept of vertices and edges, **Spark** also has the notion of **Triplet**, which is an object that somehow widens the **Edge** object. It stores information about an edge and vertices adjacent to it.

## Computations on Graphs

The remarkable fact is that after building a graph in most packages, and **GraphX** is no exception, it becomes easy to perform computations on it, as well as run standard algorithms. In fact, methods of computations on graphs are studied well enough, and sometimes the most difficult thing in some application tasks is to determine the graph, namely, determine its vertices and edges. Below is a list of methods available in **Graph** object with some comments:

```
class Graph[VD, ED] {
// Basic statistic of the graph
val numEdges: Long // the number of edges
val numVertices: Long // the number of vertices
val inDegrees: VertexRDD[Int] // inbound degrees pf vertices
val outDegrees: VertexRDD[Int] // outbound degrees of vertices
val degrees: VertexRDD[Int] // total degrees of vertices
// representations of vertices, edges and triplets
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
val triplets: RDD[EdgeTriplet[VD, ED]]
// Changing the attributes (additional information) of vertices and edges
def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
def mapEdges[ED2](map: (PartitionID, Iterator[Edge[ED]]) => Iterator[ED2]): Graph[VD, ED2]
def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
// Modifications of graphs
def reverse: Graph[VD, ED] // conversion – all edges change their direction
def subgraph(
epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
vpred: (VertexID, VD) => Boolean = ((v, d) => true))
: Graph[VD, ED] // selecting subgrpahs that meet certain conditions
def groupEdges(merge: (ED, ED) => ED): Graph[VD, ED] // merging graphs
// Basic graph algorithms
def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double] // computing the PageRank
def connectedComponents(): Graph[VertexID, ED] // searching for connected components
def triangleCount(): Graph[Int, ED] // counting the number of triangles
def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED] // searching for strongly connected components
}
```

It is worth noting that the current implementation of **SparkX** contains few implemented algorithms. Therefore, it’s still relevant to use the mentioned above packages instead of **Apache Spark**. However, we can hope that **GraphX** will be improved in the future. Thanks to the fact that it provides the capability to cache data in memory, it can become popular in solving graph problems. To summarize, let’s take a look at the examples of practical tasks, in which it is necessary to apply graph methods.

## Practical Tasks

As mentioned before, the main problem in solving practical tasks is not to run complex algorithms, but to properly determine the graph, pre-process it correctly, and reduce the task to the classical one, already solved. Let’s consider the following examples, and leave the reader a lot of questions to think about:

### Link Prediction

The problem is quite common: there is a sequence of edges that are added to the graph till some moment in time. It is necessary to predict what edges will appear in our graph in the future. From the perspective of real life, this task is a part of a recommender system. For example, predict relations (“friendship”) between two users of a social network. For each pair of randomly selected vertices, it is necessary to predict the probability that there will appear an edge connecting them. That's exactly when we should work with the features and the description of vertices. For instance, one of the features can be an intersection of subsets of friends, or the Jaccard index. The reader should think of possible metrics of similarity between the vertices and write a version of his own in comments.

### Determining Communities in Social Networks

Actually, it is quite difficult to refer it to a certain type of problems. This problem is usually considered in the context of the “clustering on graphs” task. There are plenty of methods to solve this problem, from the simple selection of connected components (the algorithm has been mentioned above) with the properly determined subset of vertices to the sequential removal of edges from the graph till there’s one component left, the necessary one. It’s important to think about the kind of communities we want to select in the network. That is to say that we should first work with the description of vertices and edges, and then think of how to select communities in the obtained subgraph.

### Shortest Paths in Graphs

This problem is also classical and is implemented in various services that help to find the shortest path in a graph, no matter if it’s a social graph or a graph of connections between two cities.

It can also be a problem that, for example, a mobile network operator can face with:

### Determining Opinion Leaders in a Network

For instance, we need to promote a new option – a mobile Internet. To optimize the advertising budget, we’d like to select people who are in some sense the center of attention. In our terms, they are vertices of a graph that have a great authority on the Internet. With the proper structure of the graph, this problem is reduced to the **PageRank**.

Thus, we have reviewed typical application tasks of graph analysis that can occur in practice. We have also considered one of the tools we can apply to solve them. In the next article, we will focus on algorithms and particular tasks.