Tutorial 11: Community detection algorithms¶

Recapitulation of theory¶

Hierarchical clustering¶

Clusters have hierarchical structure - larger clusters/communities consist of smaller ones. Algorithms are often based on the similarity of vertices, where for each pair we count the similarity and the result is written into similarity matrix. The output of these algorithms is dendrogram, which is a tree representation of clusters.

Typical types of algorithms:

aglomerative - clusters are merged if their similarity is large enough
divisive - we start with all vertices of the graph and we delete edges between groups of vertices with small similarity

Note: It might be necessary to define similarity between groups of vertices, not only between pairs of vertices, i.e. the pairwise similarity might not extend nicely to groups of vertices.

Girvan-Newman algoritm¶

is a divisive algorithm using betweenness centrality for edges: $$C_{EB}(e) = \sum_{u,v \notin e} \frac{\sigma_{uv}(e)}{\sigma_{uv}}$$ where $\sigma_{uv}(e)$ is the number of shortest paths passing through edge $e$.

In the algorithm we expect, that the edge $e_{i,j}$ has large centrality, if it connectes vertices in different communities and small centrality, if it connects vertices in the same community.

Algorithm:

count/update the centrality of all edges
remove edge with largest centrality
iterate until removing all edges

Modularity¶

Modularity is a way to evaluate the quality of the division into communities. Each division of vertices into disjoint sets receives a real number - the larger the better. The main idea is comparison with a random graph, where we do not expect a community structure.

Modularity is defined as $$Q = \frac{1}{2m} \sum_{ij} (A_{ij} - P_{ij})\delta(C_i,C_j)$$ where $P_{ij}$ is expected number of edges between communities $C_i$ and $C_j$ in a reference random model and $\delta(C_i,C_j)$ is indicator of two clusters being the same.

If we choose a reference random model which:

has (only) the same number of edges - $P_{ij} = p = \frac{2m}{n(n-1)} \forall i,j$
has the same degree distribution - $P_{ij} = 2mp_ip_j = \frac{k_ik_j}{2m}$

If we use configuration model as the reference model, we can simplify the expression for modularity as follows $$Q = \sum_{c=1}^{n_c} \left[\frac{m_c}{m} - \left(\frac{d_c}{2m}\right)^2\right],$$ where $n_c$ and $m_c$ are number of vertices and edges in a community $C_c$ and $d_c$ is total degree of a vertex $c$ in community $C_c$.

Maximization of modularity leads to $$Q_{max} = \frac{1}{m} \max_{\mathcal{P}} \left\{\sum_{c=1}^{n_c} \left[m_c - \mathbb{E}[m_c]\right]\right\} = - \frac{1}{m} \min_{\mathcal{P}} \left\{\sum_{c=1}^{n_c} \left[m_c - \mathbb{E}[m_c]\right]\right\}$$ which can be transformed into $$Q_{max} = -\frac{1}{m} \min_{\mathcal{P}} \left( |\text{Cut}_{\mathcal{P}}| - |\text{ExpCut}_{\mathcal{P}}|\right)$$ where

$|\text{Cut}_{\mathcal{P}}| = m - \sum_{c=1}^{n_c} m_c$ is the number of edges inside the cluster
$|\text{ExpCut}_{\mathcal{P}}| = m - \sum_{c=1}^{n_c} \mathbb{E}[m_c]$ is expected value of edges inside cluster in the reference model

Graph partitioning¶

is the problem of dividing vertices of a graph to a fixed number of clusters with given sizes with the goal of minimizing number of edges between groups of vertices. A special case is the partitioning of a graph in two sets, which can be solved by the following algorithm.

Kernighan-Lin

start with a random bisection of $V(G)$ to two parts $X,Y$
for all pairs $i \in X$, $j \in Y$ try to exchange their partitions
- evaluate the difference in the size of the cut and remember the result
exchange the pair of vertices, which decreases the cut the most (or increases the cut the least if it is not possible to decrease it) and at the same time none of the vertices of the pair was already exchanged
repeat previous step until all pairs of vertices have been tried
return the solution with minimal cut
iterate the whole process a few times (typically until the result decreases)

When separating the vertices into two parts $V_1 \cup V_2 = V(G)$ and when using configuration model as a reference model we obtain $$Q = \frac{1}{4m} \sum_{ij} B_{ij}s_is_j$$ where $B_{ij} = A_{ij} - P_{ij} = A_{ij} - \frac{k_ik_j}{2m}$ a $s_i = -1$ pro $i \in V_1$ and $s_i = +1$ pro $i \in V_2$.

In matrix form we obtain $$Q = \frac{1}{4m}s^T B s.$$

Similarly we can express the size of the cut $R$ as $$R = \frac{1}{4} s^T L s,$$ where $L$ is Laplacian matrix defined as $L = D - A$ where $D$ is diagonal matrix containing the degrees of vertices.

The problem can be solved by

solving the problem for $s_i \in [-1,1]$ instead of $s_i \in \{-1, +1\}$ and then by rounding the result (using a suitable threshold)
using spectral graph theory - there is a relation with eigenvalues of Laplacian matrix
- eigenvector belonging to $\lambda_2$ is called a Fiedler vector and it plays important role in the search of nontrivial solutions

Theorem A connected graph has the largest eigenvalue of modularity matrix equal to zero if and only if it is a complete multipartite graph.

Theorem (Petrovic) A connected graph has exactly one positive eigenvalue of its adjacency matrix iff it is a complete multipartite graph.

Coloring vertices of a graph¶

For coloring the vertices of a graph we can use parameter node_color for function draw_networkx, which can be also used in function draw:

Apart from directly giving colors to nodes we can give vertices a number between 0 and 1 and as another parameter of the drawing function give a color map, which gives a color to any number from this interval according to predefined pallette. The predefined pallettes cam be found in the documentation to matplotlib:

https://matplotlib.org/stable/tutorials/colors/colormaps.html

More information about coloring of vertices can be found in the documentation to networkx:

https://networkx.org/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw_networkx.html#networkx.drawing.nx_pylab.draw_networkx

Community detection¶

There are more algorithms implemented in networkx which are related to community detection or graph partitioning:

Kerninghan-Lin algorithm for dividing the graph to two parts
Girvan-Newman algorithm
community detection algorithms based on k-cliques, maximization of modularity and others

For a given division to communities it is also possible to obtain the corresponding value of modularity.

In order to be able to use these functions, it is necessary to import into networkx the namespace for communities. More on how to do this can be found in the documentation, see

https://networkx.org/documentation/stable/reference/algorithms/community.html

Tasks¶

Task 1: Run the community detection algorithm based on community maximization on the grap representing the social network of dolphins:

http://networkrepository.com/soc-dolphins.php What is the resulting number of communities $k$?

Task 2: Newman algorithm using edge betweenness centrality returns a hierarchy of communities. Compare its output corresponding to $k$ communities to the result of the previous task.