Tutorial 11: Community detection algorithms

Recapitulation of theory

Hierarchical clustering

Clusters have hierarchical structure - larger clusters/communities consist of smaller ones. Algorithms are often based on the similarity of vertices, where for each pair we count the similarity and the result is written into similarity matrix. The output of these algorithms is dendrogram, which is a tree representation of clusters.

Typical types of algorithms:

Note: It might be necessary to define similarity between groups of vertices, not only between pairs of vertices, i.e. the pairwise similarity might not extend nicely to groups of vertices.

Girvan-Newman algoritm

is a divisive algorithm using betweenness centrality for edges: $$C_{EB}(e) = \sum_{u,v \notin e} \frac{\sigma_{uv}(e)}{\sigma_{uv}}$$ where $\sigma_{uv}(e)$ is the number of shortest paths passing through edge $e$.

In the algorithm we expect, that the edge $e_{i,j}$ has large centrality, if it connectes vertices in different communities and small centrality, if it connects vertices in the same community.

Algorithm:

Modularity

Modularity is a way to evaluate the quality of the division into communities. Each division of vertices into disjoint sets receives a real number - the larger the better. The main idea is comparison with a random graph, where we do not expect a community structure.

Modularity is defined as $$Q = \frac{1}{2m} \sum_{ij} (A_{ij} - P_{ij})\delta(C_i,C_j)$$ where $P_{ij}$ is expected number of edges between communities $C_i$ and $C_j$ in a reference random model and $\delta(C_i,C_j)$ is indicator of two clusters being the same.

If we choose a reference random model which:

If we use configuration model as the reference model, we can simplify the expression for modularity as follows $$Q = \sum_{c=1}^{n_c} \left[\frac{m_c}{m} - \left(\frac{d_c}{2m}\right)^2\right],$$ where $n_c$ and $m_c$ are number of vertices and edges in a community $C_c$ and $d_c$ is total degree of a vertex $c$ in community $C_c$.

Maximization of modularity leads to $$Q_{max} = \frac{1}{m} \max_{\mathcal{P}} \left\{\sum_{c=1}^{n_c} \left[m_c - \mathbb{E}[m_c]\right]\right\} = - \frac{1}{m} \min_{\mathcal{P}} \left\{\sum_{c=1}^{n_c} \left[m_c - \mathbb{E}[m_c]\right]\right\}$$ which can be transformed into $$Q_{max} = -\frac{1}{m} \min_{\mathcal{P}} \left( |\text{Cut}_{\mathcal{P}}| - |\text{ExpCut}_{\mathcal{P}}|\right)$$ where

Graph partitioning

is the problem of dividing vertices of a graph to a fixed number of clusters with given sizes with the goal of minimizing number of edges between groups of vertices. A special case is the partitioning of a graph in two sets, which can be solved by the following algorithm.

Kernighan-Lin

When separating the vertices into two parts $V_1 \cup V_2 = V(G)$ and when using configuration model as a reference model we obtain $$Q = \frac{1}{4m} \sum_{ij} B_{ij}s_is_j$$ where $B_{ij} = A_{ij} - P_{ij} = A_{ij} - \frac{k_ik_j}{2m}$ a $s_i = -1$ pro $i \in V_1$ and $s_i = +1$ pro $i \in V_2$.

In matrix form we obtain $$Q = \frac{1}{4m}s^T B s.$$

Similarly we can express the size of the cut $R$ as $$R = \frac{1}{4} s^T L s,$$ where $L$ is Laplacian matrix defined as $L = D - A$ where $D$ is diagonal matrix containing the degrees of vertices.

The problem can be solved by

Theorem A connected graph has the largest eigenvalue of modularity matrix equal to zero if and only if it is a complete multipartite graph.

Theorem (Petrovic) A connected graph has exactly one positive eigenvalue of its adjacency matrix iff it is a complete multipartite graph.

Coloring vertices of a graph

For coloring the vertices of a graph we can use parameter node_color for function draw_networkx, which can be also used in function draw:

Apart from directly giving colors to nodes we can give vertices a number between 0 and 1 and as another parameter of the drawing function give a color map, which gives a color to any number from this interval according to predefined pallette. The predefined pallettes cam be found in the documentation to matplotlib:

https://matplotlib.org/stable/tutorials/colors/colormaps.html

More information about coloring of vertices can be found in the documentation to networkx:

https://networkx.org/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw_networkx.html#networkx.drawing.nx_pylab.draw_networkx

Community detection

There are more algorithms implemented in networkx which are related to community detection or graph partitioning:

For a given division to communities it is also possible to obtain the corresponding value of modularity.

In order to be able to use these functions, it is necessary to import into networkx the namespace for communities. More on how to do this can be found in the documentation, see

https://networkx.org/documentation/stable/reference/algorithms/community.html

Tasks

Task 1: Run the community detection algorithm based on community maximization on the grap representing the social network of dolphins:

http://networkrepository.com/soc-dolphins.php What is the resulting number of communities $k$?

Task 2: Newman algorithm using edge betweenness centrality returns a hierarchy of communities. Compare its output corresponding to $k$ communities to the result of the previous task.