Tutorial 12: Community detection

Recapitulation of theory

A community is a set of vertices, which have larger probability of being connected among each other than being connected to the vertices in the rest of the graph.

Let community $C$ be subset of vertices. For a vertex $v \in C$ we define the internal degree $k_v^{\text{in}}(C)$ as $|\{w \in N(v)\ |\ w \in C\}|$ and external degree $k_v^{\text{out}}(C) = |\{w \in N(v)\ |\ w \notin C\}|$.

Total internal degree and total exernal degree is then defined as $k_C^{\text{in}} = \sum_{v_\in V(C)} k_v^{\text{in}}$ and $k_C^{\text{ext}} = \sum_{v_\in V(C)} k_v^{\text{ext}}$. Total degree $C$ is $k_C = k_C^{\text{in}} + k_C^{\text{ext}}$. Community $C$ is strong, if $k_v^{\text{in}}(C) > k_v^{\text{ext}}(C)$ for all $v\in C$, or weak, if $k_C^{\text{in}} > k_C^{\text{ext}}$.

Density of community $C$ in graph $G$, where $|C| = n_c$ and $|G| = n$, can be defined as internal, $$\rho_{\text{int}} = \frac{|E(C)|}{{n_c \choose 2}}$$ or external, $$\rho_{\text{ext}} = \frac{|E(C,G\setminus C)|}{(n_c(n - n_c))}.$$ For $\rho = |E(G)| / {n \choose 2}$ average graph density and community $C$ we set condition $$\rho_{\text{int}} \gg \rho \gg \rho_{\text{ext}}.$$ In algorithms we then maximize $\rho_{\text{int}} - \rho_{\text{ext}}$.

Other conditions on the structure of communities:

Typically we want the communities to cover the whole graph without having intersections. In some cases we can have overlapping communities. In such cases the belonginig into the given community is given by weight.

Problem formulation

$k$-clique is a maximal subgraph such that $\forall u,v,~ d_G(u,v) \leq k$. The shortest path can lead outside of $C$, i.e. it can happen that $\text{diam}(G[C]) > k$.

$k$-club is a $k$-clique with $\text{diam}(G[C]) \leq k$. In contrast to $k$-clique assures connectivity inside $C$.

$k$-plex is a set of vertices, which are neighbouring in $C$ with all vertices up to $k-1$ exceptions, i.e. $\deg_{G[C]}(v) = |N[v] \cap C | \geq |C| - k$ for all $v\in C$.

$k$-PLEX is a problem, where for a given graph $G$ and numbers $c$ and $k$ we have to find $k$-plex of size $c$. For $k=1$ this contains a problem of finding maximal clique, i.e. it is NP-hard.

Theorem (Balasundaram, Butenko, Hicks, Sachdeva 2011) $k$-PLEX is NP-complete for any fixed $k$.

This problem can be also formulated by integer programming as follows:

Let

Let us have the following program for $k > 1$

where the variable $x_i = 1$ represents that $i\in V$ is in a $k$-plex and $x_i = 0$ otherwise. The middle condition assures, that $i$ has at least $k - 1$ neighbours in the $k$-plex.

Communities using vertex similarity

Vertices can be divided into communities also based on their similarity. We can obtain external vector of properties or use for example the distance of vertices. If it is possible to (reasonably) embedd $G$ into some space, we set
$$d^E(u,v) = \sqrt{\sum_{i=1}^n (u_i - v_i)^2}$$ and suppose, that close vertices are similar to each other.

In the opposite case we assess the vertex similarity based on connectivity. For example by structural ekvivalence $$d^S(i,j) = \sqrt{\sum_{k\neq i,j} (A_{ik} - A_{jk})^2}$$ or by neighbourhood intersection $$d^N(i,j) = \frac{|N(i) \cap N(j)|}{|N(i) \cup N(j)|}.$$

Hierarchical clustering

The clusters have hierarchical structure - larger clusters are made out of smaller ones. The algorithms are often based on the similarity of vertices, where for each pair we count the similarity and the result is written into similarity matrix. The output of the algorithms is a dendrogram, a tree representation of clusters.

Typical types of algorithms:

Note: It might be necessary to define similarity between two groups of vertices.

Girvan-Newman algorithm

is a devisive algorithm using edge betweenness centrality: $$C_{EB}(e) = \sum_{u,v \notin e} \frac{\sigma_{uv}(e)}{\sigma_{uv}}$$ where $\sigma_{uv}(e)$ is the number of shortest paths passing through the edge $e$.

In the algorithm we suppose that the edge $e_{i,j}$ has large centrality, if it connects vertices in different communities and small centrality, if it leads between vertices in the same community. Algorithm:

Modularity

Modularity is a way to evaluate the quality of division into communities. Each division of vertices into disjoint sets gest a real number - the higher, the better community detection. Main idea is a comparison to a random graph, where we do not expect community structure.

Modularity is defined as $$Q = \frac{1}{2m} \sum_{ij} (A_{ij} - P_{ij})\delta(C_i,C_j)$$ where $P_{ij}$ corresponds to expected number of edges between communities $C_i$ and $C_j$ in the reference random model and $\delta(C_i,C_j)$ is indicator whether the two clusters are the same.

If we choose reference random model:

If we use the configuration model as a reference random model, we can simplify the definition of modularity to $$Q = \sum_{c=1}^{n_c} \left[\frac{m_c}{m} - \left(\frac{d_c}{2m}\right)^2\right],$$ where $m_c$ is the number of vertices in community $C_c$, $n_c$ is the number of communities and $d_c$ is the total degree of vertices inside community $C_c$.

Modularity maximization then leads to $$Q_{max} = \frac{1}{m} \max_{\mathcal{P}} \left\{\sum_{c=1}^{n_c} \left[m_c - \mathbb{E}[m_c]\right]\right\} = - \frac{1}{m} \min_{\mathcal{P}} \left\{\sum_{c=1}^{n_c} \left[m_c - \mathbb{E}[m_c]\right]\right\}$$ which we can rewrite as $$Q_{max} = -\frac{1}{m} \min_{\mathcal{P}} \left( |\text{Cut}_{\mathcal{P}}| - |\text{ExpCut}_{\mathcal{P}}|\right)$$ where

Graph partitioning

Is a problem of dividing vertices of the graph to given number of groups (sometimes of given sizes) with the aim to minimize the number of edges between the groups. A special case is bisecting the graph into two parts, which is solved by the following algorithm.

Kernighan-Lin

When dividing the vertices into two parts $V_1 \cup V_2 = V(G)$ and setting the configuration model as a refrence model we obtain $$Q = \frac{1}{4m} \sum_{ij} B_{ij}s_is_j$$ where $B_{ij} = A_{ij} - P_{ij} = A_{ij} - \frac{k_ik_j}{2m}$ and $s_i = -1$ for $i \in V_1$ and $s_i = +1$ for $i \in V_2$.

In matrix form, $$Q = \frac{1}{4m}s^T B s.$$

Similarly we can express the size of the cut $R$ as $$R = \frac{1}{4} s^T L s,$$ where $L$ is Laplacian matrix defined as $L = D - A$ where $D$ is diagonal matrix containing degrees of vertices.

There are different approaches to solving this problem:

Theorem A connected graph has largest eigenvalue of modularity matrix equal to zero if and only if it is a complete multipartite graph.

Theorem (Petrovic) A connected graph has exactly one positive eigenvalue of the adjacency matrix if and only if it is a complete multipartite graph.

Tasks

Task 1: Assess the asymptotic complexity of Kerninghan-Li algorithm.

Task 2: Let $V(G) = n = n_1 + n_2$. What is the (approximate) complexity of solving the task of splitting the vertices into two disjoint sets of sizes $n_1$ and $n_2$ by brute force? For which values of $n_1$ and $n_2$ is the complexity the largest?

Task 3: Let $P$ be a graph isomorphic to a path on $n$ vertices. Count the modularity for splitting the veritces of $P$ to two parts, which is equivalent to removing one edge of the path, i.e. we obtain two paths containing $r$ and $n-r$ vertices.