12cv

A community is a set of vertices, which have larger probability of being connected among each other than being connected to the vertices in the rest of the graph.

Let community $C$ be subset of vertices. For a vertex $v \in C$ we define the internal degree $k_v^{\text{in}}(C)$ as $|\{w \in N(v)\ |\ w \in C\}|$ and external degree $k_v^{\text{out}}(C) = |\{w \in N(v)\ |\ w \notin C\}|$.

Total internal degree and total exernal degree is then defined as $k_C^{\text{in}} = \sum_{v_\in V(C)} k_v^{\text{in}}$ and $k_C^{\text{ext}} = \sum_{v_\in V(C)} k_v^{\text{ext}}$. Total degree $C$ is $k_C = k_C^{\text{in}} + k_C^{\text{ext}}$. Community $C$ is strong, if $k_v^{\text{in}}(C) > k_v^{\text{ext}}(C)$ for all $v\in C$, or weak, if $k_C^{\text{in}} > k_C^{\text{ext}}$.

Density of community $C$ in graph $G$, where $|C| = n_c$ and $|G| = n$, can be defined as internal, $$\rho_{\text{int}} = \frac{|E(C)|}{{n_c \choose 2}}$$ or external, $$\rho_{\text{ext}} = \frac{|E(C,G\setminus C)|}{(n_c(n - n_c))}.$$ For $\rho = |E(G)| / {n \choose 2}$ average graph density and community $C$ we set condition $$\rho_{\text{int}} \gg \rho \gg \rho_{\text{ext}}.$$ In algorithms we then maximize $\rho_{\text{int}} - \rho_{\text{ext}}$.

Other conditions on the structure of communities:

connectedness - $C$ induces connected graph in $G$
uniqueness - structure of communities is fully given by connectivity

Typically we want the communities to cover the whole graph without having intersections. In some cases we can have overlapping communities. In such cases the belonginig into the given community is given by weight.

Problem formulation¶

$k$-clique is a maximal subgraph such that $\forall u,v,~ d_G(u,v) \leq k$. The shortest path can lead outside of $C$, i.e. it can happen that $\text{diam}(G[C]) > k$.

$k$-club is a $k$-clique with $\text{diam}(G[C]) \leq k$. In contrast to $k$-clique assures connectivity inside $C$.

$k$-plex is a set of vertices, which are neighbouring in $C$ with all vertices up to $k-1$ exceptions, i.e. $\deg_{G[C]}(v) = |N[v] \cap C | \geq |C| - k$ for all $v\in C$.

$k$-PLEX is a problem, where for a given graph $G$ and numbers $c$ and $k$ we have to find $k$-plex of size $c$. For $k=1$ this contains a problem of finding maximal clique, i.e. it is NP-hard.

Theorem (Balasundaram, Butenko, Hicks, Sachdeva 2011) $k$-PLEX is NP-complete for any fixed $k$.

This problem can be also formulated by integer programming as follows:

Let

$\omega_k(G)$ be number of vertices of largest $k$-plexu,
$\overline{d}_i = |V \setminus N_{G}(i)|$ is degreee of a vertex $i$ in the complement of $G$

Let us have the following program for $k > 1$

$\omega_k = \max \sum_{i\in V} x_i$
$\sum_{j\in V\setminus N_G(i)} x_j \leq (k-1)x_i + \overline{d}_i(1-x_i)$ for all $i\in V$
$x_i \in \{0,1\}$ for all $i\in V$

where the variable $x_i = 1$ represents that $i\in V$ is in a $k$-plex and $x_i = 0$ otherwise. The middle condition assures, that $i$ has at least $k - 1$ neighbours in the $k$-plex.

Communities using vertex similarity¶

Vertices can be divided into communities also based on their similarity. We can obtain external vector of properties or use for example the distance of vertices. If it is possible to (reasonably) embedd $G$ into some space, we set
$$d^E(u,v) = \sqrt{\sum_{i=1}^n (u_i - v_i)^2}$$ and suppose, that close vertices are similar to each other.

In the opposite case we assess the vertex similarity based on connectivity. For example by structural ekvivalence $$d^S(i,j) = \sqrt{\sum_{k\neq i,j} (A_{ik} - A_{jk})^2}$$ or by neighbourhood intersection $$d^N(i,j) = \frac{|N(i) \cap N(j)|}{|N(i) \cup N(j)|}.$$

Hierarchical clustering¶

The clusters have hierarchical structure - larger clusters are made out of smaller ones. The algorithms are often based on the similarity of vertices, where for each pair we count the similarity and the result is written into similarity matrix. The output of the algorithms is a dendrogram, a tree representation of clusters.

Typical types of algorithms:

aglomerative - clusters are being joined together while they are similar enough
divisive - we start with all vertices of the graph in the same cluster and delete edges between the groups of vertices which are least similar

Note: It might be necessary to define similarity between two groups of vertices.

Girvan-Newman algorithm¶

is a devisive algorithm using edge betweenness centrality: $$C_{EB}(e) = \sum_{u,v \notin e} \frac{\sigma_{uv}(e)}{\sigma_{uv}}$$ where $\sigma_{uv}(e)$ is the number of shortest paths passing through the edge $e$.

In the algorithm we suppose that the edge $e_{i,j}$ has large centrality, if it connects vertices in different communities and small centrality, if it leads between vertices in the same community. Algorithm:

count/recount the betweenness centrality of all edges
remove edge with largest centrality
iterate until all edges are removed

Modularity¶

Modularity is a way to evaluate the quality of division into communities. Each division of vertices into disjoint sets gest a real number - the higher, the better community detection. Main idea is a comparison to a random graph, where we do not expect community structure.

Modularity is defined as $$Q = \frac{1}{2m} \sum_{ij} (A_{ij} - P_{ij})\delta(C_i,C_j)$$ where $P_{ij}$ corresponds to expected number of edges between communities $C_i$ and $C_j$ in the reference random model and $\delta(C_i,C_j)$ is indicator whether the two clusters are the same.

If we choose reference random model:

having (only) the same number of edges - $P_{ij} = p = \frac{2m}{n(n-1)} \forall i,j$
having same degree distribution - $P_{ij} = 2mp_ip_j = \frac{k_ik_j}{2m}$

If we use the configuration model as a reference random model, we can simplify the definition of modularity to $$Q = \sum_{c=1}^{n_c} \left[\frac{m_c}{m} - \left(\frac{d_c}{2m}\right)^2\right],$$ where $m_c$ is the number of vertices in community $C_c$, $n_c$ is the number of communities and $d_c$ is the total degree of vertices inside community $C_c$.

Modularity maximization then leads to $$Q_{max} = \frac{1}{m} \max_{\mathcal{P}} \left\{\sum_{c=1}^{n_c} \left[m_c - \mathbb{E}[m_c]\right]\right\} = - \frac{1}{m} \min_{\mathcal{P}} \left\{\sum_{c=1}^{n_c} \left[m_c - \mathbb{E}[m_c]\right]\right\}$$ which we can rewrite as $$Q_{max} = -\frac{1}{m} \min_{\mathcal{P}} \left( |\text{Cut}_{\mathcal{P}}| - |\text{ExpCut}_{\mathcal{P}}|\right)$$ where

$|\text{Cut}_{\mathcal{P}}| = m - \sum_{c=1}^{n_c} m_c$ is the number of edges inside the cluster
$|\text{ExpCut}_{\mathcal{P}}| = m - \sum_{c=1}^{n_c} \mathbb{E}[m_c]$ is expected number of edges inside the cluster in a reference model

Graph partitioning¶

Is a problem of dividing vertices of the graph to given number of groups (sometimes of given sizes) with the aim to minimize the number of edges between the groups. A special case is bisecting the graph into two parts, which is solved by the following algorithm.

Kernighan-Lin

start with a random bisection of $V(G)$ to two parts $X,Y$
for all tuples $i \in X$, $j \in Y$ try to exchange them between parts
- count the difference in the size of the cut and remember the result
switch tuple, which causes largest decrease (or smallest increase) in the cut and at the same time, neither of the two vertices of the pair has already been exchanged
repeat the second step, until all vertices has been switched
return solution with minimal cut over all
repeat the whole procedure several times (typically until the result improves)

When dividing the vertices into two parts $V_1 \cup V_2 = V(G)$ and setting the configuration model as a refrence model we obtain $$Q = \frac{1}{4m} \sum_{ij} B_{ij}s_is_j$$ where $B_{ij} = A_{ij} - P_{ij} = A_{ij} - \frac{k_ik_j}{2m}$ and $s_i = -1$ for $i \in V_1$ and $s_i = +1$ for $i \in V_2$.

In matrix form, $$Q = \frac{1}{4m}s^T B s.$$

Similarly we can express the size of the cut $R$ as $$R = \frac{1}{4} s^T L s,$$ where $L$ is Laplacian matrix defined as $L = D - A$ where $D$ is diagonal matrix containing degrees of vertices.

There are different approaches to solving this problem:

solving the problem for $s_i \in [-1,1]$ instead of $s_i \in \{-1, +1\}$ and then rounding the result (using suitable threshold)
using spectral graph theory - there is a connection to eigenvalues of the Laplacian matrix
- eigenvector belonging to $\lambda_2$ is called Fiedlers vector and plays an important role when finding nontrivila solutions

Theorem A connected graph has largest eigenvalue of modularity matrix equal to zero if and only if it is a complete multipartite graph.

Theorem (Petrovic) A connected graph has exactly one positive eigenvalue of the adjacency matrix if and only if it is a complete multipartite graph.

Tasks¶

Task 1: Assess the asymptotic complexity of Kerninghan-Li algorithm.

Task 2: Let $V(G) = n = n_1 + n_2$. What is the (approximate) complexity of solving the task of splitting the vertices into two disjoint sets of sizes $n_1$ and $n_2$ by brute force? For which values of $n_1$ and $n_2$ is the complexity the largest?

Task 3: Let $P$ be a graph isomorphic to a path on $n$ vertices. Count the modularity for splitting the veritces of $P$ to two parts, which is equivalent to removing one edge of the path, i.e. we obtain two paths containing $r$ and $n-r$ vertices.

Tutorial 12: Community detection¶

Recapitulation of theory¶

Problem formulation¶

Communities using vertex similarity¶

Hierarchical clustering¶

Girvan-Newman algorithm¶

Modularity¶

Graph partitioning¶

Tasks¶

Tutorial 12: Community detection¶

Recapitulation of theory¶

Basic terms related to communities¶

Problem formulation¶

Communities using vertex similarity¶

Hierarchical clustering¶

Girvan-Newman algorithm¶

Modularity¶

Graph partitioning¶

Tasks¶