A node-based index for clustering validation of graph data

Tosyali, Ali; Tavakkol, Behnam

doi:10.1007/s10479-021-04376-7

A node-based index for clustering validation of graph data

Original Research
Published: 08 November 2021

Volume 341, pages 197–221, (2024)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

321 Accesses
1 Citation
Explore all metrics

Abstract

Clustering validity indices are designed for evaluating the performance of clustering algorithms in terms of quality of clusters. They are also used for detecting the correct number of clusters. Numerous clustering validity indices have been proposed in the literature for different types of data. However, to the best of our knowledge, there are a few well-known clustering validity indices for graph data. In this paper, we propose a new clustering validity index for graph data called the node-based clustering validity index. The main characteristic of our proposed index is that it captures the exclusive contribution of each node to the separation and compactness of the clusters in a graph. This characteristic gives an advantage to the method which is the detection of the correct number of clusters for cases that two sets of nodes are bonded to each other with a single node. We provide an illustrative example to show that many existing validity indices fail to detect the correct number of clusters in such cases. We evaluate the performance of our proposed index with several experiments on different real-world and synthetic graphs. Experiments show that our proposed node-based index outperforms the existing validity indices for different graph data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

QGraph: A Quality Assessment Index for Graph Clustering

On community structure validation in real networks

Article Open access 04 October 2021

A New Clustering Algorithm Based on Graph Connectivity

References

Abadi, A., Rajabioun, T., Ioannou, P. A., et al. (2015). Traffic flow prediction for road transportation networks with limited traffic data. IEEE Transactions on Intelligent Transportation Systems, 16(2), 653–662.
Google Scholar
Ah-Pine, J., Csurka, G., & Clinchant, S. (2015). Unsupervised visual and textual information fusion in CBMIR using graph-based methods. ACM Transactions on Information Systems, 33(2), 1–31.
Article Google Scholar
Almeida, H., Guedes, D., Meira, W., & Zaki, M. J. (2011). Is there a best quality metric for graph clusters? In Joint European conference on machine learning and knowledge discovery in databases (pp. 44–59). Springer.
Boutin, F., & Hascoët, M. (2004). Cluster validity indices for graph partitioning. In Proceedings. Eighth international conference on information visualisation, 2004. IV 2004 (pp. 376–381). IEEE.
Brandes, U., Gaertler, M., & Wagner, D. (2003). Experiments on graph clustering algorithms. In European symposium on algorithms (pp. 568–579). Springer.
Condon, A., & Karp, R. M. (2001). Algorithms for graph partitioning on the planted partition model. Random Structures & Algorithms, 18(2), 116–140.
Article Google Scholar
Danon, L., Diaz-Guilera, A., Duch, J., & Arenas, A. (2005). Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005(09), P09008.
Article Google Scholar
Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 224–227.
Article Google Scholar
Demir, E., Aykanat, C., & Cambazoglu, B. B. (2008). Clustering spatial networks for aggregate query processing: A hypergraph approach. Information Systems, 33(1), 1–17.
Article Google Scholar
Dunn, J. C. (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters.
Emmons, S., Kobourov, S., Gallant, M., & Börner, K. (2016). Analysis of network clustering algorithms and cluster quality metrics at scale. PLoS ONE, 11(7), e0159161.
Article Google Scholar
Girvan, M., & Newman, M. E. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12), 7821–7826.
Article Google Scholar
Gómez, D., Zarrazola, E., Yáñez, J., & Montero, J. (2015). A divide-and-link algorithm for hierarchical clustering in networks. Information Sciences, 316, 308–328.
Article Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.
Kehagias, A., & Pitsoulis, L. (2013). Bad communities with high modularity. The European Physical Journal B, 86(7), 1–11.
Article Google Scholar
Kobourov, S. G., Pupyrev, S., & Simonetto, P. (2014). Visualizing graphs as maps with contiguous regions. in EuroVis14. Accepted to appear 4.
Lancichinetti, A., Fortunato, S., & Radicchi, F. (2008). Benchmark graphs for testing community detection algorithms. Physical Review E, 78(4), 046110.
Article Google Scholar
Li, L. T., Xiong, Z. Y., Dai, Q. Z., Zha, Y. F., Zhang, Y. F., & Dan, J. P. (2020). A novel graph-based clustering method using noise cutting. Information Systems, 91, 101504.
Article Google Scholar
Liang, S., Ren, Z., Zhao, Y., Ma, J., Yilmaz, E., & Rijke, M. D. (2017). Inferring dynamic user interests in streams of short texts for user clustering. ACM Transactions on Information Systems, 36(1), 1–37.
Google Scholar
Newman, M. (2018). Networks. Oxford University Press.
Newman, M. E. (2003). Mixing patterns in networks. Physical Review E, 67(2), 026126.
Article Google Scholar
Newman, M. E. (2004). Fast algorithm for detecting community structure in networks. Physical Review E, 69(6), 066133.
Article Google Scholar
Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113.
Article Google Scholar
Rojas-Thomas, J., Santos, M., & Mora, M. (2017). New internal index for clustering validation based on graphs. Expert Systems with Applications, 86, 334–349.
Article Google Scholar
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.
Article Google Scholar
Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27–64.
Article Google Scholar
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.
Article Google Scholar
Tavakkol, B., Jeong, M. K., & Albin, S. L. (2018). Validity indices for clusters of uncertain data objects. Annals of Operations Research, 303, 1–37.
Google Scholar
Tosyali, A., Kim, J., Choi, J., & Jeong, M. K. (2019). Regularized asymmetric nonnegative matrix factorization for clustering in directed networks. Pattern Recognition Letters, 125, 750–757.
Article Google Scholar
Tosyali, A., Kim, J., Choi, J., Kang, Y., & Jeong, M. K. (2020). New node anomaly detection algorithm based on nonnegative matrix factorization for directed citation networks. Annals of Operations Research, 288, 457–474.
Article Google Scholar
van der Pol, J., & Rameshkoumar, J. P. (2018). The co-evolution of knowledge and collaboration networks: The role of the technology life-cycle. Scientometrics, 114(1), 307–323.
Article Google Scholar
Wang, F., Li, T., Wang, X., Zhu, S., & Ding, C. (2011). Community discovery using nonnegative matrix factorization. Data Mining and Knowledge Discovery, 22(3), 493–521.
Article Google Scholar
Wang, S., & Siskind, J. M. (2003). Image segmentation with ratio cut. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(6), 675–690.
Article Google Scholar
Xie, X. L., & Beni, G. (1991). A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis & Machine Intelligence, 8, 841–847.
Article Google Scholar
Zachary, W. W. (1977). An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33(4), 452–473.
Article Google Scholar
Zhu, J., Wu, X., Lin, X., Huang, C., Fung, G. P. C., & Tang, Y. (2018). A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering. Scientometrics, 114(3), 781–794.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of MIS, Marketing, and Analytics, Rochester Institute of Technology, Rochester, NY, 14623, USA
Ali Tosyali
Business Studies Program, Stockton University, Galloway, NJ, 08205, USA
Behnam Tavakkol

Authors

Ali Tosyali
View author publications
You can also search for this author in PubMed Google Scholar
Behnam Tavakkol
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Behnam Tavakkol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Details of calculations for illustrative example

In the following, we show the details of calculating the value of our proposed node-based clustering validity index and the existing clustering validity indices for graph data.

When $K=2$:

Performance:

$$\begin{aligned} perf(G)=\frac{f(G)+g(G)}{\frac{1}{2}n(n-1)}=\frac{22+69}{\frac{1}{2}\times 17\times 16}=0.669 \end{aligned}$$

Rcut:

$$\begin{aligned} Rcut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{|C_i|}=\frac{1}{10}+\frac{1}{7}=0.242 \end{aligned}$$

Ncut:

$$\begin{aligned} Ncut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{vol(C_i)}=\frac{1}{12}+\frac{1}{12}=0.166 \end{aligned}$$

Modularity:

$$\begin{aligned} e_{11}= & {} \frac{11 \times 2}{23\times 2}=\frac{22}{46}, \quad a_1=\frac{(11\times 2)+1}{23\times 2}=\frac{23}{46}\\ e_{22}= & {} \frac{11 \times 2}{23\times 2}=\frac{22}{46}, \quad a_2=\frac{(11\times 2)+1}{23\times 2}=\frac{23}{46}\\ Q(G)= & {} \sum _k(e_{kk}-a^2_k)=\left( \frac{22}{46}-\left( \frac{23}{46}\right) ^2\right) + \left( \frac{22}{46}-\left( \frac{23}{46}\right) ^2\right) =0.456 \end{aligned}$$

Conductance:

$$\begin{aligned} \Phi (1)= & {} \frac{1}{\min (23,23)}=\frac{1}{23}, \quad \Phi (2)=\frac{1}{\min (23,23)}=\frac{1}{23}\\ \Phi (G)= & {} 1-\frac{1}{k}\sum _k\Phi (C_k)=1-\frac{1}{2}\left( \frac{1}{23}+\frac{1}{23}\right) =0.956 \end{aligned}$$

Coverage:

$$\begin{aligned} \lambda (G)=\frac{\sum _{i,j}A_{ij}I(C_{l_i},C_{l_j})}{\sum _{i,j}A_{ij}}=\frac{(11\times 2)+(11\times 2)}{23\times 2}=0.956 \end{aligned}$$

Node-based validity index:

$$\begin{aligned}&\begin{array}{|c|c|c|c|c|c|c|c|} \hline \text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}&{}\text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}\\ \hline 1&{}2/9&{}0&{}1&{}10&{}2/9&{}1/7&{}0.357\\ \hline 2&{}2/9&{}0&{}1&{}11&{}4/6&{}1/10&{}0.85\\ \hline 3&{}2/9&{}0&{}1&{}12&{}2/6&{}0&{}1\\ \hline 4&{}3/9&{}0&{}1&{}13&{}2/6&{}0&{}1\\ \hline 5&{}2/9&{}0&{}1&{}14&{}4/6&{}0&{}1\\ \hline 6&{}3/9&{}0&{}1&{}15&{}4/6&{}0&{}1\\ \hline 7&{}2/9&{}0&{}1&{}16&{}3/6&{}0&{}1\\ \hline 8&{}2/9&{}0&{}1&{}17&{}3/6&{}0&{}1\\ \hline 9&{}2/9&{}0&{}1&{}&{}&{}&{}\\ \hline \end{array}\\&N(G)=\frac{1}{17}\sum _{i=1}^{17}\frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}=0.953 \end{aligned}$$

When $K=3$:

Performance:

$$\begin{aligned} perf(G)=\frac{f(G)+g(G)}{\frac{1}{2}n(n-1)}=\frac{21+93}{\frac{1}{2}\times 17\times 16}=0.838 \end{aligned}$$

Rcut:

$$\begin{aligned} Rcut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{|C_i|}=\frac{1}{5}+\frac{2}{5}+\frac{1}{7}=0.742 \end{aligned}$$

Ncut:

$$\begin{aligned} Ncut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{vol(C_i)}=\frac{1}{6}+\frac{2}{7}+\frac{1}{12}=0.535 \end{aligned}$$

Modularity:

$$\begin{aligned} e_{11}= & {} \frac{5 \times 2}{23\times 2}=\frac{10}{46}, \quad a_1=\frac{(5\times 2)+1}{23\times 2}=\frac{11}{46}\\ e_{22}= & {} \frac{5 \times 2}{23\times 2}=\frac{10}{46}, \quad a_2=\frac{(5\times 2)+2}{23\times 2}=\frac{12}{46}\\ e_{33}= & {} \frac{11 \times 2}{23\times 2}=\frac{22}{46}, \quad a_3=\frac{(11\times 2)+1}{23\times 2}=\frac{23}{46}\\ Q(G)= & {} \sum _k(e_{kk}-a^2_k)=\left( \frac{10}{46}-\left( \frac{11}{46}\right) ^2\right) + \left( \frac{10}{46}-\left( \frac{12}{46}\right) ^2\right) \\&+ \left( \frac{22}{46}-\left( \frac{23}{46}\right) ^2\right) =0.537 \end{aligned}$$

Conductance:

$$\begin{aligned} \Phi (1)= & {} \frac{1}{\min (11,35)}=\frac{1}{11}, \quad \Phi (2)=\frac{2}{\min (12,34)}=\frac{2}{12}\\ \Phi (3)= & {} \frac{1}{\min (23,23)}=\frac{1}{23}\\ \Phi (G)= & {} 1-\frac{1}{k}\sum _k\Phi (C_k)=1-\frac{1}{3}\left( \frac{1}{11}+\frac{2}{12}+\frac{1}{23}\right) =0.899 \end{aligned}$$

Coverage:

$$\begin{aligned} \lambda (G)=\frac{\sum _{i,j}A_{ij}I\left( C_{l_i},C_{l_j}\right) }{\sum _{i,j}A_{ij}}=\frac{(5\times 2)+(5\times 2)+(11\times 2)}{23\times 2}=0.913 \end{aligned}$$

Node-based validity index:

$$\begin{aligned}&\begin{array}{|c|c|c|c|c|c|c|c|} \hline \text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}&{}\text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}\\ \hline 1&{}2/4&{}0&{}1&{}10&{}2/4&{}1/12&{}0.833\\ \hline 2 &{}2/4 &{}0 &{}1 &{}11 &{}4/6 &{}1/10 &{}0.85\\ \hline 3&{} 2/4&{} 0&{} 1&{} 12&{} 2/6&{} 0&{} 1\\ \hline 4&{} 2/4&{} 1/12&{} 0.833&{} 13&{} 2/6&{} 0&{} 1\\ \hline 5&{} 2/4&{} 0&{} 1&{} 14&{} 4/6&{} 0&{} 1\\ \hline 6&{} 2/4&{} 1/12&{} 0.833&{} 15&{} 4/6&{} 0&{} 1\\ \hline 7&{} 2/4&{} 0&{} 1&{} 16&{} 3/6&{} 0&{} 1\\ \hline 8&{} 2/4&{} 0&{} 1&{} 17&{} 3/6&{} 0&{} 1\\ \hline 9&{} 2/4&{} 0&{} 1&{}&{}&{}&{}\\ \hline \end{array}\\&N(G)=\frac{1}{17}\sum _{i=1}^{17}\frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}=0.961 \end{aligned}$$

When $K=4$:

Performance:

$$\begin{aligned} perf(G)=\frac{f(G)+g(G)}{\frac{1}{2}n(n-1)}=\frac{19+103}{\frac{1}{2}\times 17\times 16}=0.897 \end{aligned}$$

Rcut:

$$\begin{aligned} Rcut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{|C_i|}=\frac{1}{5}+\frac{2}{5}+\frac{3}{3}+\frac{2}{4}=2.100 \end{aligned}$$

Ncut:

$$\begin{aligned} Ncut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{vol(C_i)}=\frac{1}{6}+\frac{2}{7}+\frac{3}{6}+\frac{2}{8}=1.202 \end{aligned}$$

Modularity:

$$\begin{aligned} e_{11}= & {} \frac{5 \times 2}{23\times 2}=\frac{10}{46}, \quad a_1=\frac{(5\times 2)+1}{23\times 2}=\frac{11}{46}\\ e_{22}= & {} \frac{5 \times 2}{23\times 2}=\frac{10}{46}, \quad a_2=\frac{(5\times 2)+2}{23\times 2}=\frac{12}{46}\\ e_{33}= & {} \frac{3 \times 2}{23\times 2}=\frac{6}{46}, \quad a_3=\frac{(3\times 2)+3}{23\times 2}=\frac{9}{46}\\ e_{44}= & {} \frac{6 \times 2}{23\times 2}=\frac{12}{46}, \quad a_4=\frac{(6\times 2)+2}{23\times 2}=\frac{14}{46}\\ Q(G)= & {} \sum _k(e_{kk}-a^2_k)=\left( \frac{10}{46}-\left( \frac{11}{46}\right) ^2\right) + \left( \frac{10}{46}-\left( \frac{12}{46}\right) ^2\right) \\&+ \left( \frac{6}{46}-\left( \frac{9}{46}\right) ^2\right) +\left( \frac{12}{46}-\left( \frac{14}{46}\right) ^2\right) =0.569 \end{aligned}$$

Conductance:

$$\begin{aligned} \Phi (1)= & {} \frac{1}{\min (11,35)}=\frac{1}{11}, \quad \Phi (2)=\frac{2}{\min (12,34)}=\frac{2}{12}\\ \Phi (3)= & {} \frac{3}{\min (9,37)}=\frac{3}{9}, \quad \Phi (4)=\frac{2}{\min (14,32)}=\frac{2}{14}\\ \Phi (G)= & {} 1-\frac{1}{k}\sum _k\Phi (C_k)=1-\frac{1}{4} \left( \frac{1}{11}+\frac{2}{12}+\frac{3}{9}+\frac{2}{12}\right) =0.816 \end{aligned}$$

Coverage:

$$\begin{aligned} \lambda (G)=\frac{\sum _{i,j}A_{ij}I(C_{l_i},C_{l_j})}{\sum _{i,j}A_{ij}} =\frac{(5\times 2)+(5\times 2)+(3\times 2)+(6\times 2)}{23\times 2}=0.826 \end{aligned}$$

Node-based validity index:

$$\begin{aligned}&\begin{array}{|c|c|c|c|c|c|c|c|} \hline \text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}&{}\text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}\\ \hline 1&{} 2/4&{} 0&{} 1&{} 10&{} 2/4&{} 1/12&{} 0.833\\ \hline 2&{} 2/4&{} 0&{} 1&{} 11&{} 1&{} 3/14&{} 0.785\\ \hline 3&{} 2/4&{} 0&{} 1&{} 12&{} 1&{} 0&{} 1\\ \hline 4&{} 2/4&{} 1/12&{} 0.833&{} 13&{} 1&{} 0&{} 1\\ \hline 5&{} 2/4&{} 0&{} 1&{} 14&{} 1&{} 1/13&{} 0.923\\ \hline 6&{} 2/4&{} 1/12&{} 0.833&{} 15&{} 1&{} 1/13&{} 0.923\\ \hline 7&{} 2/4&{} 0&{} 1&{} 16&{} 1&{} 0&{} 1\\ \hline 8&{} 2/4&{} 0&{} 1&{} 17&{} 1&{} 0&{} 1\\ \hline 9&{} 2/4&{} 0&{} 1&{}&{}&{}&{}\\ \hline \end{array}\\&N(G)=\frac{1}{17}\sum _{i=1}^{17}\frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}=0.948 \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tosyali, A., Tavakkol, B. A node-based index for clustering validation of graph data. Ann Oper Res 341, 197–221 (2024). https://doi.org/10.1007/s10479-021-04376-7

Download citation

Accepted: 19 October 2021
Published: 08 November 2021
Issue Date: October 2024
DOI: https://doi.org/10.1007/s10479-021-04376-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A node-based index for clustering validation of graph data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

QGraph: A Quality Assessment Index for Graph Clustering

On community structure validation in real networks

A New Clustering Algorithm Based on Graph Connectivity

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Details of calculations for illustrative example

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A node-based index for clustering validation of graph data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

QGraph: A Quality Assessment Index for Graph Clustering

On community structure validation in real networks

A New Clustering Algorithm Based on Graph Connectivity

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Details of calculations for illustrative example

Appendix: Details of calculations for illustrative example

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation