iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/s10479-021-04376-7
A node-based index for clustering validation of graph data | Annals of Operations Research Skip to main content
Log in

A node-based index for clustering validation of graph data

  • Original Research
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

Clustering validity indices are designed for evaluating the performance of clustering algorithms in terms of quality of clusters. They are also used for detecting the correct number of clusters. Numerous clustering validity indices have been proposed in the literature for different types of data. However, to the best of our knowledge, there are a few well-known clustering validity indices for graph data. In this paper, we propose a new clustering validity index for graph data called the node-based clustering validity index. The main characteristic of our proposed index is that it captures the exclusive contribution of each node to the separation and compactness of the clusters in a graph. This characteristic gives an advantage to the method which is the detection of the correct number of clusters for cases that two sets of nodes are bonded to each other with a single node. We provide an illustrative example to show that many existing validity indices fail to detect the correct number of clusters in such cases. We evaluate the performance of our proposed index with several experiments on different real-world and synthetic graphs. Experiments show that our proposed node-based index outperforms the existing validity indices for different graph data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Abadi, A., Rajabioun, T., Ioannou, P. A., et al. (2015). Traffic flow prediction for road transportation networks with limited traffic data. IEEE Transactions on Intelligent Transportation Systems, 16(2), 653–662.

    Google Scholar 

  • Ah-Pine, J., Csurka, G., & Clinchant, S. (2015). Unsupervised visual and textual information fusion in CBMIR using graph-based methods. ACM Transactions on Information Systems, 33(2), 1–31.

    Article  Google Scholar 

  • Almeida, H., Guedes, D., Meira, W., & Zaki, M. J. (2011). Is there a best quality metric for graph clusters? In Joint European conference on machine learning and knowledge discovery in databases (pp. 44–59). Springer.

  • Boutin, F., & Hascoët, M. (2004). Cluster validity indices for graph partitioning. In Proceedings. Eighth international conference on information visualisation, 2004. IV 2004 (pp. 376–381). IEEE.

  • Brandes, U., Gaertler, M., & Wagner, D. (2003). Experiments on graph clustering algorithms. In European symposium on algorithms (pp. 568–579). Springer.

  • Condon, A., & Karp, R. M. (2001). Algorithms for graph partitioning on the planted partition model. Random Structures & Algorithms, 18(2), 116–140.

    Article  Google Scholar 

  • Danon, L., Diaz-Guilera, A., Duch, J., & Arenas, A. (2005). Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment, 2005(09), P09008.

    Article  Google Scholar 

  • Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 224–227.

    Article  Google Scholar 

  • Demir, E., Aykanat, C., & Cambazoglu, B. B. (2008). Clustering spatial networks for aggregate query processing: A hypergraph approach. Information Systems, 33(1), 1–17.

    Article  Google Scholar 

  • Dunn, J. C. (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters.

  • Emmons, S., Kobourov, S., Gallant, M., & Börner, K. (2016). Analysis of network clustering algorithms and cluster quality metrics at scale. PLoS ONE, 11(7), e0159161.

    Article  Google Scholar 

  • Girvan, M., & Newman, M. E. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12), 7821–7826.

    Article  Google Scholar 

  • Gómez, D., Zarrazola, E., Yáñez, J., & Montero, J. (2015). A divide-and-link algorithm for hierarchical clustering in networks. Information Sciences, 316, 308–328.

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.

  • Kehagias, A., & Pitsoulis, L. (2013). Bad communities with high modularity. The European Physical Journal B, 86(7), 1–11.

    Article  Google Scholar 

  • Kobourov, S. G., Pupyrev, S., & Simonetto, P. (2014). Visualizing graphs as maps with contiguous regions. in EuroVis14. Accepted to appear 4.

  • Lancichinetti, A., Fortunato, S., & Radicchi, F. (2008). Benchmark graphs for testing community detection algorithms. Physical Review E, 78(4), 046110.

    Article  Google Scholar 

  • Li, L. T., Xiong, Z. Y., Dai, Q. Z., Zha, Y. F., Zhang, Y. F., & Dan, J. P. (2020). A novel graph-based clustering method using noise cutting. Information Systems, 91, 101504.

    Article  Google Scholar 

  • Liang, S., Ren, Z., Zhao, Y., Ma, J., Yilmaz, E., & Rijke, M. D. (2017). Inferring dynamic user interests in streams of short texts for user clustering. ACM Transactions on Information Systems, 36(1), 1–37.

    Google Scholar 

  • Newman, M. (2018). Networks. Oxford University Press.

  • Newman, M. E. (2003). Mixing patterns in networks. Physical Review E, 67(2), 026126.

    Article  Google Scholar 

  • Newman, M. E. (2004). Fast algorithm for detecting community structure in networks. Physical Review E, 69(6), 066133.

    Article  Google Scholar 

  • Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113.

    Article  Google Scholar 

  • Rojas-Thomas, J., Santos, M., & Mora, M. (2017). New internal index for clustering validation based on graphs. Expert Systems with Applications, 86, 334–349.

    Article  Google Scholar 

  • Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.

    Article  Google Scholar 

  • Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27–64.

    Article  Google Scholar 

  • Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.

    Article  Google Scholar 

  • Tavakkol, B., Jeong, M. K., & Albin, S. L. (2018). Validity indices for clusters of uncertain data objects. Annals of Operations Research, 303, 1–37.

    Google Scholar 

  • Tosyali, A., Kim, J., Choi, J., & Jeong, M. K. (2019). Regularized asymmetric nonnegative matrix factorization for clustering in directed networks. Pattern Recognition Letters, 125, 750–757.

    Article  Google Scholar 

  • Tosyali, A., Kim, J., Choi, J., Kang, Y., & Jeong, M. K. (2020). New node anomaly detection algorithm based on nonnegative matrix factorization for directed citation networks. Annals of Operations Research, 288, 457–474.

    Article  Google Scholar 

  • van der Pol, J., & Rameshkoumar, J. P. (2018). The co-evolution of knowledge and collaboration networks: The role of the technology life-cycle. Scientometrics, 114(1), 307–323.

    Article  Google Scholar 

  • Wang, F., Li, T., Wang, X., Zhu, S., & Ding, C. (2011). Community discovery using nonnegative matrix factorization. Data Mining and Knowledge Discovery, 22(3), 493–521.

    Article  Google Scholar 

  • Wang, S., & Siskind, J. M. (2003). Image segmentation with ratio cut. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(6), 675–690.

    Article  Google Scholar 

  • Xie, X. L., & Beni, G. (1991). A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis & Machine Intelligence, 8, 841–847.

    Article  Google Scholar 

  • Zachary, W. W. (1977). An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33(4), 452–473.

    Article  Google Scholar 

  • Zhu, J., Wu, X., Lin, X., Huang, C., Fung, G. P. C., & Tang, Y. (2018). A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering. Scientometrics, 114(3), 781–794.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Behnam Tavakkol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Details of calculations for illustrative example

Appendix: Details of calculations for illustrative example

In the following, we show the details of calculating the value of our proposed node-based clustering validity index and the existing clustering validity indices for graph data.

When \(K=2\):

Performance:

$$\begin{aligned} perf(G)=\frac{f(G)+g(G)}{\frac{1}{2}n(n-1)}=\frac{22+69}{\frac{1}{2}\times 17\times 16}=0.669 \end{aligned}$$

Rcut:

$$\begin{aligned} Rcut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{|C_i|}=\frac{1}{10}+\frac{1}{7}=0.242 \end{aligned}$$

Ncut:

$$\begin{aligned} Ncut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{vol(C_i)}=\frac{1}{12}+\frac{1}{12}=0.166 \end{aligned}$$

Modularity:

$$\begin{aligned} e_{11}= & {} \frac{11 \times 2}{23\times 2}=\frac{22}{46}, \quad a_1=\frac{(11\times 2)+1}{23\times 2}=\frac{23}{46}\\ e_{22}= & {} \frac{11 \times 2}{23\times 2}=\frac{22}{46}, \quad a_2=\frac{(11\times 2)+1}{23\times 2}=\frac{23}{46}\\ Q(G)= & {} \sum _k(e_{kk}-a^2_k)=\left( \frac{22}{46}-\left( \frac{23}{46}\right) ^2\right) + \left( \frac{22}{46}-\left( \frac{23}{46}\right) ^2\right) =0.456 \end{aligned}$$

Conductance:

$$\begin{aligned} \Phi (1)= & {} \frac{1}{\min (23,23)}=\frac{1}{23}, \quad \Phi (2)=\frac{1}{\min (23,23)}=\frac{1}{23}\\ \Phi (G)= & {} 1-\frac{1}{k}\sum _k\Phi (C_k)=1-\frac{1}{2}\left( \frac{1}{23}+\frac{1}{23}\right) =0.956 \end{aligned}$$

Coverage:

$$\begin{aligned} \lambda (G)=\frac{\sum _{i,j}A_{ij}I(C_{l_i},C_{l_j})}{\sum _{i,j}A_{ij}}=\frac{(11\times 2)+(11\times 2)}{23\times 2}=0.956 \end{aligned}$$

Node-based validity index:

$$\begin{aligned}&\begin{array}{|c|c|c|c|c|c|c|c|} \hline \text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}&{}\text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}\\ \hline 1&{}2/9&{}0&{}1&{}10&{}2/9&{}1/7&{}0.357\\ \hline 2&{}2/9&{}0&{}1&{}11&{}4/6&{}1/10&{}0.85\\ \hline 3&{}2/9&{}0&{}1&{}12&{}2/6&{}0&{}1\\ \hline 4&{}3/9&{}0&{}1&{}13&{}2/6&{}0&{}1\\ \hline 5&{}2/9&{}0&{}1&{}14&{}4/6&{}0&{}1\\ \hline 6&{}3/9&{}0&{}1&{}15&{}4/6&{}0&{}1\\ \hline 7&{}2/9&{}0&{}1&{}16&{}3/6&{}0&{}1\\ \hline 8&{}2/9&{}0&{}1&{}17&{}3/6&{}0&{}1\\ \hline 9&{}2/9&{}0&{}1&{}&{}&{}&{}\\ \hline \end{array}\\&N(G)=\frac{1}{17}\sum _{i=1}^{17}\frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}=0.953 \end{aligned}$$

When \(K=3\):

Performance:

$$\begin{aligned} perf(G)=\frac{f(G)+g(G)}{\frac{1}{2}n(n-1)}=\frac{21+93}{\frac{1}{2}\times 17\times 16}=0.838 \end{aligned}$$

Rcut:

$$\begin{aligned} Rcut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{|C_i|}=\frac{1}{5}+\frac{2}{5}+\frac{1}{7}=0.742 \end{aligned}$$

Ncut:

$$\begin{aligned} Ncut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{vol(C_i)}=\frac{1}{6}+\frac{2}{7}+\frac{1}{12}=0.535 \end{aligned}$$

Modularity:

$$\begin{aligned} e_{11}= & {} \frac{5 \times 2}{23\times 2}=\frac{10}{46}, \quad a_1=\frac{(5\times 2)+1}{23\times 2}=\frac{11}{46}\\ e_{22}= & {} \frac{5 \times 2}{23\times 2}=\frac{10}{46}, \quad a_2=\frac{(5\times 2)+2}{23\times 2}=\frac{12}{46}\\ e_{33}= & {} \frac{11 \times 2}{23\times 2}=\frac{22}{46}, \quad a_3=\frac{(11\times 2)+1}{23\times 2}=\frac{23}{46}\\ Q(G)= & {} \sum _k(e_{kk}-a^2_k)=\left( \frac{10}{46}-\left( \frac{11}{46}\right) ^2\right) + \left( \frac{10}{46}-\left( \frac{12}{46}\right) ^2\right) \\&+ \left( \frac{22}{46}-\left( \frac{23}{46}\right) ^2\right) =0.537 \end{aligned}$$

Conductance:

$$\begin{aligned} \Phi (1)= & {} \frac{1}{\min (11,35)}=\frac{1}{11}, \quad \Phi (2)=\frac{2}{\min (12,34)}=\frac{2}{12}\\ \Phi (3)= & {} \frac{1}{\min (23,23)}=\frac{1}{23}\\ \Phi (G)= & {} 1-\frac{1}{k}\sum _k\Phi (C_k)=1-\frac{1}{3}\left( \frac{1}{11}+\frac{2}{12}+\frac{1}{23}\right) =0.899 \end{aligned}$$

Coverage:

$$\begin{aligned} \lambda (G)=\frac{\sum _{i,j}A_{ij}I\left( C_{l_i},C_{l_j}\right) }{\sum _{i,j}A_{ij}}=\frac{(5\times 2)+(5\times 2)+(11\times 2)}{23\times 2}=0.913 \end{aligned}$$

Node-based validity index:

$$\begin{aligned}&\begin{array}{|c|c|c|c|c|c|c|c|} \hline \text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}&{}\text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}\\ \hline 1&{}2/4&{}0&{}1&{}10&{}2/4&{}1/12&{}0.833\\ \hline 2 &{}2/4 &{}0 &{}1 &{}11 &{}4/6 &{}1/10 &{}0.85\\ \hline 3&{} 2/4&{} 0&{} 1&{} 12&{} 2/6&{} 0&{} 1\\ \hline 4&{} 2/4&{} 1/12&{} 0.833&{} 13&{} 2/6&{} 0&{} 1\\ \hline 5&{} 2/4&{} 0&{} 1&{} 14&{} 4/6&{} 0&{} 1\\ \hline 6&{} 2/4&{} 1/12&{} 0.833&{} 15&{} 4/6&{} 0&{} 1\\ \hline 7&{} 2/4&{} 0&{} 1&{} 16&{} 3/6&{} 0&{} 1\\ \hline 8&{} 2/4&{} 0&{} 1&{} 17&{} 3/6&{} 0&{} 1\\ \hline 9&{} 2/4&{} 0&{} 1&{}&{}&{}&{}\\ \hline \end{array}\\&N(G)=\frac{1}{17}\sum _{i=1}^{17}\frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}=0.961 \end{aligned}$$

When \(K=4\):

Performance:

$$\begin{aligned} perf(G)=\frac{f(G)+g(G)}{\frac{1}{2}n(n-1)}=\frac{19+103}{\frac{1}{2}\times 17\times 16}=0.897 \end{aligned}$$

Rcut:

$$\begin{aligned} Rcut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{|C_i|}=\frac{1}{5}+\frac{2}{5}+\frac{3}{3}+\frac{2}{4}=2.100 \end{aligned}$$

Ncut:

$$\begin{aligned} Ncut\left( C_1,C_2, \ldots ,C_K\right) =\sum _{i=1}^{K}\frac{cut\left( C_i, \bar{C_i}\right) }{vol(C_i)}=\frac{1}{6}+\frac{2}{7}+\frac{3}{6}+\frac{2}{8}=1.202 \end{aligned}$$

Modularity:

$$\begin{aligned} e_{11}= & {} \frac{5 \times 2}{23\times 2}=\frac{10}{46}, \quad a_1=\frac{(5\times 2)+1}{23\times 2}=\frac{11}{46}\\ e_{22}= & {} \frac{5 \times 2}{23\times 2}=\frac{10}{46}, \quad a_2=\frac{(5\times 2)+2}{23\times 2}=\frac{12}{46}\\ e_{33}= & {} \frac{3 \times 2}{23\times 2}=\frac{6}{46}, \quad a_3=\frac{(3\times 2)+3}{23\times 2}=\frac{9}{46}\\ e_{44}= & {} \frac{6 \times 2}{23\times 2}=\frac{12}{46}, \quad a_4=\frac{(6\times 2)+2}{23\times 2}=\frac{14}{46}\\ Q(G)= & {} \sum _k(e_{kk}-a^2_k)=\left( \frac{10}{46}-\left( \frac{11}{46}\right) ^2\right) + \left( \frac{10}{46}-\left( \frac{12}{46}\right) ^2\right) \\&+ \left( \frac{6}{46}-\left( \frac{9}{46}\right) ^2\right) +\left( \frac{12}{46}-\left( \frac{14}{46}\right) ^2\right) =0.569 \end{aligned}$$

Conductance:

$$\begin{aligned} \Phi (1)= & {} \frac{1}{\min (11,35)}=\frac{1}{11}, \quad \Phi (2)=\frac{2}{\min (12,34)}=\frac{2}{12}\\ \Phi (3)= & {} \frac{3}{\min (9,37)}=\frac{3}{9}, \quad \Phi (4)=\frac{2}{\min (14,32)}=\frac{2}{14}\\ \Phi (G)= & {} 1-\frac{1}{k}\sum _k\Phi (C_k)=1-\frac{1}{4} \left( \frac{1}{11}+\frac{2}{12}+\frac{3}{9}+\frac{2}{12}\right) =0.816 \end{aligned}$$

Coverage:

$$\begin{aligned} \lambda (G)=\frac{\sum _{i,j}A_{ij}I(C_{l_i},C_{l_j})}{\sum _{i,j}A_{ij}} =\frac{(5\times 2)+(5\times 2)+(3\times 2)+(6\times 2)}{23\times 2}=0.826 \end{aligned}$$

Node-based validity index:

$$\begin{aligned}&\begin{array}{|c|c|c|c|c|c|c|c|} \hline \text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}&{}\text {Node ID} &{} \gamma _i &{} \theta _i &{} \frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}\\ \hline 1&{} 2/4&{} 0&{} 1&{} 10&{} 2/4&{} 1/12&{} 0.833\\ \hline 2&{} 2/4&{} 0&{} 1&{} 11&{} 1&{} 3/14&{} 0.785\\ \hline 3&{} 2/4&{} 0&{} 1&{} 12&{} 1&{} 0&{} 1\\ \hline 4&{} 2/4&{} 1/12&{} 0.833&{} 13&{} 1&{} 0&{} 1\\ \hline 5&{} 2/4&{} 0&{} 1&{} 14&{} 1&{} 1/13&{} 0.923\\ \hline 6&{} 2/4&{} 1/12&{} 0.833&{} 15&{} 1&{} 1/13&{} 0.923\\ \hline 7&{} 2/4&{} 0&{} 1&{} 16&{} 1&{} 0&{} 1\\ \hline 8&{} 2/4&{} 0&{} 1&{} 17&{} 1&{} 0&{} 1\\ \hline 9&{} 2/4&{} 0&{} 1&{}&{}&{}&{}\\ \hline \end{array}\\&N(G)=\frac{1}{17}\sum _{i=1}^{17}\frac{\gamma _i-\theta _i}{\max (\gamma _i,\theta _i)}=0.948 \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tosyali, A., Tavakkol, B. A node-based index for clustering validation of graph data. Ann Oper Res 341, 197–221 (2024). https://doi.org/10.1007/s10479-021-04376-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-021-04376-7

Keywords

Navigation