Abstract
Measuring similarities among different vertices is a fundamental problem in graph analysis. Among different similarity measurements, SimRank is one of the most promising and popular. In reality, instead of computing the whole similarity matrix, people often issue SimRank queries in a pairwise manner, each of which needs to estimate an approximate SimRank value within a specified accuracy for a given pair of nodes. These pairwise SimRank queries are often processed on real-life graphs, which typically evolve over time, requiring efficient algorithms that can query pairwise SimRank under dynamic graph updates. However, current single-pair SimRank solutions are either static or inefficient in handling dynamic cases with good-quality results. Observing that the sample size is the major factor that determines the efficiency and the accuracy in Monte Carlo methods to estimate pairwise SimRank, in this paper, we propose three algorithms to query pairwise SimRank over static and dynamic graphs efficiently, by using different sample reduction strategies. The accuracy of our algorithms is guaranteed by the different invariants we propose for pairwise SimRank. We show that our algorithms outperform the state-of-the-art static and dynamic solutions for pairwise SimRank estimation.
Similar content being viewed by others
Notes
If \(|In(a)|=0\), the SimRank score between a and any other node is 0.
\(\delta \) is the failure probability.
d is the average in-degree of G.
The details of the derivation of the concentration inequality, which determines N, are shown in Sect. 3.
Since \(N=\frac{c^{2}\ln {\frac{2}{\delta }}}{2\epsilon ^{2}}= 0.6^{2} \times \frac{\ln {\frac{2}{0.01}}}{2 * (0.01)^{2}} \approx 9537\).
We assume \(\epsilon \) is a rationale number, instead of a real number; otherwise, \(\epsilon \) is uncountable and the problem is undecidable.
Mathematically, we can view \(\mathbf {p} = \overrightarrow{vec}(\mathbf {P})\) and \(\mathbf {p} = \overrightarrow{vec}(\mathbf {R})\), where \(\mathbf {p}, \mathbf {r} \in {\mathbb {R}}^{n^{2} \times 1}\) and \(\mathbf {P},\mathbf {R} \in {\mathbb {R}}^{n \times n}\).
If both algorithms A and B achieve the \(\epsilon \)-bound, and B has a smaller ME than A, we cannot conclude that B has higher accuracy than A, since \(\epsilon \) is a predefined accuracy parameter. In contrast, this indicates that B may have unnecessary redundant computational cost.
Since the ground truth all-pairs SimRank scores of these four small data sets can be computed within several minutes, it would be unaffordable to run the algorithm which has a total running time longer than 1 h.
References
Abbassi, Z., Mirrokni, V.S.: A recommender system based on local random walks and spectral methods. In: WebKDD/SNA-KDD (2007). https://doi.org/10.1145/1348549.1348561
Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: FOCS (2006)
Antonellis, I., Garcia-Molina, H., Chang, C.: Simrank++: query rewriting through link analysis of the click graph. PVLDB 1(1), 408–421, (2008). http://www.vldb.org/pvldb/1/1453903.pdf
Fogaras, D., Rácz, B.: Scaling link-based similarity search. In: WWW (2005)
Fujiwara, Y., Nakatsuji, M., Shiokawa, H., Onizuka, M.: Efficient search algorithm for SimRank. In: ICDE (2013)
He, G., Feng, H., Li, C., Chen, H.: Parallel SimRank computation on large graphs with iterative aggregation. In: KDD (2010). https://doi.org/10.1145/1835804.1835874
He, J., Liu, H., Yu, J.X., Li, P., He, W., Du, X.: Assessing single-pair similarity over graphs by aggregating first-meeting probabilities. Inf. Syst. 42, 107–122 (2014). https://doi.org/10.1016/j.is.2013.12.008
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963). http://www.jstor.org/stable/2282952
Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: KDD (2002)
Jeh, G., Widom, J.: Scaling personalized web search. In: Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, 20–24 May 2003, pp. 271–279 (2003). https://doi.org/10.1145/775152.775191
Jiang, M., Fu, A.W., Wong, R.C., Wang, K.: READS: a random walk approach for efficient and accurate dynamic SimRank. PVLDB 10(9), 937–948 (2017)
Kusumoto, M., Maehara, T., Kawarabayashi, K: Scalable similarity search for SimRank. In: SIGMOD (2014). https://doi.org/10.1145/2588555.2610526
Li, C., Han, J., He, G., Jin, X., Sun, Y., Yu, Y., Wu, T.: Fast computation of SimRank for static and dynamic information networks. In: EDBT (2010a). https://doi.org/10.1145/1739041.1739098
Li, P., Liu, H., Yu, J.X., He, J., Du, X.: Fast single-pair SimRank computation. In: Proceedings of the 2010 SIAM International Conference on Data Mining, SIAM, pp. 571–582 (2010b)
Li, Z., Fang, Y., Liu, Q., Cheng, J., Cheng, R., Lui, J.C.S.: Walking in the cloud: parallel simrank at scale. PVLDB 9(1), 24–35 (2015)
Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Assoc. Inf. Sci. Technol. 58(7), 1019–1031 (2007)
Liu, Y., Zheng, B., He, X., Wei, Z., Xiao, X., Zheng, K., Lu, J.: Probesim: scalable single-source and top-k SimRank computations on dynamic graphs. PVLDB 11(1), 14–26 (2017). http://www.vldb.org/pvldb/vol11/p14-liu.pdf
Lizorkin, D., Velikhov, P., Grinev, M., Turdakov, D.: Accuracy estimate and optimization techniques for SimRank computation. VLDB 1(1), 422–433 (2008)
Lofgren, P., Banerjee, S., Goel, A.: Personalized PageRank estimation and search: a bidirectional approach. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, ACM, New York (WSDM ’16), pp. 163–172 (2016). https://doi.org/10.1145/2835776.2835823
Lu, J., Gong, Z., Lin, X.: A novel and fast SimRank algorithm. IEEE Trans. Knowl. Data Eng. (2017). https://doi.org/10.1109/TKDE.2016.2626282
Maehara, T., Kusumoto, M., Kawarabayashi, K.: Efficient SimRank computation via linearization (2014). CoRR arXiv:1411.7228
Mislove, A., Koppula, H.S., Gummadi, K.P., Druschel, P., Bhattacharjee, B.: Growth of the flickr social network. In: Proceedings of the First Workshop on Online Social Networks (WOSN 2008), Seattle, 17–22 Aug 2008, pp. 25–30 (2008). https://doi.org/10.1145/1397735.1397742
Shao, Y., Cui, B., Chen, L., Liu, M., Xie, X.: An efficient similarity search framework for SimRank over large dynamic graphs. PVLDB 8(8), 838–849 (2015)
Spirin, N., Han, J.: Survey on web spam detection: principles and algorithms. SIGKDD Explor. Newsl. 13(2), 50–64 (2012)
Tao, W., Yu, M., Li, G.: Efficient top-k SimRank-based similarity join. PVLDB 8(3):317–328, (2014). http://www.vldb.org/pvldb/vol8/p317-tao.pdf
Tian, B., Xiao, X.: Sling: A near-optimal index structure for SimRank. SIGMOD (2016).https://doi.org/10.1145/2882903.2915243
Wang, Y., Lian, X., Chen, L.: Efficient SimRank tracking in dynamic graphs. In: ICDE (2018)
Yin, X., Han, J., Yu, P.S.: Linkclus: efficient clustering via heterogeneous semantic links. In: VLDB (2006)
Yoon, M., Jin, W., Kang, U.: Fast and accurate random walk with restart on dynamic graphs with guarantees (2017). CoRR arXiv:1712.00595
Yu, W., McCann, J.A.: Efficient partial-pairs SimRank search for large networks. PVLDB 8(5), 569–580 (2015)
Yu, W., Zhang, W., Lin, X., Zhang, Q., Le, J.: A space and time efficient algorithm for simrank computation. WWW 15(3) (2012). https://doi.org/10.1007/s11280-010-0100-6
Yu, W., Lin, X., Zhang, W.: Towards efficient SimRank computation on large networks. In: ICDE, pp. 601–612 (2013a). https://doi.org/10.1109/ICDE.2013.6544859
Yu, W., Lin, X., Zhang, W., Chang, L., Pei, J.: More is simpler: effectively and efficiently assessing node-pair similarities based on hyperlinks. PVLDB 7(1), 13–24 (2013b)
Yu, W., Lin, X., Zhang, W.: Fast incremental SimRank on link-evolving graphs. In: ICDE, pp. 304–315 (2014). https://doi.org/10.1109/ICDE.2014.6816660
Yu, W., Lin, X., Zhang, W., McCann, J.A.: Dynamical simrank search on time-varying networks. VLDB J. 27(1), 79–104 (2018). https://doi.org/10.1007/s00778-017-0488-z
Zhao, P., Han, J., Sun, Y.: P-rank: a comprehensive structural similarity measure over information networks. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, ACM, pp. 553–562 (2009)
Zheng, W., Zou, L., Chen, L., Zhao, D.: Efficient simrank-based similarity join. ACM Trans. Database Syst. 42(3), 16:1–16:37 (2017). https://doi.org/10.1145/3083899
Acknowledgements
Yue Wang and Lei Chen are supported in part by the Hong Kong RGC GRF Project 16214716, National Grand Fundamental Research 973 Program of China under Grant 2014CB340303, the National Science Foundation of China (NSFC) under Grant No. 61729201, Science and Technology Planning Project of Guangdong Province, China, No. 2015B010110006, Huawei Co.Ltd Collaboration Project, YBCB2009041-45, Hong Kong ITC ITF grants ITS/391/15FX and ITS/212/16FP, Microsoft Research Asia Collaborative Research Grant, and WeChat-HKUST Joint Lab on Artificial Intelligence Technology, RDC-17182280. Yulin Che and Qiong Luo are supported in part by grants 16206414 from the Hong Kong Research Grants Council and MRA11EG01 from Microsoft.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, Y., Chen, L., Che, Y. et al. Accelerating pairwise SimRank estimation over static and dynamic graphs. The VLDB Journal 28, 99–122 (2019). https://doi.org/10.1007/s00778-018-0521-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-018-0521-x