Abstract
The problem of mining correlated heavy hitters (CHH) from a two-dimensional data stream has been introduced recently, and a deterministic algorithm based on the use of the Misra–Gries algorithm has been proposed by Lahiri et al. to solve it. In this paper we present a new counter-based algorithm for tracking CHHs, formally prove its error bounds and correctness and show, through extensive experimental results, that our algorithm outperforms the Misra–Gries based algorithm with regard to accuracy and speed whilst requiring asymptotically much less space.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. ACM SIGMOD Rec 22(2):207–216
Ananthakrishna R, Das A, Gehrke J, Korn F, Muthukrishnan S, Srivastava D (2003) Efficient approximation of correlated sums on data streams. IEEE Trans Knowl Data Eng 15(3):569–572
Cafaro M, Pulimeno M (2016) Merging frequent summaries. In: Proceedings of the 17th Italian conference on theoretical computer science (ICTCS 2016), vol 1720, CEUR proceedings, pp 280–285
Cafaro M, Tempesta P (2011) Finding frequent items in parallel. Concurr Comput Pract Exp 23(15):1774–1788. doi:10.1002/cpe.1761
Cafaro M, Pulimeno M, Epicoco I, Aloisio G (2016a) Mining frequent items in the time fading model. Inf Sci 370–371:221–238. doi:10.1016/j.ins.2016.07.077
Cafaro M, Pulimeno M, Tempesta P (2016b) A parallel space saving algorithm for frequent items and the hurwitz zeta distribution. Inf Sci 329:1–19. doi:10.1016/j.ins.2015.09.003, http://www.sciencedirect.com/science/article/pii/S002002551500657X
Cafaro M, Pulimeno M, Epicoco I, Aloisio G (2017) Parallel space saving on multi- and many-core processors. Concurr Comput Pract Exp. doi:10.1002/cpe.4160
Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: ICALP ’02: proceedings of the 29th international colloquium on automata. Languages and programming. Springer-Verlag, pp 693–703
Chen L, Mei Q (2014) Mining frequent items in data stream using time fading model. Inf Sci 257:54–69. doi:10.1016/j.ins.2013.09.007, http://www.sciencedirect.com/science/article/pii/S0020025513006403
Cheng J, Ke Y, Ng W (2008) A survey on algorithms for mining frequent itemsets over data streams. Knowl Inf Syst 16(1):1–27. doi:10.1007/s10115-007-0092-4
Cormode G, Hadjieleftheriou M (2009) Finding the frequent items in streams of data. Commun ACM 52(10):97–105. doi:10.1145/1562764.1562789
Cormode G, Muthukrishnan S (2005a) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75. doi:10.1016/j.jalgor.2003.12.001
Cormode G, Muthukrishnan S (2005b) What’s hot and what’s not: tracking most frequent items dynamically. ACM Trans Database Syst 30(1):249–278. doi:10.1145/1061318.1061325
Cormode G, Korn F, Tirthapura S (2008) Exponentially decayed aggregates on data streams. In: IEEE 24th international conference on data engineering, 2008, ICDE 2008, pp 1379–1381, doi:10.1109/ICDE.2008.4497562
Cormode G, Tirthapura S, Xu B (2009) Time-decayed correlated aggregates over data streams. Stat Anal Data Min 2(5–6):294–310
Das S, Antony S, Agrawal D, El Abbadi A (2009) Thread cooperation in multicore architectures for frequency counting over multiple data streams. Proc VLDB Endow 2(1):217–228. doi:10.14778/1687627.1687653
Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows: (extended abstract). In: Proceedings of the thirteenth annual ACM-SIAM symposium on discrete algorithms, SODA ’02. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 635–644
Demaine ED, López-Ortiz A, Munro JI (2002) Frequency estimation of internet packet streams with limited space. In: ESA, pp 348–360
Erra U, Frola B (2012) Frequent items mining acceleration exploiting fast parallel sorting on the gpu. Proc Comput Sci 9(0):86–95, doi:10.1016/j.procs.2012.04.010, http://www.sciencedirect.com/science/article/pii/S1877050912001317, proceedings of the international conference on computational science, ICCS 2012
Gehrke J, Korn F, Srivastava D (2001) On computing correlated aggregates over continual data streams. SIGMOD Rec 30(2):13–24. doi:10.1145/376284.375665
Govindaraju NK, Raghuvanshi N, Manocha D (2005) Fast and approximate stream mining of quantiles and frequencies using graphics processors. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data, SIGMOD ’05, ACM, pp 611–622, doi:10.1145/1066157.1066227
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. ACM SIGMOD Rec 29(2):1–12
Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: Proceedings of CIKM, ACM Press, pp 287–294
Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28(1):51–55. doi:10.1145/762471.762473
Lahiri B, Mukherjee AP, Tirthapura S (2016) Identifying correlated heavy-hitters in a two-dimensional data stream. Data Min Knowl Discov 30(4):797–818. doi:10.1007/s10618-015-0438-6
Manerikar N, Palpanas T (2009) Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl Eng 68(4):415–430. doi:10.1016/j.datak.2008.11.001
Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proceedings of 21st international conference on data engineering, 2005, ICDE 2005, pp 767–778, doi:10.1109/ICDE.2005.68
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: VLDB, pp 346–357
Metwally A, Agrawal D, Abbadi AE (2006) An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans Database Syst 31(3):1095–1133. doi:10.1145/1166074.1166084
Mirylenka K, Cormode G, Palpanas T, Srivastava D (2015) Conditional heavy hitters: detecting interesting correlations in data streams. VLDB J 24(3):395–414. doi:10.1007/s00778-015-0382-5
Misra J, Gries D (1982) Finding repeated elements. Sci Comput Progr 2(2):143–152
Roy P, Teubner J, Alonso G (2012) Efficient frequent item counting in multi-core hardware. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12, ACM, pp 1451–1459, doi:10.1145/2339530.2339757
Tangwongsan K, Tirthapura S, Wu KL (2014) Parallel streaming frequency-based aggregates. In: Proceedings of the 26th ACM symposium on parallelism in algorithms and architectures, SPAA ’14, ACM, pp 236–245, doi:10.1145/2612669.2612695
Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390
Zhang Y (2012) Parallelizing the weighted lossy counting algorithm in high-speed network monitoring. In: Second international conference on instrumentation, measurement, computer, communication and control (IMCCC), pp 757–761, doi:10.1109/IMCCC.2012.183
Zhang Y, Singh S, Sen S, Duffield N, Lund C (2004) Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, ACM, pp 101–114
Zhang Y, Sun Y, Zhang J, Xu J, Wu Y (2014) An efficient framework for parallel and continuous frequent item monitoring. Concurr Comput Pract Exp 26(18):2856–2879. doi:10.1002/cpe.3182
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Toon Calders.
Rights and permissions
About this article
Cite this article
Epicoco, I., Cafaro, M. & Pulimeno, M. Fast and accurate mining of correlated heavy hitters. Data Min Knowl Disc 32, 162–186 (2018). https://doi.org/10.1007/s10618-017-0526-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-017-0526-x