Abstract
Finding patterns or clusters in streaming data is very important in the present information mining. The most critical issue is the huge amount of data versus the limited size of storage space. In the previous works, the essential information of huge data was represented by subsets of data, grid summarization, or spherical function. Those forms of data representation are not compact enough to capture the topology of the arriving data points and may lead to the lack of information for generating the accurate cluster result. In this work, we proposed a new versatile hyper-elliptic clustering algorithm, called VHEC, to cluster the streaming data in one-pass-thrown-away fashion in order to preserve the original topology of data space. To cope with the problem of one-pass-thrown-away clustering, a new set of elliptic micro-cluster parameters, i.e. boundary, density, direction, intra-distance and inter-distance, was introduced. Furthermore, a feasible technique for merging two micro-clusters was developed. The proposed parameters and one-pass-throw-away clustering algorithm were tested against several benchmark data sets and structural clustering data sets. Our performance was compared with existing algorithms. Regardless of different sizes, shapes, and densities, VHEC outperformed the other previous data stream clustering algorithms on both synthetic and real data sets. Moreover, VHEC is more significantly robust to streaming speed and incoming data sequence than the other compared algorithms in terms of purity, Rand index, and adjusted Rand index measures.
Similar content being viewed by others
References
ACHTERT, E., BOHM, C., KRIEGEl, H.-P., and KROGER, P. (2005), “Online Hierarchical Clustering in a Data Warehouse Environment”, in Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 10–17.
AGGARWAL, C.C. (2009), “On High Dimensional Projected Clustering of Uncertain Data Streams”, in Proceedings of the 2009 IEEE International Conference on Data Engineering, pp. 1152–1154.
AGGARWAL, C.C., HAN, J., WANG, J., and YU, P.S. (2003), “A Framework for Clustering Evolving Data Streams”, in Proceedings of the 29th International Conference on Very Large Data Bases, pp. 81–92.
AMINI, A., WAH, T., and SABOOHI, H. (2014), “On Density-Based Data Streams Clustering Algorithms: A Survey”, Journal of Computer Science and Technology 29(1), 116–141.
BERINGER, J., and HÜLLERMEIER, E.H. (2006), “Online Clustering of Parallel Data Streams”, Data and Knowledge Engineering 58, 180–204.
BHATNAGAR, V., and KAUR, S. (2007), “Exclusive and Complete Clustering of Streams”, in Database and Expert Systems Applications, pp. 629–638.
BHATNAGAR, V., KAUR, S., and CHAKRAVARTHY, S. (2014), “Clustering Data Streams Using Grid-Based Synopsis”, Knowledge and Information Systems 41(1), 127–152.
BHATNAGAR, V., KAUR, S., and MIGNET, L. (2009), “A Parameterized Framework for Clustering Streams”, International Journal of Data Warehousing and Mining 5, 36–56.
CAO, F., ESTER, M., QIAN, W., and ZHOU, A. (2006), “Density-Based Clustering over an Evolving Data Stream with Noise”, in 2006 SIAM Conference on Data Mining, pp. 328–339.
CHEN, H.-L., CHEN, M.-S, and LIN, S.-C. (2009), “Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data”, Knowledge and Data Engineering, IEEE Transactions on 21(5), 652–665.
CORDER, G.W., and FOREMAN,D.I. (2009), Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach, New Jersey: Wiley.
DA SILVA, A., CHIKY, R., and HBRAIL, G. (2012), “A Clustering Approach for Sampling Data Streams in Sensor Networks”, Knowledge and Information Systems 32(1), 1–23.
DANIEL, B. (2002), “Requirements for Clustering Data Streams”, ACM SIGKDD Explorations Newsletter 3(2), 23–27.
DING, S.,WU, F., QIAN, J., JIA, H., and JIN, F. (2013), “Research on Data Stream Clustering Algorithms”, Artificial Intelligence Review, 43(4), 593–600.
DRAGUT, A. (2012), “Stock Data Clustering and Multiscale Trend Detection”, Methodology and Computing in Applied Probability 14(1), 87–105.
ESTER, M., KRIEGEL, H.-P., SANDER, J., and XU, X. (1996), “ A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231.
GAO, J., LI, J., ZHANG, Z., and TAN, P.-N. (2005), “An Incremental Data Stream Clustering Algorithm Based on Dense Units Detection”, in The 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 420–425.
GONG, L., ZENG, J., and ZHANG, S. (2011), “Text Stream Clustering Algorithm Based on Adaptive Feature Selection” Expert Systems with Applications 38(3), 1393–1399.
GUHA, S., MEYERSON, A., MISHRA, N., MOTWANI, R., and O’CALLAGHAN, L. (2003), “Clustering Data Streams: Theory and Practice”, IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528.
HORE, P., HALL, L., GOLDGOF, D., and CHENG, W. (2008), “Online Fuzzy C Means”, in Annual Meeting of the North American Fuzzy Information Processing Society, pp.
HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification 2(1), 193–218.
KRANEN, P., ASSENT, I., BALDAUF, C., and SEIDL, T. (2011), “The Clustree: Indexing Micro-Clusters for Anytime Stream Mining”, Knowledge and Information Systems 29(2), 249–272.
LEE, C.-H. (2012), “Mining Spatio-Temporal Information on Microblogging Streams Using a Density-Based Online ClusteringMethod”, Expert Systems with Applications 39(10), 9623–9641.
LI, Y., and GOPALAN, R. (2006), “Clustering Transactional Data Streams”, in Advances in Artificial Intelligence, pp. 1069–1073.
LI, Y., LI, D., WANG, S., and ZHAI, Y. (2014), “Incremental Entropy-Based Clustering on Categorical Data Streams with Concept Drift”, Knowledge-Based Systems 59, 33–47.
LI-XIONG, L., HAI, H., YUN-FEI, G., and FU-CAI, C. (2009), “rdenstream, A Clustering Algorithm over an Evolving Data Stream”, in International Conference on Information Engineering and Computer Science, pp. 1–4.
LU, Y., SUN, Y., XU, G., and LIU, G. (2005), “A Grid-Based Clustering Algorithm for High-Dimensional Data Streams”, in International Conference on Advanced Data Mining and Applications, pp. 824–831.
LÜHR, S., and LAZARESCU,M. (2009), “Incremental Clustering of Dynamic Data Streams Using Connectivity Based Representative Points”, Data and Knowledge Engineering 68(1), 1–27.
LUO, Q., YAN, X., LI, J., and PENG, Y. (2014), “Ddeudsc: A Dynamic Distance Estimation Using Uncertain Data Stream Clustering in Mobile Wireless Sensor Networks”, Measurement 55, 423–433.
MAGDY, A., and BASSIOUNY, M. (2010), “Sic-Means: A Semi-Fuzzy Approach for Clustering Data Streams Using C-Means”, in Artificial Neural Networks in Pattern Recognition, pp. 96–107.
MILLER, Z., DICKINSON, B., DEITRICK,W., HU,W., and WANG, A.H. (2014), “Twitter Spammer Detection Using Data Stream Clustering”, Information Sciences 260, 64–73.
PARK, N.H., and LEE, W.S. (2004), “Statistical Grid-Based Clustering over Data Streams”, SIGMOD Record 33(1), 32–37.
PARK, N.H., and LEE, W.S. (2007a), “Cell Trees: An Adaptive Synopsis Structure for Clustering Multi-Dimensional On-Line Data Streams”, Data and Knowledge Engineering 63(2), 528–549.
PARK, N.H., and LEE,W.S. (2007b), “Grid-Based Subspace Clustering over Data Streams”, in Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 801–810.
PARK, N.H., OH, S.H., and LEE, W.S. (2010), “Anomaly Intrusion Detection by Clustering Transactional Audit Streams in a Host Computer”, Information Sciences 180(12), 2375–2389.
PEREIRA, C., and DE MELLO, R. (2014), “TS-Stream: Clustering Time Series on Data Streams”, Journal of Intelligent Information Systems 42(3), 531–566.
PHRIDVIRAJ, M.S.B., SRINIVAS, C., and RAO, C.V.G. (2014), “Clustering Text Data Streams - A Tree Based Approach with Ternary Function and Ternary Feature Vector”, in Proceedings of the Second International Conference on Information Technology and Quantitative Management, pp. 976–984.
RAND,W.M. (1971), “Objective Criteria for the Evaluation of ClusteringMethods”, Journal of the American Statistical Association 66(336), 846–850.
REHMAN, M.Z., LI, Y., YANG, Y., and WANG, H. (2014), “Hyper-Ellipsoidal Clustering Technique for Evolving Data Stream”, Knowledge-Based Systems 70(C), 3–14.
REN, J., CAI, B., and HU, C. (2011), “Clustering over Data Streams Based on Grid Density and Index Tree”, Journal of Convergence Information Technology 6(1), 83–93.
REN, J., and MA, R. (2009), “Density-Based Data Streams Clustering over Sliding Windows”, in 6th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 248–252.
RODRIGUES, P., GAMA, J., and PEDROSO, J. (2008), “Hierarchical Clustering of Time-Series Data Streams”, IEEE Transactions on Knowledge and Data Engineering, 20(5), 615–627.
RUIZ, C., MENASALVAS, E., and SPILIOPOULOU, M. (2009), “C-Denstream: Using Domain Knowledge on a Data Stream”, in Proceedings of the 12th International Conference on Discovery Science, pp. 287–301.
SONG, M., and WANG, H. (2005), “Highly Efficient Incremental Estimation of Gaussian Mixture Models for Online Data Stream Clustering”, in Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, pp. 174–183.
SONG, M.J., and ZHANG, L. (2008), “Comparison of Cluster Representations from Partial Second- to Full Fourth-Order Cross Moments for Data Stream Clustering”, in Proceedings of the 8th IEEE International Conference on Data Mining, pp. 560–569.
STEINLEY, D. (2004), “Properties of the Hubert-Arable Adjusted Rand Index”, Psychological Methods 9(3), 386–396.
STEINLEY, D., and BRUSCO, M.J. (2007), “Initializing K-Means Batch Clustering: A Critical Evaluation of Several Techniques”, Journal of Classification 24(1), 99–121.
SUN, Y., and LU, Y. (2006), “A Grid-Based Subspace Clustering Algorithm for High-Dimensional Data Streams”, in Web Information Systems Workshops, pp. 37–48.
TASOULIS, D.K., ADAMS, N.M., and HAND, D.J. (2006), “ Unsupervised Clustering in Streaming Data”, in Workshops Proceedings of the 6th IEEE International Conference on Data Mining, pp. 638–642.
WANG, W., GUYET, T., QUINIOU, R., CORDIER, M.-O., MASSEGLIA, F., and ZHANG, X. (2014), “Autonomic Intrusion Detection: Adaptively Detecting Anomalies over Unlabeled Audit Data Streams in Computer Networks”, Knowledge-Based Systems 70, 103–117.
WEI, L.-Y., and PENG, W.-C. (2013), “An Incremental Algorithm for Clustering Spatial Data Streams: Exploring Temporal Locality”, Knowledge and Information Systems 37(2), 453–483.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wattanakitrungroj, N., Maneeroj, S. & Lursinsap, C. Versatile Hyper-Elliptic Clustering Approach for Streaming Data Based on One-Pass-Thrown-Away Learning. J Classif 34, 108–147 (2017). https://doi.org/10.1007/s00357-017-9222-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-017-9222-1