Abstract
Clustering is an important technique for exploratory data analysis. While most of the earlier clustering algorithms focused on numerical data, real-world problems and data mining applications frequently involve categorical data. Here, we propose a new clustering algorithm for categorical data that is based on the frequency of attribute value combinations. Our algorithm finds all the combinations of attribute values in a record, which represent a subset of all the attribute values, and then groups the records using the frequency of these combinations. As our algorithm considers all the subsets of attribute values in a record, records in a cluster have not only similar attribute value sets but also strongly associated attribute values. We evaluated our algorithm with real and synthetic data sets, and the experimental results demonstrate the effectiveness of our algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Huang, Z.: A Fast Clustering Algorithm to Cluster Very large Categorical Data Sets in Data Mining. In: Proceedings of ACM SIGMOD Workshop on Research Issues on data Mining and knowledge Discovery (1997)
Gibson, D., Kleinberg, J., Raghavan, P.: Clustering Categorical Data: An Approach based on Dynamical. In: Proceedings of the 24th International Conference on Very Large Databases (1998)
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS-Clustering Categorical Data Using Summaries. In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–83 (1999)
Guha, S., Rastogi, R., Shim, K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes. In: Proceedings of the 15th International Conference on Data Engineering (1999)
Barbara, D., Couto, J., Li, Y.: COOLCAT: An entropy-based algorithm for categorical clustering. In: Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, pp. 590–599 (2002)
Yun, C.H., Chuang, K.T., Chen, M.S.: Adherence clustering: an efficient method for mining market-basket clusters. Information Systems 31, 170–186 (2006)
Hsu, C.C., Chen, Y.C.: Mining of Mixed data with application to catalog marketing. Expert Systems with Applications (2006)
Kim, M., Ramarkrishna, R.S.: Projected clustering for categorical datasets. Pattern Recognition Letters 27, 1405–1417 (2006)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems (2001)
Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
UCI machine Learning Repository, http://www.ics.uci.edu/~mlearn/MLRepository.html
Dataset Generator (DatGen), http://www.datasetgenerator.com
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering algorithms and validity measures. In: Thirteenth international conference on scientific and statistical database management, pp. 3–22 (2001)
Chen, H.L., Chuang, K.T., Chen, M.S.: Labeling Unclustered Categorical Data into Clusters Based on the Important Attribute Values. In: Proceedings of the 5th IEEE International Conference on Data Mining, pp. 106–113 (2005)
Mirkin, B.: Reinterpreting the Category Utility Function. Machine Learning, 1–11 (2001)
Gluck, A., Corter, J.: Information, Uncertainty, and the utility of categories. In: Proceedings of the Seventh Annual Conference of the Cognitive Science society (1985)
Ordonez, C., Omiecinski, E.: Efficient disk-based K-means clustering for relational database. IEEE Transactions on Knowledge and Data Engineering 16(8), 909–921 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Do, HJ., Kim, JY. (2008). Categorical Data Clustering Using the Combinations of Attribute Values. In: Gervasi, O., Murgante, B., Laganà, A., Taniar, D., Mun, Y., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2008. ICCSA 2008. Lecture Notes in Computer Science, vol 5073. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69848-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-69848-7_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69840-1
Online ISBN: 978-3-540-69848-7
eBook Packages: Computer ScienceComputer Science (R0)