Abstract
This work investigates how approximate binary patterns can be objectively evaluated by using as a proxy measure the quality achieved by a text clustering algorithm, where the document features are derived from such patterns. Specifically, we exploit approximate patterns within the well-known FIHC (Frequent Itemset-based Hierarchical Clustering) algorithm, which was originally designed to employ exact frequent itemsets to achieve a concise and informative representation of text data. We analyze different state-of-the-art algorithms for approximate pattern mining, in particular we measure their ability in extracting patterns that well characterize the document topics in terms of the quality of clustering obtained by FIHC. Extensive and reproducible experiments, conducted on publicly available text corpora, show that approximate itemsets provide a better representation than exact ones.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
\(set(\cdot )\) takes an indicator vector and returns the corresponding subset.
- 2.
- 3.
References
Beil, F., Ester, M., Xiaowei, X.: Frequent term-based text clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 436–442. ACM (2002)
Cheng, H., Yu, P.S., Han, J.: Ac-close: Efficiently mining approximate closed itemsets by core pattern recovery. In: Sixth International Conference on Data Mining, 2006, ICDM 2006, pp. 839–844. IEEE (2006)
Fung, Benjamin C. M Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM International Conference on Data Mining (SDM), pp. 59–70 (2003)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Dubes (1988)
Lucchese, C., Orlando, S., Perego, R.: Fast and memory efficient mining of frequent closed itemsets. IEEE Trans. Knowl. Data Eng. 18, 21–36 (2006)
Lucchese, C., Orlando, S., Perego, R.: A generative pattern model for mining binary datasets. In: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1109–1110. ACM (2010)
Lucchese, C., Orlando, S., Perego, R.: Mining top-k patterns from binary datasets in presence of noise. In: Proceedings of SIAM International Conference on Data Mining (SDM), pp. 165–176. SIAM (2010)
Lucchese, C., Orlando, S., Perego, R.: A unifying framework for mining approximate top-k binary patterns. IEEE Trans. Knowl. Data Eng. 26, 2900–2913 (2014)
Miettinen, P., Mielikainen, T., Gionis, A., Das, G., Mannila, H.: The discrete basis problem. IEEE Trans. Knowl. Data Eng. 20(10), 1348–1362 (2008)
Miettinen, P., Vreeken, J.: Model order selection for boolean matrix factorization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 51–59 (2011)
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Wang, K., Chu, X., Liu, B.: Clustering transactions using large items. In: International Conference on Information and Knowledge Management, CIKM-99, pp. 483–490 (1999)
Xiang, Y., Jin, R., Fuhry, D., Dragan, F.F.: Summarizing transactional databases with overlapped hyperrectangles. Data Min. Knowl. Discov. 23(2), 215–251 (2011)
Zaki, M.J., Hsiao, C.J.: Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans. Knowl. Data Eng. 17(4), 462–478 (2005)
Acknowledgments
This work was partially supported by the EC H2020 Program INFRAIA-1-2014-2015 SoBigData: Social Mining & Big Data Ecosystem (654024).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Lucchese, C., Orlando, S., Perego, R. (2016). Evaluating Top-K Approximate Patterns via Text Clustering. In: Madria, S., Hara, T. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2016. Lecture Notes in Computer Science(), vol 9829. Springer, Cham. https://doi.org/10.1007/978-3-319-43946-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-43946-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43945-7
Online ISBN: 978-3-319-43946-4
eBook Packages: Computer ScienceComputer Science (R0)