Abstract
In this chapter we describe how to successfully apply the MDL principle to pattern mining. In particular, we discuss how pattern-based models can be designed and induced by means of compression, resulting in succinct and characteristic descriptions of the data.
As motivation, we argue that traditional pattern mining asks the wrong question: instead of asking for all patterns satisfying some interestingness measure, one should ask for a small, non-redundant, and interesting set of patterns—which allows us to avoid the pattern explosion. Firmly rooted in algorithmic information theory, the approach we discuss in this chapter states that the best set of patterns is that set that compresses the data best. We formalize this problem using the Minimum Description Length (MDL) principle, describe useful model classes, and briefly discuss algorithmic approaches to inducing good models from data. Last but not least, we describe how the obtained models—in addition to showing the key patterns of the data—can be used for a wide range of data mining tasks; hence showing that MDL selects useful patterns.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
P. Adriaans and P. Vitányi. Approximation of the two-part MDL code. IEEE TIT, 55(1):444–457, 2009.
H. Akaike. A new look at the statistical model identification. IEEE TAC, 19(6):716–723, 1974.
L. Akoglu, H. Tong, J. Vreeken, and C. Faloutsos. CompreX (refer MSP): Compression based anomaly detection. In CIKM. ACM, 2012.
L. Akoglu, J. Vreeken, H. Tong, N. Tatti, and C. Faloutsos. Mining connection pathways for marked nodes in large graphs. In SDM. SIAM, 2013.
R. Bathoorn, A. Koopman, and A. Siebes. Reducing the frequent pattern set. In ICDM-Workshop, pages 1–5, 2006.
C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. Robust information-theoretic clustering. In KDD, pages 65–75, 2006.
F. Bonchi, M. van Leeuwen, and A. Ukkonen. Characterizing uncertain data using compression. In SDM, pages 534–545, 2011.
S. Chakrabarti, S. Sarawagi, and B. Dom. Mining surprising patterns using temporal description length. In VLDB, pages 606–617. Morgan Kaufmann, 1998.
D. Chakrabarti, S. Papadimitriou, D. S. Modha, and C. Faloutsos. Fully automatic cross-associations. In KDD, pages 79–88, 2004.
V. Chandola and V. Kumar. Summarization – compressing data into an informative representation. Knowl. Inf. Sys., 12(3):355–378, 2007.
R. Cilibrasi and P. Vitányi. Clustering by compression. IEEE TIT, 51(4):1523–1545, 2005.
T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience New York, 2006.
T. De Bie. An information theoretic framework for data mining. In KDD, pages 564–572. ACM, 2011.
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B, 39(1):1–38, 1977.
C. Faloutsos and V. Megalooikonomou. On data mining, compression and Kolmogorov complexity. Data Min. Knowl. Disc., 15(1):3–20, 2007.
U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In UAI, pages 1022–1027, 1993.
R. A. Fisher. On the interpretation of χ2from contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85(1):87–94, 1922.
F. Geerts, B. Goethals, and T. Mielikäinen. Tiling databases. In DS, pages 278–289, 2004.
A. Gionis, H. Mannila, and J. K. Seppänen. Geometric and combinatorial tiles in 0-1 data. In PKDD, pages 173–184. Springer, 2004.
P. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.
T. Guns, S. Nijssen, and L. D. Raedt. Itemset mining: A constraint programming perspective. Artif. Intell., 175(12-13):1951–1983, 2011.
E. Halperin and R. M. Karp. The minimum-entropy set cover problem. TCS, 348(2-3):240–250, 2005.
H. Heikinheimo, J. K. Seppänen, E. Hinkkanen, H. Mannila, and T. Mielikäinen. Finding low-entropy sets and trees from binary data. In KDD, pages 350–359, 2007.
H. Heikinheimo, J. Vreeken, A. Siebes, and H. Mannila. Lowentropy set selection. In SDM, pages 569–580, 2009.
E. Jaynes. On the rationale of maximum-entropy methods. Proc. IEEE, 70(9):939–952, 1982.
U. Kang and C. Faloutsos. Beyond caveman communities: Hubs and spokes for graph compression and mining. In ICDM, pages 300–309. IEEE, 2011.
R. M. Karp. Reducibility among combinatorial problems. In Proc. Compl. Comp. Comput., pages 85–103, New York, USA, 1972.
E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In KDD, pages 206–215, 2004.
E. Keogh, S. Lonardi, C. A. Ratanamahatana, L. Wei, S.-H. Lee, and J. Handley. Compression-based data mining of sequential data. Data Min. Knowl. Disc., 14(1):99–129, 2007.
P. Kontkanen and P. Myllymäki. A linear-time algorithm for computing the multinomial stochastic complexity. Inf. Process. Lett., 103(6):227–233, 2007.
P. Kontkanen, P. Myllymäki, W. Buntine, J. Rissanen, and H. Tirri. An MDL framework for clustering. Technical report, HIIT, 2004. Technical Report 2004–6.
A. Koopman and A. Siebes. Discovering relational items sets efficiently. In SDM, pages 108–119, 2008.
A. Koopman and A. Siebes. Characteristic relational patterns. In KDD, pages 437–446, 2009.
H. T. Lam, F. Mörchen, D. Fradkin, and T. Calders. Mining compressing sequential patterns. In SDM, 2012.
H. T. Lam, T. Calders, J. Yang, F. Moerchen, and D. Fradkin.: Mining compressing sequential patterns in streams. In IDEA, pages 54–62, 2013.
M. van Leeuwen and A. Siebes. StreamKrimp: Detecting change in data streams. In ECML PKDD, pages 672–687, 2008.
M. van Leeuwen, J. Vreeken, and A. Siebes. Compression picks the item sets that matter. In PKDD, pages 585–592, 2006.
M. van Leeuwen, F. Bonchi, B. Sigurbjörnsson, and A. Siebes. Compressing tags to find interesting media groups. In CIKM, pages 1147–1156, 2009.
M. van Leeuwen, J. Vreeken, and A. Siebes. Identifying the components. Data Min. Knowl. Disc., 19(2):173–292, 2009.
M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer, 1993.
M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi. The similarity metric. IEEE TIT, 50(12): 3250–3264, 2004.
C. Lucchese, S. Orlando, and R. Perego. Mining top-k patterns from binary datasets in presence of noise. In SDM, pages 165–176, 2010.
M. Mampaey and J. Vreeken. Summarising categorical data by clustering attributes. Data Min. Knowl. Disc., 26(1):130–173, 2013.
M. Mampaey, J. Vreeken, and N. Tatti. Summarizing data succinctly with the most informative itemsets. ACM TKDD, 6:1–44, 2012.
P. Miettinen and J. Vreeken. Model order selection for Boolean matrix factorization. In KDD, pages 51–59. ACM, 2011.
P. Miettinen and J. Vreeken. mdl4bmf: Minimum description length for Boolean matrix factorization. ACM TKDD. In Press.
S. Papadimitriou, J. Sun, C. Faloutsos, and P. S. Yu. Hierarchical, parameter-free community discovery. In ECML PKDD, pages 170–187, 2008.
B. Pfahringer. Compression-based feature subset selection. In Proc. IJCAI'95 Workshop on Data Engineering for Inductive Learning, pages 109–119, 1995.
B. A. Prakash, J. Vreeken, and C. Faloutsos. Spotting culprits in epidemics: How many and which ones? In ICDM. IEEE, 2012.
J. Quinlan. C4.5: Programs for Machine Learning. Morgan-Kaufmann, Los Altos, California, 1993.
L. D. Raedt. Declarative modeling for machine learning and data mining. In ECML PKDD, pages 2–3, 2012.
J. Rissanen. Modeling by shortest data description. Automatica, 14(1):465–471, 1978.
G. Schwarz. Estimating the dimension of a model. Annals Stat., 6(2):461–464, 1978.
H. Shao, B. Tong, and E. Suzuki. Extended MDL principle for feature-based inductive transfer learning. Knowl. Inf. Sys., 35(2):365–389, 2013.
A. Siebes. Queries for data analysis. In IDA, pages 7–22, 2012.
A. Siebes and R. Kersten. A structure function for transaction data. In SDM, pages 558–569. SIAM, 2011.
A. Siebes, J. Vreeken, and M. van Leeuwen. Item sets that compress. In SDM, pages 393–404. SIAM, 2006.
K. Smets and J. Vreeken. The odd one out: Identifying and characterising anomalies. In SDM, pages 804–815. SIAM, 2011.
K. Smets and J. Vreeken. Slim: Directly mining descriptive patterns. In SDM, pages 236–247. SIAM, 2012.
J. Sun, C. Faloutsos, S. Papadimitriou, and P. S. Yu. Graphscope: parameter-free mining of large time-evolving graphs. In KDD, pages 687–696, 2007.
N. Tatti. Computational complexity of queries based on itemsets. Inf. Process. Lett., 98(5): 183–187, 2006.
N. Tatti and J. Vreeken. Finding good itemsets by packing data. In ICDM, pages 588–597, 2008.
N. Tatti and J. Vreeken. Discovering descriptive tile trees by fast mining of optimal geometric subtiles. In ECML PKDD. Springer, 2012.
N. Tatti and J. Vreeken. The long and the short of it: Summarizing event sequences with serial episodes. In KDD. ACM, 2012.
J. Vreeken and A. Siebes. Filling in the blanks: Krimp minimisation for missing data. In ICDM, pages 1067–1072. IEEE, 2008.
J. Vreeken, M. van Leeuwen, and A. Siebes. Characterising the difference. In KDD, pages 765–774, 2007.
J. Vreeken, M. van Leeuwen, and A. Siebes. Preserving privacy through data generation. In ICDM, pages 685–690. IEEE, 2007.
J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: Mining itemsets that compress. Data Min. Knowl. Disc., 23(1):169–214, 2011.
C. Wallace. Statistical and inductive inference by minimum message length. Springer-Verlag, 2005.
C. Wang and S. Parthasarathy. Summarizing itemset patterns using probabilistic models. In KDD, pages 730–735, 2006.
H. Warner, A. Toronto, L. Veasey, and R. Stephenson. A mathematical model for medical diagnosis, application to congenital heart disease. J. Am. Med. Assoc., 177:177–184, 1961.
Acknowledgments
Matthijs van Leeuwen is supported by a Post-doctoral Fellowship of the Research Foundation Flanders (fwo). Jilles Vreeken is supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
van Leeuwen, M., Vreeken, J. (2014). Mining and Using Sets of Patterns through Compression. In: Aggarwal, C., Han, J. (eds) Frequent Pattern Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-07821-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-07821-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07820-5
Online ISBN: 978-3-319-07821-2
eBook Packages: Computer ScienceComputer Science (R0)