Abstract
The discovery of patterns plays an important role in data mining. A pattern can be any type of regularity displayed in that data, such as, e.g. which items are typically sold together, which genes are mostly active for patients of a certain disease, etc, etc. Generally speaking, finding a pattern
... read more
is easy. Discovering interesting patterns, however, is complicated. Existing techniques for mining patterns from data lead to enormous amounts of results. Often more patterns are discovered than there is original data. As such, the problem is that it is impossible to consider the potentially interesting patterns. This thesis is about finding interesting patterns, and, more boldly, about making pattern mining useful. It is about how to discover few, but highly interesting patterns. And, prominently, it is about how to put these patterns to good use, solving a number of data mining problems. This thesis proposes to use the Minimum Description Length principle to select small groups of patterns that together describe the data well. To this end, it introduces KRIMP: a heuristic for finding the itemsets that together optimally compress the database. Through extensive evaluation, we show the high quality of these code tables. Experiments regarding the choices in the algorithm identified the best settings for the algorithm, making it parameter-free for all practical purposes. We successfully apply these patterns in a number of different data mining problems. We show how the difference between transaction databases can be measured and characterised through the use of code tables. Next, we specify the problem of identifying the parts of a transaction database drawn from different distributions in terms of MDL: the best decomposition minimises the total compressed size. We also discuss using the pattern sets as generative models. The resulting generated data preserves privacy but contains the patterns present in the original data, including correct margins. Further, we consider the problem of high quality imputation of incomplete records. Experiments show these approaches to be highly successful. By using entropy instead of frequency, the LESS algorithm we introduce is particularly suited for mining dense data. Further, by regarding data 0/1 symmetric, all major interactions in the data are captured, not just co-occurences. Experiments show that this approach is able to succinctly describe data in only tens of patterns, allowing for very easy consideration by experts. The PACK algorithm we introduce also considers data symmetrically and selects interesting itemsets. It achieves very high compression ratios. Further, it is able to mine its models directly on the data, without the need of first mining large collections of candidate patterns. The thesis concludes that to the end of making pattern mining useful simply the best set of patterns should be mined, as opposed to all patterns that satisfy certain criteria. The MDL principle is particularly well-suited for mining these useful patterns. By using this principle to select the set of patterns that describe the data best, we are returned very few, but high-quality, patterns. These characteristic patterns are easily interpreted and used in further analysis.
show less