Abstract
EDCM, the Exponential-family approximation to the Dirichlet Compound Multinomial (DCM), proposed by Elkan (2006), is an efficient statistical model for high-dimensional and sparse count data. EDCM models take into account the burstiness phenomenon correctly while being many times faster than DCM. This work proposes the use of Minimum Message Length (MML) criterion for determining the number of components that best describes the data with a finite EDCM mixture model. Parameters estimation is based on the previously proposed Deterministic Annealing Expectation-Maximization (DAEM) approach. The validation of the proposed unsupervised algorithm involves different real applications: text document modeling, topic novelty detection and hierarchical image clustering. A comparison with results obtained for other information-theory based selection criteria is provided.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Aggarwal CC, Zhai C (2012) An introduction to text mining. In: Mining text data. Springer, pp 1–10
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Amayri O, Bouguila N (2013) Online news topic detection and tracking via localized feature selection. In: Proceedings of the international joint conference on neural networks (IJCNN). IEEE, pp 1–8
Banerjee A, Basu S (2007) Topic models over text streams: a study of batch and online unsupervised learning. In: Proceedings of the 2007 SIAM international conference on data mining. SIAM, pp 431–436
Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6:1705–1749
Baxter RA, Oliver JJ (2000) Finding overlapping components with mml. Stat Comput 10(1):5–16
Bhatia SK, Deogun JS (1998) Conceptual clustering in information retrieval. IEEE Trans Syst Man Cybern B Cybern 28(3):427–436
Bishop CM, Tipping ME (1998) A hierarchical latent variable model for data visualization. IEEE Trans Pattern Anal Mach Intell 20(3):281–293
Blashfield RK (1976) Mixture model tests of cluster analysis: accuracy of four agglomerative hierarchical methods. Psychol Bull 83(3):377
Bouguila N (2011) Count data modeling and classification using finite mixtures of distributions. IEEE Trans Neural Netw 22(2):186–198
Bouguila N, Ziou D (2006) Online clustering via finite mixtures of dirichlet and minimum message length. Eng Appl Artif Intell 19(4):371–379
Bouguila N, Ziou D (2006) Unsupervised selection of a finite dirichlet mixture model: an mml-based approach. IEEE Trans Knowl Data Eng 18(8):993–1009
Bouguila N, Ziou D (2007) High-dimensional unsupervised selection and estimation of a finite generalized dirichlet mixture model based on minimum message length. IEEE Trans Pattern Anal Mach Intell 29(10):1716–1731
Bouguila N, Ziou D (2007) Unsupervised learning of a finite discrete mixture: applications to texture modeling and image databases summarization. J Vis Commun Image Represent 18(4):295–309
Boutemedjet S, Ziou D (2012) Predictive approach for user long-term needs in content-based image suggestion. IEEE Trans Neural Netw Learn Syst 23(8):1242–1253
Brown LD (1986) Fundamentals of statistical exponential families: with applications in statistical decision theory. In: Lecture notes-monograph series, vol 9
Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1
Church KW, Gale WA (1995) Poisson mixtures. Nat Lang Eng 1(2):163–190
Conway JH, Sloane NJA (2013) Sphere packings, lattices and groups, vol 290. Springer Science & Business Media, Berlin
Cover TM, Thomas JA (2012) Elements of information theory. Wiley, Hoboken
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of the workshop on statistical learning in computer vision (ECCV), vol 1. Prague, pp 1–2
DasGupta A (2011) The exponential family and statistical applications. In: Probability for statistics and machine learning. Springer, pp 583–612
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1-2):143–175
Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, Hoboken
Edelbrock C, McLaughlin B (1980) Hierarchical cluster analysis using intraclass correlations: a mixture model study. Multivar Behav Res 15(3):299–318
Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the 20th international conference on machine learning (ICML), pp 147–153
Elkan C (2006) Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd international conference on machine learning (ICML). ACM, pp 289–296
Everitt BS (1996) An introduction to finite mixture distributions
Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), vol 2. IEEE, pp 524–531
Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Frigui H, Krishnapuram R (1999) A robust competitive clustering algorithm with applications in computer vision. IEEE Trans Pattern Anal Mach Intell 21(5):450–465
Goldwater S, Griffiths T, Johnson M (2006) Interpolating between types and tokens by estimating power-law generators. Adv Neural Inf Proces Syst 18:459–467
Graybill FA (1983) Matrices with applications in statistics. Wadsworth
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of the ACM sigmod record, vol 27. ACM, pp 73–84
Handcock MS, Raftery AE, Tantrum JM (2007) Model-based clustering for social networks. J R Stat Soc A Stat Soc 170(2):301–354
Hatzivassiloglou V, Gravano L, Maganti A (2000) An investigation of linguistic features and clustering algorithms for topical document clustering. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 224–231
Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 297–304
Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), vol 4. IEEE, pp IV–317
Iwayama M, Tokunaga T (1995) Cluster-based text categorization: a comparison of category search strategies. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 273– 280
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666
Jain AK, Flynn PJ (1996) Image segmentation using clustering. IEEE Press, Piscataway
Jefferys WH, Berger JO (1992) Ockham’s razor and bayesian analysis. Am Sci 80(1):64–72
Jensen JH, Ellis DP, Christensen MG, Jensen SH (2007) Evaluation of distance measures between gaussian mixture models of mfccs. In: ISMIR, pp 107–108
Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2 (1):15–59
Kingman JFC (1993) Poisson processes. Wiley Online Library
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Tech. rep., University of Toronto
Kullback S (1997) Information theory and statistics. Courier Corporation
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Lin YS, Jiang JY, Lee SJ (2014) A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng 26(7):1575–1590
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning (ICML). ACM, pp 545–552
Margaritis D, Thrun S (2001) A bayesian multiresolution independence test for continuous variables. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 346–353
McCallum A, Nigam K, et al. (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol 752. Citeseer, pp 41–48
McCallum AK (1996) Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/∼mccallum/bow
McLachlan G, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley, Hoboken
McLachlan G, Peel D (2004) Finite mixture models. Wiley, Hoboken
Minka TP (2003) Estimating a dirichlet distribution. Ann Phys 2000(8):1–13. https://doi.org/10.1007/s00256-007-0299-1
Mosimann JE (1962) On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika 49(1/2):65–82
Olech ŁP, Paradowski M (2016) Hierarchical gaussian mixture model with objects attached to terminal and non-terminal dendrogram nodes. In: Proceedings of the 9th international conference on computer recognition systems (CORES). Springer, pp 191–201
Pimentel MA, Clifton DA, Clifton L, Tarassenko L (2014) A review of novelty detection. Signal Process 99:215–249
Rennie JDM, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the twentieth international conference on machine learning (ICML), vol 3, pp 616–623
Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471
Sahami M, Koller D (1998) Using machine learning to improve information access. Ph.D. thesis, Stanford University Department of Computer Science
Sandler M (2007) Hierarchical mixture models: a probabilistic analysis. In: Proceedings of the 13th international conference on knowledge discovery and data mining (SIGKDD). ACM, pp 580–589
Singh S, Markou M (2004) An approach to novelty detection applied to the classification of image regions. IEEE Trans Knowl Data Eng 16(4):396–407
Sneath PH, Sokal RR, et al. (1973) Numerical taxonomy. The principles and practice of numerical classification. WH Freeman and Company, San Francisco
Timande N, Chandak M, Kamble M (2014) Document clustering with feature selection using dirichlet process mixture model and dirichlet multinomial allocation model. International Journal of Engineering Research and Applications 10–16. https://pdfs.semanticscholar.org/6a0f/2d593688780a73e12e171ee157bb94b037cc.pdf
Ueda N, Nakano R (1998) Deterministic annealing em algorithm. Neural Netw 11(2):271–282
Wallace CS (1990) Classification by minimum-message-length inference. In: Proceedings of the international conference on computing and information. Springer, pp 72–81
Wallace CS (2005) Statistical and inductive inference by minimum message length. Springer Science & Business Media
Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2):185–194
Wallace CS, Dowe DL (2000) MML clustering of multi-state, poisson, von mises circular and gaussian distributions. Stat Comput 10(1):73–83
Wallace CS, Freeman PR (1987) Estimation and inference by compact coding. J R Stat Soc Ser B Methodol 49(3):240–265
Yang MH, Ahuja N (1998) Gaussian mixture model for human skin color and its applications in image and video databases. In: Storage and retrieval for image and video databases VII, vol 3656. International Society for Optics and Photonics, pp 458–467
Yao B, Fei-Fei L (2010) Grouplet: a structured image representation for recognizing human and object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 9–16
Yao JF (2000) On recursive estimation in incomplete data models. Stat A J Theor Appl Stat 34(1):27–51
Zamzami N, Bouguila N (2018) Mml-based approach for determining the number of topics in edcm mixture models. In: Proceedings of the 31st Canadian conference on artificial intelligence (CAI). Springer
Zhang J, Ghahramani Z, Yang Y (2005) A probabilistic model for online document clustering with application to novelty detection. In: Advances in neural information processing systems, pp 1617–1624
Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Disc 10(2):141–168
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Proof of (16)
We have the negative of log-likelihood function as:
Then, the first order derivative of the negative log-likelihood, also called the Fisher score function, is:
where Ψ is the digamma function. Then,
and:
where Ψ′ is the trigamma function. We remark that F(φj) can be written as:
where \(D=diag\left [\sum \limits _{d=l}^{l+\eta _{j}-1} I(x_{dw} \geq 1) \left .\frac {1}{\varphi _{j1}^{2}} \right ), {\dots } , \sum \limits _{d=l}^{l+\eta _{j}-1} I\right .\)\(\left .\vphantom {\sum \limits _{d=l}^{l+\eta _{j}-1}}(x_{dw} \geq 1) \left .\frac {1}{\varphi _{jW}^{2}} \right ) \right ]\), \(\gamma =\eta _{j}(-{\Psi }^{\prime }(s_{j}))+\sum \limits _{d=l}^{l+\eta _{j}-1} {\Psi }^{\prime }(s_{j}+n_{d})\), and AT = 1. Then, according to (Theorem 8.4.3) given by Graybill [33], the determinant of the Fisher information matrix F(φj) is:
By substituting (35) and (15) into (14), we obtain:
Then, taking the log gives (16).
Appendix B: Proof of (29)
The KL divergence between two distributions that belong to the exponential family is defined as [16]:
where E𝜃 is the expectation with respect to P(X|𝜃). We have:
Moreover, we have the following [16]:
Thus, according to (38 and 40), we have:
Rights and permissions
About this article
Cite this article
Zamzami, N., Bouguila, N. Model selection and application to high-dimensional count data clustering. Appl Intell 49, 1467–1488 (2019). https://doi.org/10.1007/s10489-018-1333-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1333-9