Model selection and application to high-dimensional count data clustering

Zamzami, Nuha; Bouguila, Nizar

doi:10.1007/s10489-018-1333-9

Model selection and application to high-dimensional count data clustering

via finite EDCM mixture models

Published: 13 November 2018

Volume 49, pages 1467–1488, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

702 Accesses
13 Citations
Explore all metrics

Abstract

EDCM, the Exponential-family approximation to the Dirichlet Compound Multinomial (DCM), proposed by Elkan (2006), is an efficient statistical model for high-dimensional and sparse count data. EDCM models take into account the burstiness phenomenon correctly while being many times faster than DCM. This work proposes the use of Minimum Message Length (MML) criterion for determining the number of components that best describes the data with a finite EDCM mixture model. Parameters estimation is based on the previously proposed Deterministic Annealing Expectation-Maximization (DAEM) approach. The validation of the proposed unsupervised algorithm involves different real applications: text document modeling, topic novelty detection and hierarchical image clustering. A comparison with results obtained for other information-theory based selection criteria is provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering bivariate mixed-type data via the cluster-weighted model

Article 04 July 2015

Exponential family mixed membership models for soft clustering of multivariate data

Article 09 August 2016

Advances in Robust Constrained Model Based Clustering

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

http://kdd.ics.uci.edu/databases/reuters21578
https://cs.nyu.edu/∼roweis/data.html
http://www.cs.cmu.edu/∼webkb
Both data sets are available at: http://www.cad.zju.edu.cn/home/dengcai/Data/TextData

References

Aggarwal CC, Zhai C (2012) An introduction to text mining. In: Mining text data. Springer, pp 1–10
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Article MathSciNet MATH Google Scholar
Amayri O, Bouguila N (2013) Online news topic detection and tracking via localized feature selection. In: Proceedings of the international joint conference on neural networks (IJCNN). IEEE, pp 1–8
Banerjee A, Basu S (2007) Topic models over text streams: a study of batch and online unsupervised learning. In: Proceedings of the 2007 SIAM international conference on data mining. SIAM, pp 431–436
Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6:1705–1749
MathSciNet MATH Google Scholar
Baxter RA, Oliver JJ (2000) Finding overlapping components with mml. Stat Comput 10(1):5–16
Article Google Scholar
Bhatia SK, Deogun JS (1998) Conceptual clustering in information retrieval. IEEE Trans Syst Man Cybern B Cybern 28(3):427–436
Article Google Scholar
Bishop CM, Tipping ME (1998) A hierarchical latent variable model for data visualization. IEEE Trans Pattern Anal Mach Intell 20(3):281–293
Article Google Scholar
Blashfield RK (1976) Mixture model tests of cluster analysis: accuracy of four agglomerative hierarchical methods. Psychol Bull 83(3):377
Article Google Scholar
Bouguila N (2011) Count data modeling and classification using finite mixtures of distributions. IEEE Trans Neural Netw 22(2):186–198
Article Google Scholar
Bouguila N, Ziou D (2006) Online clustering via finite mixtures of dirichlet and minimum message length. Eng Appl Artif Intell 19(4):371–379
Article Google Scholar
Bouguila N, Ziou D (2006) Unsupervised selection of a finite dirichlet mixture model: an mml-based approach. IEEE Trans Knowl Data Eng 18(8):993–1009
Article Google Scholar
Bouguila N, Ziou D (2007) High-dimensional unsupervised selection and estimation of a finite generalized dirichlet mixture model based on minimum message length. IEEE Trans Pattern Anal Mach Intell 29(10):1716–1731
Bouguila N, Ziou D (2007) Unsupervised learning of a finite discrete mixture: applications to texture modeling and image databases summarization. J Vis Commun Image Represent 18(4):295–309
Article Google Scholar
Boutemedjet S, Ziou D (2012) Predictive approach for user long-term needs in content-based image suggestion. IEEE Trans Neural Netw Learn Syst 23(8):1242–1253
Article Google Scholar
Brown LD (1986) Fundamentals of statistical exponential families: with applications in statistical decision theory. In: Lecture notes-monograph series, vol 9
Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1
MathSciNet Google Scholar
Church KW, Gale WA (1995) Poisson mixtures. Nat Lang Eng 1(2):163–190
Article MathSciNet Google Scholar
Conway JH, Sloane NJA (2013) Sphere packings, lattices and groups, vol 290. Springer Science & Business Media, Berlin
Google Scholar
Cover TM, Thomas JA (2012) Elements of information theory. Wiley, Hoboken
MATH Google Scholar
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of the workshop on statistical learning in computer vision (ECCV), vol 1. Prague, pp 1–2
DasGupta A (2011) The exponential family and statistical applications. In: Probability for statistics and machine learning. Springer, pp 583–612
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1-2):143–175
Article MATH Google Scholar
Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, Hoboken
MATH Google Scholar
Edelbrock C, McLaughlin B (1980) Hierarchical cluster analysis using intraclass correlations: a mixture model study. Multivar Behav Res 15(3):299–318
Article Google Scholar
Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the 20th international conference on machine learning (ICML), pp 147–153
Elkan C (2006) Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd international conference on machine learning (ICML). ACM, pp 289–296
Everitt BS (1996) An introduction to finite mixture distributions
Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), vol 2. IEEE, pp 524–531
Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Article Google Scholar
Frigui H, Krishnapuram R (1999) A robust competitive clustering algorithm with applications in computer vision. IEEE Trans Pattern Anal Mach Intell 21(5):450–465
Article Google Scholar
Goldwater S, Griffiths T, Johnson M (2006) Interpolating between types and tokens by estimating power-law generators. Adv Neural Inf Proces Syst 18:459–467
Google Scholar
Graybill FA (1983) Matrices with applications in statistics. Wadsworth
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of the ACM sigmod record, vol 27. ACM, pp 73–84
Handcock MS, Raftery AE, Tantrum JM (2007) Model-based clustering for social networks. J R Stat Soc A Stat Soc 170(2):301–354
Article MathSciNet Google Scholar
Hatzivassiloglou V, Gravano L, Maganti A (2000) An investigation of linguistic features and clustering algorithms for topical document clustering. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 224–231
Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 297–304
Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), vol 4. IEEE, pp IV–317
Iwayama M, Tokunaga T (1995) Cluster-based text categorization: a comparison of category search strategies. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 273– 280
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666
Article Google Scholar
Jain AK, Flynn PJ (1996) Image segmentation using clustering. IEEE Press, Piscataway
Google Scholar
Jefferys WH, Berger JO (1992) Ockham’s razor and bayesian analysis. Am Sci 80(1):64–72
Google Scholar
Jensen JH, Ellis DP, Christensen MG, Jensen SH (2007) Evaluation of distance measures between gaussian mixture models of mfccs. In: ISMIR, pp 107–108
Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2 (1):15–59
Article Google Scholar
Kingman JFC (1993) Poisson processes. Wiley Online Library
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Tech. rep., University of Toronto
Kullback S (1997) Information theory and statistics. Courier Corporation
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Article MathSciNet MATH Google Scholar
Lin YS, Jiang JY, Lee SJ (2014) A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng 26(7):1575–1590
Article Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning (ICML). ACM, pp 545–552
Margaritis D, Thrun S (2001) A bayesian multiresolution independence test for continuous variables. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 346–353
McCallum A, Nigam K, et al. (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol 752. Citeseer, pp 41–48
McCallum AK (1996) Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/∼mccallum/bow
McLachlan G, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley, Hoboken
Google Scholar
McLachlan G, Peel D (2004) Finite mixture models. Wiley, Hoboken
Minka TP (2003) Estimating a dirichlet distribution. Ann Phys 2000(8):1–13. https://doi.org/10.1007/s00256-007-0299-1
Article Google Scholar
Mosimann JE (1962) On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika 49(1/2):65–82
Article MathSciNet MATH Google Scholar
Olech ŁP, Paradowski M (2016) Hierarchical gaussian mixture model with objects attached to terminal and non-terminal dendrogram nodes. In: Proceedings of the 9th international conference on computer recognition systems (CORES). Springer, pp 191–201
Pimentel MA, Clifton DA, Clifton L, Tarassenko L (2014) A review of novelty detection. Signal Process 99:215–249
Article Google Scholar
Rennie JDM, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the twentieth international conference on machine learning (ICML), vol 3, pp 616–623
Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471
Article MATH Google Scholar
Sahami M, Koller D (1998) Using machine learning to improve information access. Ph.D. thesis, Stanford University Department of Computer Science
Sandler M (2007) Hierarchical mixture models: a probabilistic analysis. In: Proceedings of the 13th international conference on knowledge discovery and data mining (SIGKDD). ACM, pp 580–589
Singh S, Markou M (2004) An approach to novelty detection applied to the classification of image regions. IEEE Trans Knowl Data Eng 16(4):396–407
Article Google Scholar
Sneath PH, Sokal RR, et al. (1973) Numerical taxonomy. The principles and practice of numerical classification. WH Freeman and Company, San Francisco
MATH Google Scholar
Timande N, Chandak M, Kamble M (2014) Document clustering with feature selection using dirichlet process mixture model and dirichlet multinomial allocation model. International Journal of Engineering Research and Applications 10–16. https://pdfs.semanticscholar.org/6a0f/2d593688780a73e12e171ee157bb94b037cc.pdf
Ueda N, Nakano R (1998) Deterministic annealing em algorithm. Neural Netw 11(2):271–282
Article Google Scholar
Wallace CS (1990) Classification by minimum-message-length inference. In: Proceedings of the international conference on computing and information. Springer, pp 72–81
Wallace CS (2005) Statistical and inductive inference by minimum message length. Springer Science & Business Media
Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2):185–194
Article MATH Google Scholar
Wallace CS, Dowe DL (2000) MML clustering of multi-state, poisson, von mises circular and gaussian distributions. Stat Comput 10(1):73–83
Article Google Scholar
Wallace CS, Freeman PR (1987) Estimation and inference by compact coding. J R Stat Soc Ser B Methodol 49(3):240–265
MathSciNet MATH Google Scholar
Yang MH, Ahuja N (1998) Gaussian mixture model for human skin color and its applications in image and video databases. In: Storage and retrieval for image and video databases VII, vol 3656. International Society for Optics and Photonics, pp 458–467
Yao B, Fei-Fei L (2010) Grouplet: a structured image representation for recognizing human and object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 9–16
Yao JF (2000) On recursive estimation in incomplete data models. Stat A J Theor Appl Stat 34(1):27–51
MathSciNet MATH Google Scholar
Zamzami N, Bouguila N (2018) Mml-based approach for determining the number of topics in edcm mixture models. In: Proceedings of the 31st Canadian conference on artificial intelligence (CAI). Springer
Zhang J, Ghahramani Z, Yang Y (2005) A probabilistic model for online document clustering with application to novelty detection. In: Advances in neural information processing systems, pp 1617–1624
Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Disc 10(2):141–168
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada
Nuha Zamzami & Nizar Bouguila
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
Nuha Zamzami

Authors

Nuha Zamzami
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nuha Zamzami.

Appendices

Appendix A: Proof of (16)

We have the negative of log-likelihood function as:

$$\begin{array}{@{}rcl@{}} -\mathcal{L}(\mathcal{X}_{j}|\boldsymbol{\varphi}_{j})&=&-\log \left( \prod\limits_{d=l}^{l+\eta_{j}-1} \mathcal{EDCM}(\mathbf{X}_{d}|\boldsymbol{\varphi}_{j}) \right) \\ &=&\eta_{j}(-\log {\Gamma}(s_{j}))+\sum\limits_{d=l}^{l+\eta_{j}-1} \log {\Gamma}(s_{j}+n_{d}) \\ &&-\sum\limits_{w:x_{dw}\geq1} \log \varphi_{jw} + \log x_{dw} \end{array} $$

(30)

Then, the first order derivative of the negative log-likelihood, also called the Fisher score function, is:

$$\begin{array}{@{}rcl@{}} -\frac{\partial \mathcal{L}(\mathcal{X}_{j}|\varphi_{j})}{\partial\varphi_{jw}}&=&\eta_{j}(-{\Psi}(s_{j}))+\sum\limits_{d=l}^{l+\eta_{j}-1} {\Psi}(s_{j}+n_{d}) \\ && -\sum\limits_{d=l}^{l+\eta_{j}-1} I(x_{dw} \geq 1) \frac{1}{\varphi_{jw}} \end{array} $$

(31)

where Ψ is the digamma function. Then,

$$\begin{array}{@{}rcl@{}} -\frac{\partial^{2} \mathcal{L}(\mathcal{X}_{j}|\varphi_{j})}{\partial\varphi^{2}_{jw}}&=&\eta_{j}(-{\Psi}^{\prime}(s_{j}))+\sum\limits_{d=l}^{l+\eta_{j}-1} {\Psi}^{\prime}(s_{j}+n_{d}) \\ &&+\sum\limits_{d=l}^{l+\eta_{j}-1} I(x_{dw} \geq 1) \frac{1}{\varphi_{jw}^{2}} \end{array} $$

(32)

and:

$$\begin{array}{@{}rcl@{}} -\frac{\partial^{2} \mathcal{L}(\mathcal{X}_{j}|\varphi_{j})}{\partial\varphi_{jw1} \partial\varphi_{jw2}}&=& \eta_{j}(-{\Psi}^{\prime}(s_{j}))\\ &&+\sum\limits_{d=l}^{l+\eta_{j}-1} {\Psi}^{\prime}(s_{j}+n_{d}) , w_{1} \neq w_{2} \end{array} $$

(33)

where Ψ^′ is the trigamma function. We remark that F(φ_j) can be written as:

$$ F(\boldsymbol{\varphi}_{j})=D_{j} +\gamma_{j} \mathbf{AA}^{T} $$

(34)

where $D=diag\left [\sum \limits _{d=l}^{l+\eta _{j}-1} I(x_{dw} \geq 1) \left .\frac {1}{\varphi _{j1}^{2}} \right ), {\dots } , \sum \limits _{d=l}^{l+\eta _{j}-1} I\right .$$\left .\vphantom {\sum \limits _{d=l}^{l+\eta _{j}-1}}(x_{dw} \geq 1) \left .\frac {1}{\varphi _{jW}^{2}} \right ) \right ]$, $\gamma =\eta _{j}(-{\Psi }^{\prime }(s_{j}))+\sum \limits _{d=l}^{l+\eta _{j}-1} {\Psi }^{\prime }(s_{j}+n_{d})$, and A^T = 1. Then, according to (Theorem 8.4.3) given by Graybill [33], the determinant of the Fisher information matrix F(φ_j) is:

$$ |F(\boldsymbol{\varphi}_{j})|=\left( 1+\gamma_{j} \sum\limits_{w = 1}^{W} \frac{a_{jw}^{2}}{D_{jw}} \right) \prod\limits_{w = 1}^{W} D_{jw} $$

(35)

By substituting (35) and (15) into (14), we obtain:

$$ |F({\Theta})|\simeq\frac{N}{\prod\limits_{j = 1}^{M} \mu_{j}} \prod\limits_{j = 1}^{M} \left[ \left( 1+\gamma_{j} \sum\limits_{w = 1}^{W} \frac{a_{jw}^{2}}{D_{jw}} \right) \prod\limits_{w = 1}^{W} D_{jw} \right] $$

(36)

Then, taking the log gives (16).

Appendix B: Proof of (29)

The KL divergence between two distributions that belong to the exponential family is defined as [16]:

$$\begin{array}{@{}rcl@{}} KL \left( P(X|\theta_{j1}),P(X|\theta_{j2})\right)&=&{\Phi}(\theta_{j1})-{\Phi}(\theta_{j2})\\ &&+\left( G(\theta_{j1})\right.\\ &&\left.-G(\theta_{j2})\right) E_{\theta_{j1}}[T(X)] \end{array} $$

(37)

where E_𝜃 is the expectation with respect to P(X|𝜃). We have:

$$\begin{array}{@{}rcl@{}} {\Phi}(\theta)&=&\frac{{\Gamma}(s)}{{\Gamma}(s+n)} \end{array} $$

(38)

$$\begin{array}{@{}rcl@{}} G(\theta)&=&\log(\varphi_{jw}) \end{array} $$

(39)

$$\begin{array}{@{}rcl@{}} T(X)&=&I(x_{w} \geq 1) \end{array} $$

(40)

Moreover, we have the following [16]:

$$ E_{\theta}[T(X)]=-{\Phi}^{\prime}(\theta) $$

(41)

Thus, according to (38 and 40), we have:

$$ E_{\theta}\left[I(x_{w} \geq 1)\right]=-\frac{\partial {\Phi}(\theta)}{\partial \varphi_{jw}}= {\Psi}(s+n)-{\Psi}(s) $$

(42)

substituting (42, 38 and 39) in (37) gives (29).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zamzami, N., Bouguila, N. Model selection and application to high-dimensional count data clustering. Appl Intell 49, 1467–1488 (2019). https://doi.org/10.1007/s10489-018-1333-9

Download citation

Published: 13 November 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s10489-018-1333-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model selection and application to high-dimensional count data clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering bivariate mixed-type data via the cluster-weighted model

Exponential family mixed membership models for soft clustering of multivariate data

Advances in Robust Constrained Model Based Clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of (16)

Appendix B: Proof of (29)

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Model selection and application to high-dimensional count data clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering bivariate mixed-type data via the cluster-weighted model

Exponential family mixed membership models for soft clustering of multivariate data

Advances in Robust Constrained Model Based Clustering

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of (16)

Appendix B: Proof of (29)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation