Abstract
Count data are commonly exploited in machine learning and computer vision applications; however, they often suffer from the well-known curse of dimensionality, which declines the performance of clustering algorithms dramatically. Feature selection is a major technique for handling a large number of features, which most are often redundant and noisy. In this paper, we propose a probabilistic approach for count data based on the concept of feature saliency in the context of mixture-based clustering using the generalized Dirichlet multinomial distribution. The saliency of irrelevant features is reduced toward zero by minimizing the message length, which equates to doing feature and model selection simultaneously. It is proved that the developed approach is effective in identifying both the optimal number of clusters and the most important features, and so enhancing clustering performance significantly, using a range of challenging applications including text and image clustering.
Similar content being viewed by others
Notes
In our experiments, the values for \(M_{min}\) and \(M_{max}\) have been set to 2 and 50, respectively.
References
Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5(Aug):845–889
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182
Liu H, Wu X, Zhang S (2011) Feature selection using hierarchical feature clustering. In: Proceedings of the 20th ACM international conference on information and knowledge management, ACM, pp 979–984
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 333–342
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Kohavi R, Sommerfield D (1995) Feature subset selection using the wrapper method: Overfitting and dynamic search space topology. In: KDD, pp 192–197
Wolf L, Shashua A (2005) Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach. J Mach Learn Res 6(Nov):1855–1887
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 4:491–502
Chuang L-Y, Chang H-W, Tu C-J, Yang C-H (2008) Improved binary PSO for feature selection using gene expression data. Comput Biol Chem 32(1):29–38
Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, de Schaetzen V, Duque R, Bersini H, Nowe A (2012) A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 9(4):1106–1119
Tang J, Liu H (2012) Feature selection with linked data in social media. In: Proceedings of the 2012 SIAM international conference on data mining, SIAM, pp 118–128
Tang J, Liu H (2014) An unsupervised feature selection framework for social media data. IEEE Trans Knowl Data Eng 26(12):2914–2927
Liu L, Shao L, Rockett P (2013) Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recogn 46(7):1810–1818
Lin C-H, Chen H-Y, Wu Y-S (2014) Study of image retrieval and classification based on adaptive features using genetic algorithm feature selection. Expert Syst Appl 41(15):6611–6621
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
Zeng Z, Wang X, Zhang J, Wu Q (2016) Semi-supervised feature selection based on local discriminative information. Neurocomputing 173:102–109
Chen X, Yuan G, Nie F, Huang JZ (2017) Semi-supervised feature selection via rescaled linear regression. In: IJCAI, vol 2017, pp 1525–1531
Li Z, Tang J (2021) Semi-supervised local feature selection for data classification. Science China Inf Sci 64(9):1–12
Law MH, Figueiredo MA, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166
Bouguila N (2009) A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity. IEEE Trans Knowl Data Eng 21(12):1649–1664
Luo M, Nie F, Chang X, Yang Y, Hauptmann AG, Zheng Q (2017) Adaptive unsupervised feature selection with structure regularization. IEEE Trans Neural Netw Learn Syst 29(4):944–956
Li Z, Liu J, Zhu X, Liu T, Lu H (2010) Image annotation using multi-correlation probabilistic matrix factorization. In: Proceedings of the 18th ACM international conference on multimedia, ACM, pp 1187–1190
Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150
Hong X, Li H, Miller P, Zhou J, Li L, Crookes D, Lu Y, Li X, Zhou H (2019) Component-based feature saliency for clustering. IEEE transactions on knowledge and data engineering
Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables. vol 30. Siam
Wu TT, Lange K (2010) The MM alternative to EM. Stat Sci 25(4):492–505
Dempster AP (1977) Maximum likelihood estimation from incomplete data via the EM algorithm. J R Stat Soc Ser B (Statistical Methodology) 39:1–38
Wallace CS (2005) Statistical and inductive inference by minimum message length. Springer, New York
Bouguila N (2008) Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans Knowl Data Eng 20(4):462–474
Connor RJ, Mosimann JE (1969) Concepts of independence for proportions with a generalization of the Dirichlet distribution. J Am Stat Assoc 64(325):194–206
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning, ACM, pp 545–552
Wong T-T (2009) Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Min Knowl Disc 18(2):183–213
Zamzami N, Bouguila N (2018) Consumption behavior prediction using hierarchical Bayesian frameworks. In: 2018 first international conference on artificial intelligence for industries (AI4I), IEEE, pp 31–34
Graham MW, Miller DJ (2006) Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection. IEEE Trans Signal Process 54(4):1289–1303
Zhou H, Lange K (2010) MM algorithms for some discrete multivariate distributions. J Comput Graph Stat 19(3):645–665
Wu X, Jiang B, Yu K, Miao C, Chen H (2019) Accurate Markov boundary discovery for causal feature selection. IEEE Trans Cybern 50:4983–4996
Liu C, Zheng C-T, Wu S, Yu Z, Wong H-S (2018) Multitask feature selection by graph-clustered feature sharing. IEEE Trans Cybern 50:74–86
Wu H, Liu T, Xie J (2017) Fine-grained product feature extraction in chinese reviews. In: 2017 international conference on computing intelligence and information system (CIIS), IEEE, pp. 327–331
Marquetti I, Link JV, Lemes ALG, dos Santos Scholz MB, Valderrama P, Bona E (2016) Partial least square with discriminant analysis and near infrared spectroscopy for evaluation of geographic and genotypic origin of arabica coffee. Comput Electr Agric 121:313–319
Fan Z, Xu Y, Zuo W, Yang J, Tang J, Lai Z, Zhang D (2014) Modified principal component analysis: An integration of multiple similarity subspace models. IEEE Trans Neural Netw Learn Syst 25(8):1538–1552
Zhao H, Wang Z, Nie F (2018) A new formulation of linear discriminant analysis for robust dimensionality reduction. IEEE Trans Knowl Data Eng 31(4):629–640
Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78
Dash M, Liu H (2000) Feature selection for clustering. In: Pacific-Asia conference on knowledge discovery and data mining, Springer, pp 110–121
Wang Y, Feng L (2019) A new hybrid feature selection based on multi-filter weights and multi-feature weights. Appl Intell 49:1–25
Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer, New York
Liu H, Motoda H (2007) Computational methods of feature selection. CRC Press, Boca Raton
Dash M, Choi K, Scheuermann P, Liu H (2002) Feature selection for clustering-a filter solution. In: Proceedings of 2002 IEEE international conference on data mining, IEEE, pp 115–122
Ambusaidi MA, He X, Nanda P, Tan Z (2016) Building an intrusion detection system using a filter-based feature selection algorithm. IEEE Trans Comput 65(10):2986–2998
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Kabir MM, Islam MM, Murase K (2010) A new wrapper feature selection approach using neural network. Neurocomputing 73(16–18):3273–3283
Apolloni J, Leguizamón G, Alba E (2016) Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments. Appl Soft Comput 38:922–932
Moradkhani M, Amiri A, Javaherian M, Safari H (2015) A hybrid algorithm for feature subset selection in high-dimensional datasets using FICA and IWSSr algorithm. Appl Soft Comput 35:123–135
Tang B, Kay S, He H (2016) Toward optimal feature selection in naive Bayes for text categorization. IEEE Trans Knowl Data Eng 28(9):2508–2521
Bouillot F, Hai PN, Béchet N, Bringay S, Ienco D, Matwin S, Poncelet P, Roche M, Teisseire M (2012) How to extract relevant knowledge from tweets? In: International workshop on information search, integration, and personalization, Springer, pp 111–120
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: ICML, vol 99, pp 258–267
Caropreso MF, Matwin S, Sebastiani F (2001) A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. Text Databases Doc Manage Theory Pract 5478:78–102
Li Y, Luo C, Chung SM (2008) Text clustering with feature selection by using statistical data. IEEE Trans Knowl Data Eng 20(5):641–652
Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization. In: International conference on theory and practice of digital libraries, Springer, pp 59–68
Talavera L (1999) Feature selection as a preprocessing step for hierarchical clustering. In: ICML, vol 99, pp 389–397 (Citeseer)
He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Advances in neural information processing systems; 18; pp 507–514
Dasgupta A, Drineas P, Harb B, Josifovski V, Mahoney MW (2007) Feature selection methods for text classification. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 230–239
Sharma KK, Seal A (2020) Clustering analysis using an adaptive fused distance. Eng Appl Artif Intell 96:103928
Sharma KK, Seal A (2021) Spectral embedded generalized mean based k-nearest neighbors clustering with s-distance. Expert Syst Appl 169:114326
Sharma KK, Seal A, Herrera-Viedma E, Krejcar O (2021) An enhanced spectral clustering algorithm with s-distance. Symmetry 13(4):596
Adams S, Beling PA (2017) A survey of feature selection methods for Gaussian mixture models and hidden Markov models. Artif Intell Rev 52:1–41
Boutemedjet S, Bouguila N, Ziou D (2008) A hybrid feature extraction selection approach for high-dimensional non-gaussian data clustering. IEEE Trans Pattern Anal Mach Intell 31(8):1429–1443
Fan W, Bouguila N, Ziou D (2012) Unsupervised hybrid feature extraction selection for high-dimensional non-gaussian data clustering with variational inference. IEEE Trans Knowl Data Eng 25(7):1670–1685
Vaithyanathan S, Dom B (2000) Generalized model selection for unsupervised learning in high dimensions. Adv Neural Inf Process Syst 12:970–976
Wang X, Kabán A (2006) Model-based estimation of word saliency in text. In: International conference on discovery science, Springer, pp 279–290
Li Z, Yang Y, Liu J, Zhou X, Lu H (2012) Unsupervised feature selection using nonnegative spectral analysis. In: Proceedings of the AAAI conference on artificial intelligence, vol 26
Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355
Cheung Y-m, Zeng H (2007) A maximum weighted likelihood approach to simultaneous model selection and feature weighting in gaussian mixture. In: International conference on artificial neural networks, Springer, pp 78–87
Tsai C-Y, Chiu C-C (2008) Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm. Comput Stat Data Anal 52(10):4658–4672
Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 3:381–396
Wallace CS, Dowe DL (2000) MML clustering of multi-state, poisson, von mises circular and Gaussian distributions. Stat Comput 10(1):73–83
Mosimann JE (1962) On the compound multinomial distribution, the multivariate \(\beta\)-distribution, and correlations among proportions. Biometrika 49(1/2):65–82
Wong T-T (2014) Generalized dirichlet priors for naïve bayesian classifiers with multinomial models in document classification. Data Min Knowl Disc 28(1):123–144
Caballero KL, Barajas J, Akella R (2012) The generalized dirichlet distribution in enhanced topic detection. In: Proceedings of the 21st ACM international conference on information and knowledge management, ACM, pp 773–782
Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2(1):15–59
Puig P, Valero J (2006) Count data distributions: some characterizations with applications. J Am Stat Assoc 101(473):332–340
Haldane JB (1941) The fitting of binomial distributions. Ann Eugen 11(1):179–181
Bailey NT (1957) The mathematical theory of epidemics. Technical report
Griffiths D (1973) Maximum likelihood estimation for the beta-binomial distribution and an application to the household distribution of the total number of cases of a disease. Biometrics, pp 637–648
Pudil P, Novovičová J, Choakjarernwanit N, Kittler J (1995) Feature selection based on the approximation of class densities by finite mixtures of special type. Pattern Recogn 28(9):1389–1398
Nguyen HD (2017) An introduction to Majorization-Minimization algorithms for machine learning and statistical estimation. Wiley Interdiscip Rev Data Min Knowl Discov 7(2):1198
Tian G-L, Liu Y, Tang M-L, Li T (2019) A novel MM algorithm and the mode-sharing method in bayesian computation for the analysis of general incomplete categorical data. Comput Stat Data Anal 140:122–143
Elkan C (2006) Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd international conference on machine learning, ACM, pp 289–296
Baxter RA, Oliver JJ (2000) Finding overlapping components with mml. Stat Comput 10(1):5–16
Bernardo JM, Smith AF (2001) Bayesian Theory. IOP Publishing, Bristol
Celeux G, Chrétien S, Forbes F, Mkhadri A (2001) A component-wise em algorithm for mixtures. J Comput Graph Stat 10(4):697–712
Novovičová J, Malik A (2003) Application of multinomial mixture model to text classification. In: Iberian conference on pattern recognition and image analysis, Springer, pp 646–653
Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: Eleventh international AAAI conference on web and social media
Ortiz EG, Becker BC (2014) Face recognition for web-scale datasets. Comput Vis Image Underst 118:153–170
Kumar N, Berg A, Belhumeur PN, Nayar S (2011) Describable visual attributes for face verification and image search. IEEE Trans Pattern Anal Mach Intell 33(10):1962–1977
Huang GB, Mattar M, Berg T, Learned-Miller E (2008) Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition, Erik Learned-Miller and Andras Ferencz and Frédéric Jurie, Oct 2008, Marseille, France. ffinria-00321923
Zhang Z, Song Y, Qi H (2017) Age progression/regression by conditional adversarial autoencoder. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5810–5818
Ricanek K, Tesafaye T (2006) Morph: A longitudinal image database of normal adult age-progression. In: 7th international conference on automatic face and gesture recognition (FGR06), IEEE, pp 341–345
Guo G, Zhang C (2014) A study on cross-population age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4257–4263
He Z, Li X, Zhang Z, Wu F, Geng X, Zhang Y, Yang M-H, Zhuang Y (2017) Data-dependent label distribution learning for age estimation. IEEE Trans Image Process 26(8):3846–3858
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Surrogate function construction
Appendix A: Surrogate function construction
The complete data log-likelihood of proposed model (Eq. 7) is given by:
To construct an MM algorithm, we need to minorize terms such as \(\log (\pi _{jl}+ k)\), \(\log (\mu _{jl}+k)\). Noticing that the term \(\log (\pi _{jl}+ k)\) occurs in the log-likelihood if and only if \(X_{il} \ge k+1\) , and the term \(\log (\mu _{jl}+k)\) occurs in the log-likelihood if and only if \(Y_{il} \ge k+1\), we define the following associated counts for \(l=1,\dots , D\) :
where the index k ranges from 0 to \(\max _i m_i-1\). Recalling that \(\upsilon _{ijl}=P(Z_i=j,\phi _{jl}=1 \mid X_i)\) and \(\nu _{ijl}=P(Z_i=j,\phi _{jl}=0 \mid X_i)\), thus, Eq.(A1) can be re-written as:
Then, we apply the basic minorization functions found by Zhou and Lange (see equations 2.3 and 2.4 in [35]) to the previous equation which yields the surrogate function as:
Rights and permissions
About this article
Cite this article
Zamzami, N., Bouguila, N. A novel minorization–maximization framework for simultaneous feature selection and clustering of high-dimensional count data. Pattern Anal Applic 26, 91–106 (2023). https://doi.org/10.1007/s10044-022-01094-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-022-01094-z