Abstract
Developing both generative and discriminative techniques for classification has achieved significant progress in the last few years. Considering the capabilities and limitations of both, hybrid generative discriminative approaches have received increasing attention. Our goal is to combine the advantages and desirable properties of generative models, i.e. finite mixture, and the Support Vector Machines (SVMs) as powerful discriminative techniques for modeling count data that appears in many domains in machine learning and computer vision applications. In particular, we select accurate kernels generated from mixtures of Multinomial Scaled Dirichlet distribution and its exponential approximation (EMSD) for support vector machines. We demonstrate the effectiveness and the merits of the proposed framework through challenging real-world applications namely; object recognition and visual scenes classification. Large scale datasets have been considered in the empirical study such as Microsoft MOCR, Fruits-360 and MIT places.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The size of each vector depends on the image representation approach, in our case the vectors are 128-dimensional given that we are representing each image as a bag of SIFT descriptors [46].
References
Agarwal A, Daumà H et al (2011) Generative kernels for exponential families. In: Proceedings of the 14th international conference on artificial intelligence and statistics, pp 85–92
Amayri O, Bouguila N (2015) Beyond hybrid generative discriminative learning: spherical data classification. Pattern Anal Appl 18(1):113–133
Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von mises-fisher distributions. J Mach Learn Res 6:1345–1382
Bdiri T, Bouguila N (2013) Bayesian learning of inverted dirichlet mixtures for svm kernels generation. Neural Comput Appl 23(5):1443–1458
Berk RA (2016) Support vector machines. In: Statistical learning from a regression perspective. Springer, pp 291–310
Bishop C (2006) Pattern recognition and machine learning. Information science and statistics. Springer, New York
Bishop C, Bishop CM et al (1995) Neural networks for pattern recognition. Oxford University Press, Oxford
Bosch A, Muñoz X, Martí R (2007) Which is the best way to organize/classify images by content? Image Vis Comput 25(6):778–791
Bouguila N (2008) Clustering of count data using generalized dirichlet multinomial distributions. IEEE Trans Knowl Data Eng 20(4):462–474
Bouguila N (2011) Bayesian hybrid generative discriminative learning based on finite liouville mixture models. Pattern Recogn 44(6):1183–1200
Bouguila N (2011) Count data modeling and classification using finite mixtures of distributions. IEEE Trans Neural Netw 22(2):186–198
Bouguila N (2012) Hybrid generative/discriminative approaches for proportional data modeling and classification. IEEE Trans Knowl Data Eng 24(12):2184–2202
Bouguila N (2013) Deriving kernels from generalized dirichlet mixture models and applications. Inf Process Manag 49(1):123–137
Bouguila N, Amayri O (2009) A discrete mixture-based kernel for svms: application to spam and image categorization. Inf Process Manag 45(6):631–642
Bouguila N, Ziou D (2007) Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization. J Vis Commun Image Represent 18(4):295–309
Brown LD (1986) Fundamentals of statistical exponential families: with applications in statistical decision theory. Ims
Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2 (2):121–167
Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using gmm super vectors for speaker verification. IEEE Signal Process Lett 13(5):308–311
Chan AB, Vasconcelos N, Moreno PJ (2004) A family of probabilistic kernels based on information divergence. Univ. California, San Diego, CA, Tech. Rep SVCL-TR-2004-1
Chang SK, Hsu A (1992) Image information systems: where do we go from here? IEEE Trans Knowl Data Eng 4(5):431–442
Christianini N, Shawe-Taylor J (2000) Support vector machines, vol 93. Cambridge University Press, Cambridge, pp 935–948
Church KW, Gale WA (1995) Poisson mixtures. Nat Lang Eng 1(2):163–190
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV. Prague, vol 1, pp 1–2
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B Methodol 39(1):1–22
Deng J, Xu X, Zhang Z, Frühholz S., Grandjean D, Schuller B (2017) Fisher kernels on phase-based features for speech emotion recognition. In: Dialogues with social robots. Springer, pp 195–203
Elisseeff A, Weston J (2002) A kernel method for multi-labelled classification. In: Advances in neural information processing systems, pp 681–687
Elkan C (2006) Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 289–296
Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11(Feb):625–660
Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput Vis Image Underst 106(1):59–70
Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 2. IEEE, pp 524–531
Ferrari V, Tuytelaars T, Van Gool L (2006) Object detection by contour segment networks. In: European conference on computer vision. Springer, pp 14–28
Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features. In: 10th IEEE international conference on computer vision, 2005. ICCV 2005, vol 2. IEEE, pp 1458–1465
Gupta RD, Richards DSP (1987) Multivariate liouville distributions. J Multivar Anal 23(2):233–256
Han X, Dai Q (2018) Batch-normalized mlpconv-wise supervised pre-training network in network. Appl Intell 48(1):142–155
Hankin RK et al (2010) A generalization of the dirichlet distribution. J Stat Softw 33(11):1–18
Jaakkola T, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems, pp 487–493
Jebara T (2003) Images as bags of pixels. In: ICCV, pp 265–272
Jebara T, Kondor R, Howard A (2004) Probability product kernels. J Mach Learn Res 5(Jul):819–844
Jégou H, Douze M, Schmid C (2009) On the burstiness of visual elements. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 1169–1176
Kailath T (1967) The divergence and bhattacharyya distance measures in signal selection. IEEE Trans Commun Technol 15(1):52–60
Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2 (1):15–59
Keerthi SS, Lin CJ (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput 15(7):1667–1689
Lin HT, Lin CJ (2003) A study on sigmoid kernels for svm and the training of non-psd kernels by smo-type methods. submitted to Neural Computation 3:1–32
Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theor 37(1):145–151
Lochner RH (1975) A generalized dirichlet distribution in bayesian life testing. J R Stat Soc Ser B Methodol 37(1):103–113
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Ma Y, Guo G (2014) Support vector machines applications. Springer, New York
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 545–552
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the AAAI-98 workshop on learning for text categorization, vol 752. Citeseer, pp 41–48
McCallum AK (1996) Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/∼mccallum/bow
McLachlan G, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley, New Jersey
Migliorati S, Monti GS, Ongaro A (2008) E–m algorithm: an application to a mixture model for compositional data. In: Proceedings of the 44th scientific meeting of the italian statistical society
Moguerza JM, Muñoz A, et al. (2006) Support vector machines with applications. Stat Sci 21(3):322–336
Monti GS, Mateu-Figueras G, Pawlowsky-Glahn V (2011) Compositional Data Analysis: Theory and Applications, chap. Notes on the scaled Dirichlet distribution. Wiley, Chichester. https://doi.org/10.1002/9781119976462.ch10
Moreno PJ, Ho PP, Vasconcelos N (2004) A kullback-leibler divergence based kernel for svm classification in multimedia applications. In: Advances in neural information processing systems, pp 1385–1392
Mosimann JE (1962) On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika 49(1/2):65–82
Mureşan H, Oltean M (2018) Fruit recognition from images using deep learning. Acta Universitatis Sapientiae, Informatica 10(1):26–42
Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Advances in neural information processing systems, pp 841–848
Oboh BS, Bouguila N (2017) Unsupervised learning of finite mixtures using scaled dirichlet distribution and its application to software modules categorization. In: Proceedings of the 2017 IEEE international conference on industrial technology (ICIT). IEEE, pp 1085–1090
Van den Oord A, Schrauwen B (2014) Factoring variations in natural images with deep gaussian mixture models. In: Advances in neural information processing systems, pp 3518–3526
Penny WD (2001) Kullback-liebler divergences of normal, gamma, dirichlet and wishart densities. Wellcome Department of Cognitive Neurology
Pérez-Cruz F (2008) Kullback-leibler divergence estimation of continuous distributions. In: IEEE international symposium on information theory, 2008. ISIT 2008. IEEE, pp 1666–1670
Raina R, Shen Y, Mccallum A, Ng AY (2004) Classification with hybrid generative/discriminative models. In: Advances in neural information processing systems, pp 545–552
Rennie JDM, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th international conference on machine learning ICML, vol 3, pp 616–623
Rényi A et al (1961) On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California
Rubinstein YD, Hastie T et al (1997) Discriminative vs informative learning. In: KDD, vol 5, pp 49–53
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Shmilovici A (2010) Support vector machines. In: Data mining and knowledge discovery handbook. Springer, pp 231–247
Sivazlian B (1981) On a multivariate extension of the gamma and beta distributions. SIAM J Appl Math 41 (2):205–209
Song G, Dai Q (2017) A novel double deep elms ensemble system for time series forecasting. Knowl-Based Syst 134:31–49
Van Der Maaten L (2011) Learning discriminative fisher kernels. In: ICML, vol 11, pp 217–224
Vapnik V (2013) The nature of statistical learning theory. Springer Science & Business Media, New York
Vapnik VN (1995) The nature of statistical learning theory
Variani E, McDermott E, Heigold G (2015) A gaussian mixture model layer jointly optimized with discriminative features within a deep neural network architecture. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4270–4274
Vasconcelos N, Ho P, Moreno P (2004) The kullback-leibler kernel as a framework for discriminant and localized representations for visual recognition. In: European conference on computer vision. Springer, pp 430–441
Wang P, Sun L, Yang S, Smeaton AF (2015) Improving the classification of quantified self activities and behaviour using a fisher kernel. In: Adjunct Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing and proceedings of the 2015 ACM international symposium on wearable computers. ACM, pp 979–984
Winn J, Criminisi A, Minka T (2005) Object categorization by learned universal visual dictionary. In: 10th IEEE international conference on computer vision, 2005. ICCV 2005, vol 2. IEEE, pp 1800–1807
Wong TT (2009) Alternative prior assumptions for improving the performance of naïve bayesian classifiers. Data Min Knowl Disc 18(2):183–213
Zamzami N, Bouguila N (2018) Text modeling using multinomial scaled dirichlet distributions. In: International conference on industrial, engineering and other applications of applied intelligent systems. Springer, pp 69–80
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems , pp 487–495
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of (5)
The composition of the Multinomial and Scaled Dirichlet is obtained by the following integration:
Using the fact that the integration of the PDF = 1, we have: \({\int }_{\rho } \mathcal {M}\mathcal {S}\mathcal {D}(\rho |\theta ) d\rho = 1\), straightforward manipulation yield:
and we solve the integration using the following empirically found approximation: \(\left ({\sum }_{d = 1}^{D} \beta _{d} \ \rho _{d}\right )^{{\sum }_{d = 1}^{D} x_{d}} \simeq {\prod }_{d = 1}^{D} \beta _{d}^{x_{d}}\), as:
Using this to solve the integration in (29), we obtain (5).
Appendix B: Newton Raphson approach
The complete data log likelihood corresponding to a K-component mixture is given by:
By computing the second and mixed derivatives of \( \mathcal {L}(\mathcal {X},\mathcal {Z}|{\Theta })\) with respect to \(\alpha _{kd},\ d = 1,\dots ,D\), we obtain:
where \({\Psi }^{\prime }\) is the trigamma function. By computing the second and mixed derivatives of \( \mathcal {L}(\mathcal {X},\mathcal {Z}|{\Theta })\) with respect to \(\beta _{kd},\ d = 1,\dots ,D\), we obtain:
The second and mixed derivatives of \(\mathcal {L}(\mathcal {X},\mathcal {Z}|{\Theta })\) with respect to αkd and βkd, \(d = 1,\dots ,D\), is 0.
Appendix C: Proof of (19)
The KL-divergence between two exponential distributions is given by [40]:
where E𝜃 is the expectation with respect to p(X|𝜃). Moreover, we have the following [16]:
Thus, according to (14), we have:
where \(n={\sum }_{d = 1}^{D} x_{d}\), and Ψ(.) is the digamma function. By substituting the previous two equations into Eq.(35), we obtain:
Appendix D: Proof of (23)
In the case of the EMSD distribution, we can show that:
We have the PDF of an EMSD distribution that integrates to one which gives:
By substituting (40) into (39), we obtain:
Appendix E: Proof of (27)
By substituting (37) into the previous equation, we obtain the following:
Rights and permissions
About this article
Cite this article
Zamzami, N., Bouguila, N. Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models. Appl Intell 49, 3783–3800 (2019). https://doi.org/10.1007/s10489-019-01437-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01437-0