iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://unpaywall.org/10.1007/S10489-019-01437-0
Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models | Applied Intelligence Skip to main content
Log in

Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Developing both generative and discriminative techniques for classification has achieved significant progress in the last few years. Considering the capabilities and limitations of both, hybrid generative discriminative approaches have received increasing attention. Our goal is to combine the advantages and desirable properties of generative models, i.e. finite mixture, and the Support Vector Machines (SVMs) as powerful discriminative techniques for modeling count data that appears in many domains in machine learning and computer vision applications. In particular, we select accurate kernels generated from mixtures of Multinomial Scaled Dirichlet distribution and its exponential approximation (EMSD) for support vector machines. We demonstrate the effectiveness and the merits of the proposed framework through challenging real-world applications namely; object recognition and visual scenes classification. Large scale datasets have been considered in the empirical study such as Microsoft MOCR, Fruits-360 and MIT places.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. The size of each vector depends on the image representation approach, in our case the vectors are 128-dimensional given that we are representing each image as a bag of SIFT descriptors [46].

  2. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data

  3. http://kdd.ics.uci.edu/databases/reuters21578

  4. https://cs.nyu.edu/∼roweis/data.html

References

  1. Agarwal A, Daumà H et al (2011) Generative kernels for exponential families. In: Proceedings of the 14th international conference on artificial intelligence and statistics, pp 85–92

  2. Amayri O, Bouguila N (2015) Beyond hybrid generative discriminative learning: spherical data classification. Pattern Anal Appl 18(1):113–133

    MathSciNet  MATH  Google Scholar 

  3. Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von mises-fisher distributions. J Mach Learn Res 6:1345–1382

    MathSciNet  MATH  Google Scholar 

  4. Bdiri T, Bouguila N (2013) Bayesian learning of inverted dirichlet mixtures for svm kernels generation. Neural Comput Appl 23(5):1443–1458

    Google Scholar 

  5. Berk RA (2016) Support vector machines. In: Statistical learning from a regression perspective. Springer, pp 291–310

  6. Bishop C (2006) Pattern recognition and machine learning. Information science and statistics. Springer, New York

    MATH  Google Scholar 

  7. Bishop C, Bishop CM et al (1995) Neural networks for pattern recognition. Oxford University Press, Oxford

    MATH  Google Scholar 

  8. Bosch A, Muñoz X, Martí R (2007) Which is the best way to organize/classify images by content? Image Vis Comput 25(6):778–791

    Google Scholar 

  9. Bouguila N (2008) Clustering of count data using generalized dirichlet multinomial distributions. IEEE Trans Knowl Data Eng 20(4):462–474

    Google Scholar 

  10. Bouguila N (2011) Bayesian hybrid generative discriminative learning based on finite liouville mixture models. Pattern Recogn 44(6):1183–1200

    MATH  Google Scholar 

  11. Bouguila N (2011) Count data modeling and classification using finite mixtures of distributions. IEEE Trans Neural Netw 22(2):186–198

    Google Scholar 

  12. Bouguila N (2012) Hybrid generative/discriminative approaches for proportional data modeling and classification. IEEE Trans Knowl Data Eng 24(12):2184–2202

    Google Scholar 

  13. Bouguila N (2013) Deriving kernels from generalized dirichlet mixture models and applications. Inf Process Manag 49(1):123–137

    Google Scholar 

  14. Bouguila N, Amayri O (2009) A discrete mixture-based kernel for svms: application to spam and image categorization. Inf Process Manag 45(6):631–642

    Google Scholar 

  15. Bouguila N, Ziou D (2007) Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization. J Vis Commun Image Represent 18(4):295–309

    Google Scholar 

  16. Brown LD (1986) Fundamentals of statistical exponential families: with applications in statistical decision theory. Ims

  17. Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2 (2):121–167

    Google Scholar 

  18. Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using gmm super vectors for speaker verification. IEEE Signal Process Lett 13(5):308–311

    Google Scholar 

  19. Chan AB, Vasconcelos N, Moreno PJ (2004) A family of probabilistic kernels based on information divergence. Univ. California, San Diego, CA, Tech. Rep SVCL-TR-2004-1

  20. Chang SK, Hsu A (1992) Image information systems: where do we go from here? IEEE Trans Knowl Data Eng 4(5):431–442

    Google Scholar 

  21. Christianini N, Shawe-Taylor J (2000) Support vector machines, vol 93. Cambridge University Press, Cambridge, pp 935–948

    Google Scholar 

  22. Church KW, Gale WA (1995) Poisson mixtures. Nat Lang Eng 1(2):163–190

    MathSciNet  Google Scholar 

  23. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV. Prague, vol 1, pp 1–2

  24. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B Methodol 39(1):1–22

    MathSciNet  MATH  Google Scholar 

  25. Deng J, Xu X, Zhang Z, Frühholz S., Grandjean D, Schuller B (2017) Fisher kernels on phase-based features for speech emotion recognition. In: Dialogues with social robots. Springer, pp 195–203

  26. Elisseeff A, Weston J (2002) A kernel method for multi-labelled classification. In: Advances in neural information processing systems, pp 681–687

  27. Elkan C (2006) Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 289–296

  28. Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11(Feb):625–660

    MathSciNet  MATH  Google Scholar 

  29. Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput Vis Image Underst 106(1):59–70

    Google Scholar 

  30. Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 2. IEEE, pp 524–531

  31. Ferrari V, Tuytelaars T, Van Gool L (2006) Object detection by contour segment networks. In: European conference on computer vision. Springer, pp 14–28

  32. Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features. In: 10th IEEE international conference on computer vision, 2005. ICCV 2005, vol 2. IEEE, pp 1458–1465

  33. Gupta RD, Richards DSP (1987) Multivariate liouville distributions. J Multivar Anal 23(2):233–256

    MathSciNet  MATH  Google Scholar 

  34. Han X, Dai Q (2018) Batch-normalized mlpconv-wise supervised pre-training network in network. Appl Intell 48(1):142–155

    Google Scholar 

  35. Hankin RK et al (2010) A generalization of the dirichlet distribution. J Stat Softw 33(11):1–18

    Google Scholar 

  36. Jaakkola T, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems, pp 487–493

  37. Jebara T (2003) Images as bags of pixels. In: ICCV, pp 265–272

  38. Jebara T, Kondor R, Howard A (2004) Probability product kernels. J Mach Learn Res 5(Jul):819–844

    MathSciNet  MATH  Google Scholar 

  39. Jégou H, Douze M, Schmid C (2009) On the burstiness of visual elements. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 1169–1176

  40. Kailath T (1967) The divergence and bhattacharyya distance measures in signal selection. IEEE Trans Commun Technol 15(1):52–60

    Google Scholar 

  41. Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2 (1):15–59

    Google Scholar 

  42. Keerthi SS, Lin CJ (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput 15(7):1667–1689

    MATH  Google Scholar 

  43. Lin HT, Lin CJ (2003) A study on sigmoid kernels for svm and the training of non-psd kernels by smo-type methods. submitted to Neural Computation 3:1–32

    Google Scholar 

  44. Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theor 37(1):145–151

    MathSciNet  MATH  Google Scholar 

  45. Lochner RH (1975) A generalized dirichlet distribution in bayesian life testing. J R Stat Soc Ser B Methodol 37(1):103–113

    MathSciNet  MATH  Google Scholar 

  46. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Google Scholar 

  47. Ma Y, Guo G (2014) Support vector machines applications. Springer, New York

    Google Scholar 

  48. Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 545–552

  49. McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the AAAI-98 workshop on learning for text categorization, vol 752. Citeseer, pp 41–48

  50. McCallum AK (1996) Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/∼mccallum/bow

  51. McLachlan G, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley, New Jersey

    Google Scholar 

  52. Migliorati S, Monti GS, Ongaro A (2008) E–m algorithm: an application to a mixture model for compositional data. In: Proceedings of the 44th scientific meeting of the italian statistical society

  53. Moguerza JM, Muñoz A, et al. (2006) Support vector machines with applications. Stat Sci 21(3):322–336

    MathSciNet  MATH  Google Scholar 

  54. Monti GS, Mateu-Figueras G, Pawlowsky-Glahn V (2011) Compositional Data Analysis: Theory and Applications, chap. Notes on the scaled Dirichlet distribution. Wiley, Chichester. https://doi.org/10.1002/9781119976462.ch10

    Google Scholar 

  55. Moreno PJ, Ho PP, Vasconcelos N (2004) A kullback-leibler divergence based kernel for svm classification in multimedia applications. In: Advances in neural information processing systems, pp 1385–1392

  56. Mosimann JE (1962) On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika 49(1/2):65–82

    MathSciNet  MATH  Google Scholar 

  57. Mureşan H, Oltean M (2018) Fruit recognition from images using deep learning. Acta Universitatis Sapientiae, Informatica 10(1):26–42

    Google Scholar 

  58. Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Advances in neural information processing systems, pp 841–848

  59. Oboh BS, Bouguila N (2017) Unsupervised learning of finite mixtures using scaled dirichlet distribution and its application to software modules categorization. In: Proceedings of the 2017 IEEE international conference on industrial technology (ICIT). IEEE, pp 1085–1090

  60. Van den Oord A, Schrauwen B (2014) Factoring variations in natural images with deep gaussian mixture models. In: Advances in neural information processing systems, pp 3518–3526

  61. Penny WD (2001) Kullback-liebler divergences of normal, gamma, dirichlet and wishart densities. Wellcome Department of Cognitive Neurology

  62. Pérez-Cruz F (2008) Kullback-leibler divergence estimation of continuous distributions. In: IEEE international symposium on information theory, 2008. ISIT 2008. IEEE, pp 1666–1670

  63. Raina R, Shen Y, Mccallum A, Ng AY (2004) Classification with hybrid generative/discriminative models. In: Advances in neural information processing systems, pp 545–552

  64. Rennie JDM, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th international conference on machine learning ICML, vol 3, pp 616–623

  65. Rényi A et al (1961) On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California

  66. Rubinstein YD, Hastie T et al (1997) Discriminative vs informative learning. In: KDD, vol 5, pp 49–53

  67. Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  68. Shmilovici A (2010) Support vector machines. In: Data mining and knowledge discovery handbook. Springer, pp 231–247

  69. Sivazlian B (1981) On a multivariate extension of the gamma and beta distributions. SIAM J Appl Math 41 (2):205–209

    MathSciNet  MATH  Google Scholar 

  70. Song G, Dai Q (2017) A novel double deep elms ensemble system for time series forecasting. Knowl-Based Syst 134:31–49

    Google Scholar 

  71. Van Der Maaten L (2011) Learning discriminative fisher kernels. In: ICML, vol 11, pp 217–224

  72. Vapnik V (2013) The nature of statistical learning theory. Springer Science & Business Media, New York

    MATH  Google Scholar 

  73. Vapnik VN (1995) The nature of statistical learning theory

    MATH  Google Scholar 

  74. Variani E, McDermott E, Heigold G (2015) A gaussian mixture model layer jointly optimized with discriminative features within a deep neural network architecture. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4270–4274

  75. Vasconcelos N, Ho P, Moreno P (2004) The kullback-leibler kernel as a framework for discriminant and localized representations for visual recognition. In: European conference on computer vision. Springer, pp 430–441

  76. Wang P, Sun L, Yang S, Smeaton AF (2015) Improving the classification of quantified self activities and behaviour using a fisher kernel. In: Adjunct Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing and proceedings of the 2015 ACM international symposium on wearable computers. ACM, pp 979–984

  77. Winn J, Criminisi A, Minka T (2005) Object categorization by learned universal visual dictionary. In: 10th IEEE international conference on computer vision, 2005. ICCV 2005, vol 2. IEEE, pp 1800–1807

  78. Wong TT (2009) Alternative prior assumptions for improving the performance of naïve bayesian classifiers. Data Min Knowl Disc 18(2):183–213

    MathSciNet  Google Scholar 

  79. Zamzami N, Bouguila N (2018) Text modeling using multinomial scaled dirichlet distributions. In: International conference on industrial, engineering and other applications of applied intelligent systems. Springer, pp 69–80

  80. Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems , pp 487–495

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nuha Zamzami.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of (5)

The composition of the Multinomial and Scaled Dirichlet is obtained by the following integration:

$$\begin{array}{@{}rcl@{}} \mathcal{M}\mathcal{S}\mathcal{D}(\mathbf{X}|\boldsymbol{\alpha},\boldsymbol{\beta} )&=& {\int}_{\rho} \mathcal{M}(\mathbf{X}|\boldsymbol{\rho}) \mathcal{S}\mathcal{D}(\boldsymbol{\rho}|\boldsymbol{\alpha}, \boldsymbol{\beta}) d\rho \\ &=& {\int}_{\rho} \frac{n!}{\prod\limits_{d = 1}^{D} x_{d}!} \prod\limits_{d = 1}^{D} \rho_{d}^{x_{d}} \frac{{\Gamma} (A)}{\prod\limits_{d = 1}^{D} {\Gamma}(\alpha_{d})} \frac{\prod\limits_{d = 1}^{D} \beta_{d}^{\alpha_{d}} p_{d}^{\alpha_{d}-1}}{\left( \sum\limits_{d = 1}^{D} \beta_{d} \rho_{d} \right)^{A}} d\rho \end{array} $$
$$\begin{array}{@{}rcl@{}} &=& \frac{n!}{\prod\limits_{d = 1}^{D} x_{d}!}\frac{{\Gamma} (A)}{\prod\limits_{d = 1}^{D} {\Gamma}(\alpha_{d})} \prod\limits_{d = 1}^{D} \beta_{d}^{\alpha_{d}} {\int}_{\rho} \frac{\prod \limits_{d = 1}^{D} \rho_{d}^{x_{d}+\alpha_{d}-1}}{\left( \sum\limits_{d = 1}^{D} \beta_{d} \rho_{d} \right)^{A}} d\rho\\ \end{array} $$
(29)

Using the fact that the integration of the PDF = 1, we have: \({\int }_{\rho } \mathcal {M}\mathcal {S}\mathcal {D}(\rho |\theta ) d\rho = 1\), straightforward manipulation yield:

$$\begin{array}{@{}rcl@{}} && {\int}_{\rho} \frac{{\Gamma} \left( \sum\limits_{d = 1}^{D} \alpha_{d}\right)}{\prod\limits_{d = 1}^{D} {\Gamma}(\alpha_{d})} \frac{\prod\limits_{d = 1}^{D} \beta_{d}^{\alpha_{d}} \rho_{d}^{\alpha_{d}-1}}{\left( \sum\limits_{d = 1}^{D} \beta_{d} \rho_{d} \right)^{A}} d\rho = 1 \\ && \frac{{\Gamma} \left( \sum\limits_{d = 1}^{D} \alpha_{d}\right) \prod\limits_{d = 1}^{D} \beta_{d}^{\alpha_{d}}}{\prod\limits_{d = 1}^{D} {\Gamma}(\alpha_{d})} {\int}_{\rho} \frac{\prod\limits_{d = 1}^{D} \rho_{d}^{\alpha_{d}-1}}{\left( \sum\limits_{d = 1}^{D} \beta_{d} \rho_{d} \right)^{A}} d\rho = 1 \end{array} $$
(30)

and we solve the integration using the following empirically found approximation: \(\left ({\sum }_{d = 1}^{D} \beta _{d} \ \rho _{d}\right )^{{\sum }_{d = 1}^{D} x_{d}} \simeq {\prod }_{d = 1}^{D} \beta _{d}^{x_{d}}\), as:

$$\begin{array}{@{}rcl@{}} {\int}_{\rho} \frac{\prod\limits_{d = 1}^{D} \rho_{d}^{\alpha_{d}-1}}{\left( \sum\limits_{d = 1}^{D} \beta_{d} \rho_{d} \right)^{A}} d\rho = \frac{\prod\limits_{d = 1}^{D} {\Gamma}(\alpha_{d})}{{\Gamma} \left( \sum\limits_{d = 1}^{D} \alpha_{d}\right) \prod\limits_{d = 1}^{D} \beta_{d}^{\alpha_{d}}} \end{array} $$
(31)

Using this to solve the integration in (29), we obtain (5).

Appendix B: Newton Raphson approach

The complete data log likelihood corresponding to a K-component mixture is given by:

$$ \mathcal{L}(\mathcal{X},\mathcal{Z}|{\Theta})=\sum\limits_{k = 1}^{K} \sum\limits_{i = 1}^{N} z_{ik} \left( \log \pi_{k} + \log p(\mathbf{X}_{i}|\theta_{k}) \right) $$
(32)

By computing the second and mixed derivatives of \( \mathcal {L}(\mathcal {X},\mathcal {Z}|{\Theta })\) with respect to \(\alpha _{kd},\ d = 1,\dots ,D\), we obtain:

$$\begin{array}{@{}rcl@{}} &&\frac{\partial^{2} \mathcal{L}(\mathcal{X},\mathcal{Z}|{\Theta})}{\partial\alpha_{kd1}\partial\alpha_{kd2}} = \\ &&\left\{\begin{array}{ll} \sum\limits_{i = 1}^{N} z_{ik} \left( {\Psi}^{\prime}(A)-{\Psi}^{\prime}(n_{i}+A)\right. \\ \left. +{\Psi}^{\prime}(x_{id}+\alpha_{kd})-{\Psi}^{\prime}(\alpha_{kd}) \right) &\text{if}\quad d_{1}=d_{2}=d, \\ \sum\limits_{i = 1}^{N} z_{ik} \left( {\Psi}^{\prime}(A)-{\Psi}^{\prime}(n_{i}+A) \right) & \text{otherwise,} \end{array}\right. \end{array} $$
(33)

where \({\Psi }^{\prime }\) is the trigamma function. By computing the second and mixed derivatives of \( \mathcal {L}(\mathcal {X},\mathcal {Z}|{\Theta })\) with respect to \(\beta _{kd},\ d = 1,\dots ,D\), we obtain:

$$ \frac{\partial^{2} \mathcal{L}(\mathcal{X},\mathcal{Z}|{\Theta})}{\partial\beta_{kd1}\partial\beta_{kd2}} = \left\{\begin{array}{ll} \sum\limits_{i = 1}^{N} z_{ik} \left( \frac{x_{id}}{\beta_{kd}^{2}} \right) & \text{if}\ d_{1}=d_{2}=d, \\ \\ 0 & \text{otherwise,} \end{array}\right. $$
(34)

The second and mixed derivatives of \(\mathcal {L}(\mathcal {X},\mathcal {Z}|{\Theta })\) with respect to αkd and βkd, \(d = 1,\dots ,D\), is 0.

Appendix C: Proof of (19)

The KL-divergence between two exponential distributions is given by [40]:

$$\begin{array}{@{}rcl@{}} KL(p(X|{\Theta}),p^{\prime}(X|{\Theta}^{\prime}))&=&{\Phi}(\theta)-{\Phi}(\theta^{\prime})\\&&+[G(\theta)-G(\theta^{\prime})]^{tr} E_{\theta}[T(X)]\\ \end{array} $$
(35)

where E𝜃 is the expectation with respect to p(X|𝜃). Moreover, we have the following [16]:

$$ E_{\theta}[T(X)]=-{\Phi}^{\prime}(\theta) $$
(36)

Thus, according to (14), we have:

$$\begin{array}{@{}rcl@{}} E_{\theta} \left[\sum\limits_{d = 1}^{D} I(x_{d} \geq 1)\right]&=&-\frac{\partial {\Phi}(\theta)}{\partial \lambda_{d}} \\&=&{\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right) -{\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) \\ E_{\theta} \left[\sum\limits_{d = 1}^{D} I(x_{d} \geq 1) x_{d}\right]&=&-\frac{\partial {\Phi}(\theta)}{\partial \nu_{d}} = 0 \end{array} $$
(37)

where \(n={\sum }_{d = 1}^{D} x_{d}\), and Ψ(.) is the digamma function. By substituting the previous two equations into Eq.(35), we obtain:

$$\begin{array}{@{}rcl@{}} KL &&(p(X|{\Theta}),p^{\prime}(X|{\Theta}^{\prime})) \\ &=&\log \left( {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right)\right)-\log \left( {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda^{\prime}_{d}\right)\right)\\ &&-\log\left( {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d} +n\right)\right)+\log \left( {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda^{\prime}_{d} +n\right)\right) \\ &&+{\sum}_{d = 1}^{D} \left( {\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right) -{\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) \right) (\lambda_{d}-\lambda^{\prime}_{d}) \\ &=& \log \left[\frac{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}\right).{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda^{\prime}_{d} +n\right)}{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda^{\prime}_{d}\right).{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d} +n\right)}\right] \\ &&+{\sum}_{d = 1}^{D} \left( {\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right) -{\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) \right) (\lambda_{d}-\lambda^{\prime}_{d})\\ \end{array} $$
(38)

Appendix D: Proof of (23)

In the case of the EMSD distribution, we can show that:

$$\begin{array}{@{}rcl@{}} {\int}_{0}^{+\infty} & p&(\mathbf{X}|{\Theta})^{\sigma} p^{\prime}(\mathbf{X}|{\Theta}^{\prime})^{1-\sigma} dX= \\ && \left[\frac{{\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right)}{{\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right)}\right]^{\sigma} \left[\frac{{\Gamma}\left( \sum\limits_{d = 1}^{D} {\lambda^{\prime}_{d}}\right)}{{\Gamma}\left( \sum\limits_{d = 1}^{D} {\lambda^{\prime}_{d}}+\sum\limits_{d = 1}^{d} x_{d}\right)}\right]^{1-\sigma} \\ &\times& {\int}_{0}^{+\infty} \left[\frac{n!}{\prod\limits_{d = 1}^{D} x_{d}} \prod\limits_{d = 1}^{D} \frac{ \lambda_{d}}{\nu_{d}^{x_{d}}}\right]^{\sigma} dX \\ &\times& {\int}_{0}^{+\infty} \left[\frac{n!}{\prod\limits_{d = 1}^{D} x_{d}} \prod\limits_{d = 1}^{D} \frac{\lambda^{\prime}_{d}}{{\nu^{\prime}_{d}}^{x_{d}}}\right]^{1-\sigma} dX \\ &=&\left[\frac{{\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) }{{\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+\sum\limits_{d = 1}^{D} x_{d}\right)}\right]^{\sigma} \left[\frac{{\Gamma}\left( \sum\limits_{d = 1}^{D} {\lambda^{\prime}_{d}}\right)}{{\Gamma}(s^{\prime}+n)}\right]^{1-\sigma} \\ &\times& {\int}_{0}^{+\infty} \frac{n!}{\prod\limits_{d = 1}^{D} x_{d}} \prod\limits_{d = 1}^{D} \lambda_{d} \nu_{d}^{-\sigma x_{d}} dX \\ &\times& {\int}_{0}^{+\infty} \frac{n!}{\prod\limits_{d = 1}^{D} x_{d}} \prod\limits_{d = 1}^{D} \lambda^{\prime}_{d}{\nu^{\prime}_{d}}^{-x_{d}+\sigma x_{d}} dX \end{array} $$
(39)

We have the PDF of an EMSD distribution that integrates to one which gives:

$$ {\int}_{0}^{+\infty} \frac{n!}{{\prod}_{d = 1}^{D} x_{d}} \prod\limits_{d = 1}^{D} \frac{\lambda_{d}}{\nu_{d}^{x_{d}}}dX=\frac{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d} + {\sum}_{d = 1}^{D} x_{d}\right)}{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}\right) } $$
(40)

By substituting (40) into (39), we obtain:

$$\begin{array}{@{}rcl@{}} {\int}_{0}^{+\infty} & p&(\mathbf{X}|{\Theta})^{\sigma} p^{\prime}(\mathbf{X}|{\Theta}^{\prime})^{1-\sigma} dX= \\ && \left[\frac{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}\right) }{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}+n\right)}\right]^{\sigma} \left[\frac{{\Gamma}\left( {\sum}_{d = 1}^{D} {\lambda^{\prime}_{d}}\right)}{{\Gamma}\left( {\sum}_{d = 1}^{D} {\lambda^{\prime}_{d}}+{\sum}_{d = 1}^{d} x_{d}\right)}\right]^{1-\sigma} \\ &\times& \frac{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}+{\sum}_{d = 1}^{D} -\sigma x_{d}\right)}{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}\right)}\\ &\times& \frac{{\Gamma}\left( {\sum}_{d = 1}^{D} {\lambda^{\prime}_{d}}+{\sum}_{d = 1}^{D} -x_{d}+\sigma x_{d}\right)}{{\Gamma}\left( {\sum}_{d = 1}^{D} {\lambda^{\prime}_{d}}\right)} \end{array} $$
(41)

Appendix E: Proof of (27)

$$\begin{array}{@{}rcl@{}} H[p(\mathbf{X}|{\Theta})]&=&- {\int}_{0}^{+ \infty} p(\mathbf{X}|{\Theta}) \log p(\mathbf{X}|{\Theta}) dX \\ &=&- {\int}_{0}^{+ \infty} p(\mathbf{X}|{\Theta}) \left[\log {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) \right.\\&&\left.-\log {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right)\right. \\ &&+\sum\limits_{d = 1}^{D} \log(\lambda_{d}) E_{\theta}[I(x_{d} \geq 1)] \\ &&\left.-\sum\limits_{d = 1}^{D} \log(\nu_{d}) E_{\theta} [I(x_{d} \geq 1) x_{d}]\right] \end{array} $$
(42)

By substituting (37) into the previous equation, we obtain the following:

$$\begin{array}{@{}rcl@{}} H[p(\mathbf{X}|{\Theta})]&=&-\log {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) +\log {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right) \\ &-&\sum\limits_{d = 1}^{D} \log(\lambda_{d}) \left( {\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right) \right.\\&&\left.-{\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) \right) \end{array} $$
(43)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zamzami, N., Bouguila, N. Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models. Appl Intell 49, 3783–3800 (2019). https://doi.org/10.1007/s10489-019-01437-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01437-0

Keywords

Navigation