Abstract
In data mining it is usual to describe a group of measurements using summary statistics or through empirical distribution functions. Symbolic data analysis (SDA) aims at the treatment of such kinds of data, allowing the description and the analysis of conceptual data or of macrodata summarizing classical data. In the conceptual framework of SDA, the paper aims at presenting new basic statistics for distribution-valued variables, i.e., variables whose realizations are distributions. The proposed measures extend some classical univariate (mean, variance, standard deviation) and bivariate (covariance and correlation) basic statistics to distribution-valued variables, taking into account the nature and the variability of such data. The novel statistics are based on a distance between distributions: the \(\ell _2\) Wasserstein distance. A comparison with other univariate and bivariate statistics presented in the literature points out some relevant properties of the proposed ones. An application on a clinic dataset shows the main differences in terms of interpretation of results.
Similar content being viewed by others
Notes
We joined together the three tables describing the three histogram variables that are presented in different sections of the book.
References
Aitchison J (1986) The statistical analysis of compositional data. Chapman Hall, New York
Bacelar-Nicolau H (1987) On the distribution equivalence in cluster analysis. In: Devijver PA, Kittler J (eds) Pattern recognition theory and applications, NATO ASI SeriesF, vol 30. Springer Verlag, Berlin, pp 73–79
Bacelar-Nicolau H (1988) Two probabilistic models for classification of variables in frequency tables. In: Bock HH (ed) Classification and related methods. North-Holland, Amsterdam, pp 181–189
Barrio E, Matran C, Rodriguez-Rodriguez J, Cuesta-Albertos JA (1999) Tests of goodness of fit based on the L2-Wasserstein distance. Ann Stat 27:1230–1239
Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124
Billard L (2007) Dependencies and variation components of symbolic interval-valued data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 3–12
Billard L (2008) Sample covariance function for complex quantitative data. In: Proceedings of IASC 2008, Yokohama, Japan, pp 157–163
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chirchester
Bock HH, Diday E (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation. Springer-Verlag, Berlin
Brito P (2007) On the analysis of symbolic data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 13–22
Chisini O (1929) Sul concetto di media. Periodico di Matematiche 4:106–116
Diday E (2013) Principal component analysis for bar charts and metabins tables. Stat Anal Data Min 6(5):403–430
Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, Berlin
Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 7(3):419–435
Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, London
Ginestet CE, Simmons A, Kolaczyk ED (2012) Weighted Frechet means as convex combinations in metric spaces: properties and generalized median inequalities. Stat Probab Lett 82(10):1859–1863
Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4(2):184–198
Irpino A, Lechevallier Y, Verde R (2006) Dynamic clustering of histograms using Wasserstein metric. In: Rizzi A, Vichi M (eds) COMPSTAT 2006. Physica-Verlag, Berlin, pp 869–876
Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batanjeli V, Bock HH, Ferligoj A, Ziberna A (eds) Data science and classification, IFCS 2006. Springer, Berlin, pp 185–192
Irpino A, Verde R (2008a) Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recogn Lett 29:1648–1658
Irpino A, Verde R (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008. Physica-Verlag, Heidelberg, pp 77–89
Kim J, Billard L (2013) Dissimilarity measures for histogram-valued observations. Commun Stat-Theor M 42:283–303
Matusita K (1951) On the theory of statistical decision functions. Ann I Stat Math 3(1):1–30
Moore RE (1966) Interval analysis. Prentice Hall, Englewood Cliffs
Moore R, Lodwick W (2003) Interval analysis and fuzzy set theory. Fuzzy Set Syst 135(1):5–9
Noirhomme-Fraiture M, Brito P (2012) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170
Nielsen F, Nock R (2009) Sided and symmetrized Bregman centroids. IEEE T Inform Theory 55(6):2882–2904
Rüschendorf L (2001) Wasserstein metric. In: Hazewinkel M (ed) Encyclopedia of mathematics. Springer, New York
Verde R, Irpino A (2007) Dynamic clustering of histogram data: using the right metric. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 123–134
Author information
Authors and Affiliations
Corresponding author
Appendix A: Proof of the decomposition of the \(\ell _2\) squared Wasserstein distance
Appendix A: Proof of the decomposition of the \(\ell _2\) squared Wasserstein distance
Let \(\phi _i\) and \(\phi _{i'}\) be two density functions having finite the first two moments. The \(\phi _i\) density function is in a one-to-one correspondence with the cumulative distribution function \(\varvec{\varPhi }_i\) and the quantile function \(\varvec{\varPhi }_i^{-1}\) (the inverse of the distribution function). The expected value of \(\phi _i\) is denoted by \(\mu _i\) and the standard deviations with \(\sigma _i\). In this appendix we prove the result in Eq. (15).
First of all we note that
where \(t = \varvec{\varPhi }(y)\), \(\varvec{\varPhi }(-\infty )=0\), \(\varvec{\varPhi }(+\infty )=1\) and \( y = \varvec{\varPhi }^{ - 1} (\varvec{\varPhi }(y)) = \varvec{\varPhi }^{ - 1} (t)\). Analogously, for \(\sigma ^2\) we have that:
We develop the squared term of the distance, and using Eqs. (38) and (39) we obtain:
Now we introduce the following quantity:
that is the correlation of two series of data where each couple of observations is represented respectively by the \(t\)th quantile of the first distribution and the \(t-th\) quantile of the second. In this sense we may consider it as the correlation between quantile functions represented by the curve of the infinite quantile points in a Q–Q plot. It is worth noting that, if \(\sigma _i\) and \(\sigma _{i'}\) are positive, \(0 < \rho _{i,i'} \le 1\) and is equal to 1 when the two standardized series of quantiles are the same, or, in other words, when the two distributions are identical except for the means and the standard deviations. Using the last term of \(\rho _{i,i'}\) in Eq. (41), we observe that
Thus, we continue developing Eq.(40) as follows
Finally, adding and subtracting \(2\sigma _i \sigma _{i'}\) we obtain Eq. (15):
Rights and permissions
About this article
Cite this article
Irpino, A., Verde, R. Basic statistics for distributional symbolic variables: a new metric-based approach. Adv Data Anal Classif 9, 143–175 (2015). https://doi.org/10.1007/s11634-014-0176-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-014-0176-4