Basic statistics for distributional symbolic variables: a new metric-based approach

Irpino, Antonio; Verde, Rosanna

doi:10.1007/s11634-014-0176-4

Basic statistics for distributional symbolic variables: a new metric-based approach

Regular Article
Published: 18 May 2014

Volume 9, pages 143–175, (2015)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Antonio Irpino¹ &
Rosanna Verde¹

1052 Accesses
Explore all metrics

Abstract

In data mining it is usual to describe a group of measurements using summary statistics or through empirical distribution functions. Symbolic data analysis (SDA) aims at the treatment of such kinds of data, allowing the description and the analysis of conceptual data or of macrodata summarizing classical data. In the conceptual framework of SDA, the paper aims at presenting new basic statistics for distribution-valued variables, i.e., variables whose realizations are distributions. The proposed measures extend some classical univariate (mean, variance, standard deviation) and bivariate (covariance and correlation) basic statistics to distribution-valued variables, taking into account the nature and the variability of such data. The novel statistics are based on a distance between distributions: the $\ell _2$ Wasserstein distance. A comparison with other univariate and bivariate statistics presented in the literature points out some relevant properties of the proposed ones. An application on a clinic dataset shows the main differences in terms of interpretation of results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Association measures for interval variables

Article 03 July 2021

Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

Article 14 February 2015

New models for symbolic data analysis

Article Open access 19 September 2022

Notes

We joined together the three tables describing the three histogram variables that are presented in different sections of the book.

References

Aitchison J (1986) The statistical analysis of compositional data. Chapman Hall, New York
Book MATH Google Scholar
Bacelar-Nicolau H (1987) On the distribution equivalence in cluster analysis. In: Devijver PA, Kittler J (eds) Pattern recognition theory and applications, NATO ASI SeriesF, vol 30. Springer Verlag, Berlin, pp 73–79
Chapter Google Scholar
Bacelar-Nicolau H (1988) Two probabilistic models for classification of variables in frequency tables. In: Bock HH (ed) Classification and related methods. North-Holland, Amsterdam, pp 181–189
Barrio E, Matran C, Rodriguez-Rodriguez J, Cuesta-Albertos JA (1999) Tests of goodness of fit based on the L2-Wasserstein distance. Ann Stat 27:1230–1239
MATH Google Scholar
Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 103–124
Google Scholar
Billard L (2007) Dependencies and variation components of symbolic interval-valued data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 3–12
Chapter Google Scholar
Billard L (2008) Sample covariance function for complex quantitative data. In: Proceedings of IASC 2008, Yokohama, Japan, pp 157–163
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487
MathSciNet Google Scholar
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Chirchester
Book Google Scholar
Bock HH, Diday E (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation. Springer-Verlag, Berlin
Google Scholar
Brito P (2007) On the analysis of symbolic data. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 13–22
Chapter Google Scholar
Chisini O (1929) Sul concetto di media. Periodico di Matematiche 4:106–116
Google Scholar
Diday E (2013) Principal component analysis for bar charts and metabins tables. Stat Anal Data Min 6(5):403–430
MathSciNet Google Scholar
Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, Berlin
MATH Google Scholar
Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 7(3):419–435
Google Scholar
Gilchrist WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, London
Book Google Scholar
Ginestet CE, Simmons A, Kolaczyk ED (2012) Weighted Frechet means as convex combinations in metric spaces: properties and generalized median inequalities. Stat Probab Lett 82(10):1859–1863
MATH MathSciNet Google Scholar
Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min 4(2):184–198
MathSciNet Google Scholar
Irpino A, Lechevallier Y, Verde R (2006) Dynamic clustering of histograms using Wasserstein metric. In: Rizzi A, Vichi M (eds) COMPSTAT 2006. Physica-Verlag, Berlin, pp 869–876
Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batanjeli V, Bock HH, Ferligoj A, Ziberna A (eds) Data science and classification, IFCS 2006. Springer, Berlin, pp 185–192
Chapter Google Scholar
Irpino A, Verde R (2008a) Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recogn Lett 29:1648–1658
Google Scholar
Irpino A, Verde R (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008. Physica-Verlag, Heidelberg, pp 77–89
Google Scholar
Kim J, Billard L (2013) Dissimilarity measures for histogram-valued observations. Commun Stat-Theor M 42:283–303
MATH MathSciNet Google Scholar
Matusita K (1951) On the theory of statistical decision functions. Ann I Stat Math 3(1):1–30
Google Scholar
Moore RE (1966) Interval analysis. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Moore R, Lodwick W (2003) Interval analysis and fuzzy set theory. Fuzzy Set Syst 135(1):5–9
MATH MathSciNet Google Scholar
Noirhomme-Fraiture M, Brito P (2012) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170
MathSciNet Google Scholar
Nielsen F, Nock R (2009) Sided and symmetrized Bregman centroids. IEEE T Inform Theory 55(6):2882–2904
MathSciNet Google Scholar
Rüschendorf L (2001) Wasserstein metric. In: Hazewinkel M (ed) Encyclopedia of mathematics. Springer, New York
Google Scholar
Verde R, Irpino A (2007) Dynamic clustering of histogram data: using the right metric. In: Brito P, Bertrand P, Cucumel G, de Carvalho FAT (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 123–134
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Political Sciences J. Monnet, Second University of Naples Caserta, Viale Ellittico, 31, Caserta, Italy
Antonio Irpino & Rosanna Verde

Authors

Antonio Irpino
View author publications
You can also search for this author in PubMed Google Scholar
Rosanna Verde
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Irpino.

Appendix A: Proof of the decomposition of the $\ell _2$ squared Wasserstein distance

Let $\phi _i$ and $\phi _{i'}$ be two density functions having finite the first two moments. The $\phi _i$ density function is in a one-to-one correspondence with the cumulative distribution function $\varvec{\varPhi }_i$ and the quantile function $\varvec{\varPhi }_i^{-1}$ (the inverse of the distribution function). The expected value of $\phi _i$ is denoted by $\mu _i$ and the standard deviations with $\sigma _i$. In this appendix we prove the result in Eq. (15).

First of all we note that

$$\begin{aligned} {\mu _i} = \int \limits _{ - \infty }^{ + \infty } {y\cdot {\phi _i}(y)dy} = \int \limits _{ - \infty }^{ + \infty } {yd{\varvec{\varPhi } _i}(y)} = \int \limits _{ - \infty }^{ + \infty } {\varvec{\varPhi } _i^{ - 1}({\varvec{\varPhi } _i}(y))d{\varvec{\varPhi } _i}(y)} = \int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)dt},\nonumber \\ \end{aligned}$$

(38)

where $t = \varvec{\varPhi }(y)$, $\varvec{\varPhi }(-\infty )=0$, $\varvec{\varPhi }(+\infty )=1$ and $ y = \varvec{\varPhi }^{ - 1} (\varvec{\varPhi }(y)) = \varvec{\varPhi }^{ - 1} (t)$. Analogously, for $\sigma ^2$ we have that:

$$\begin{aligned} {\sigma _i}^2&= \int \limits _{ - \infty }^{ + \infty } {{y^2}{\phi _i}(y)dy - \mu _i^2} = \int \limits _{ - \infty }^{ + \infty } {{{\left[ {\varvec{\varPhi } _i^{ - 1}({\varvec{\varPhi } _i}(y))} \right] }^2}d{\varvec{\varPhi } _i}(y)} - \mu _i^2\nonumber \\&= \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _i^{ - 1}(t)} \right] }^2}dt - \mu _i^2} . \end{aligned}$$

(39)

We develop the squared term of the distance, and using Eqs. (38) and (39) we obtain:

$$\begin{aligned}&d_W^2\left( {{\phi _i}(y),{\phi _{i'}}(y)} \right) = \displaystyle \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _i^{ - 1}(t) - \varvec{\varPhi } _{i'}^{ - 1}(t)} \right] }^2}dt} = \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _i^{ - 1}(t)} \right] }^2}dt} + \int \limits _0^1 {{{\left[ {\varvec{\varPhi } _{i'}^{ - 1}(t)} \right] }^2}dt} \nonumber \\&\quad - 2\displaystyle \int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\cdot \varvec{\varPhi } _{i'}^{ - 1}(t)dt} = \sigma _i^2 + \mu _i^2 + \sigma _{i'}^2 + \mu _{i'}^2 - 2\int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\cdot \varvec{\varPhi } _{i'}^{ - 1}(t)dt} \end{aligned}$$

(40)

Now we introduce the following quantity:

$$\begin{aligned} \begin{array}{c} {\rho _{i,i'}} = \frac{{\int \nolimits _0^1 {\left( {\varvec{\varPhi } _i^{ - 1}(t) - {\mu _i}} \right) \left( {\varvec{\varPhi } _{i'}^{ - 1}(t) - {\mu _{i'}}} \right) dt} }}{{\sqrt{\left[ {\int \nolimits _0^1 {{{\left( {\varvec{\varPhi } _i^{ - 1}(t) - {\mu _i}} \right) }^2}dt} } \right] \left[ {\int \nolimits _0^1 {{{\left( {\varvec{\varPhi } _{i'}^{ - 1}(t) - {\mu _{i'}}} \right) }^2}dt} } \right] } }} = \frac{{\int \nolimits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\varvec{\varPhi } _{i'}^{ - 1}(t)dt} - {\mu _i}{\mu _{i'}}}}{{{\sigma _i}{\sigma _{i'}}}} \end{array} \end{aligned}$$

(41)

that is the correlation of two series of data where each couple of observations is represented respectively by the $t$th quantile of the first distribution and the $t-th$ quantile of the second. In this sense we may consider it as the correlation between quantile functions represented by the curve of the infinite quantile points in a Q–Q plot. It is worth noting that, if $\sigma _i$ and $\sigma _{i'}$ are positive, $0 < \rho _{i,i'} \le 1$ and is equal to 1 when the two standardized series of quantiles are the same, or, in other words, when the two distributions are identical except for the means and the standard deviations. Using the last term of $\rho _{i,i'}$ in Eq. (41), we observe that

$$\begin{aligned} {\int \limits _0^1 {\varvec{\varPhi } _i^{ - 1}(t)\cdot \varvec{\varPhi }_{i'}^{ - 1}(t)dt} }=\rho _{i,i'}\sigma _i\sigma _{i'} + {\mu _i}{\mu _{i'}}. \end{aligned}$$

Thus, we continue developing Eq.(40) as follows

$$\begin{aligned} d_W^2\left( {{\phi _i},{\phi _{i'}}} \right)&= \sigma _i^2 + \mu _i^2 + \sigma _{i'}^2 + \mu _{i'}^2 - 2\left[ {{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}} + {\mu _i}{\mu _{i'}}} \right] \nonumber \\&= \left( {\mu _i^2 + \mu _{i'}^2 - 2{\mu _i}{\mu _{i'}}} \right) + \sigma _i^2 + \sigma _{i'}^2 - 2{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}}. \end{aligned}$$

(42)

Finally, adding and subtracting $2\sigma _i \sigma _{i'}$ we obtain Eq. (15):

$$\begin{aligned} d_W^2\left( {{\phi _i},{\phi _{i'}}} \right)&= \left( {\mu _i^2 + \mu _{i'}^2 - 2{\mu _i}{\mu _{i'}}} \right) + \sigma _i^2 + \sigma _{i'}^2 - 2{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}} + 2{\sigma _i}{\sigma _{i'}} - 2{\sigma _i}{\sigma _{i'}} \nonumber \\&= {\left( {{\mu _i} - {\mu _{i'}}} \right) ^2} + \left( {\sigma _i^2 + \sigma _{i'}^2 - 2{\sigma _i}{\sigma _{i'}}} \right) + 2{\sigma _i}{\sigma _{i'}} - 2{\rho _{i,i'}}{\sigma _i}{\sigma _{i'}} \nonumber \\&= {\left( {{\mu _i} - {\mu _{i'}}} \right) ^2} + {\left( {{\sigma _i} - {\sigma _{i'}}} \right) ^2} + 2{\sigma _i}{\sigma _{i'}}\left( {1 - {\rho _{i,i'}}} \right) . \end{aligned}$$

(43)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Irpino, A., Verde, R. Basic statistics for distributional symbolic variables: a new metric-based approach. Adv Data Anal Classif 9, 143–175 (2015). https://doi.org/10.1007/s11634-014-0176-4

Download citation

Received: 15 May 2012
Revised: 02 May 2014
Accepted: 05 May 2014
Published: 18 May 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s11634-014-0176-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Basic statistics for distributional symbolic variables: a new metric-based approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Association measures for interval variables

Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

New models for symbolic data analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix A: Proof of the decomposition of the \(\ell _2\) squared Wasserstein distance

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Basic statistics for distributional symbolic variables: a new metric-based approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Association measures for interval variables

Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

New models for symbolic data analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix A: Proof of the decomposition of the \(\ell _2\) squared Wasserstein distance

Appendix A: Proof of the decomposition of the \(\ell _2\) squared Wasserstein distance

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation