Abstract
This paper deals with a batch self organizing map algorithm for data described by distributional-valued variables (DBSOM). Such variables are characterized to take as values probability or frequency distributions on numeric support. According to the nature of the data, the loss function is based on the L2 Wasserstein distance, that is one of the most used metrics to compare distributions in the context of distributional data analysis. Besides, to consider the different contributions of the variables, four adaptive versions of the DBSOM algorithm are proposed. Relevance weights are automatically learned, one for each distributional-valued variable, in an additional step of the algorithm. Since the L2 Wasserstein metric allows a decomposition of the distance into two components, one related to the means and one related to the size and shape of the distributions, relevance weights are automatically learned for each of the two components to emphasize the importance of the different characteristics, related to the moments of the distributions, on the distance value. The proposed algorithms are corroborated by applications on real distributional-valued data sets.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Notes
We remark that gmj is a distributional data because the corresponding quantile function Qmj is a weighted average quantile function. For further details see Irpino and Verde (2015)
It is available at https://github.com/Airpino/ADBSOM
See the ?? for the proof.
References
Altun, K, Barshan, B, & Tunçel, O (2010). Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognition, 43(10), 3605–3620.
Badran, F, Yacoub, M, & Thiria, S (2005). Self-organizing maps and unsupervised classification. In G Dreyfus (Ed.) Neural Networks: methodology and applications (pp. 379–442). Singapore: Springer.
Bao, C, Peng, H, He, D, & Wang, J (2018). Adaptive fuzzy c-means clustering algorithm for interval data type based on interval-dividing technique. Pattern Analysis and Applications, 21, 803–812.
Barshan, B, & Yuksek, M C (2014). Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. The Computer Journal, 57(11), 1649–1667.
Barshan, B, & Yurtman, A (2016). Investigating inter-subject and inter-activity variations in activity recognition using wearable motion sensors. The Computer Journal, 59(9), 1345–1362.
Bock, H H (2002). Clustering algorithms and Kohonen maps for symbolic data. J Jpn Soc Comp Statist, 15, 1–13.
Bock, H H, & Diday, E. (2000). Analysis of symbolic data exploratory methods for extracting statistical information from complex data. Berlin: Springer.
Cabanes, G, Bennani, Y, Destenay, R, & Hardy, A (2013). A new topological clustering algorithm for interval data. Pattern Recognition, 46, 3030–3039.
Cabanes, G, Bennani, Y, Verde, R, & Irpino, A. (2021). On the use of Wasserstein metric in topological clustering of distributional data. https://arxiv.org/abs/2109.04301.
Campello, R J G B, & Hruschka, E R (2006). A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets and Systems, 157(21), 2858–2875.
de Carvalho, F A T, & De Souza, R M C R (2010). Unsupervised pattern recognition models for mixed feature–type symbolic data. Pattern Recognition Letters, 31, 430–443.
de Carvalho, F A T, & Lechevallier, Y (2009). Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognition, 42(7), 1223–1236.
de Carvalho, FAT, Irpino, A, & Verde, R. (2015). Fuzzy clustering of distribution-valued data using an adaptive L2 Wasserstein distance. In: 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (pp. 1–8). https://doi.org/10.1109/FUZZ-IEEE.2015.7337847.
de Carvalho, F A T, Bertrand, P, & Simões, E C (2016). Batch SOM algorithms for interval-valued data with automatic weighting of the variables. Neurocomputing, 182, 66–81.
Diday, E, & Govaert, G (1977). Classification automatique avec distances adaptatives. RAIRO Informatique Computer Science, 11(4), 329–349.
Diday, E, & Simon, J C (1976). Clustering analysis. In K Fu (Ed.) Digital pattern classification (pp. 47–94). Berlin: Springer.
D’Urso, P, & Giovanni, L D (2011). Midpoint radius self-organizing maps for interval-valued data with telecommunications application. Applied Soft Computing, 11, 3877–3886.
Friedman, J H, & Meulman, J J (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Serie B, 66, 815–849.
Gibbs, A L, & Su, F E (2002). On choosing and bounding probability metrics. International Statistical Review, 70(3), 419–435.
Hajjar, C, & Hamdan, H (2011a). Self-organizing map based on city-block distance for interval-valued data. In Complex Systems Design and Management - CSDM, (Vol. 2011 pp. 281–292).
Hajjar, C, & Hamdan, H (2011b). Self-organizing map based on Hausdorff distance for interval-valued data. In IEEE International Conference on Systems, Man, and Cybernetics - SMC, (Vol. 2011 pp. 1747–1752).
Hajjar, C, & Hamdan, H (2011c). Self-organizing map based on l2 distance for interval-valued data. In 6th IEEE International Symposium on Applied Computational Intelligence and Informatics - SACI, (Vol. 2011 pp. 317–322).
Hajjar, C, & Hamdan, H (2013). Interval data clustering using self-organizing maps based on adaptive mahalanobis distances. Neural Networks, 46, 124–132.
Huang, J Z, Ng, M K, Rong, H, & Li, Z (2005). Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell, 27(5), 657–668.
Hubert, L, & Arabie, P (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Hulse, D, Gregory, S, & Baker, J. (2002). Willamette River Basin: Trajectories of environmental and ecological change. Oregon State University Press. http://www.fsl.orst.edu/pnwerc/wrb/Atlas_web_compressed/PDFtoc.html.
Irpino, A. (2018). HistDAWass: Histogram data analysis using Wasserstein distance. R package version 1.0.1.
Irpino, A, & Romano, E. (2007). Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation. Revue des Nouvelles Technologies de l’Information RNTI-E-9, 99–110.
Irpino, A, & Verde, R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In B Vea (Ed.) Data Science and Classification (pp. 185–192). Berlin: Springer.
Irpino, A, & Verde, R (2015). Basic statistics for distributional symbolic variables: A new metric-based approach. Advances in Data Analysis and Classification, 9(2), 143–175.
Irpino, A, Verde, R, & de Carvalho, F A T (2014). Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Systems with Applications, 41(7), 3351–3366.
Irpino, A, Verde, R, & de Carvalho, F A T (2017). Fuzzy clustering of distributional data with automatic weighting of variable components. Information Sciences, 406-407, 248–268.
Kim, J, & Billard, L (2011). A polythetic clustering process and cluster validity indexes for histogram-valued objects. Computational Statistics and Data Analysis, 55(7), 2250–2262.
Kim, J, & Billard, L (2013). Dissimilarity measures for histogram-valued observations. Communications in Statistics - Theory and Methods, 42(2), 283–303.
Kiviluoto, K (1996). Topology preservation in self-organizing maps. In IEEE International Conference on Neural Networks, 1996. https://doi.org/10.1109/ICNN.1996.548907, (Vol. 1 pp. 294–299).
Kohonen, T. (1995). Self-organizing maps. New York: Springer.
Kohonen, T (2013). Essentials of the self-organizing map. Neural Networks, 37(1), 52–65.
Kohonen, T. (2014). MATLAB Implementations and Applications of the Self-Organizing Map. Helsinki: Unigrafia Oy.
Korenjak-Černe, S, & Batagelj, V. (2002). Symbolic data analysis approach to clustering large datasets, (pp. 319–327). Berlin: Springer.
Manning, C, Raghavan, P, & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.
Meila, M (2007). Comparing clusterings – an information based distance. Journal of Multivariate Analysis, 98(5), 873–895.
Milligan, G W, & Cooper, M C (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5(2), 181–204. https://doi.org/10.1007/BF01897163.
Modha, D S, & Spangler, W S (2003). Feature weighting in k-means clustering. Machine Learning, 52(3), 217–237.
Mount, N J, & Weaver, D (2011). Self-organizing maps and boundary effects: Quantifying the benefits of torus wrapping for mapping som trajectories. Pattern Analysis and Applications, 14(2), 139–148.
Rousseeuw, P J (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.
Rüshendorff, L. (2001). Wasserstein metric. In Encyclopedia of Mathematics. Springer.
Terada, T, & Yadohisa, H (2010). Non-hierarchical clustering for distribution-valued data. In Y Lechevallier G Saporta (Eds.) Proceedings of COMPSTAT, (Vol. 2010 pp. 1653–1660). Berlin: Springer.
Verde, R, & Irpino, A (2008a). Comparing histogram data using a Mahalanobis-Wasserstein distance. In P Brito (Ed.) Proceedings of COMPSTAT 2008, Compstat, (Vol. 2008 pp. 77–89). Heidelberg: Springer.
Verde, R, & Irpino, A (2008b). Dynamic clustering of histogram data: Using the right metric. In B Pea (Ed.) Selected contributions in data analysis and classification (pp. 123–134). Berlin: Springer.
Verde, R, & Irpino, A. (2018). Multiple factor analysis of distributional data. Statistica Applicata: Italin Journal of applied statistics. To appear.
Verde, R, Irpino, A, & Lechevallier, Y (2006). Dynamic clustering of histograms using Wasserstein metric. In A Rizzi M Vichi (Eds.) Proceedings of COMPSTAT 2006, Compstat, (Vol. 2006 pp. 869–876). Heidelberg: Physica Verlag.
Vesanto, J, Himberg, J, Alhoniemi, E, & Parhankangas, J. (1999). Self-organizing map in matlab: the som toolbox. In Inproceedings of the Matlab DSP Conference (pp. 35–40).
Vrac, M, Billard, L, Diday, E, & Chedin, A (2012). Copula analysis of mixture models. Computational Statistics, 27, 427–457.
Zhang, L, Bing, Z, & Zhang, L (2015). A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data. Pattern Analysis and Applications, 18, 377–384.
Acknowledgements
The authors would like to acknowledge the anonymous referees for their careful revision.
Funding
The first author would like to acknowledge the partial funding of the Conselho Nacional de Desenvolvimento Cientifico e Tecnologico - CNPq (311164/2020-0) and of the University of Campania L. Vanvitelli (1049/2018).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: : Fast Computation of Silhouette Coefficient for Euclidean Spaces
Appendix: : Fast Computation of Silhouette Coefficient for Euclidean Spaces
Let us consider a table Y containing N numerical data vectors yi = [yi1,…,yiP] of P components. Without loss of generality, the N vectors are clustered in two groups, namely, A and B, having, respectively, size NA and NB such that NA + NB = N. Let \(\bar {y}_{Aj}= n_{A}^{-1}\sum \limits _{\mathbf {y_{i}}\in A}y_{ij}\) and \(\bar {y}_{Bj}= n_{B}^{-1}\sum \limits _{\mathbf {y_{i}}\in B}y_{ij}\), j = 1,…,P, be the two cluster means, and \(\text {SSE}_{Aj}=\sum \limits _{\mathbf {y_{i}}\in A}{y_{ij}^{2}}- n_{A} (\bar {y}_{Aj})^{2}\) and \(\text {SSE}_{Bj}=\sum \limits _{\mathbf {y_{i}}\in B}{y_{ij}^{2}}- n_{B} (\bar {y}_{Bj})^{2}\) the two sum of squares deviations from the respective cluster means.
1.1 The Average Euclidean Distance Between a Point to all the Other Points of a Set Where It Is Contained
Let consider that yi ∈ A, the average distance of yi with respect all the other vectors in A is computed as follows:
It is easy to show, for a single variable j, that:
Then, the average distance is:
1.2 The Average Euclidean Distance of a Point from all the Other Points of a Set Where It Is Not Contained
Let consider that yi ∈ A and we want to compute the average distance of yi with respect to all the other vectors in B. The average squared Euclidean distance between yi and all the other vectors in B, for each variable, is given by:
for a single variable j, it is easy to show that:
Then, the average distance is that:
1.3 The Silhouette Coefficient
As stated above, the silhouette coefficient for a single yi is given by:
where, in the case of two groups A and B, if i ∈ A:
and
If we consider the original formulation, for computing N silhouette indexes, the computational complexity is in the order of \(\mathcal {O}(N^{2})\), if N is sufficiently larger than P. If we use the formulas suggested here, once computed the averages and the SSE, the computational cost is approximately of \(\mathcal {O}(N)\). In fact, it is possible to compute a(i) and b(i) as follows:
and
Considering that the squared L2 Wasserstein distance is a Euclidean distance between quantile functions, that the SSE computed for distribution functions is defined as a sum of squared distance, and that the average distributions are Fréchet means with respect to the squared L2 Wasserstein distance. The same simplification can be applied for computing the silhouette coefficient when data are distributions, and they are compared using the squared L2 Wasserstein distance.
Rights and permissions
About this article
Cite this article
de Carvalho, F.d.A.T., Irpino, A., Verde, R. et al. Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components. J Classif 39, 343–375 (2022). https://doi.org/10.1007/s00357-022-09411-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-022-09411-1