Abstract
Often the data units are described with discrete distributions (work described with citation distribution over time, population pyramid described as age-sex distribution etc.).When the set of such units is very large, appropriate clustering methods can reveal the typical patterns hidden in the data.
In this paper we present an adapted leaders method combined with a compatible adapted agglomerative hierarchical method that are based on relative error measure between a unit and the corresponding cluster representative–leader. The proposed approach is illustrated on citation distributions derived from the data set of US patents from 1980 to 1999. These new methods were developed because clustering of units, described with distributions, with classical k-means method reveals patterns with single high peaks which correspond to a single year. These patterns prevail over other distribution shapes also present in the data. Compared with centers in k-means method, clusters’ representatives obtained with the proposed new methods better detect typical distribution shapes of units. The obtained main cluster types for different sets of units show three main patterns: patents with early or late peak of importance to the community, and patents where the importance is slowly increasing throughout the time period.
Similar content being viewed by others
References
ANDERBERG, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press.
BATAGELJ, V. (1988), “Generalized Ward and Related Clustering Problems”, in Classification and Related Methods of Data Analysis, ed. H.H. Bock, North-Holland: Amsterdam, pp. 67–74.
BICKEL, P.J., and DOKSUM, K.J. (1977), Mathematical Statistics: Basic Ideas and Selected Topics, Oakland: Holden-Day, Inc.
BRUCKER, P. (1978), “On the Complexity of Clustering Problems”, in Lecture Notes in Economics and Mathematical Systems: Optimizing and Operational Research, eds. R. Henn, B. Korte, and W. Oletti, Berlin: Springer, pp. 45–54.
CLUSTDDIST–R PACKAGE (2009), Test Version of an R Package for Clustering of Distributions, by N. Kejžar, V. Batagelj, and S. Korenjak-Černe, https://r-forge.rproject.org/projects/clustddist/.
DIDAY, E. et al. (1979), Optimisation en classification automatique, Tomes 1., 2., Rocquencourt: INRIA.
FORGY, E.W. (1965), “Cluster Analysis of Multivariate Data: Efficiency Vs. Interpretability of Classifications”, Biometrics, 21, 768–769.
GARFIELD, E. (1985), “Uses and Misuses of Citation Frequency”, Current Contents. Essays of an Information Scientist, 8, 403–409.
GARFIELD, E. (1998a), “Long-Term Vs. Short-Term Journal Impact: Does It Matter?”, The Scientist, 12, 3.
GARFIELD, E. (1998b), “The Impact Factor and Using It Correctly”, Der Unfallchirurg, 101(6), 413.
GOWER, J.C., and LEGENDRE, P. (1986), “Metric and Euclidean Properties of Dissimilarity Coefficients”, Journal of Classification, 3, 5–48.
HALL, B.H., JAFFE, A.B., and TRATJENBERG, M. (2001), “The NBER Patent Citation Data File: Lessons, Insights and Methodological Tools”, NBER Working Paper 8498, NBER, http://papers.nber.org/papers/w8498.pdf.
HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley-Interscience.
HIRSCH, J.E. (2005), “An Index to Quantify an Individual’s Scientific Research Output”, Proceedings of the National Academy of Sciences of the United Stated of America, 102, 16569–16572.
IMU REPORT (2008), “Citation Statistics. A Report from the International Mathematical Union (IMU) in Cooperation with the International Council of Industrial and Applied Mathematics (ICIAM) and the Institute of Mathematical Statistics (IMS)”, by R. Adler, J. Ewing, and P. Taylor, http://www.mathunion.org/fileadmin/IMU/Report/CitationStatistics.pdf.
KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley.
KATSAROS, D., SIDIROPOULOS, A., and MANOLOPOUS, Y. (2007), “Age Decaying HIndex for Social Network of Citations”, Proceedings of Workshop on Social Aspects of the Web, Poznan, Poland, April 27.
KEJŽAR, N., KORENJAK-ČERNE, S., and BATAGELJ, V. (2009) “Clustering of Discrete Distributions: New R Package and Comparison of Its Methods”, Abstract for the International Conference IFCS 2009 in Dresden, March 2009.
MACQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.
NEWMAN, M.E.J. (2005), “Power Laws, Pareto Distributions and Zipf’s Law”, Contemporary Physics, 46, 5, 323–351.
RAMSEY, J., and SILVERMAN, B.W. (2005), Functional Data Analysis (2nd ed.), New York: Springer-Verlag.
R DEVELOPMENT CORE TEAM (2008), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, http://www.R-project.org.
RESEARCH REPORT BY UNIVERSITIES UK (2007), “The Use of Bibliometrics to Measure Research Quality in UK Higher Educational Institutions”, 40, October 2007, http://www.universitiesuk.ac.uk/Publications/Pages/Publication-275.aspx.
SALTON, G. (1989), Authomatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Reading, Massachusetts: Addison-Wesley.
SIDIROPOULOS, A., KATSAROS, D., and MANOLOPOUS, Y. (2006), “Generalized Hindex for Revealing Latent Facts in Social Networks of Citations”, Proceedings of the 4th ACM International Workshop on Link Analysis: Dynamics and Static of Large Networks (LinkKDD), (in conjunction with ACM KDD), ACM Press, pp. 45–52.
SPÄTH, H. (1977), Cluster-Analyse-Algorithmen, München: R. Oldenbourg.
VINOD, H. (1969), “Integer Programming and the Theory of Grouping”, Journal of American Statistical Association, 64, 506–517.
WARD, J.H. (1963), “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, 58, 236–244.
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors would like to thank the anonymous referees for many valuable comments and suggestions how to improve this paper. This work was partially supported by the Slovenian Research Agency, Project J1-6062-0101.
Rights and permissions
About this article
Cite this article
Kejžar, N., Korenjak-Černe, S. & Batagelj, V. Clustering of Distributions: A Case of Patent Citations. J Classif 28, 156–183 (2011). https://doi.org/10.1007/s00357-011-9084-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-011-9084-x