Abstract
Curating the records of an authority file is an activity as important as committing for many organizations, which have to rely on experts equipped with so-called authority control tools, capable of automatically supporting complex disambiguation workflows through user-friendly interfaces. This paper presents PACE, an open source authority control tool which offers user interfaces for (i) customizing the structure (ontology) of authority files, (ii) tune-up probabilistic disambiguation of authority files through a set of similarity functions for detecting record candidates for duplication and overload (iii) curate such authority files by applying record merges and splitting actions, and (iv) expose authority files to third-party consumers in several ways. PACE’s back-end is based on Cassandra’s “NOSQL”technology to offer (i) read-write performances that scale up linearly with the number of records and (ii) parallel and efficient (MapReduce-based) record sorting and matching algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J.: Swoosh: A generic approach to entity resolution. Stanford University technical report (March 2005)
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: 34th Annual Symposium on Theory and Computing, Montreal, Quebec, Canada (May 2002)
Christen, T., Churches, P., Zhu, J.: Probabilistic name and address cleaning and standardization. In: The Australian Data Mining Workshop (November 2002)
Churches, T., Christen, P., Lu, J., Zhu, J.X.: Preparation of name and address data for record linkage using hidden markov models. BioMed Central Medical Informatics and Decision Making 2(9) (2002)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string metrics for matching names and addresses. In: International Joint Conference on Artificial Intelligence, Proceedings of the Workshop on Information Integration on the Web (August 2003)
Dalrymple, P.W., Young, J.A.: From authority control to informed retrieval: Framing the expanded domain of subject access. College & Research Libraries 52, 139–149 (1991)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)
Fayad, U., Uthurusamy, R.: Evolving data mining into solutions for insights. Communications of the Association of Computing Machinery 45(8), 28–31 (2002)
Gong, C., Huang, Y., Cheng, X., Bai, S.: Detecting near-duplicates in large-scale short text databases. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 877–883. Springer, Heidelberg (2008)
Gorman, M.: Authority control in the context of bibliographic control in the electronic environment. In: International Conference Authority Control: Definition and International Experiences, Florence, February 10-12 (2003)
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35–40 (2010)
Manku, G., Jain, A., S.A.D.: Detecting near-duplicates for web crawling. In: 16th International World Wide Conference, Banff, Alberta, Canada (May 2007)
Rick, B., Hengel-Dittrich, C., O’Neill, E.T., Tillett, B.: Viaf (virtual international authority file): Linking the deutsche nationalbibliothek and library of congress name authority files. International Cataloging and Bibliographic Control 36(1), 12–19 (2007)
Tejada, S., Knoblock, C., Minton, S.: Learning object identification rules for information extraction. Information Systems 26(8), 607–633 (2001)
Tillett, B.T.: Authority control: State of the art and new perspectives. In: Authority Control International Conference, Florence, Italy (2003)
Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., Li, R.: Mapdupreducer: detecting near duplicates over massive datasets. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD 2010, pp. 1119–1122. ACM, New York (2010)
Weber, J.: Leaf. linking and exploring authority files. In: International Conference Authority Control: Definition and International Experiences, Florence, February 10-12 (2003)
Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354–359 (1990)
Winkler, W.E.: Overview of record linkage and current research directions. Technical report, Research Report Series, RRS (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Manghi, P., Mikulicic, M. (2011). PACE: A General-Purpose Tool for Authority Control. In: García-Barriocanal, E., Cebeci, Z., Okur, M.C., Öztürk, A. (eds) Metadata and Semantic Research. MTSR 2011. Communications in Computer and Information Science, vol 240. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24731-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-24731-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24730-9
Online ISBN: 978-3-642-24731-6
eBook Packages: Computer ScienceComputer Science (R0)