Abstract
Data quality in databases is a critical challenge because the cost of anomalies may be very high, especially for large databases. Therefore, the correction of these anomalies represents an issue that has become more and more important both in enterprises and in academia. In this work, we address the problems of intra-column and inter-columns anomalies in big data. We propose a new approach for data cleaning that takes into account the semantic dependencies between the columns of a data source. The novelty of our proposal is the reduction of the size of the search space in the process of functional dependency discovery based on data semantics. In this paper, we present the first steps of our work. They allow recognizing the semantics of data and correct intra-column anomalies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ben Salem, A., Boufarès, F., Correia, S.: Semantic recogonition of a data structure in Big-data. In: 6th International Conference on Computational Intelligence and software Engineering, Beijing, pp. 93–103 (2014)
Ben Salem, A.: Qualité contextuelle des données: Détection et nettoyage guidés par la sémantique des données. Thèse de doctorat, de l’université Sorbonne Paris cité, Paris (2015)
Berkopec, A.: HyperQuick algorithm for discrete hypergeometric distribution. J. Discrete Algorithms. 5(2), 341–347 (2007)
Boufarès, F., Ben Salem, A., Correia, S.: Qualité de données dans les entrepôts de données: élimination des similaires. In: 8èmes Journées francophones sur les Entrepôts de Données et l’Analyse en ligne, Bordeaux, France, pp. 32–41 (2012)
Boufarès, F., Ben Salem, A., Rehab, M., Correia, S.: Similar elimination data: MFB algorithm. In: IEEE-2013 International Conference on Control, Decision and Information Technologies, Hammamet, Tunisie, pp. 289–293 (2013)
Dallachiesay, M., Ebaidz, A., Eldawy, A., Elmagarmid, A, Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: 2013 ACM SIGMOD International Conference on Management of Data, pp. 541–552. IEEE Press, New York (2013)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: 6th Conference on Symposium on Operating System Design and Implementation, California, pp. 137–150 (2004)
Diallo, T., Novelli, N.: Découverte des dépendances fonctionnelles conditionnelles. In: 10th conférence internationale sur l’extraction et la gestion des connaissances, Hammamet, Tunisie, pp. 315–326 (2010)
Novelli, N., Cicchetti, R.: FUN: an efficient algorithm for mining functional and embedded dependencies. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 189–203. Springer, Heidelberg (2000)
PentahoDataIntegration. http://www.pentaho.fr/explore/pentaho-data-integration
Raman, V., Hellerstein J.M.: Potter’s wheel: an interactive data cleaning system. In: 27th International Conference on Very Large Data Bases, Rome, Italy, pp. 381–390 (2001)
Simonenko, E., Novelli, N.: Extraction de dépendances fonctionnelles approximatives: une approche incrémentale. In: 12th Conférence Internationale Francophone sur l’Extraction et la Gestion des Connaissances, Bordeaux, France, pp. 95–100 (2012)
Talend. https://www.talend.com/
Vassiliadis, P., Simitsis A., Georgantas, P., Terrovitis, M.: A framework for the design of ETL scenarios. In: 15th Conference on Advanced Information Systems Engineering, Klagenfurt, Austria, pp. 520–535 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zaidi, H., Pollet, Y., Boufarès, F., Kraiem, N. (2015). Semantic of Data Dependencies to Improve the Data Quality. In: Bellatreche, L., Manolopoulos, Y. (eds) Model and Data Engineering. Lecture Notes in Computer Science(), vol 9344. Springer, Cham. https://doi.org/10.1007/978-3-319-23781-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-23781-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23780-0
Online ISBN: 978-3-319-23781-7
eBook Packages: Computer ScienceComputer Science (R0)