A Survey on Data Quality Dimensions and Tools for Machine Learning

Zhou, Yuhan; Tu, Fengjiao; Sha, Kewei; Ding, Junhua; Chen, Haihua

Computer Science > Machine Learning

arXiv:2406.19614 (cs)

[Submitted on 28 Jun 2024]

Title:A Survey on Data Quality Dimensions and Tools for Machine Learning

Authors:Yuhan Zhou, Fengjiao Tu, Kewei Sha, Junhua Ding, Haihua Chen

View PDF HTML (experimental)

Abstract:Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: this https URL.

Comments:	This paper has been accepted by The 6th IEEE International Conference on Artificial Intelligence Testing (IEEE AITest 2024) as an invited paper
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.19614 [cs.LG]
	(or arXiv:2406.19614v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2406.19614

Submission history

From: Haihua Chen [view email]
[v1] Fri, 28 Jun 2024 02:41:33 UTC (379 KB)

Computer Science > Machine Learning

Title:A Survey on Data Quality Dimensions and Tools for Machine Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:A Survey on Data Quality Dimensions and Tools for Machine Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators