iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1145/3368640.3368653
A hybrid method for missing value imputation | Proceedings of the 23rd Pan-Hellenic Conference on Informatics skip to main content
10.1145/3368640.3368653acmotherconferencesArticle/Chapter ViewAbstractPublication PagespciConference Proceedingsconference-collections
research-article

A hybrid method for missing value imputation

Published: 28 November 2019 Publication History

Abstract

Missing values are a common incurrence in a great number of real-world datasets, emerging from diverse domains of interest. In research, missing data constitute a significant problem as it can affect the conclusions drawn from them. Considering this, the difficulty of data preprocessing is increasing as selecting an inappropriate way to handle missing information can lead to untrustworthy results. Unfortunately, like in most cases in Machine Learning, there is not a single solution that fits in every task related to the problem. For this reason, many strategies have been proposed to successfully deal with this issue. One of the most well-known, besides efficient, is imputation. Replacing a missing value with an estimation apparently eliminates the problem and provides complete datasets but the difficulty shifts in selecting the right method to impute missing values. A widely used imputation method that can be found in libraries of the most noted statistical and Machine Learning suites is IRMI. In this work, we propose a variant of IRMI in order to maintain the advantages of this famous imputation method, while outperforming its traditional variant used in many Machine Learning software tools. To achieve this, the benefits of boosting as well as decision tree theory are exploiting. To test the efficiency of our method, a series of experiments over 30 datasets was executed, measuring the classification accuracy of the proposed method to prove that outperforms its rivals, which include classic, as well as more sophisticated imputation strategies. Finally, the results of our study are provided, along with the conclusions that arise from them.

References

[1]
Acuña, E. and Rodriguez, C. 2004. The Treatment of Missing Values and its Effect on Classifier Accuracy. Classification, Clustering, and Data Mining Applications. 1995 (2004), 639--647.
[2]
Armitage, E.G. et al. 2015. Missing value imputation strategies for metabolomics data. Electrophoresis. 36, 24 (2015), 3050--3060.
[3]
Batista, G.E.A.P.A. and Monard, M.C. 2010. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence (2010).
[4]
van Buuren, S. 2018. Flexible Imputation of Missing Data, Second Edition.
[5]
Croiseau, P. et al. 2007. Dealing with missing data in family-based association studies: A multiple imputation approach. Human Heredity. 63, 3--4 (2007), 229--238.
[6]
Demšar, J. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets. Jour. of Machine Learning Research. 7, (2006), 1--30.
[7]
Dua, D. and Graff, C. 2017. {UCI} Machine Learning Repository.
[8]
Enders, C.K. 2017. Multiple imputation as a flexible tool for missing data handling in clinical research. Behaviour Research and Therapy. 98, (2017), 4--18.
[9]
Freund, Y. and Schapire, R.E. 1995. A decision-theoretic generalization of online learning and an application to boosting. Lecture Notes in Comp. Science. 904, (1995), 23--37.
[10]
Friedman, J. et al. 2000. Additive logistic regression: a statistical view of boosting. The Annals of Statistics. 28, 2 (2000), 337--407.
[11]
Gajawada, S. and Toshniwal, D. 2012. Missing Value Imputation Method Based on Clusteringand Nearest Neighbours. International Journal of Future Computer and Communication. 1, 2 (2012), 206--208.
[12]
Grzymala-Busse, J.W. et al. 2005. Handling missing attribute values in preterm birth data sets. Lecture Notes in Computer Science. 3642 LNAI, (2005), 342--351.
[13]
Haukoos, J.S. and Newgard, C.D. 2007. Advanced Statistics: Missing Data in Clinical Research-Part 1: An Introduction and Conceptual Framework. Academic Emergency Medicine. 14, 7 (2007), 662--668.
[14]
Hayati Rezvan, P. et al. 2015. The rise of multiple imputation: A review of the reporting and implementation of the method in medical research Data collection, quality, and reporting. BMC Medical Research Methodology. 15, 1 (2015), 1--14.
[15]
Kwak, S.K. and Kim, J.H. 2017. Statistical data preparation: Management of missing values and outliers. Korean Journal of Anesthesiology. 70, 4 (2017), 407--411.
[16]
Lall, R. 2016. How multiple imputation makes a difference. Political Analysis. 24, 4 (2016), 414--433.
[17]
Langkamp, D.L. et al. 2010. Techniques for handling missing data in secondary analyses of large surveys. Academic Pediatrics. 10, 3 (2010), 205--210.
[18]
Li, D. et al. 2004. Towards missing data imputation: A study of fuzzy K-means clustering method. Lecture Notes in AI. 3066, c (2004), 573--579.
[19]
Liu, Z.G. et al. 2016. Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognition. 52, (2016), 85--95.
[20]
Manly, C.A. and Wells, R.S. 2015. Reporting the Use of Multiple Imputation for Missing Data in Higher Education Research. Research in Higher Education. 56, 4 (2015), 397--409.
[21]
Pampaka, M. et al. 2016. Handling missing data: analysis of a challenging data set using multiple imputation. Int. Jour. of Research and Method in Education. 39, 1 (2016), 19--37.
[22]
Quinlan, J.R. 2006. Learing With Continuous Classes.Pdf. 92, (2006), 343--348.
[23]
Raghunathan, T. et al. 2001. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey methodology. 27, 1 (2001), 85--96.
[24]
Rubin, D.B. 1988. AN OVERVIEW OF MULTIPLE IMPUTATION Donald B. Rubin, Harvard University One Oxford Street, Cambridge, MA 02138. Methods. (1988).
[25]
Schapire, R.E. et al 1998. Improved Boosting Algorithms Using Confidencerated Predictions A Generalized Analysis of Adaboost m i. ReCALL. 1997 (1998).
[26]
Sharma, R. et al. 2015. Comparative Analysis of Classification Techniques in Data Mining Using Different Datasets, JCSMC, 44, 12 (2015), 125--134.
[27]
Takahashi, M. and Ito, T. 2012. Multiple Imputation of Turnover in EDINET Data: Toward the Improvement of Imputation for the Economic Census. Work Session on Statistical Data Editing, UNECE. March 2011 (2012), 1--10.
[28]
Templ, M. et al. 2011. Iterative stepwise regression imputation using standard and robust methods. Comp. Stat. and Data Analysis. 55, 10 (2011), 2793--2806.
[29]
Tutz, G. and Ramzan, S. 2015. Improved methods for the imputation of missing data by nearest neighbor methods. Comp. Stat. and Data Analysis. 90, 172 (2015), 84--99.
[30]
Wang, L. and Fu, D.M. 2009. Estimation of missing values using a weighted k-nearest neighbors algorithm. Proceedings - 2009 ESIAT, ESIAT 2009. 3, 2 (2009), 660--663.

Cited By

View all
  • (2022)Maintainability Challenges in ML: A Systematic Literature Review2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)10.1109/SEAA56994.2022.00018(60-67)Online publication date: Aug-2022
  • (2022)Systematic Review of Using Machine Learning in Imputing Missing ValuesIEEE Access10.1109/ACCESS.2022.316084110(44483-44502)Online publication date: 2022

Index Terms

  1. A hybrid method for missing value imputation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    PCI '19: Proceedings of the 23rd Pan-Hellenic Conference on Informatics
    November 2019
    165 pages
    ISBN:9781450372923
    DOI:10.1145/3368640
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 November 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data preprocessing
    2. imputation strategies
    3. machine learning
    4. missing values imputation

    Qualifiers

    • Research-article

    Conference

    PCI '19
    PCI '19: 23rd Pan-Hellenic Conference on Informatics
    November 28 - 30, 2019
    Nicosia, Cyprus

    Acceptance Rates

    PCI '19 Paper Acceptance Rate 18 of 35 submissions, 51%;
    Overall Acceptance Rate 190 of 390 submissions, 49%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Maintainability Challenges in ML: A Systematic Literature Review2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)10.1109/SEAA56994.2022.00018(60-67)Online publication date: Aug-2022
    • (2022)Systematic Review of Using Machine Learning in Imputing Missing ValuesIEEE Access10.1109/ACCESS.2022.316084110(44483-44502)Online publication date: 2022

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media