Abstract
Compound property assays are an important part of drug development, but incomplete data may occur for a variety of reasons. To deal with these incomplete data and improve the success rate of drug development, researchers often need to effectively impute the missing data. Therefore, this paper proposes a gene expression programming-based method, called GEP-CPI, for imputing missing compound property assay data. In GEP-CPI, the missing data imputation model is expressed by the parse tree of a chromosome, and then the optimal missing data imputation model is mined by iterative evolution of the chromosome population. Experimental results on three compound property assay related datasets demonstrates that the proposed method generally outperforms the state-of-the-art methods in imputing missing data of compound property assays.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhang, H., Zhou, S., Zhang, K., Guan, J.: Residual similarity based conditional independence test and its application in causal discovery. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 5, pp. 5942–5949 (2022)
Zhang, H., Zhou, S., Yan, C., Guan, J., Wang, X.: Recursively learning causal structures using regression-based conditional independence test. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 3108–3115 (2019)
Zhang, H., Zhou, S., Yan, C., Wang, X., Zhang, J., Huan, J.: Learning causal structures based on divide and conquer. IEEE Trans. Cybern. 52(5), 3232–3243 (2022)
Peng, Y., Zhang, Z., Jiang, Q., Guan, J., Zhou, S.: TOP: towards better toxicity prediction by deep molecular representation learning. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 318–325. IEEE (2019)
Peng, Y., Zhang, Z., Jiang, Q., Guan, J., Zhou, S.: TOP: A deep mixture representation learning method for boosting molecular toxicity prediction. Methods 179(1), 55–64 (2020)
Peng, Y., Lin, Y., Jing, X., Zhang, H., Huang, Y., Luo, G.: Enhanced graph isomorphism network for molecular ADMET properties prediction. IEEE Access 8(1), 168344–168360 (2020)
Little, R., Rubin, D.: Statistical Analysis with Missing Data, 2nd edn. Wiley, Hoboken (2019)
Liu, K., Hu, X., Zhou, H.: Feature analyses and modeling of lithium-ion battery manufacturing based on random forest classification. IEEE/ASME Trans. Mechatron. 6, 2944–2955 (2021)
Kim, E., Bae, G., Ahn, G.: A study on the imputation solution for missing speed data on UTIS by using adaptive k-NN algorithm. J. Korea Inst. Intell. Transp. Syst. 3, 66–77 (2014)
Sahoo, A., Ghose, D.: Imputation of missing precipitation data using KNN, SOM, RF, and FNN. Soft. Comput. 12, 5919–5936 (2022)
Ma, T., Hu, Y., Wang, J.: A novel vegetation index approach using sentinel-2 data and random forest algorithm for estimating forest stock volume in the Helan mountains, Ningxia, China. Remote Sens. 15(7), 1853 (2023)
Zushida, K., Haohao, Z., Shimamur, H.: Application and analysis of random forest algorithm for estimating lawn grass lengths in robotic lawn mower. Int. J. Mech. Eng. Appl. (1), 6 (2021)
Rahman, M., Islam, M.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl.-Based Syst. 53, 51–65 (2013)
Che, Z., Purushotham, S., Cho, K.: Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 8(1), 6085 (2018)
Phiwhorm, K., Saikaew, C., Leung, C.: Adaptive multiple imputations of missing values using the class center. J. Big Data 9(1), 52 (2022)
Chen, J., Huang, H., Tian, F.: A selective bayes classifier for classifying incomplete data based on gain ratio. Knowl.-Based Syst. 21(7), 530–534 (2008)
Johnson, T., Isaac, N., Paviolo, A.: Handling missing values in trait data. Glob. Ecol. Biogeogr. 30(1), 51–62 (2021)
Fei, K., Li, Q., Zhu, C.: Non-technical losses detection using missing values’ pattern and neural architecture search. Int. J. Electr. Power Energy Syst. 134, 107410 (2022)
Dinh, D., Huynh, V., Sriboonchitta, S.: Clustering mixed numerical and categorical data with missing values. Inf. Sci. 571, 418–442 (2021)
Zhang, Y., Wang, Y., Gong, D.: Clustering-guided particle swarm feature selection algorithm for high-dimensional imbalanced data with missing values. IEEE Trans. Evol. Comput. 26(4), 616–630 (2021)
Di, N.: Missing data analysis with fuzzy C-Means: a study of its application in a psychological scenario. Expert Syst. Appl. 6, 6793–6797 (2011)
Wang, J., Li, D., Zhang, H.: An improvement of support vector machine imputation algorithm based on multiple iteration and grid search strategies. In: 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), pp. 538–543 (2020)
Kengkanna, A., Ohue, M.: Enhancing Model Learning and Interpretation Using Multiple Molecular Graph Representations for Compound Property and Activity Prediction. arXiv preprint arXiv:2304.06253Â (2023)
Irwin, B., Levell, J., Whitehead, T.: Practical applications of deep learning to impute heterogeneous drug discovery data. J. Chem. Inf. Model. 6, 2848–2857 (2020)
Whitehead, T., Irwin, B., Hunt, P.: Imputation of assay bioactivity data using deep learning. J. Chem. Inf. Model. 3, 1197–1204 (2019)
Whitehead, T., Irwin, B., Hunt, P.: Imputing compound activities based on sparse and noisy data. In: The American Chemical Society (ACS), p. 257 (2019)
Sarir, P., Chen, J., Asteris, P.: Developing GEP tree-based, neuro-swarm, and whale optimization models for evaluation of bearing capacity of concrete-filled steel tube columns. Eng. Comput. 37, 1–19 (2021)
Ren, L., Wang, N., Pang, W.: Modeling and monitoring the material removal rate of abrasive belt grinding based on vision measurement and the gene expression programming (GEP) algorithm. Int. J. Adv. Manuf. Technol. 120(1–2), 385–401 (2022)
Ferreira, C.: Gene expression programming: a new adaptive algorithm for solving problems. Complex Syst. (2), 87–129 (2001)
Changan, Y., Yuzhong, P., Xiao, Q.: Principles and Applications of Gene Expression Programming Algorithm. China Science Publishing, Beijing (2010)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (#62262044), and Natural Science Foundation of Guangxi Province (#2023GXNSFAA026027), the Project of Guangxi Chinese medicine multidisciplinary crossover innovation team (#GZKJ2311).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhou, H., Lin, Y., Chen, N., Peng, Y. (2024). Imputation of Compound Property Assay Data Using a Gene Expression Programming-Based Method. In: Huang, DS., Premaratne, P., Yuan, C. (eds) Applied Intelligence. ICAI 2023. Communications in Computer and Information Science, vol 2014. Springer, Singapore. https://doi.org/10.1007/978-981-97-0903-8_13
Download citation
DOI: https://doi.org/10.1007/978-981-97-0903-8_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0902-1
Online ISBN: 978-981-97-0903-8
eBook Packages: Computer ScienceComputer Science (R0)