Abstract
Virtual screening (VS) methods have been shown to increase success rates in many drug discovery campaigns, when they complement experimental approaches, such as high-throughput screening methods or classical medicinal chemistry approaches. Nevertheless, predictive capability of VS is not yet optimal, mainly due to limitations in the underlying physical principles describing drug binding phenomena. One approach that can improve VS methods is the aid of machine learning methods. When enough experimental data are available to train such methods, predictive capability can considerably increase. We show in this research work how a multi-objective evolutionary search strategy for feature selection, which can provide with small and accurate decision trees that can be very easily understood by chemists, can drastically increase the applicability and predictive ability of these techniques and therefore aid considerable in the drug discovery problem. With the proposed methodology, we find classification models with accuracy between 0.9934 and 1.00 and area under ROC between 0.96 and 1.00 evaluated in full training sets, and accuracy between 0.9849 and 0.9940 and area under ROC between 0.89 and 0.93 evaluated with tenfold cross-validation over 30 iterations, while substantially reducing the model size.
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abagyan R, Totrov M, Kuznetsov D (1994) ICM—a new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation. J Comput Chem 15(5):488–506. https://doi.org/10.1002/jcc.540150503
Ahmad A, Dey L (2005) A feature selection technique for classificatory analysis. Pattern Recognit Lett 26(1):43–56
Anirudha R, Kannan R, Patil N (2014) Genetic algorithm based wrapper feature selection on hybrid prediction model for analysis of high dimensional data. In: 9th international conference on industrial and information systems (ICIIS), 2014, pp 1–6
Barrett SJ, Langdon WB (2006) Advances in the application of machine learning techniques in drug discovery, design and development. In: Tiwari A, Roy R, Knowles J, Avineri E, Dahal K (eds) Applications of soft computing. Advances in intelligent and soft computing, vol 36. Springer, Berlin, Heidelberg, pp 99–110
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28(1):235–242
Bertsekas D (1999) Nonlinear programming, 2nd edn. Athena Scientific, Cambridge
Beume N, Naujoks B, Emmerich M (2007) SMS-EMOA: multiobjective selection based on dominated hypervolume. Eur J Oper Res 181(3):1653–1669
Bohm H-J, Stahl M (2002) The use of scoring functions in drug discovery applications. Rev Comput Chem 18:41–88
Cano G, Garcia-Rodriguez J, Garcia-Garcia A, Perez-Sanchez H, Benediktsson JA, Thapa A, Barr A (2017) Automatic selection of molecular descriptors using random forest: application to drug discovery. Exp Syst Appl 72:151–159. https://doi.org/10.1016/j.eswa.2016.12.008
Cao D-S, Xu Q-S, Hu Q-N, Liang Y-Z (2013) Chemopy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29:1092–1094
Castro PA, Von Zuben FJ (2010) Multi-objective feature selection using a bayesian artificial immune system. Int J Intell Comput Cybern 3(2):235–256
Chen H, Yao X (2006) Evolutionary multiobjective ensemble learning based on Bayesian feature selection. In: IEEE congress on evolutionary computation, 2006. CEC 2006, pp. 267–274
Collette Y, Siarry P (2004) Multiobjective optimization: principles and case studies. Springer, Berlin
Daszykowski M, Walczak B, Xu QS, Daeyaert F, de Jonge MR, Heeres J, Koymans LMH, Lewi PJ, Vinkers HM, Janssen PA, Massart DL (2004) Classification and regression trees studies of HIV reverse transcriptase inhibitors. J Chem Inf Comput Sci 44(2):716–726
Deb K (2001) Multi-objective optimization using evolutionary algorithms. Wiley, London
Deb K, Pratab A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197
Drews J (2000) Drug discovery: a historical perspective. Science 287(5460):1960–1964
Dreyer S (2013) Evolutionary feature selection. Norwegian University of Science and Technology. Department of Computer and Information Science, Institutt for datateknikk og informasjonsvitenskap, p 76
Ekbal A, Saha S, Garbe C (2010) Feature selection using multiobjective optimization for named entity recognition. In: 20th international conference on pattern recognition (ICPR), 2010, pp 1937–1940
ElAlami M (2009) A filter model for feature subset selection based on genetic algorithm. Knowl Based Syst 22(5):356–362
Fawcett T (2006) An introduction to roc analysis. Pattern Recogn Lett 27(8):861–874
Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH, Shelley M, Perry JK, Shaw DE, Francis P, Shenkin PS (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47(7):1739–1749. https://doi.org/10.1021/jm0306430 pMID: 15027865
García-Nieto J, Alba E, Jourdan L, Talbi E (2009) Sensitivity and specificity based multiobjective approach for feature selection: application to cancer diagnosis. Inf Process Lett 109(16):887–896
Gaspar-Cunha A (2010) Feature selection using multi-objective evolutionary algorithms: application to cardiac SPECT diagnosis. In: Rocha M, Riverola F, Shatkay H, Corchado J (eds) Advances in bioinformatics, volume 74 of advances in intelligent and soft computing. Springer, Berlin, pp 85–92
Gaspar-Cunha A, Covas JA (2004) RPSGAe—reduced Pareto set genetic algorithm: application to polymer extrusion. In: Gandibleux X, Sevaux M, Sorensen K, Kindt VT (eds) Metaheuristics for multiobjective optimisation, volume of 535 lecture notes in economics and mathematical systems. Springer, Berlin, pp 221–249
Gaspar-Cunha A, Recio G, Costa L, Estébanez C (2014) Self-adaptive MOEA feature selection for classification of bankruptcy prediction data. Sci World J 2014:314728. https://doi.org/10.1155/2014/314728
Goldberg D (1989) Genetic algorithms in search, optimization and machine learning, 1st edn. Addison-Wesley Longman Publishing Co. Inc., Boston
Gómez-Skarmeta AF, Jiménez F, Ibánez J, Paredes S (1999) Evolutionary variable identification. In: Proceedings of 7th European congress on intelligent techniques and soft computing (EUFIT’99)
Hall MA (1999) Correlation-based feature selection for machine learning. Technical report, University of Waikato
Han L, Wang Y, Bryant SH (2008) Developing and validating predictive decision tree models from mining chemical structural fingerprints and high–throughput screening data in PubChem. BMC Bioinf 9(1):401–8
Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under the roc curve. Mach Learn 77(1):103–123. https://doi.org/10.1007/s10994-009-5119-5
Huang N, Shoichet BK, Irwin JJ (2006) Benchmarking sets for molecular docking. J Med Chem 49(23):6789–6801
Huang J, Cai Y, Xu X (2007) A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recognit Lett 28(13):1825–1844
Hubertus T, Klaus M, Eberhard T (2004) Optimization theory. Kluwer Academic, Dordrecht
Ishibuchi H (2000) Multi-objective pattern and feature selection by a genetic algorithm. In: Proceedings of genetic and evolutionary computation conference GECCO’2000, Morgan Kaufmann, pp 1069–1076
Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, New York
Jara A, Martínez R, Vigueras D, Sánchez G, Jiménez F (2011) Attribute selection by multiobjective evolutionary computation applied to mortality from infection in severe burns patients. In: HEALTHINF 2011—proceedings of the international conference on health informatics, Rome, Italy, 26–29 January, 2011, pp 467–471
Jiménez F, Verdegay JL (2001) Evolutionary computation and mathematical programming. In: Reusch B, Temme KH (eds) Computational intelligence in theory and practice. Advances in soft computing, vol 8. Physica, Heidelberg, pp 167–182
Jiménez F, Gómez-Skarmeta A, Sánchez G, Deb K (2002) An evolutionary algorithm for constrained multi-objective optimization. In: Proceedings of the evolutionary computation on 2002. CEC’02. Proceedings of the 2002 congress, vol 2 of CEC’02. IEEE Computer Society, Washington, DC, USA, pp 1133–1138
Jiménez F, Sánchez G, Juárez JM (2014) Multi-objective evolutionary algorithms for fuzzy classification in survival prediction. Artif Intell Med 60(3):197–219
Jiménez F, Jodár R, Sánchez G, Martín M, Sciavicco G (2016) Multi-objective evolutionary computation based feature selection applied to behaviour assessment of children. In: Proceedings of the 2016 international conference on educational data mining (ICEDM), vol 2(6), pp 1888–1897
Jiménez F, Sánchez G, García J, Sciavicco G, Miralles L (2017) Multi-objective evolutionary feature selection for online sales forecasting. Neurocomputing 234:75–92
Jin Y (ed) (2006) Multi-objective machine learning, volume 16 of studies in computational intelligence. Springer, Warsaw
Karegowda AG, Manjunath AS, Jayaram MA (2010) Comparative study of attribute selection using gain ratio and correlation based feature selection. Int J Inf Technol Knowl Manag 2(2):271–277
Karloff H (1991) Linear programming. Birkhauser Basel, Boston
Karshenas H, Larrañaga Múgica P, Zhang Q, Bielza C (2012) An interval-based multiobjective approach to feature subset selection using joint modeling of objectives and variables. Technical report, Facultad de Informática, Universidad Politécnica de Madrid
Kimovski D, Ortega J, Ortiz A, Banos R (2015) Parallel alternatives for evolutionary multi-objective optimization in unsupervised feature selection. Exp Syst Appl 42(9):4239–4252
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2, IJCAI’95. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 1137–1143
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324 (special issue on relevance)
Krishna B, Kaliaperumal B (2011) Efficient genetic-wrapper algorithm based data mining for feature subset selection in a power quality pattern recognition application. Int Arab J Inf Technol 8(4):397–405
Li L, Li M, Lu Y, Zhang Y (2010) A new multi-objective genetic algorithm for feature subset selection in fatigue fracture image identification. JCP 5(7):1105–1111
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers, Norwell
Maros I, Mitra G (1996) Simplex algorithms, Oxford Science. Chapter 1, pp 1–46
Martínez C, Jiménez F, Sánchez G. Multiobjective evolutionary search. https://sourceforge.net/projects/moea/files/
McInnes C (2007) Virtual screening strategies in drug discovery. Curr Opin Chem Biol 11(5):494–502
Metz CE (1978) Basic principles of ROC analysis. Semin Nucl Med 8:283–298
Mlakar U, Fister I, Brest J, Potocnik B (2017) Multi-objective differential evolution for feature selection in facial expression recognition systems. Exp Syst Appl 89:129–137. https://doi.org/10.1016/j.eswa.2017.07.037
Moraglio A, Di Chio C, Poli R (2007) Geometric particle swarm optimisation. In: Ebner M, Oneill M, Ekárt A, Vanneschi L, Esparcia-Alcázar A (eds) Genetic programming, volume 4445 of lecture notes in computer science. Springer, Berlin, pp 125–136
Mukhopadhyay A, Maulik U, Bandyopadhyay S, Coello CC (2014a) A survey of multiobjective evolutionary algorithms for data mining (part I). IEEE Trans Evol Comput 18(1):4–19
Mukhopadhyay A, Maulik U, Bandyopadhyay S, Coello CC (2014b) A survey of multiobjective evolutionary algorithms for data mining (part II). IEEE Trans Evol Comput 18(1):20–35
Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281. https://doi.org/10.1023/A:1024068626366
Nayak SK, Rout PK, Jagadev AK, Swarnkar T (2017) Elitism based multi-objective differential evolution for feature selection: a filter approach with an efficient redundancy measure. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2017.08.001
Olsson A (2011) Particle swarm optimization: theory, techniques and applications. Nova Science Publishers, Hauppauge
Package caret. http://cran.r-project.org/web/packages/caret/caret.pdf (2015)
Papadimitriou CH, Steiglitz K (1982) Combinatorial optimization: algorithms and complexity. Prentice-Hall Inc, Upper Saddle River
Pati S, Das A, Ghosh A (2013) Gene selection using multi-objective genetic algorithm integrating cellular automata and rough set theory. In: Panigrahi B, Suganthan P, Das S, Dash S (eds) Swarm, evolutionary, and memetic computing, volume 8298 of lecture notes in computer science. Springer, Berlin, pp 144–155
Pereira JC, Caffarena ER, dos Santos CN (2016) Boosting docking-based virtual screening with deep learning. J Chem Inf Model 56(12):2495–2506. https://doi.org/10.1021/acs.jcim.6b00355
Pérez-Sánchez H, Cano G, García-Rodríguez J (2014a) Improving drug discovery using hybrid softcomputing methods. Appl Soft Comput 20:119–126
Pérez-Sánchez H, Cano G, García-Rodríguez J (2014b) Improving drug discovery using hybrid softcomputing methods. Appl Soft Comput 20:119–126. https://doi.org/10.1016/j.asoc.2013.10.033 (hybrid intelligent methods for health technologies)
Qiu J (2007) Traditional medicine: a culture in the balance. Nature 448(7150):126–128
Reynolds AP, Corne DW, Chantler MJ (2010) Feature selection for multi-purpose predictive models: a many-objective task. In: Schaefer R, Cotta C, Kołodziej J, Rudolph G (eds) Parallel problem solving from nature, PPSN XI. PPSN 2010. Lecture notes in computer science, vol 6238. Springer, Berlin, Heidelberg, pp 384–393
Roy A, Skolnick J (2014) LIGSIFT: an open-source tool for ligand structural alignment and virtual screening. Bioinformatics 31:539–544
Salzberg S (1994) C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach Learn 16(3):235–240. https://doi.org/10.1007/BF00993309
Shoichet BK, Bodian DL, Kuntz ID (1992) Molecular docking using shape descriptors. J Comput Chem JCC 13:380–397
Siedlecki W, Sklansky J (1989) A note on genetic algorithms for large-scale feature selection. Pattern Recognit Lett 10(5):335–347
Sikdar UK, Ekbal A, Saha S (2015) Mode: multiobjective differential evolution for feature selection and classifier ensemble. Soft Comput 19(12):3529–3549. https://doi.org/10.1007/s00500-014-1565-5
Sinha S (2006) Mathematical programming: theory and methods. Elsevier, New York City
Storn R, Price K (1997) Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optim 11(4):341–359. https://doi.org/10.1023/A:1008202821328
Terstappen GC, Reggiani A (2001) In silico research in drug discovery. Trends Pharmacol Sci 22(1):23–26
Vafaie H, De Jong K (1992) Genetic algorithms as a tool for feature selection in machine learning. In: Fourth international conference on tools with artificial intelligence, 1992. TAI’92, Proceedings, pp. 200–203
Vatolkin I, Preuß M, Rudolph G (2011) Multi-objective feature selection in music genre and style recognition tasks. In: Proceedings of the 13th annual conference on genetic and evolutionary computation, GECCO’11, ACM, New York, NY, USA, pp 411–418
Venkatadri M, Srinivasa Rao K (2010) A multiobjective genetic algorithm for feature selection in data mining. Int J Comput Sci Inf Technol 1(5):443–448
Wang R, Lu Y, Fang X, Wang S (2004) An extensive test of 14 scoring functions using the pdbbind refined set of 800 protein-ligand complexes. J Chem Inf Comput Sci 44(6):2114–2125
White RE (2000) High-throughput screening in drug metabolism and pharmacokinetic support of drug discovery. Annu Rev Pharmacol Toxicol 40(1):133–157
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn (Morgan Kaufmann series in data management systems). Morgan Kaufmann Publishers Inc., San Francisco
Witten IH, Frank E, Hall MA (2011) Introduction to weka. In: Witten IH, Frank E, Hall MA (eds) Data mining: practical machine learning tools and techniques. The Morgan Kaufmann Series in data management systems, 3rd edn. Morgan Kaufmann, Boston, pp 403–406
Yang S-Y (2010) Pharmacophore modeling and applications in drug discovery: challenges and recent advances. Drug Discov Today 15(11):444–450
Yang J, Honavar V (1998) Feature subset selection using a genetic algorithm. Intell Syst Appl IEEE 13(2):44–49
Zhao J, Fernandes V B, Jiao L, Yevseyeva I, Maulana A, Li R, Bäck T, Emmerich MTM (2016) Multiobjective optimization of classifiers by means of 3-D convex hull based evolutionary algorithm. CoRR abs/1412.5710
Zhu Z, Ong Y-S, Kuo J-L (2009) Feature selection using single/multi-objective memetic frameworks. In: Goh C-K, Ong Y-S, Tan K (eds) Multi-objective memetic algorithms, volume 171 of studies in computational intelligence. Springer, Berlin, pp 111–131
Acknowledgements
This study was supported by computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of Spain. This work was partially funded by the Fundación Séneca del Centro de Coordinación de la Investigación de la Región de Murcia under Project 18946/JLI/13.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
Author Fernando Jiménez Barrionuevo declares that he has no conflict of interest. Author Horacio Pérez Sánchez declares that he has no conflict of interest. Author José Palma Méndez declares that he has no conflict of interest. Author Gracia Sánchez Carpena declares that she has no conflict of interest. Author Carlos Martínez Cortés declares that he has no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jiménez, F., Pérez-Sánchez, H., Palma, J. et al. A methodology for evaluating multi-objective evolutionary feature selection for classification in the context of virtual screening. Soft Comput 23, 8775–8800 (2019). https://doi.org/10.1007/s00500-018-3479-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-018-3479-0