Abstract
Variable selection has consistently been a hot topic in linear regression models, especially when facing with high-dimensional data. Variable ranking, an advanced form of selection, is actually more fundamental since selection can be realized by thresholding once the variables are ranked suitably. In recent years, ensemble learning has gained a significant interest in the context of variable selection due to its great potential to improve selection accuracy and to reduce the risk of falsely including some unimportant variables. Motivated by the widespread success of boosting algorithms, a novel ensemble method PBoostGA is developed in this paper to implement variable ranking and selection in linear regression models. In PBoostGA, a weight distribution is maintained over the training set and genetic algorithm is adopted as its base learner. Initially, equal weight is assigned to each instance. According to the weight updating and ensemble member generating mechanism like AdaBoost.RT, a series of slightly different importance measures are sequentially produced for each variable. Finally, the candidate variables are ordered in the light of the average importance measure and some significant variables are then selected by a thresholding rule. Both simulation results and a real data illustration show the effectiveness of PBoostGA in comparison with some existing counterparts. In particular, PBoostGA has stronger ability to exclude redundant variables.
Similar content being viewed by others
Notes
The authors are grateful to one anonymous referee for providing us with the insight into this.
References
Breiman L (1996a) Heuristics of instability and stabilization in model selection. Ann Stat 24(6):2350–2383
Breiman L (1996b) Bagging predictors. Mach Learn 24(2):123–140
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22(4):477–505
Bühlmann P, Hothorn T (2010) Twin boosting: improved feature selection and prediction. Stat Comput 20(2):119–138
Bühlmann P, Mandozzi J (2014) High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat 29(3–4):407–430
Bühlmann P, van de Geer S (2010) Statistics for high-dimensional data: methods, theory and applications. Springer, New York
Chatterjee S, Lauadto M, Lynch LA (1996) Genetic algorithms and their statistical applications: an introduction. Comput Stat Data Anal 22(6):633–651
Drucker H (1997) Improving regressors using boosting techniques. In: Proceedings of the 14th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 107–115
Efron B, Hastie T, Hohnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Fan JQ, Li RZ (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Fan JQ, Lv JC (2008) Sure independence screening for ultrahigh dimensional feature space (with discussions). J R Stat Soc B 70(5):849–911
Fan JQ, Lv JC (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101–148
Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Guo L, Boukir S (2013) Margin-based ordered aggregation for ensemble pruning. Pattern Recognit Lett 34:603–609
He HB, Garcia EA (2009) Learning from imbalanced data. IEEE Transl Knowl Data Eng 21(9):1263–1284
Jadhav NH, Kashid DN, Kulkarni SR (2014) Subset selection in multiple linear regression in the presence of outlier and multicollinearity. Stat Methodol 19:44–59
Liu C, Shi T, Lee Y (2014) Two tales of variable selection for high dimensional regression: screening and model building. Stat Anal Data Min 7(2):140–159
Meinshausen N, Bühlmann P (2010) Stability selection (with discussion). J R Stat Soc B 72(4):417–473
Mendes-Moreira J, Soares C, Jorge AM, de Sousa JF (2012) Ensemble approaches for regression: a survey. ACM Comput Surv 45(1):1–40 (Article 10)
Miller A (2002) Subset selection in regression, 2nd edn. Chapman & Hall, New Work
Rokach L (2009) Taxonomy for characterizing ensemble methods in classification tasks: a review and annotated bibliography. Comput Stat Data Anal 53(12):4046–4072
Sauerbrei W, Buchholz A, Boulesteix A, Binder H (2015) On stability issues in deriving multivariable regression models. Biom J 57(4):531–555
Shah RD, Samworth RJ (2013) Variable selection with error control: another look at stability selection. J R Stat Soc B 75(1):55–80
Shrestha DL, Solomatine DP (2006) Experiments with AdaBoost.RT, an improved boosting scheme for regression. Neural Comput 18(7):1678–1710
Shmueli G (2010) To explain or to predict? Stat Sci 25(3):289–310
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58(1):267–288
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc B 63(2):411–423
Wang SJ, Nan B, Rosset S, Zhu J (2011) Random lasso. Ann Appl Stat 5(1):468–485
Xin L, Zhu M (2012) Stochastic stepwise ensembles for variable selection. J Comput Graph Stat 21(2):275–294
Zhang C, Ma YQ (2012) Ensemble machine learning: methods and applications. Springer, New York
Zhang CX, Wang GW (2014) Boosting variable selection algorithm for linear regression models. In: Proceedings of the 10th international conference on natural computation. IEEE Press, China, pp 769–774
Zhang CX, Wang GW, Liu JM (2015a) RandGA: injecting randomness into parallel genetic algorithm for variable selection. J Appl Stat 42(3):630–647
Zhang CX, Zhang JS, Wang GW (2015b) A novel bagging ensemble approach for variable ranking and selection for linear regression models. In: The 12th international workshop on multiple classifier systems, Günzburg, Germany. LNCS, vol 9132, pp 3–14
Zhou ZH (2012) Ensemble methods: foundations and algorithms. Taylor & Francis, Boca Raton
Zhu M, Chipman HA (2006) Darwinian evolution in parallel universes: a parallel genetic algorithm for variable selection. Technometrics 48(4):491–502
Zhu M, Fan GZ (2011) Variable selection by ensembles for the Cox model. J Stat Comput Simul 81(12):1983–1992
Zhu XY, Yang YH (2015) Variable selection after screening: with or without data splitting? Comput Stat 30(1):191–203
Acknowledgments
The authors are very grateful to the anonymous referees and the editor for their critical comments which helped improve the presentation greatly. This research was supported by the National Basic Research Program of China (973 Program, No. 2013CB329406), the National Natural Science Foundations of China (Nos. 11201367, 91230101, 61572393), the National Research Foundation of Korea (NRF-2012R1A1A2041661), the Basic Research Program of Natural Science of Shaanxi Province of China (No. 2015JQ1002).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Zhang, CX., Zhang, JS. & Kim, SW. PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection. Comput Stat 31, 1237–1262 (2016). https://doi.org/10.1007/s00180-016-0652-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-016-0652-8
Keywords
- Variable selection
- Variable ranking
- Genetic algorithm
- Ensemble learning
- Variable selection ensemble
- Boosting