Text length considered adaptive bagging ensemble learning algorithm for text classification

Wang, Youwei; Liu, Jiangchun; Feng, Lizhou

doi:10.1007/s11042-023-14578-9

Text length considered adaptive bagging ensemble learning algorithm for text classification

Published: 18 February 2023

Volume 82, pages 27681–27706, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Youwei Wang¹,
Jiangchun Liu² &
Lizhou Feng³

248 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Ensemble learning constructs strong classifiers by training multiple weak classifiers, and is widely used in text classification field. In order to improve the text classification accuracy, a text length considered adaptive bootstrap aggregating (Bagging) ensemble learning algorithm (called TC_Bagging) for text classification is proposed. Firstly, the performances of different typical deep learning methods in processing long and short texts are compared, and the optimal base classifier groups are constructed for long and short texts. Secondly, an adaptive threshold group based random sampling method is proposed to realize the training of long text and short text sample subsets while retaining the proportions of samples in different categories. Finally, in order to avoid the problem that the sampling process may decrease the accuracy, the smooth inverse frequency (SIF) based text vector generation algorithm is combined with the traditional weighted voting classifier ensemble method to obtain the final classification result. By comparing TC_Bagging with several other baseline methods on three datasets, our evaluation suggests that the results of TC_Bagging are approximately 0.120, 0.300 and 0.060 better than that of RF, WAVE, RF_WMVE and RF_WAVE in terms of average F₁, average sensitivity and average specificity measurements, respectively, showing that TC_Bagging has obvious advantage over typical ensemble learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble feature selection for single-label text classification: a comprehensive analytical study

Article 22 June 2023

Ensemble Learning Model for Medical Text Classification

A Combination of Resampling and Ensemble Method for Text Classification on Imbalanced Data

References

Ali A, Zhu Y, Chen Q, et al (2019) Leveraging spatio-temporal patterns for predicting citywide traffic crowd flows using deep hybrid neural networks. In 2019 IEEE 25th international conference on parallel and distributed systems (ICPADS). IEEE. 125-132
Ali A, Zhu Y, Zakarya M (2021) A data aggregation based approach to exploit dynamic spatio-temporal correlations for citywide crowd flows prediction in fog computing. Multimed Tools Appl 80(20):31401–31433
Article Google Scholar
Ali A, Zhu Y, Zakarya M (2022) Exploiting dynamic spatio-temporal graph convolutional neural networks for citywide traffic flows prediction. Neural Netw 145:233–247
Article Google Scholar
Arora S, Li YZ, Liang YY, Ma T, Risteski A (2016) A latent variable model approach to PMI-based word embeddings. Transac Assoc Comput Linguis 4:385–399
Article Google Scholar
Arora S, Liang YY, Ma TY (2017) A simple but tough-to-beat baseline for sentence embedding. In Proceedings of ICLR
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Article MATH Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp 475–482
Chapter Google Scholar
Charbuty B, Abdulazeez A (2021) Classification based on decision tree algorithm for machine learning. J Appl Sci Technol Trends 2(01):20–28
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Cui YM, Che WX, Liu T, et al (2019) Pre-training with whole word masking for Chinese BERT arXiv preprint arXiv: 1906.08101.
De M, Romero FAB, Vasconcelos GC (2019) Boosting the performance of over-sampling algorithms through under-sampling the minority class. Neurocomputing 343:3–18
Article Google Scholar
De'ath G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81(11):3178–3192
Article Google Scholar
Deng J, Cheng L, Wang Z (2021) Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification. Comput Speech Lang 68:101182
Article Google Scholar
Devlin J, Chang MW, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In proceedings of NAACL-HLT. pages 4171-4186
Diao S, Xu R, Su H, et al (2021) Taming pre-trained language models with N-gram representations for low-resource domain adaptation. In proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), 3336-3349.
Ding H, Wei B, Gu Z, Yu Z, Zheng H, Zheng B, Li J (2020) KA-ensemble: towards imbalanced image classification ensembling under-sampling and over-sampling. Multimed Tools Appl 79(21):14871–14888
Article Google Scholar
Dogan A, Birant D (2019) A weighted majority voting ensemble approach for classification. In 4th International Conference on Computer Science and Engineering (UBMK). IEEE, 1-6
Du C, Huang L (2018) Text classification research with attention-based recurrent neural networks. Int J Comput Commun Contr 13(1):50–61
Article Google Scholar
Fanny F, Muliono Y, Tanzil F (2018) A comparison of text classification methods k-NN, Naïve Bayes, and support vector machine for news classification. Jurnal Informatika: Jurnal Pengembangan IT 3(2):157–160
Google Scholar
Galar M, Fernandez A, Barrenechea E (2012) A review on ensembles for the class imbalance problem: bagging, boosting, and hybrid-based approaches. IEEE Transac Syst Man Cybern Part C Appl Revi 42(4):463–484
Article Google Scholar
Garcia S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306
Article MathSciNet Google Scholar
Giveki D (2021) Scale-space multi-view bag of words for scene categorization. Multimed Tools Appl 80(1):1223–1245
Article Google Scholar
Guo B, Zhang C, Liu J, Ma X (2019) Improving text classification with weighted word embeddings via a multi-channel TextCNN model. Neurocomputing 363:366–374
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, Heidelberg, pp 878–887
Google Scholar
He H, Bai Y, Garcia EA, et al (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning, In: Proceedings of the (IEEE world congress on computational intelligence). IEEE International Joint Conference on Neural Networks, IJCNN, IEEE. pp 1322–1328
Hsu KW, Srivastava J (2012) Improving bagging performance through multi-algorithm ensembles. Front Comput Sci 6(5):498–512
MathSciNet MATH Google Scholar
Huang L, Ma D, Li S, et al (2019) Text level graph neural network for text classification. arXiv preprint arXiv:1910.02356
Johnson R, Zhang T (2017) Deep pyramid convolutional neural networks for text categorization. In proceedings of ACL. pages 562-570
Joulin A, Grave E, Bojanowski P, (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759
Khoshgoftaar TM, Van HJ, Napolitano A (2011) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans Pattern Anal Mach Intell 41(3):552–568
Google Scholar
Kim Y (2014) Convolutional neural networks for sentence classification. In proceedings of EMNLP, pages 1746-1751
Kim H, Kim H, Moon H, Ahn H (2011) A weight-adjusted voting algorithm for ensembles of classifiers. J Korean Statis Soc 40(4):437–449
Article MathSciNet MATH Google Scholar
Kim A, Myung J, Kim H (2020) Random forest ensemble using a weight-adjusted voting algorithm. J Korean Data Inform Sci Soc 31(2):427–438
Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Lacy SE, Lones MA, Smith SL (2015) A comparison of evolved linear and non-linear ensemble vote aggregators. In: IEEE congress on evolutionary computation (CEC). IEEE. 758-763
Lan Z, Chen M, Goodman S, et al (2020) ALBERT: A lite BERT for self-supervised learning of language representations. In proceedings of ICLR
Li S, Zhao Z, Hu RF, et al (2018) Analogical reasoning on Chinese morphological and semantic relations. In Proceedings of ACL
Li C, Peng X, Peng H, et al (2021) TextGTL: Graph-based Transductive Learning for Semi-Supervised Text Classification via Structure-Sensitive Interpolation. In proceedings of IJCAI
Li Q, Peng H, Li J, Xia C, Yang R, Sun L, Yu PS, He L (2022) A survey on text classification: from traditional to deep learning. ACM Transac Intel Syst Technol (TIST) 13(2):1–41
Google Scholar
Liu G, Guo J (2019) Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337(APR.14):325–338
Article Google Scholar
Liu XY, Wu JX, Zhou ZH (2009) Exploratory undersampling for classimbalance learning. IEEE Transac Syst, Man, Cyberne, Part B: Cybernetics 39(2):539–550
Article Google Scholar
Luengo J, Fernández A, Garica S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936
Article Google Scholar
Luo W, Zhang L (2022) Question text classification method of tourism based on deep learning model. Wirel Commun Mob Comput 2022:4330701–4330709
Google Scholar
Marcińczuk M, Gniewkowski M, Walkowiak T, et al (2021) Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In proceedings of the 11th global Wordnet conference, 207-214
Matloob F, Ghazal TM, Taleb N, Aftab S, Ahmad M, Khan MA, Abbas S, Soomro TR (2021) Software defect prediction using ensemble learning: a systematic literature review. IEEE Access 9:98754–98771
Article Google Scholar
Murphree DH, Arabmakki E, Ngufor C, Storlie CB, McCoy RG (2018) Stacked classifiers for individualized prediction of glycemic control following initiation of metformin therapy in type 2 diabetes. Comput Biol Med 103:109–115
Article Google Scholar
Pappagari R, Zelasko P, Villalba J, et al (2019) Hierarchical transformers for long document classification. In IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 838-844
Peng H, Li J, He Y, et al (2018) Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In proceedings of the 2018 world wide web conference (WWW). 1063-1072
Shah K, Patel H, Sanghvi D, Shah M (2020) A comparative analysis of logistic regression, random forest and KNN models for the text classification. Aug Human Res 5(1):1–16
Google Scholar
Sun B, Chen HY, Wang JD et al (2018) Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front Comput Sci China 012(002):331–350
Article Google Scholar
Tang DY, Qin B, Feng XC, et al (2016) Effective LSTMs for target-dependent sentiment classification. Proceedings of COLING
Vaswani A, Shazeer N, Parmar N (2017) Attention is all you need. In proceedings of NIPS
Xie J, Hou Y, Wang Y, Wang Q, Li B, Lv S, Vorotnitsky YI (2020) Chinese text classification based on attention mechanism and feature-enhanced fusion neural network. Computing 102(6):683–700
Article MathSciNet MATH Google Scholar
Xu J, Cai Y, Wu X, Lei X, Huang Q, Leung HF, Li Q (2020) Incorporating context-relevant concepts into convolutional neural networks for short text classification. Neurocomputing 386:42–53
Article Google Scholar
Yan P, Li H, Wang Z (2021) WNTC: an efficient weight news text classification model. 2021 Asia-Pacific conference on communications technology and computer science (ACCTCS). pp. 271-276
Yang ZC, Yang DY, Dyer C, et al (2016) Hierarchical attention networks for document classification. In proceedings of NAACL, pages 1480-1489
Yang M, Tu W, Wang J, et al (2017) Attention-based LSTM for target-dependent sentiment classification, in: Proceedings of the 31st AAAI conference on artificial intelligence, AAAI press, San Francisco, CA, United states, p. 5013–5014
Yao L, Mao C S, Luo Y (2017) Graph convolutional networks for text classification. In proceedings of AAAI
Ye Z, Geng Y, Chen J, et al (2020) Zero-shot text classification via reinforced self-training. In proceedings of the 58th annual meeting of the Association for Computational Linguistics. 3014-3024
Zhang H, Zhang J (2020) Text graph transformer for document classification. In proceedings of EMNLP
Zhang YF, Yu XL, Cui ZY, et al (2020) Every document owns its structure: inductive text classification via graph neural networks. In proceedings of ACL
Zhou ZH (2021) Ensemble learning, machine learning. Springer, Singapore, pp 181–210
Book Google Scholar
Zhou Y, Mazzuchi TA, Sarkani S (2020) M-AdaBoost-A based ensemble system for network intrusion detection [J]. Expert Syst Appl 162(6):113864
Article Google Scholar
Zulqarnain M, Ghazali R, Hassim YMM, Aamir M (2021) An enhanced gated recurrent unit with auto-encoder for solving text classification problems. Arab J Sci Eng 46(9):8953–8967
Article Google Scholar

Download references

Acknowledgments

This research is supported by the National Natural Science Foundation of China (No. 61906220), the Ministry of education of Humanities and Social Science project (No. 19YJCZH178), National Social Science Foundation of China (No.18CTJ008), the Natural Science Foundation of Tianjin Province (No. 18JCQNJC69600), the National Key R&D Program of China (2017YFB1400700) and the Emerging Interdisciplinary Project of CUFE.

Author information

Authors and Affiliations

School of information, Central University of Finance and Economics, Beijing, 100081, China
Youwei Wang
Credit Card Center of Agricultural Bank of China Limited, Shanghai, 200001, China
Jiangchun Liu
School of Statistics, Tianjin University of Finance and Economics, Tianjin, 300222, China
Lizhou Feng

Authors

Youwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiangchun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lizhou Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lizhou Feng.

Ethics declarations

Conflict of interest

The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Liu, J. & Feng, L. Text length considered adaptive bagging ensemble learning algorithm for text classification. Multimed Tools Appl 82, 27681–27706 (2023). https://doi.org/10.1007/s11042-023-14578-9

Download citation

Received: 15 September 2021
Revised: 20 May 2022
Accepted: 31 January 2023
Published: 18 February 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11042-023-14578-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text length considered adaptive bagging ensemble learning algorithm for text classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Ensemble feature selection for single-label text classification: a comprehensive analytical study

Ensemble Learning Model for Medical Text Classification

A Combination of Resampling and Ensemble Method for Text Classification on Imbalanced Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Text length considered adaptive bagging ensemble learning algorithm for text classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Ensemble feature selection for single-label text classification: a comprehensive analytical study

Ensemble Learning Model for Medical Text Classification

A Combination of Resampling and Ensemble Method for Text Classification on Imbalanced Data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation