Abstract
In recent years, the demand for high-quality data has intensified, particularly in the medical field where accurate data annotation is costly and critical. Active Learning (AL) has emerged as a pivotal approach in these scenarios, where selecting high-quality data for training machine learning models is essential. This paper introduces a novel method, “Stochastic Featurization for Active-learning” (SFAL), designed to efficiently identify hard-to-classify unlabeled data within both medical and general datasets. Unlike traditional AL methods that rely on a pre-trained estimator, SFAL extracts novelty features from the latent representations of a target model, thereby circumventing the need for extensive initial training and facilitating the selection of a diverse array of challenging medical data samples. This technique is particularly effective in the context of medical text classification and named entity recognition, areas where precise data interpretation is crucial. Our extensive testing across seven benchmark datasets, including those specific to clinical settings, confirms that SFAL surpasses existing state-of-the-art AL methods in performance, demonstrating its significant potential for advancing medical data analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abe, N., Zadrozny, B., Langford, J.: Outlier detection by active learning. In: SIGKDD (2006)
Ash, J.T., Zhang, C., Krishnamurthy, A., Langford, J., Agarwal, A.: Deep batch active learning by diverse, uncertain gradient lower bounds. In: ICLR (2020)
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: EMNLP-IJCNLP (2019)
Beluch, W.H., Genewein, T., Nürnberger, A., Köhler, J.M.: The power of ensembles for active learning in image classification. In: CVPR (2018)
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate - a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B (1995)
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. (2001)
Boros, E., et al.: Alleviating digitization errors in named entity recognition for historical documents. In: CoNLL (2020)
Culotta, A., McCallum, A.: Reducing labeling effort for structured prediction tasks. In: AAAI (2005)
Dagan, I., Engelson, S.P.: Committee-based sampling for training probabilistic classifiers. In: Machine Learning Proceedings (1995)
Dernoncourt, F., Lee, J.Y.: PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In: IJCNLP (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Gal, Y., Islam1, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: ICML (2017)
Gissin, D., Shalev-Shwartz, S.: Discriminative active learning. In: ICLR (2019)
Hu, P., Lipton, Z., Anandkumar, A., Ramanan, D.: Active learning with partial feedback. In: ICLR (2019)
Kann, K., Cho, K., Bowman, S.R.: Towards realistic practices in low-resource natural language processing: the development set. In: EMNLP-IJCNLP (2019)
Kholghi, M., Vine, L.D., Sitbon, L., Zuccon, G., Nguyen, A.N.: Clinical information extraction using small data: an active learning approach based on sequence representations and word embeddings. JASIST (2017)
Linh, L., Nguyen, M.T., Zuccon, G., Demartini, G.: Loss-based active learning for named entity recognition. In: IJCNN (2021)
Liu, Y., et al.: Generative adversarial active learning for unsupervised outlier detection. In: TKDE (2019)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mao, X., Koopman, B., Zuccon, G.: A reproducibility study of goldilocks: just-right tuning of BERT for TAR. In: ECIR, vol. 14611, pp. 132–146 (2024)
Margatina, K., Vernikos, G., Barrault, L., Aletras, N.: Active learning by acquiring contrastive examples. In: EMNLP (2021)
Michalopoulos, G., Wang, Y., Kaka, H., Chen, H., Wong, A.: UmlsBERT: clinical domain knowledge augmentation of contextual embeddings using the unified medical language system metathesaurus. In: NAACL (2021)
Nguyen, D.H.M., Patrick, J.D.: Supervised machine learning and active learning in classification of radiology reports. JAMIA (2014)
Nguyen, H.T., Smeulders, A.: Active learning using pre-clustering. In: ICML (2004)
Parvaneh, A., Abbasnejad, E., Teney, D., Haffari, R., van den Hengel, A., Shi, J.Q.: Active learning by feature mixing. In: CVPR (2022)
Peluso, A., et al.: Deep learning uncertainty quantification for clinical text classification. J. Biomed. Inf. 149, 104576 (2024)
Prokhorov, V., Shareghi, E., Li, Y., Pilehvar, M.T., Collier, N.: On the importance of the Kullback-Leibler divergence term in variational autoencoders for text generation. In: EMNLP-IJCNLP (2019)
Sang, E.F.T.K., Meulder, F.D.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: NAACL (2003)
Seo, S., Kim, D., Ahn, Y., Lee, K.: Active learning on pre-trained language model with task-independent triplet loss. In: AAAI (2022)
Settles, B.: Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (2012). https://doi.org/10.1007/978-3-031-01560-1
Sharma, M., Zhuang, D., Bilgic, M.: Active learning with rationales for text classification. In: Mihalcea, R., Chai, J.Y., Sarkar, A. (eds.) NAACL, pp. 441–451 (2015)
Shen, Y., Yun, H., Lipton, Z.C., Kronrod, Y., Anandkumar, A.: Deep active learning for named entity recognition. In: ICLR (2018)
Socher, R., et al.: Active learning by acquiring contrastive examples. In: ACL (2021)
Srinivasan, A., Vajjala, S.: A multilingual evaluation of NER robustness to adversarial inputs. In: RepL4NLP@ACL (2023)
Suominen, H., et al.: Overview of the ShaRe/CLEF eHealth evaluation lab 2013. In: CLEF (2013)
Uzuner, Ö., South, B.R., Shen, S., DuVall, S.L.: 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. JAMIA (2011)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP (2020)
Yoo, D., Kweon, I.S.: Learning loss for active learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 93–102 (2019)
Yu, Y., Kong, L., Zhang, J., Zhang, R., Zhang, C.: AcTune: uncertainty-based active self-training for active fine-tuning of pretrained language models. In: NAACL (2022)
Yuan, M., Lin, H.T., Boyd-Graber, J.: Cold-start active learning through self-supervised language modeling. In: EMNLP (2020)
Zhang, M., Plank, B.: Cartography active learning. In: Findings of EMNLP (2021)
Zhang, X., Zhao, J.J., Lecun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015)
Zhu, J., Wang, H., Yao, T., Tsou, B.K.: Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In: COLING (2008)
Acknowledgements
This research is supported by the National Key Research and Development Program of China No. 2020AAA0109400 and the Shenyang Science and Technology Plan Fund (No. 21-102-0-09), and by the Swiss National Science Foundation (SNSF) under contract number CRSII5_205975.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Le, L. et al. (2024). Stochastic Featurization for Active Learning. In: Chen, H., Zhou, Y., Xu, D., Vardhanabhuti, V.V. (eds) Trustworthy Artificial Intelligence for Healthcare. TAI4H 2024. Lecture Notes in Computer Science, vol 14812. Springer, Cham. https://doi.org/10.1007/978-3-031-67751-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-67751-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-67750-2
Online ISBN: 978-3-031-67751-9
eBook Packages: Computer ScienceComputer Science (R0)