Stochastic Featurization for Active Learning

Le, Linh; Nguyen, Minh-Tien; Tran, Khai Phan; Zhao, Genghong; Xia, Zhang; Zuccon, Guido; Demartini, Gianluca

doi:10.1007/978-3-031-67751-9_5

Linh Le ORCID: orcid.org/0000-0002-1241-1881¹¹,
Minh-Tien Nguyen¹⁴,
Khai Phan Tran¹¹,
Genghong Zhao¹²,
Zhang Xia¹³,
Guido Zuccon¹¹ &
…
Gianluca Demartini¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14812))

Included in the following conference series:

International Workshop on Trustworthy Artificial Intelligence for Healthcare

167 Accesses

Abstract

In recent years, the demand for high-quality data has intensified, particularly in the medical field where accurate data annotation is costly and critical. Active Learning (AL) has emerged as a pivotal approach in these scenarios, where selecting high-quality data for training machine learning models is essential. This paper introduces a novel method, “Stochastic Featurization for Active-learning” (SFAL), designed to efficiently identify hard-to-classify unlabeled data within both medical and general datasets. Unlike traditional AL methods that rely on a pre-trained estimator, SFAL extracts novelty features from the latent representations of a target model, thereby circumventing the need for extensive initial training and facilitating the selection of a diverse array of challenging medical data samples. This technique is particularly effective in the context of medical text classification and named entity recognition, areas where precise data interpretation is crucial. Our extensive testing across seven benchmark datasets, including those specific to clinical settings, confirms that SFAL surpasses existing state-of-the-art AL methods in performance, demonstrating its significant potential for advancing medical data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abe, N., Zadrozny, B., Langford, J.: Outlier detection by active learning. In: SIGKDD (2006)
Google Scholar
Ash, J.T., Zhang, C., Krishnamurthy, A., Langford, J., Agarwal, A.: Deep batch active learning by diverse, uncertain gradient lower bounds. In: ICLR (2020)
Google Scholar
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: EMNLP-IJCNLP (2019)
Google Scholar
Beluch, W.H., Genewein, T., Nürnberger, A., Köhler, J.M.: The power of ensembles for active learning in image classification. In: CVPR (2018)
Google Scholar
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate - a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B (1995)
Google Scholar
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. (2001)
Google Scholar
Boros, E., et al.: Alleviating digitization errors in named entity recognition for historical documents. In: CoNLL (2020)
Google Scholar
Culotta, A., McCallum, A.: Reducing labeling effort for structured prediction tasks. In: AAAI (2005)
Google Scholar
Dagan, I., Engelson, S.P.: Committee-based sampling for training probabilistic classifiers. In: Machine Learning Proceedings (1995)
Google Scholar
Dernoncourt, F., Lee, J.Y.: PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In: IJCNLP (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Gal, Y., Islam1, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: ICML (2017)
Google Scholar
Gissin, D., Shalev-Shwartz, S.: Discriminative active learning. In: ICLR (2019)
Google Scholar
Hu, P., Lipton, Z., Anandkumar, A., Ramanan, D.: Active learning with partial feedback. In: ICLR (2019)
Google Scholar
Kann, K., Cho, K., Bowman, S.R.: Towards realistic practices in low-resource natural language processing: the development set. In: EMNLP-IJCNLP (2019)
Google Scholar
Kholghi, M., Vine, L.D., Sitbon, L., Zuccon, G., Nguyen, A.N.: Clinical information extraction using small data: an active learning approach based on sequence representations and word embeddings. JASIST (2017)
Google Scholar
Linh, L., Nguyen, M.T., Zuccon, G., Demartini, G.: Loss-based active learning for named entity recognition. In: IJCNN (2021)
Google Scholar
Liu, Y., et al.: Generative adversarial active learning for unsupervised outlier detection. In: TKDE (2019)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mao, X., Koopman, B., Zuccon, G.: A reproducibility study of goldilocks: just-right tuning of BERT for TAR. In: ECIR, vol. 14611, pp. 132–146 (2024)
Google Scholar
Margatina, K., Vernikos, G., Barrault, L., Aletras, N.: Active learning by acquiring contrastive examples. In: EMNLP (2021)
Google Scholar
Michalopoulos, G., Wang, Y., Kaka, H., Chen, H., Wong, A.: UmlsBERT: clinical domain knowledge augmentation of contextual embeddings using the unified medical language system metathesaurus. In: NAACL (2021)
Google Scholar
Nguyen, D.H.M., Patrick, J.D.: Supervised machine learning and active learning in classification of radiology reports. JAMIA (2014)
Google Scholar
Nguyen, H.T., Smeulders, A.: Active learning using pre-clustering. In: ICML (2004)
Google Scholar
Parvaneh, A., Abbasnejad, E., Teney, D., Haffari, R., van den Hengel, A., Shi, J.Q.: Active learning by feature mixing. In: CVPR (2022)
Google Scholar
Peluso, A., et al.: Deep learning uncertainty quantification for clinical text classification. J. Biomed. Inf. 149, 104576 (2024)
Article Google Scholar
Prokhorov, V., Shareghi, E., Li, Y., Pilehvar, M.T., Collier, N.: On the importance of the Kullback-Leibler divergence term in variational autoencoders for text generation. In: EMNLP-IJCNLP (2019)
Google Scholar
Sang, E.F.T.K., Meulder, F.D.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: NAACL (2003)
Google Scholar
Seo, S., Kim, D., Ahn, Y., Lee, K.: Active learning on pre-trained language model with task-independent triplet loss. In: AAAI (2022)
Google Scholar
Settles, B.: Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (2012). https://doi.org/10.1007/978-3-031-01560-1
Sharma, M., Zhuang, D., Bilgic, M.: Active learning with rationales for text classification. In: Mihalcea, R., Chai, J.Y., Sarkar, A. (eds.) NAACL, pp. 441–451 (2015)
Google Scholar
Shen, Y., Yun, H., Lipton, Z.C., Kronrod, Y., Anandkumar, A.: Deep active learning for named entity recognition. In: ICLR (2018)
Google Scholar
Socher, R., et al.: Active learning by acquiring contrastive examples. In: ACL (2021)
Google Scholar
Srinivasan, A., Vajjala, S.: A multilingual evaluation of NER robustness to adversarial inputs. In: RepL4NLP@ACL (2023)
Google Scholar
Suominen, H., et al.: Overview of the ShaRe/CLEF eHealth evaluation lab 2013. In: CLEF (2013)
Google Scholar
Uzuner, Ö., South, B.R., Shen, S., DuVall, S.L.: 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. JAMIA (2011)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP (2020)
Google Scholar
Yoo, D., Kweon, I.S.: Learning loss for active learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 93–102 (2019)
Google Scholar
Yu, Y., Kong, L., Zhang, J., Zhang, R., Zhang, C.: AcTune: uncertainty-based active self-training for active fine-tuning of pretrained language models. In: NAACL (2022)
Google Scholar
Yuan, M., Lin, H.T., Boyd-Graber, J.: Cold-start active learning through self-supervised language modeling. In: EMNLP (2020)
Google Scholar
Zhang, M., Plank, B.: Cartography active learning. In: Findings of EMNLP (2021)
Google Scholar
Zhang, X., Zhao, J.J., Lecun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015)
Google Scholar
Zhu, J., Wang, H., Yao, T., Tsou, B.K.: Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In: COLING (2008)
Google Scholar

Download references

Acknowledgements

This research is supported by the National Key Research and Development Program of China No. 2020AAA0109400 and the Shenyang Science and Technology Plan Fund (No. 21-102-0-09), and by the Swiss National Science Foundation (SNSF) under contract number CRSII5_205975.

Author information

Authors and Affiliations

University of Queensland, Saint Lucia, Australia
Linh Le, Khai Phan Tran, Guido Zuccon & Gianluca Demartini
Neusoft Research of Intelligent Healthcare Technology, Co., Ltd., Shenyang, China
Genghong Zhao
Neusoft Corporation, Shenyang, China
Zhang Xia
Hung Yen University of Technology and Education, Hai Duong, Vietnam
Minh-Tien Nguyen

Authors

Linh Le
View author publications
You can also search for this author in PubMed Google Scholar
Minh-Tien Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Khai Phan Tran
View author publications
You can also search for this author in PubMed Google Scholar
Genghong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Xia
View author publications
You can also search for this author in PubMed Google Scholar
Guido Zuccon
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Demartini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Linh Le .

Editor information

Editors and Affiliations

The Hong Kong University of Science and Technology, Kowloon, Hong Kong
Hao Chen
University of California, Santa Cruz, Santa Cruz, CA, USA
Yuyin Zhou
NVIDIA Corporation, Santa Clara, CA, USA
Daguang Xu
University of Hong Kong, Pok Fu Lam, Hong Kong
Varut Vince Vardhanabhuti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Le, L. et al. (2024). Stochastic Featurization for Active Learning. In: Chen, H., Zhou, Y., Xu, D., Vardhanabhuti, V.V. (eds) Trustworthy Artificial Intelligence for Healthcare. TAI4H 2024. Lecture Notes in Computer Science, vol 14812. Springer, Cham. https://doi.org/10.1007/978-3-031-67751-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-67751-9_5
Published: 01 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-67750-2
Online ISBN: 978-3-031-67751-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Stochastic Featurization for Active Learning