Abstract
Out-of-scope (OOS) intent classification is an emerging field in conversational AI research. The goal is to detect out-of-scope user intents that do not belong to a predefined intent ontology. However, establishing a reliable OOS detection system is challenging due to limited data availability. This situation necessitates solutions rooted in few-shot learning techniques. For such few-shot text classification tasks, prompt-based learning has been shown more effective than conventionally finetuned large language models with a classification layer on top. Thus, we advocate for exploring prompt-based approaches for OOS intent detection. Additionally, we propose a new evaluation metric, the Area Under the In-scope and Out-of-Scope Characteristic curve (AU-IOC). This metric addresses the shortcomings of current evaluation standards for OOS intent detection. AU-IOC provides a comprehensive assessment of a model’s dual performance capacities: in-scope classification accuracy and OOS recall. Under this new evaluation method, we compare our prompt-based OOS detector against 3 strong baseline models by exploiting the metadata of intent annotations, i.e., intent description. Our study found that our prompt-based model achieved the highest AU-IOC score across different data regimes. Further experiments showed that our detector is insensitive to a variety of intent descriptions. An intriguing finding shows that for extremely low data settings (1- or 5-shot), employing a naturally phrased prompt template boosts the detector’s performance compared to rather artificially structured template patterns.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The reason for using recall instead of precision to evaluate the OOS detection is that a recall error (an OOS question is wrongly classified as an in-scope intent) would generate a completely wrong response, while a precision error (i.e., an in-scope question is misclassified as OOS) is rather safe since it usually triggers a fallback response that asks the user to rephrase the question.
We also experimented on a multi-mask PET model [28] to directly predict labels yielding much worse performance and less efficient training compared to our prompt-based model.
Without specification, “intent” represents either “intent label” or “intent description”.
All annotations are available from https://bit.ly/3Xo5BAR
See [49, Tables 5–6] for a list of mono-lingual PLMs.
References
Liang S, Li Y, Srikant R (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In: Proceedings of ICLR, Vancouver, BC, Canada. https://openreview.net/forum?id=H1VGkIxRZ
Hsu Y, Shen Y, Jin H, Kira Z (2020) Generalized ODIN: detecting out-of-distribution image without learning from out-of-distribution data. In: Proceedings of CVPR, Seattle, WA, USA, pp 10948–10957. https://doi.org/10.1109/CVPR42600.2020.01096
Ren J, Liu PJ, Fertig E, Snoek J, Poplin R, DePristo MA, Dillon JV, Lakshminarayanan B (2019) Likelihood ratios for out-of-distribution detection. In: Proceedings of NeurIPS, Vancouver, BC, Canada, 32:14680–14691. https://proceedings.neurips.cc/paper/2019/hash/1e79596878b2320cac26dd792a6c51c9-Abstract.html
Lee K, Lee K, Lee H, Shin J (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: Proceedings of NeurIPS, Montréal, Canada 31:7167–7177. https://proceedings.neurips.cc/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html
Zheng Y, Chen G, Huang M (2020) Out-of-domain detection for natural language understanding in dialog systems. IEEE/ACM Trans Audio, Speech and Lang Proc 28:1198–1209. https://doi.org/10.1109/TASLP.2020.2983593
Jin D, Gao S, Kim S, Liu Y, Hakkani-Tür D (2022) Towards textual out-of-domain detection without in-domain labels. IEEE/ACM Trans Audio, Speech Lang Proc 30:1386–1395. https://doi.org/10.1109/TASLP.2022.3162081
Shen Y, Hsu Y-C, Ray A, Jin H (2021) Enhancing the generalization for intent classification and out-of-domain detection in SLU. In: Proceedings of ACL, Association for Computational Linguistics, Online, 2443–2453. https://doi.org/10.18653/v1/2021.acl-long.190
Zhou W, Liu F, Chen M (2021) Contrastive out-of-distribution detection for pretrained transformers. In: Proceedings of EMNLP, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic , pp 1100–1111. https://doi.org/10.18653/v1/2021.emnlp-main.84
Zhang H, Xu H, Lin T-E (2021) Deep open intent classification with adaptive decision boundary. In: Proceedings of AAAI, AAAI Press, Online 35:14374–14382. https://ojs.aaai.org/index.php/AAAI/article/view/17690
Lane I, Kawahara T, Matsui T, Nakamura S (2007) Out-of-domain utterance detection using classification confidences of multiple topics. IEEE/ACM Trans Audio, Speech Lang Proc 15(1):150–161. https://doi.org/10.1109/TASL.2006.876727
Iqbal T, Cao Y, Kong Q, Plumbley MD, Wang W (2020) Learning with out-of-distribution data for audio classification. In: Proceedings of ICASSP, IEEE, Barcelona, Spain pp 636–640. https://doi.org/10.1109/ICASSP40776.2020.9054444
Lin T-E, Xu H (2019) Deep unknown intent detection with margin loss. In: Proceedings of ACL, Association for Computational Linguistics, Florence, Italy, pp 5491–5496. https://doi.org/10.18653/v1/P19-1548
Zhan L-M, Liang H, Liu B, Fan L, Wu X-M, Lam AYS (2021) Out-of-scope intent detection with self-supervision and discriminative training. In: Proceedings of ACL, Association for Computational Linguistics, Online, pp 3521–3532. https://doi.org/10.18653/v1/2021.acl-long.273
Zhang J, Hashimoto K, Wan Y, Liu Z, Liu Y, Xiong C, Yu P (2022) Are pre-trained transformers robust in intent classification: A missing ingredient in evaluation of out-of-scope intent detection. In: Proceedings of the 4th Workshop on NLP for ConvAI, Association for Computational Linguistics, Dublin, Ireland, pp 12–20. https://doi.org/10.18653/v1/2022.nlp4convai-1.2
Zhang J, Hashimoto K, Liu W, Wu C-S, Wan Y, Yu P, Socher R, Xiong C (2020) Discriminative nearest neighbor few-shot intent detection by transferring natural language inference. In: Proceedings of EMNLP, Association for Computational Linguistics, Online, pp 5064–5082. https://doi.org/10.18653/v1/2020.emnlp-main.411
Hendrycks D, Gimpel K (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: Proceedings of ICLR, Toulon, France. https://openreview.net/forum?id=Hkg4TI9xl
Liu J, Lin Z, Padhy S, Tran D, Bedrax Weiss T, Lakshminarayanan B (2020) Simple and principled uncertainty estimation with deterministic deep learning via distance awareness, Online, 33:7498–7512. https://proceedings.neurips.cc/paper/2020/hash/543e83748234f7cbab21aa0ade66565f-Abstract.html
Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of EACL, Association for Computational Linguistics, Online pp 255–269. https://doi.org/10.18653/v1/2021.eacl-main.20
Schick T, Schütze H (2021) It’s not just size that matters: Small language models are also few-shot learners. In: Proceedings of NAACL, Association for Computational Linguistics, Online, pp 2339–2352. https://doi.org/10.18653/v1/2021.naacl-main.185
Li D, Hu B, Chen Q (2022) Prompt-based text entailment for low-resource named entity recognition. In: Proceedings of ICCL, International Committee on Computational Linguistics, Gyeongju, Republic of Korea, pp 1896–1903. https://aclanthology.org/2022.coling-1.164
Chen Y, Harbecke D, Hennig L (2022) Multilingual relation classification via efficient and effective prompting. In: Proceedings of EMNLP, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, pp 1059–1075. https://aclanthology.org/2022.emnlp-main.69
Shu L, Xu H, Liu B (2017) DOC: Deep open classification of text documents. In: Proceedings of EMNLP, Association for Computational Linguistics, Copenhagen, Denmark, pp 2911–2916. https://doi.org/10.18653/v1/D17-1314
Yan G, Fan L, Li Q, Liu H, Zhang X, Wu X-M, Lam AYS (2020) Unknown intent detection using Gaussian mixture model with an application to zero-shot intent classification. In: Proceedings of ACL, Association for Computational Linguistics, Online, pp 1050–1060. https://doi.org/10.18653/v1/2020.acl-main.99
Larson S, Mahendran A, Peper JJ, Clarke C, Lee A, Hill P, Kummerfeld JK, Leach K, Laurenzano MA, Tang L, Mars J (2019) An evaluation dataset for intent classification and out-of-scope prediction. In: Proceedings of EMNLP, Association for Computational Linguistics, Hong Kong, China, pp 1311–1316. https://doi.org/10.18653/v1/D19-1131
Qu J, Hashimoto K, Liu W, Xiong C, Zhou Y (2021) Few-shot intent classification by gauging entailment relationship between utterance and semantic label. In: Proceedings of the 3rd Workshop on NLP for ConvAI, Association for Computational Linguistics, Online, pp 8–15. https://doi.org/10.18653/v1/2021.nlp4convai-1.2
Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of ICML, Association for Computing Machinery, Pittsburgh, Pennsylvania, USA, pp 233–240. https://doi.org/10.1145/1143844.1143874
Fawcett T (2006) An introduction to roc analysis. Pattern Recognit Lett 27(8):861–874. https://doi.org/10.1016/j.patrec.2005.10.010
Schick T, Schütze H (2022) True Few-Shot Learning with Prompts-A Real-World Perspective. Trans Assoc Comput Linguistics 10:716–731. https://doi.org/10.1162/tacl_a_00485
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL, Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Tam D, R Menon R, Bansal M, Srivastava S, Raffel C (2021) Improving and simplifying pattern exploiting training. In: Proceedings of EMNLP, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp 4980–4991. https://doi.org/10.18653/v1/2021.emnlp-main.407
Casanueva I, Temcinas T, Gerz D, Henderson M, Vulic I (2020) Efficient intent detection with dual sentence encoders. In: Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020. Data available at https://github.com/PolyAI-LDN/task-specific-datasets. arXiv:2003.04807
Coucke A, Saade A, Ball A, Bluche T, Caulier A, Leroy D, Doumouro C, Gisselbrecht T, Caltagirone F, Lavril T, Primet M, Dureau J (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv:1805.10190
Schuster S, Gupta S, Shah R, Lewis M (2019) Cross-lingual transfer learning for multilingual task oriented dialog. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp 3795–3805. https://doi.org/10.18653/v1/N19-1380. https://aclanthology.org/N19-1380
Xu J, Wang P, Tian G, Xu B, Zhao J, Wang F, Hao H (2015) Short text clustering via convolutional neural networks. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Association for Computational Linguistics, Denver, Colorado, pp 62–69. https://doi.org/10.3115/v1/W15-1509. https://aclanthology.org/W15-1509
Liu X, Eshghi A, Swietojanski P, Rieser V (2021) Benchmarking natural language understanding services for building conversational agents. In: Increasing Naturalness and Flexibility in Spoken Dialogue Interaction: 10th International Workshop on Spoken Dialogue Systems, Springer, pp 165–183. https://doi.org/10.1007/978-981-15-9323-9_15
Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of EMNLP, Association for Computational Linguistics, Hong Kong, China, pp 3982–3992. https://doi.org/10.18653/v1/D19-1410
Thakur N, Reimers N, Daxenberger J, Gurevych I (2021) Augmented SBERT: Data augmentation method for improving Bi-encoders for pairwise sentence scoring tasks. In: Proceedings of NAACL, Association for Computational Linguistics, Online, pp 296–310. https://doi.org/10.18653/v1/2021.naacl-main.28
Chen Q, Zhu X, Ling Z-H, Wei S, Jiang H, Inkpen D (2017) Enhanced LSTM for natural language inference. In: Proceedings of ACL, Association for Computational Linguistics, Vancouver, Canada, pp 1657–1668. https://doi.org/10.18653/v1/P17-1152
Williams A, Nangia N, Bowman S (2018) A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of NAACL, Association for Computational Linguistics, New Orleans, Louisiana, pp 1112–1122. https://doi.org/10.18653/v1/N18-1101
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: Proceedings of ICLR, Vancouver, BC, Canada. https://openreview.net/forum?id=Bkg6RiCqY7
Efron B, Tibshirani RJ (1994) An Introduction to the Bootstrap. CRC Press, New York, USA
Gao T, Fisch A, Chen D (2021) Making pre-trained language models better few-shot learners. In: Proceedings of ACL, Association for Computational Linguistics, Online. pp 3816–3830. https://doi.org/10.18653/v1/2021.acl-long.295
Chen D, Yu Z (2021) GOLD: Improving out-of-scope detection in dialogues using data augmentation. In: Proceedings of EMNLP, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp 429–442. https://doi.org/10.18653/v1/2021.emnlp-main.35
Cheng Z, Jiang Z, Yin Y, Wang C, Gu Q (2022) Learning to classify open intent via soft labeling and manifold mixup. IEEE/ACM Trans Audio, Speech Lang Proc 30:635–645. https://doi.org/10.1109/TASLP.2022.3145308
Tan M, Yu Y, Wang H, Wang D, Potdar S, Chang S, Yu M (2019) Out-of-domain detection for low-resource text classification tasks. In: Proceedings of EMNLP, Association for Computational Linguistics, Hong Kong, China, pp 3566–3572. https://doi.org/10.18653/v1/D19-1364
Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Proceedings of NeurIPS, Long Beach, CA, USA 30:4077–4087. https://proceedings.neurips.cc/paper/2017/file/cb8da6767461f2812ae4290eac7cbc42-Paper.pdf
Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Proceddings of NeurIPS, Vancouver, BC, Canada vol. 32. https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf
Kalyan KS, Rajasekharan A, Sangeetha S (2022) Ammu: A survey of transformer-based biomedical pretrained language models. J Biomed Inf 126(C). https://doi.org/10.1016/j.jbi.2021.103982
Min S, Lewis M, Hajishirzi H, Zettlemoyer L (2022) Noisy channel language model prompting for few-shot text classification. In: Proceedings of ACL, Association for Computational Linguistics, Dublin, Ireland, pp 5316–5330. https://doi.org/10.18653/v1/2022.acl-long.365
Funding
The first author is supported by China Scholarship Council (No. 201906020194) and Ghent University Special Research Fund (BOF) (No. 01SC0618). This research also received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.
Author information
Authors and Affiliations
Contributions
Yiwei Jiang: Conceptualization, Methodology, Software, Investigation, Writing - original draft preparation. Maarten De Raedt: Conceptualization, Investigation, Writing - review and editing. Johannes Deleu: Conceptualization, Investigation, Writing - review and editing. Thomas Demeester: Conceptualization, Investigation, Writing - review and editing, Supervision. Chris Develder: Conceptualization, Investigation, Writing - review and editing, Supervision.
Corresponding author
Ethics declarations
Competing Interests
The authors have no competing interests to disclose in any material discussed in this article.
Ethical Compliance
This study does not involve any human participant or animal. All the data used in this article are sourced from open and publicly accessible platforms. No proprietary, confidential, or private data has been used
Scientific assessment
We thank the reviewers for their useful feedback, which helped us to improve the manuscript, including with their suggestion to add more datasets.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A IOC curves at 1-50 shots
Figures 8, 9, 10, 11, 12 and 13 plot IOC curves of different models in 1-50 shot settings for SNIPS, Facebook, CLINC-Banking, Stackoverflow, HWU64 and BANKING dataset respectively.
Appendix B Confidence score distributions of the other 3 datasets at 5-shot
Figure 14 shows the confidence score distributions of the 4 architectures on 3 datasets (SNIPS, Facebook and HWU64) at 5-shot.
Appendix C Inference speed
Figure 15 illustrates the inference throughput against the number of in-scope classes (denoted as L). To ensure a fair comparison between the models and to aptly simulate the online evaluation setting, we standardized the input batch size to 1 across all models. This means that each batch contains only a single user question. We observed that the throughput of the Softmax model remains relatively stable (approximately 62 instances/s) irrespective of the variations in L. The Softmax model bypasses the one-vs-all binary classification, thereby exhibiting speed insensitivity. In the case of the Siamese model, we implemented a strategy to cache the intent label embedding to foster efficiency. However, despite this optimization, we found that the computational demand for the cosine similarity operation escalates as L increases. In contrast, the throughput for the other two models is much smaller when L increases over 14. Notably, the prompt-based model surpasses others in achieving higher AU-IOC scores, albeit at the expense of reduced inference throughput, particularly when L exceeds 10. A significant factor contributing to this reduced speed is the tensor extraction operations involved in the prompt-based model, requiring data transfer between the GPU and CPU, which is time-consuming. While our primary focus in this study remains on scrutinizing the robustness of different models in handling the Out-of-Scope (OOS) intent detection task, we acknowledge that optimizing the inference speed is a critical aspect that warrants attention in future work. It is also pertinent to note that the inference time is contingent upon the hardware utilized during the evaluation, implying that a change in hardware could potentially alter the throughput numbers reported.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, Y., De Raedt, M., Deleu, J. et al. Few-shot out-of-scope intent classification: analyzing the robustness of prompt-based learning. Appl Intell 54, 1474–1496 (2024). https://doi.org/10.1007/s10489-023-05215-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05215-x