iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://unpaywall.org/10.1007/S10489-023-05215-X
Few-shot out-of-scope intent classification: analyzing the robustness of prompt-based learning | Applied Intelligence Skip to main content
Log in

Few-shot out-of-scope intent classification: analyzing the robustness of prompt-based learning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Out-of-scope (OOS) intent classification is an emerging field in conversational AI research. The goal is to detect out-of-scope user intents that do not belong to a predefined intent ontology. However, establishing a reliable OOS detection system is challenging due to limited data availability. This situation necessitates solutions rooted in few-shot learning techniques. For such few-shot text classification tasks, prompt-based learning has been shown more effective than conventionally finetuned large language models with a classification layer on top. Thus, we advocate for exploring prompt-based approaches for OOS intent detection. Additionally, we propose a new evaluation metric, the Area Under the In-scope and Out-of-Scope Characteristic curve (AU-IOC). This metric addresses the shortcomings of current evaluation standards for OOS intent detection. AU-IOC provides a comprehensive assessment of a model’s dual performance capacities: in-scope classification accuracy and OOS recall. Under this new evaluation method, we compare our prompt-based OOS detector against 3 strong baseline models by exploiting the metadata of intent annotations, i.e., intent description. Our study found that our prompt-based model achieved the highest AU-IOC score across different data regimes. Further experiments showed that our detector is insensitive to a variety of intent descriptions. An intriguing finding shows that for extremely low data settings (1- or 5-shot), employing a naturally phrased prompt template boosts the detector’s performance compared to rather artificially structured template patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The original datasets used in this study come from multiple studies [24, 31,32,33,34,35]. Our work adapted these datasets for training our models and experiment analysis. The adapted versions are available from the corresponding author on reasonable request.

Notes

  1. The reason for using recall instead of precision to evaluate the OOS detection is that a recall error (an OOS question is wrongly classified as an in-scope intent) would generate a completely wrong response, while a precision error (i.e., an in-scope question is misclassified as OOS) is rather safe since it usually triggers a fallback response that asks the user to rephrase the question.

  2. We also experimented on a multi-mask PET model [28] to directly predict labels yielding much worse performance and less efficient training compared to our prompt-based model.

  3. Without specification, “intent” represents either “intent label” or “intent description”.

  4. https://bit.ly/3r4bDN0

  5. For completeness, in Appendix A we also plot IOC curves of the 4 architectures for all the 6 datasets from 1- to 50-shot settings in Figs. 8, 9, 10, 11, 12 and 13.

  6. To save space, the score distributions of the other 3 datasets are plotted in Fig. 14, Appendix B.

  7. All annotations are available from https://bit.ly/3Xo5BAR

  8. See [49, Tables 5–6] for a list of mono-lingual PLMs.

References

  1. Liang S, Li Y, Srikant R (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In: Proceedings of ICLR, Vancouver, BC, Canada. https://openreview.net/forum?id=H1VGkIxRZ

  2. Hsu Y, Shen Y, Jin H, Kira Z (2020) Generalized ODIN: detecting out-of-distribution image without learning from out-of-distribution data. In: Proceedings of CVPR, Seattle, WA, USA, pp 10948–10957. https://doi.org/10.1109/CVPR42600.2020.01096

  3. Ren J, Liu PJ, Fertig E, Snoek J, Poplin R, DePristo MA, Dillon JV, Lakshminarayanan B (2019) Likelihood ratios for out-of-distribution detection. In: Proceedings of NeurIPS, Vancouver, BC, Canada, 32:14680–14691. https://proceedings.neurips.cc/paper/2019/hash/1e79596878b2320cac26dd792a6c51c9-Abstract.html

  4. Lee K, Lee K, Lee H, Shin J (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: Proceedings of NeurIPS, Montréal, Canada 31:7167–7177. https://proceedings.neurips.cc/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html

  5. Zheng Y, Chen G, Huang M (2020) Out-of-domain detection for natural language understanding in dialog systems. IEEE/ACM Trans Audio, Speech and Lang Proc 28:1198–1209. https://doi.org/10.1109/TASLP.2020.2983593

    Article  Google Scholar 

  6. Jin D, Gao S, Kim S, Liu Y, Hakkani-Tür D (2022) Towards textual out-of-domain detection without in-domain labels. IEEE/ACM Trans Audio, Speech Lang Proc 30:1386–1395. https://doi.org/10.1109/TASLP.2022.3162081

    Article  Google Scholar 

  7. Shen Y, Hsu Y-C, Ray A, Jin H (2021) Enhancing the generalization for intent classification and out-of-domain detection in SLU. In: Proceedings of ACL, Association for Computational Linguistics, Online, 2443–2453. https://doi.org/10.18653/v1/2021.acl-long.190

  8. Zhou W, Liu F, Chen M (2021) Contrastive out-of-distribution detection for pretrained transformers. In: Proceedings of EMNLP, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic , pp 1100–1111. https://doi.org/10.18653/v1/2021.emnlp-main.84

  9. Zhang H, Xu H, Lin T-E (2021) Deep open intent classification with adaptive decision boundary. In: Proceedings of AAAI, AAAI Press, Online 35:14374–14382. https://ojs.aaai.org/index.php/AAAI/article/view/17690

  10. Lane I, Kawahara T, Matsui T, Nakamura S (2007) Out-of-domain utterance detection using classification confidences of multiple topics. IEEE/ACM Trans Audio, Speech Lang Proc 15(1):150–161. https://doi.org/10.1109/TASL.2006.876727

    Article  Google Scholar 

  11. Iqbal T, Cao Y, Kong Q, Plumbley MD, Wang W (2020) Learning with out-of-distribution data for audio classification. In: Proceedings of ICASSP, IEEE, Barcelona, Spain pp 636–640. https://doi.org/10.1109/ICASSP40776.2020.9054444

  12. Lin T-E, Xu H (2019) Deep unknown intent detection with margin loss. In: Proceedings of ACL, Association for Computational Linguistics, Florence, Italy, pp 5491–5496. https://doi.org/10.18653/v1/P19-1548

  13. Zhan L-M, Liang H, Liu B, Fan L, Wu X-M, Lam AYS (2021) Out-of-scope intent detection with self-supervision and discriminative training. In: Proceedings of ACL, Association for Computational Linguistics, Online, pp 3521–3532. https://doi.org/10.18653/v1/2021.acl-long.273

  14. Zhang J, Hashimoto K, Wan Y, Liu Z, Liu Y, Xiong C, Yu P (2022) Are pre-trained transformers robust in intent classification: A missing ingredient in evaluation of out-of-scope intent detection. In: Proceedings of the 4th Workshop on NLP for ConvAI, Association for Computational Linguistics, Dublin, Ireland, pp 12–20. https://doi.org/10.18653/v1/2022.nlp4convai-1.2

  15. Zhang J, Hashimoto K, Liu W, Wu C-S, Wan Y, Yu P, Socher R, Xiong C (2020) Discriminative nearest neighbor few-shot intent detection by transferring natural language inference. In: Proceedings of EMNLP, Association for Computational Linguistics, Online, pp 5064–5082. https://doi.org/10.18653/v1/2020.emnlp-main.411

  16. Hendrycks D, Gimpel K (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: Proceedings of ICLR, Toulon, France. https://openreview.net/forum?id=Hkg4TI9xl

  17. Liu J, Lin Z, Padhy S, Tran D, Bedrax Weiss T, Lakshminarayanan B (2020) Simple and principled uncertainty estimation with deterministic deep learning via distance awareness, Online, 33:7498–7512. https://proceedings.neurips.cc/paper/2020/hash/543e83748234f7cbab21aa0ade66565f-Abstract.html

  18. Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of EACL, Association for Computational Linguistics, Online pp 255–269. https://doi.org/10.18653/v1/2021.eacl-main.20

  19. Schick T, Schütze H (2021) It’s not just size that matters: Small language models are also few-shot learners. In: Proceedings of NAACL, Association for Computational Linguistics, Online, pp 2339–2352. https://doi.org/10.18653/v1/2021.naacl-main.185

  20. Li D, Hu B, Chen Q (2022) Prompt-based text entailment for low-resource named entity recognition. In: Proceedings of ICCL, International Committee on Computational Linguistics, Gyeongju, Republic of Korea, pp 1896–1903. https://aclanthology.org/2022.coling-1.164

  21. Chen Y, Harbecke D, Hennig L (2022) Multilingual relation classification via efficient and effective prompting. In: Proceedings of EMNLP, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, pp 1059–1075. https://aclanthology.org/2022.emnlp-main.69

  22. Shu L, Xu H, Liu B (2017) DOC: Deep open classification of text documents. In: Proceedings of EMNLP, Association for Computational Linguistics, Copenhagen, Denmark, pp 2911–2916. https://doi.org/10.18653/v1/D17-1314

  23. Yan G, Fan L, Li Q, Liu H, Zhang X, Wu X-M, Lam AYS (2020) Unknown intent detection using Gaussian mixture model with an application to zero-shot intent classification. In: Proceedings of ACL, Association for Computational Linguistics, Online, pp 1050–1060. https://doi.org/10.18653/v1/2020.acl-main.99

  24. Larson S, Mahendran A, Peper JJ, Clarke C, Lee A, Hill P, Kummerfeld JK, Leach K, Laurenzano MA, Tang L, Mars J (2019) An evaluation dataset for intent classification and out-of-scope prediction. In: Proceedings of EMNLP, Association for Computational Linguistics, Hong Kong, China, pp 1311–1316. https://doi.org/10.18653/v1/D19-1131

  25. Qu J, Hashimoto K, Liu W, Xiong C, Zhou Y (2021) Few-shot intent classification by gauging entailment relationship between utterance and semantic label. In: Proceedings of the 3rd Workshop on NLP for ConvAI, Association for Computational Linguistics, Online, pp 8–15. https://doi.org/10.18653/v1/2021.nlp4convai-1.2

  26. Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of ICML, Association for Computing Machinery, Pittsburgh, Pennsylvania, USA, pp 233–240. https://doi.org/10.1145/1143844.1143874

  27. Fawcett T (2006) An introduction to roc analysis. Pattern Recognit Lett 27(8):861–874. https://doi.org/10.1016/j.patrec.2005.10.010

    Article  ADS  MathSciNet  Google Scholar 

  28. Schick T, Schütze H (2022) True Few-Shot Learning with Prompts-A Real-World Perspective. Trans Assoc Comput Linguistics 10:716–731. https://doi.org/10.1162/tacl_a_00485

    Article  Google Scholar 

  29. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL, Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423

  30. Tam D, R Menon R, Bansal M, Srivastava S, Raffel C (2021) Improving and simplifying pattern exploiting training. In: Proceedings of EMNLP, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp 4980–4991. https://doi.org/10.18653/v1/2021.emnlp-main.407

  31. Casanueva I, Temcinas T, Gerz D, Henderson M, Vulic I (2020) Efficient intent detection with dual sentence encoders. In: Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020. Data available at https://github.com/PolyAI-LDN/task-specific-datasets. arXiv:2003.04807

  32. Coucke A, Saade A, Ball A, Bluche T, Caulier A, Leroy D, Doumouro C, Gisselbrecht T, Caltagirone F, Lavril T, Primet M, Dureau J (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv:1805.10190

  33. Schuster S, Gupta S, Shah R, Lewis M (2019) Cross-lingual transfer learning for multilingual task oriented dialog. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp 3795–3805. https://doi.org/10.18653/v1/N19-1380. https://aclanthology.org/N19-1380

  34. Xu J, Wang P, Tian G, Xu B, Zhao J, Wang F, Hao H (2015) Short text clustering via convolutional neural networks. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Association for Computational Linguistics, Denver, Colorado, pp 62–69. https://doi.org/10.3115/v1/W15-1509. https://aclanthology.org/W15-1509

  35. Liu X, Eshghi A, Swietojanski P, Rieser V (2021) Benchmarking natural language understanding services for building conversational agents. In: Increasing Naturalness and Flexibility in Spoken Dialogue Interaction: 10th International Workshop on Spoken Dialogue Systems, Springer, pp 165–183. https://doi.org/10.1007/978-981-15-9323-9_15

  36. Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of EMNLP, Association for Computational Linguistics, Hong Kong, China, pp 3982–3992. https://doi.org/10.18653/v1/D19-1410

  37. Thakur N, Reimers N, Daxenberger J, Gurevych I (2021) Augmented SBERT: Data augmentation method for improving Bi-encoders for pairwise sentence scoring tasks. In: Proceedings of NAACL, Association for Computational Linguistics, Online, pp 296–310. https://doi.org/10.18653/v1/2021.naacl-main.28

  38. Chen Q, Zhu X, Ling Z-H, Wei S, Jiang H, Inkpen D (2017) Enhanced LSTM for natural language inference. In: Proceedings of ACL, Association for Computational Linguistics, Vancouver, Canada, pp 1657–1668. https://doi.org/10.18653/v1/P17-1152

  39. Williams A, Nangia N, Bowman S (2018) A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of NAACL, Association for Computational Linguistics, New Orleans, Louisiana, pp 1112–1122. https://doi.org/10.18653/v1/N18-1101

  40. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692

  41. Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: Proceedings of ICLR, Vancouver, BC, Canada. https://openreview.net/forum?id=Bkg6RiCqY7

  42. Efron B, Tibshirani RJ (1994) An Introduction to the Bootstrap. CRC Press, New York, USA

    Book  Google Scholar 

  43. Gao T, Fisch A, Chen D (2021) Making pre-trained language models better few-shot learners. In: Proceedings of ACL, Association for Computational Linguistics, Online. pp 3816–3830. https://doi.org/10.18653/v1/2021.acl-long.295

  44. Chen D, Yu Z (2021) GOLD: Improving out-of-scope detection in dialogues using data augmentation. In: Proceedings of EMNLP, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp 429–442. https://doi.org/10.18653/v1/2021.emnlp-main.35

  45. Cheng Z, Jiang Z, Yin Y, Wang C, Gu Q (2022) Learning to classify open intent via soft labeling and manifold mixup. IEEE/ACM Trans Audio, Speech Lang Proc 30:635–645. https://doi.org/10.1109/TASLP.2022.3145308

    Article  Google Scholar 

  46. Tan M, Yu Y, Wang H, Wang D, Potdar S, Chang S, Yu M (2019) Out-of-domain detection for low-resource text classification tasks. In: Proceedings of EMNLP, Association for Computational Linguistics, Hong Kong, China, pp 3566–3572. https://doi.org/10.18653/v1/D19-1364

  47. Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Proceedings of NeurIPS, Long Beach, CA, USA 30:4077–4087. https://proceedings.neurips.cc/paper/2017/file/cb8da6767461f2812ae4290eac7cbc42-Paper.pdf

  48. Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Proceddings of NeurIPS, Vancouver, BC, Canada vol. 32. https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf

  49. Kalyan KS, Rajasekharan A, Sangeetha S (2022) Ammu: A survey of transformer-based biomedical pretrained language models. J Biomed Inf 126(C). https://doi.org/10.1016/j.jbi.2021.103982

  50. Min S, Lewis M, Hajishirzi H, Zettlemoyer L (2022) Noisy channel language model prompting for few-shot text classification. In: Proceedings of ACL, Association for Computational Linguistics, Dublin, Ireland, pp 5316–5330. https://doi.org/10.18653/v1/2022.acl-long.365

Download references

Funding

The first author is supported by China Scholarship Council (No. 201906020194) and Ghent University Special Research Fund (BOF) (No. 01SC0618). This research also received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

Author information

Authors and Affiliations

Authors

Contributions

Yiwei Jiang: Conceptualization, Methodology, Software, Investigation, Writing - original draft preparation. Maarten De Raedt: Conceptualization, Investigation, Writing - review and editing. Johannes Deleu: Conceptualization, Investigation, Writing - review and editing. Thomas Demeester: Conceptualization, Investigation, Writing - review and editing, Supervision. Chris Develder: Conceptualization, Investigation, Writing - review and editing, Supervision.

Corresponding author

Correspondence to Yiwei Jiang.

Ethics declarations

Competing Interests

The authors have no competing interests to disclose in any material discussed in this article.

Ethical Compliance

This study does not involve any human participant or animal. All the data used in this article are sourced from open and publicly accessible platforms. No proprietary, confidential, or private data has been used

Scientific assessment

We thank the reviewers for their useful feedback, which helped us to improve the manuscript, including with their suggestion to add more datasets.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A IOC curves at 1-50 shots

Figures 8, 9, 10, 11, 12 and 13 plot IOC curves of different models in 1-50 shot settings for SNIPS, Facebook, CLINC-Banking, Stackoverflow, HWU64 and BANKING dataset respectively.

Fig. 9
figure 9

IOC curves (\(Acc_{in}\) vs. \(R_{oos}\)) of models in 1-50 shot settings evaluated on Facebook (test set). Better viewed in color

Fig. 10
figure 10

IOC curves (\(Acc_{in}\) vs. \(R_{oos}\)) of models in 1-50 shot settings evaluated on CLINC-Banking (test set). Better viewed in color

Fig. 11
figure 11

IOC curves (\(Acc_{in}\) vs. \(R_{oos}\)) of models in 1-50 shot settings evaluated on Stackoverflow (test set). Better viewed in color

Fig. 12
figure 12

IOC curves (\(Acc_{in}\) vs. \(R_{oos}\)) of models in 1-50 shot settings evaluated on HWU64 (test set). Better viewed in color

Fig. 13
figure 13

IOC curves (\(Acc_{in}\) vs. \(R_\textit{oos}\)) of models in 1-50 shot settings evaluated on BANKING (test set). Better viewed in color

Appendix B Confidence score distributions of the other 3 datasets at 5-shot

Figure 14 shows the confidence score distributions of the 4 architectures on 3 datasets (SNIPS, Facebook and HWU64) at 5-shot.

Fig. 14
figure 14

Confidence score histogram at the 5-shot setting on the test set of (a-d) SNIPS, (e-h) Facebook and (i-l) HWU64. Best viewed in color

Appendix C Inference speed

Figure 15 illustrates the inference throughput against the number of in-scope classes (denoted as L). To ensure a fair comparison between the models and to aptly simulate the online evaluation setting, we standardized the input batch size to 1 across all models. This means that each batch contains only a single user question. We observed that the throughput of the Softmax model remains relatively stable (approximately 62 instances/s) irrespective of the variations in L. The Softmax model bypasses the one-vs-all binary classification, thereby exhibiting speed insensitivity. In the case of the Siamese model, we implemented a strategy to cache the intent label embedding to foster efficiency. However, despite this optimization, we found that the computational demand for the cosine similarity operation escalates as L increases. In contrast, the throughput for the other two models is much smaller when L increases over 14. Notably, the prompt-based model surpasses others in achieving higher AU-IOC scores, albeit at the expense of reduced inference throughput, particularly when L exceeds 10. A significant factor contributing to this reduced speed is the tensor extraction operations involved in the prompt-based model, requiring data transfer between the GPU and CPU, which is time-consuming. While our primary focus in this study remains on scrutinizing the robustness of different models in handling the Out-of-Scope (OOS) intent detection task, we acknowledge that optimizing the inference speed is a critical aspect that warrants attention in future work. It is also pertinent to note that the inference time is contingent upon the hardware utilized during the evaluation, implying that a change in hardware could potentially alter the throughput numbers reported.

Fig. 15
figure 15

Inference throughput v.s. number of in-scope classes. All the throughput numbers are computed with a single NVIDIA GTX-1080Ti GPU

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, Y., De Raedt, M., Deleu, J. et al. Few-shot out-of-scope intent classification: analyzing the robustness of prompt-based learning. Appl Intell 54, 1474–1496 (2024). https://doi.org/10.1007/s10489-023-05215-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05215-x

Keywords

Navigation