评估基于置信度分数的无监督开放世界分类技术的实际效用

First Workshop on Insights from Negative Results in NLP Pub Date : 1900-01-01 DOI:10.18653/v1/2022.insights-1.3

Sopan Khosla, Rashmi Gangadharaiah

{"title":"评估基于置信度分数的无监督开放世界分类技术的实际效用","authors":"Sopan Khosla, Rashmi Gangadharaiah","doi":"10.18653/v1/2022.insights-1.3","DOIUrl":null,"url":null,"abstract":"Open-world classification in dialog systems require models to detect open intents, while ensuring the quality of in-domain (ID) intent classification. In this work, we revisit methods that leverage distance-based statistics for unsupervised out-of-domain (OOD) detection. We show that despite their superior performance on threshold-independent metrics like AUROC on test-set, threshold values chosen based on the performance on a validation-set do not generalize well to the test-set, thus resulting in substantially lower performance on ID or OOD detection accuracy and F1-scores. Our analysis shows that this lack of generalizability can be successfully mitigated by setting aside a hold-out set from validation data for threshold selection (sometimes achieving relative gains as high as 100%). Extensive experiments on seven benchmark datasets show that this fix puts the performance of these methods at par with, or sometimes even better than, the current state-of-the-art OOD detection techniques.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"215 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Evaluating the Practical Utility of Confidence-score based Techniques for Unsupervised Open-world Classification\",\"authors\":\"Sopan Khosla, Rashmi Gangadharaiah\",\"doi\":\"10.18653/v1/2022.insights-1.3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Open-world classification in dialog systems require models to detect open intents, while ensuring the quality of in-domain (ID) intent classification. In this work, we revisit methods that leverage distance-based statistics for unsupervised out-of-domain (OOD) detection. We show that despite their superior performance on threshold-independent metrics like AUROC on test-set, threshold values chosen based on the performance on a validation-set do not generalize well to the test-set, thus resulting in substantially lower performance on ID or OOD detection accuracy and F1-scores. Our analysis shows that this lack of generalizability can be successfully mitigated by setting aside a hold-out set from validation data for threshold selection (sometimes achieving relative gains as high as 100%). Extensive experiments on seven benchmark datasets show that this fix puts the performance of these methods at par with, or sometimes even better than, the current state-of-the-art OOD detection techniques.\",\"PeriodicalId\":441528,\"journal\":{\"name\":\"First Workshop on Insights from Negative Results in NLP\",\"volume\":\"215 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"First Workshop on Insights from Negative Results in NLP\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2022.insights-1.3\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"First Workshop on Insights from Negative Results in NLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.insights-1.3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

对话系统中的开放世界分类要求模型检测开放意图，同时保证域内意图分类的质量。在这项工作中，我们重新审视了利用基于距离的统计进行无监督域外(OOD)检测的方法。我们表明，尽管它们在测试集上的AUROC等与阈值无关的指标上表现优异，但基于验证集上的性能选择的阈值并不能很好地推广到测试集，从而导致ID或OOD检测精度和f1分数的性能大大降低。我们的分析表明，通过从验证数据中留出一个保留集用于阈值选择(有时可以获得高达100%的相对增益)，可以成功地减轻这种泛化性的缺乏。在7个基准数据集上进行的大量实验表明，该修复使这些方法的性能与当前最先进的OOD检测技术相当，有时甚至更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating the Practical Utility of Confidence-score based Techniques for Unsupervised Open-world Classification

Open-world classification in dialog systems require models to detect open intents, while ensuring the quality of in-domain (ID) intent classification. In this work, we revisit methods that leverage distance-based statistics for unsupervised out-of-domain (OOD) detection. We show that despite their superior performance on threshold-independent metrics like AUROC on test-set, threshold values chosen based on the performance on a validation-set do not generalize well to the test-set, thus resulting in substantially lower performance on ID or OOD detection accuracy and F1-scores. Our analysis shows that this lack of generalizability can be successfully mitigated by setting aside a hold-out set from validation data for threshold selection (sometimes achieving relative gains as high as 100%). Extensive experiments on seven benchmark datasets show that this fix puts the performance of these methods at par with, or sometimes even better than, the current state-of-the-art OOD detection techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

First Workshop on Insights from Negative Results in NLP

自引率

0.00%

发文量