Unsupervised evaluation for out-of-distribution detection

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2024-11-26 DOI:10.1016/j.patcog.2024.111212

Yuhang Zhang , Jiani Hu , Dongchao Wen , Weihong Deng

{"title":"Unsupervised evaluation for out-of-distribution detection","authors":"Yuhang Zhang , Jiani Hu , Dongchao Wen , Weihong Deng","doi":"10.1016/j.patcog.2024.111212","DOIUrl":null,"url":null,"abstract":"<div><div>We need to acquire labels for test sets to evaluate the performance of existing out-of-distribution (OOD) detection methods. In real-world deployment, it is laborious to label each new test set as there are various OOD data with different difficulties. However, we need to use different OOD data to evaluate OOD detection methods as their performance varies widely. Thus, we propose evaluating OOD detection methods on unlabeled test sets, which can free us from labeling each new OOD test set. It is a non-trivial task as we do not know which sample is correctly detected without OOD labels, and the evaluation metric like AUROC cannot be calculated. In this paper, we address this important yet untouched task for the first time. Inspired by the bimodal distribution of OOD detection test sets, we propose an unsupervised indicator named Gscore that has a certain relationship with the OOD detection performance; thus, we could use neural networks to learn that relationship to predict OOD detection performance without OOD labels. Through extensive experiments, we validate that there does exist a strong quantitative correlation, which is almost linear, between Gscore and the OOD detection performance. Additionally, we introduce Gbench, a new benchmark consisting of 200 different real-world OOD datasets, to test the performance of Gscore. Our results show that Gscore achieves state-of-the-art performance compared with other unsupervised evaluation methods and generalizes well with different in-distribution (ID)/OOD datasets, OOD detection methods, backbones, and ID:OOD ratios. Furthermore, we conduct analyses on Gbench to study the effects of backbones and ID/OOD datasets on OOD detection performance. The dataset and code will be available.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"160 ","pages":"Article 111212"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324009634","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

We need to acquire labels for test sets to evaluate the performance of existing out-of-distribution (OOD) detection methods. In real-world deployment, it is laborious to label each new test set as there are various OOD data with different difficulties. However, we need to use different OOD data to evaluate OOD detection methods as their performance varies widely. Thus, we propose evaluating OOD detection methods on unlabeled test sets, which can free us from labeling each new OOD test set. It is a non-trivial task as we do not know which sample is correctly detected without OOD labels, and the evaluation metric like AUROC cannot be calculated. In this paper, we address this important yet untouched task for the first time. Inspired by the bimodal distribution of OOD detection test sets, we propose an unsupervised indicator named Gscore that has a certain relationship with the OOD detection performance; thus, we could use neural networks to learn that relationship to predict OOD detection performance without OOD labels. Through extensive experiments, we validate that there does exist a strong quantitative correlation, which is almost linear, between Gscore and the OOD detection performance. Additionally, we introduce Gbench, a new benchmark consisting of 200 different real-world OOD datasets, to test the performance of Gscore. Our results show that Gscore achieves state-of-the-art performance compared with other unsupervised evaluation methods and generalizes well with different in-distribution (ID)/OOD datasets, OOD detection methods, backbones, and ID:OOD ratios. Furthermore, we conduct analyses on Gbench to study the effects of backbones and ID/OOD datasets on OOD detection performance. The dataset and code will be available.

查看原文本刊更多论文

分布外检测的无监督评价

我们需要获取测试集的标签来评估现有的超分布（OOD）检测方法的性能。在实际部署中，标记每个新的测试集是很费力的，因为存在不同难度的各种OOD数据。然而，我们需要使用不同的OOD数据来评估OOD检测方法，因为它们的性能差异很大。因此，我们提出了在未标记的测试集上评估OOD检测方法，这可以使我们从标记每个新的OOD测试集中解脱出来。这是一项非常重要的任务，因为如果没有OOD标签，我们不知道哪个样本是正确检测的，并且无法计算AUROC等评估指标。在本文中，我们首次解决了这一重要但尚未触及的任务。受OOD检测测试集双峰分布的启发，我们提出了一个与OOD检测性能有一定关系的无监督指标Gscore；因此，我们可以使用神经网络来学习这种关系，从而在没有OOD标签的情况下预测OOD检测性能。通过大量的实验，我们验证了Gscore和OOD检测性能之间确实存在很强的定量相关性，几乎是线性的。此外，我们引入了Gbench，这是一个由200个不同的真实世界OOD数据集组成的新基准，用于测试Gscore的性能。我们的研究结果表明，与其他无监督评估方法相比，Gscore达到了最先进的性能，并且在不同的分布(ID)/OOD数据集、OOD检测方法、骨干和ID:OOD比率下都有很好的推广效果。此外，我们在Gbench上进行了分析，研究了主干和ID/OOD数据集对OOD检测性能的影响。数据集和代码将可用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.