Yuhang Zhang , Jiani Hu , Dongchao Wen , Weihong Deng
{"title":"Unsupervised evaluation for out-of-distribution detection","authors":"Yuhang Zhang , Jiani Hu , Dongchao Wen , Weihong Deng","doi":"10.1016/j.patcog.2024.111212","DOIUrl":null,"url":null,"abstract":"<div><div>We need to acquire labels for test sets to evaluate the performance of existing out-of-distribution (OOD) detection methods. In real-world deployment, it is laborious to label each new test set as there are various OOD data with different difficulties. However, we need to use different OOD data to evaluate OOD detection methods as their performance varies widely. Thus, we propose evaluating OOD detection methods on unlabeled test sets, which can free us from labeling each new OOD test set. It is a non-trivial task as we do not know which sample is correctly detected without OOD labels, and the evaluation metric like AUROC cannot be calculated. In this paper, we address this important yet untouched task for the first time. Inspired by the bimodal distribution of OOD detection test sets, we propose an unsupervised indicator named Gscore that has a certain relationship with the OOD detection performance; thus, we could use neural networks to learn that relationship to predict OOD detection performance without OOD labels. Through extensive experiments, we validate that there does exist a strong quantitative correlation, which is almost linear, between Gscore and the OOD detection performance. Additionally, we introduce Gbench, a new benchmark consisting of 200 different real-world OOD datasets, to test the performance of Gscore. Our results show that Gscore achieves state-of-the-art performance compared with other unsupervised evaluation methods and generalizes well with different in-distribution (ID)/OOD datasets, OOD detection methods, backbones, and ID:OOD ratios. Furthermore, we conduct analyses on Gbench to study the effects of backbones and ID/OOD datasets on OOD detection performance. The dataset and code will be available.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"160 ","pages":"Article 111212"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324009634","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
We need to acquire labels for test sets to evaluate the performance of existing out-of-distribution (OOD) detection methods. In real-world deployment, it is laborious to label each new test set as there are various OOD data with different difficulties. However, we need to use different OOD data to evaluate OOD detection methods as their performance varies widely. Thus, we propose evaluating OOD detection methods on unlabeled test sets, which can free us from labeling each new OOD test set. It is a non-trivial task as we do not know which sample is correctly detected without OOD labels, and the evaluation metric like AUROC cannot be calculated. In this paper, we address this important yet untouched task for the first time. Inspired by the bimodal distribution of OOD detection test sets, we propose an unsupervised indicator named Gscore that has a certain relationship with the OOD detection performance; thus, we could use neural networks to learn that relationship to predict OOD detection performance without OOD labels. Through extensive experiments, we validate that there does exist a strong quantitative correlation, which is almost linear, between Gscore and the OOD detection performance. Additionally, we introduce Gbench, a new benchmark consisting of 200 different real-world OOD datasets, to test the performance of Gscore. Our results show that Gscore achieves state-of-the-art performance compared with other unsupervised evaluation methods and generalizes well with different in-distribution (ID)/OOD datasets, OOD detection methods, backbones, and ID:OOD ratios. Furthermore, we conduct analyses on Gbench to study the effects of backbones and ID/OOD datasets on OOD detection performance. The dataset and code will be available.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.