Characterizing Submanifold Region for Out-of-Distribution Detection

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2024-10-04 DOI:10.1109/TKDE.2024.3468629

Xuhui Li;Zhen Fang;Yonggang Zhang;Ning Ma;Jiajun Bu;Bo Han;Haishuai Wang

{"title":"Characterizing Submanifold Region for Out-of-Distribution Detection","authors":"Xuhui Li;Zhen Fang;Yonggang Zhang;Ning Ma;Jiajun Bu;Bo Han;Haishuai Wang","doi":"10.1109/TKDE.2024.3468629","DOIUrl":null,"url":null,"abstract":"Detecting out-of-distribution (OOD) samples poses a significant safety challenge when deploying models in open-world scenarios. Advanced works assume that OOD and in-distributional (ID) samples exhibit a distribution discrepancy, showing an encouraging direction in estimating the uncertainty with embedding features or predicting outputs. Besides incorporating auxiliary outlier as decision boundary, quantifying a “meaningful distance” in embedding space as uncertainty measurement is a promising strategy. However, these distances-based approaches overlook the data structure and heavily rely on the high-dimension features learned by deep neural networks, causing unreliable distances due to the “curse of dimensionality”. In this work, we propose a data structure-aware approach to mitigate the sensitivity of distances to the “curse of dimensionality”, where high-dimensional features are mapped to the manifold of ID samples, leveraging the well-known manifold assumption. Specifically, we present a novel distance termed as \ntangent distance\n, which tackles the issue of generalizing the meaningfulness of distances on testing samples to detect OOD inputs. Inspired by manifold learning for adversarial examples, where adversarial region probability density is close to the orthogonal direction of the manifold, and both OOD and adversarial samples have common characteristic \n<inline-formula><tex-math>$-$</tex-math></inline-formula>\n imperceptible perturbations with shift distribution, we propose that OOD samples are relatively far away from the ID manifold, where \ntangent distance\n directly computes the Euclidean distance between samples and the nearest submanifold space \n<inline-formula><tex-math>$-$</tex-math></inline-formula>\n instantiated as the linear approximation of local region on the manifold. We provide empirical and theoretical insights to demonstrate the effectiveness of OOD uncertainty measurements on the low-dimensional subspace. Extensive experiments show that the \ntangent distance\n performs competitively with other post hoc OOD detection baselines on common and large-scale benchmarks, and the theoretical analysis supports our claim that ID samples are likely to reside in high-density regions, explaining the effectiveness of internal connections among ID data.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"130-147"},"PeriodicalIF":8.9000,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10705965/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Detecting out-of-distribution (OOD) samples poses a significant safety challenge when deploying models in open-world scenarios. Advanced works assume that OOD and in-distributional (ID) samples exhibit a distribution discrepancy, showing an encouraging direction in estimating the uncertainty with embedding features or predicting outputs. Besides incorporating auxiliary outlier as decision boundary, quantifying a “meaningful distance” in embedding space as uncertainty measurement is a promising strategy. However, these distances-based approaches overlook the data structure and heavily rely on the high-dimension features learned by deep neural networks, causing unreliable distances due to the “curse of dimensionality”. In this work, we propose a data structure-aware approach to mitigate the sensitivity of distances to the “curse of dimensionality”, where high-dimensional features are mapped to the manifold of ID samples, leveraging the well-known manifold assumption. Specifically, we present a novel distance termed as tangent distance , which tackles the issue of generalizing the meaningfulness of distances on testing samples to detect OOD inputs. Inspired by manifold learning for adversarial examples, where adversarial region probability density is close to the orthogonal direction of the manifold, and both OOD and adversarial samples have common characteristic

$-$

imperceptible perturbations with shift distribution, we propose that OOD samples are relatively far away from the ID manifold, where tangent distance directly computes the Euclidean distance between samples and the nearest submanifold space

$-$

instantiated as the linear approximation of local region on the manifold. We provide empirical and theoretical insights to demonstrate the effectiveness of OOD uncertainty measurements on the low-dimensional subspace. Extensive experiments show that the tangent distance performs competitively with other post hoc OOD detection baselines on common and large-scale benchmarks, and the theoretical analysis supports our claim that ID samples are likely to reside in high-density regions, explaining the effectiveness of internal connections among ID data.

查看原文本刊更多论文

分布外检测的子流形区域特征

当在开放世界场景中部署模型时，检测偏离分布（OOD）样本提出了重大的安全挑战。先进的研究假设OOD和分布内样本（ID）表现出分布差异，这在估计嵌入特征的不确定性或预测输出方面显示出令人鼓舞的方向。除了将辅助离群值作为决策边界外，将嵌入空间中的“有意义距离”量化为不确定性度量也是一种很有前途的策略。然而，这些基于距离的方法忽略了数据结构，严重依赖于深度神经网络学习的高维特征，由于“维数诅咒”，导致距离不可靠。在这项工作中，我们提出了一种数据结构感知的方法来减轻距离对“维数诅咒”的敏感性，其中高维特征被映射到ID样本的流形，利用众所周知的流形假设。具体来说，我们提出了一种称为切线距离的新距离，它解决了在测试样本上推广距离的意义以检测OOD输入的问题。受对抗样本的流形学习的启发，其中对抗区域概率密度接近流形的正交方向，并且OOD和对抗样本都具有具有移位分布的共同特征$-$不可察觉的扰动，我们提出OOD样本相对远离ID流形，其中切线距离直接计算样本与最近子流形空间之间的欧氏距离$-$实例化为流形上局部区域的线性逼近。我们提供了经验和理论见解来证明在低维子空间上OOD不确定度测量的有效性。大量实验表明，在普通和大规模基准上，切线距离与其他事后OOD检测基线具有竞争力，理论分析支持我们的说法，即ID样本可能位于高密度区域，解释了ID数据之间内部连接的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.