Variational inference for semiparametric Bayesian novelty detection in large datasets

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification Pub Date : 2023-12-04 DOI:10.1007/s11634-023-00569-z

Luca Benedetti, Eric Boniardi, Leonardo Chiani, Jacopo Ghirri, Marta Mastropietro, Andrea Cappozzo, Francesco Denti

{"title":"Variational inference for semiparametric Bayesian novelty detection in large datasets","authors":"Luca Benedetti, Eric Boniardi, Leonardo Chiani, Jacopo Ghirri, Marta Mastropietro, Andrea Cappozzo, Francesco Denti","doi":"10.1007/s11634-023-00569-z","DOIUrl":null,"url":null,"abstract":"<div><p>After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available <span>Statlog</span> dataset, a large collection of satellite imaging spectra, to search for novel soil types.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"681 - 703"},"PeriodicalIF":1.3000,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00569-z.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Data Analysis and Classification","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s11634-023-00569-z","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available Statlog dataset, a large collection of satellite imaging spectra, to search for novel soil types.

Abstract Image

查看原文本刊更多论文

大数据集中半参数贝叶斯新颖性检测的变分推理

在一个完全标记的训练集上训练后，观察结果被分组到一定数量的已知类中，新颖性检测方法的目标是对未标记的测试集的实例进行分类，同时允许存在以前未见过的类。这些模型在许多领域都很有价值，从社会网络和食品掺假分析到可能存在进化种群的生物学。在本文中，我们重点研究了最近在文献中介绍的两阶段贝叶斯半参数新颖性检测器，也称为Brand。利用基于模型的混合表示，Brand允许将测试观察聚类到已知的训练项或单个新项中。此外，用Dirichlet过程混合模型对新颖性项进行建模，以灵活地捕获与已知模式的任何偏离。Brand最初是使用MCMC方案来估计的，当应用于高维数据时，这种方案的成本非常高。为了扩大Brand对大型数据集的适用性，我们建议采用变分贝叶斯方法，提供一种有效的后验逼近算法。通过深入的仿真研究，我们证明了该方法在效率和分类性能方面的显著提高。最后，为了展示其适用性，我们使用公开可用的Statlog数据集(大量卫星成像光谱集合)进行新颖性检测分析，以搜索新的土壤类型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Advances in Data Analysis and Classification STATISTICS & PROBABILITY-

CiteScore

3.40

自引率

6.20%

发文量

审稿时长

>12 weeks

期刊介绍： The international journal Advances in Data Analysis and Classification (ADAC) is designed as a forum for high standard publications on research and applications concerning the extraction of knowable aspects from many types of data. It publishes articles on such topics as structural, quantitative, or statistical approaches for the analysis of data; advances in classification, clustering, and pattern recognition methods; strategies for modeling complex data and mining large data sets; methods for the extraction of knowledge from data, and applications of advanced methods in specific domains of practice. Articles illustrate how new domain-specific knowledge can be made available from data by skillful use of data analysis methods. The journal also publishes survey papers that outline, and illuminate the basic ideas and techniques of special approaches.