Advances in Data Analysis and Classification最新文献

筛选
英文 中文
Loss-guided stability selection 损失引导的稳定性选择
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-12-15 DOI: 10.1007/s11634-023-00573-3
Tino Werner
{"title":"Loss-guided stability selection","authors":"Tino Werner","doi":"10.1007/s11634-023-00573-3","DOIUrl":"https://doi.org/10.1007/s11634-023-00573-3","url":null,"abstract":"<p>In modern data analysis, sparse model selection becomes inevitable once the number of predictor variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated Stability Selection overcomes these weaknesses by aggregating models, based on subsamples of the training data, followed by choosing a stable predictor set which is usually much sparser than the predictor sets from the raw models. The standard Stability Selection is based on a global criterion, namely the per-family error rate, while additionally requiring expert knowledge to suitably configure the hyperparameters. Model selection depends on the loss function, i.e., predictor sets selected w.r.t. some particular loss function differ from those selected w.r.t. some other loss function. Therefore, we propose a Stability Selection variant which respects the chosen loss function via an additional validation step based on out-of-sample validation data, optionally enhanced with an exhaustive search strategy. Our Stability Selection variants are widely applicable and user-friendly. Moreover, our Stability Selection variants can avoid the issue of severe underfitting, which affects the original Stability Selection for noisy high-dimensional data, so our priority is not to avoid false positives at all costs but to result in a sparse stable model with which one can make predictions. Experiments where we consider both regression and binary classification with Boosting as model selection algorithm reveal a significant precision improvement compared to raw Boosting models while not suffering from any of the mentioned issues of the original Stability Selection.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"199 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138690803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A fresh look at mean-shift based modal clustering 重新审视基于均值移动的模态聚类
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-12-14 DOI: 10.1007/s11634-023-00575-1
Jose Ameijeiras-Alonso, Jochen Einbeck
{"title":"A fresh look at mean-shift based modal clustering","authors":"Jose Ameijeiras-Alonso,&nbsp;Jochen Einbeck","doi":"10.1007/s11634-023-00575-1","DOIUrl":"10.1007/s11634-023-00575-1","url":null,"abstract":"<div><p>Modal clustering is an unsupervised learning technique where cluster centers are identified as the local maxima of nonparametric probability density estimates. A natural algorithmic engine for the computation of these maxima is the <i>mean shift procedure</i>, which is essentially an iteratively computed chain of local means. We revisit this technique, focusing on its link to kernel density gradient estimation, in this course proposing a novel concept for bandwidth selection based on the concept of a critical bandwidth. Furthermore, in the one-dimensional case, an inverse version of the mean shift is developed to provide a novel approach for the estimation of antimodes, which is then used to identify cluster boundaries. A simulation study is provided which assesses, in the univariate case, the classification accuracy of the mean-shift based clustering approach. Three (univariate and multivariate) examples from the fields of philately, engineering, and imaging, illustrate how modal clusterings identified through mean shift based methods relate directly and naturally to physical properties of the data-generating system. Solutions are proposed to deal computationally efficiently with large data sets.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"1067 - 1095"},"PeriodicalIF":1.4,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138690553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A probabilistic method for reconstructing the Foreign Direct Investments network in search of ultimate host economies 重构外国直接投资网络以寻找最终东道国经济的概率方法
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-12-08 DOI: 10.1007/s11634-023-00571-5
Nadia Accoto, Valerio Astuti, Costanza Catalano
{"title":"A probabilistic method for reconstructing the Foreign Direct Investments network in search of ultimate host economies","authors":"Nadia Accoto, Valerio Astuti, Costanza Catalano","doi":"10.1007/s11634-023-00571-5","DOIUrl":"https://doi.org/10.1007/s11634-023-00571-5","url":null,"abstract":"<p>The Ultimate Host Economies (UHEs) of a given country are defined as the ultimate destinations of Foreign Direct Investment (FDI) originating in that country. Bilateral FDI statistics struggle to identify them due to the non-negligible presence of conduit jurisdictions, which provide attractive intermediate destinations for pass-through investments due to favorable tax regimes. At the same time, determining UHEs is crucial for understanding the actual paths followed by FDI among increasingly interdependent economies. In this paper, we first reconstruct the global FDI network through mirroring and clustering techniques, starting from data collected by the International Monetary Fund. Then we provide a method for computing an (approximate) distribution of the UHEs of a country by using a probabilistic approach to this network, based on Markov chains. More specifically, we analyze the Italian case.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"251 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138553072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variational inference for semiparametric Bayesian novelty detection in large datasets 大数据集中半参数贝叶斯新颖性检测的变分推理
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-12-04 DOI: 10.1007/s11634-023-00569-z
Luca Benedetti, Eric Boniardi, Leonardo Chiani, Jacopo Ghirri, Marta Mastropietro, Andrea Cappozzo, Francesco Denti
{"title":"Variational inference for semiparametric Bayesian novelty detection in large datasets","authors":"Luca Benedetti,&nbsp;Eric Boniardi,&nbsp;Leonardo Chiani,&nbsp;Jacopo Ghirri,&nbsp;Marta Mastropietro,&nbsp;Andrea Cappozzo,&nbsp;Francesco Denti","doi":"10.1007/s11634-023-00569-z","DOIUrl":"10.1007/s11634-023-00569-z","url":null,"abstract":"<div><p>After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available <span>Statlog</span> dataset, a large collection of satellite imaging spectra, to search for novel soil types.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"681 - 703"},"PeriodicalIF":1.4,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00569-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Claims fraud detection with uncertain labels 标签不确定的索赔欺诈检测
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-11-30 DOI: 10.1007/s11634-023-00568-0
Félix Vandervorst, Wouter Verbeke, Tim Verdonck
{"title":"Claims fraud detection with uncertain labels","authors":"Félix Vandervorst,&nbsp;Wouter Verbeke,&nbsp;Tim Verdonck","doi":"10.1007/s11634-023-00568-0","DOIUrl":"10.1007/s11634-023-00568-0","url":null,"abstract":"<div><p><i>Insurance fraud</i> is a non self-revealing type of fraud. The true historical labels (fraud or legitimate) are only as precise as the investigators’ efforts and successes to uncover them. Popular approaches of supervised and unsupervised learning fail to capture the ambiguous nature of uncertain labels. Imprecisely observed labels can be represented in the Dempster–Shafer theory of belief functions, a generalization of supervised and unsupervised learning suited to represent uncertainty. In this paper, we show that partial information from the historical investigations can add valuable, learnable information for the fraud detection system and improves its performances. We also show that belief function theory provides a flexible mathematical framework for concept drift detection and cost sensitive learning, two common challenges in fraud detection. Finally, we present an application to a real-world motor insurance claim fraud.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"219 - 243"},"PeriodicalIF":1.4,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust and sparse logistic regression 鲁棒稀疏逻辑回归
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-11-27 DOI: 10.1007/s11634-023-00572-4
Dries Cornilly, Lise Tubex, Stefan Van Aelst, Tim Verdonck
{"title":"Robust and sparse logistic regression","authors":"Dries Cornilly,&nbsp;Lise Tubex,&nbsp;Stefan Van Aelst,&nbsp;Tim Verdonck","doi":"10.1007/s11634-023-00572-4","DOIUrl":"10.1007/s11634-023-00572-4","url":null,"abstract":"<div><p>Logistic regression is one of the most popular statistical techniques for solving (binary) classification problems in various applications (e.g. credit scoring, cancer detection, ad click predictions and churn classification). Typically, the maximum likelihood estimator is used, which is very sensitive to outlying observations. In this paper, we propose a robust and sparse logistic regression estimator where robustness is achieved by means of the <span>(gamma)</span>-divergence. An elastic net penalty ensures sparsity in the regression coefficients such that the model is more stable and interpretable. We show that the influence function is bounded and demonstrate its robustness properties in simulations. The good performance of the proposed estimator is also illustrated in an empirical application that deals with classifying the type of fuel used by cars.\u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"663 - 679"},"PeriodicalIF":1.4,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semiparametric mixture of linear regressions with nonparametric Gaussian scale mixture errors 具有非参数高斯尺度混合误差的半参数混合线性回归
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-11-23 DOI: 10.1007/s11634-023-00570-6
Sangkon Oh, Byungtae Seo
{"title":"Semiparametric mixture of linear regressions with nonparametric Gaussian scale mixture errors","authors":"Sangkon Oh,&nbsp;Byungtae Seo","doi":"10.1007/s11634-023-00570-6","DOIUrl":"10.1007/s11634-023-00570-6","url":null,"abstract":"<div><p>In finite mixture of regression models, normal assumption for the errors of each regression component is typically adopted. Though this common assumption is theoretically and computationally convenient, it often produces inefficient and undesirable estimates which undermine the applicability of the model particularly in the presence of outliers. To reduce these defects, we propose to use nonparametric Gaussian scale mixture distributions for component error distributions. By this means, we can lessen the risk of misspecification and obtain robust estimators. In this paper, we study the identifiability of the proposed model and develop a feasible estimating algorithm. Numerical studies including simulation studies and real data analysis to demonstrate the performance of the proposed method are also presented.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"5 - 31"},"PeriodicalIF":1.4,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Functional clustering of fictional narratives using Vonnegut curves 利用冯内古特曲线对小说叙事进行功能聚类
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-11-04 DOI: 10.1007/s11634-023-00567-1
Shan Zhong, David B. Hitchcock
{"title":"Functional clustering of fictional narratives using Vonnegut curves","authors":"Shan Zhong,&nbsp;David B. Hitchcock","doi":"10.1007/s11634-023-00567-1","DOIUrl":"10.1007/s11634-023-00567-1","url":null,"abstract":"<div><p>Motivated by a public suggestion by the famous novelist Kurt Vonnegut, we clustered functional data that represented sentiment curves for famous fictional stories. We analyzed text data from novels written between 1612 and 1925, and transformed them into curves measuring sentiment as a function of the percentage of elapsed contents of the novel. We employed sentence-level sentiment evaluation and nonparametric curve smoothing. Our clustering methods involved finding the optimal number of clusters, aligning curves using different chronological warping functions to account for phase and amplitude variation, and implementing functional K-means algorithms under the square root velocity framework. Our results revealed insights about patterns in fictional narratives that Vonnegut and others have suggested but not analyzed in a functional way.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"1045 - 1066"},"PeriodicalIF":1.4,"publicationDate":"2023-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135774377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A between-cluster approach for clustering skew-symmetric data 斜对称数据聚类的聚类间方法
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-10-28 DOI: 10.1007/s11634-023-00566-2
Donatella Vicari, Cinzia Di Nuzzo
{"title":"A between-cluster approach for clustering skew-symmetric data","authors":"Donatella Vicari,&nbsp;Cinzia Di Nuzzo","doi":"10.1007/s11634-023-00566-2","DOIUrl":"10.1007/s11634-023-00566-2","url":null,"abstract":"<div><p>In order to investigate exchanges between objects, a clustering model for skew-symmetric data is proposed, which relies on the between-cluster effects of the skew-symmetries that represent the imbalances of the observed exchanges between pairs of objects. The aim is to detect clusters of objects that share the same behaviour of exchange so that origin and destination clusters are identified. The proposed model is based on the decomposition of the skew-symmetric matrix pertaining to the imbalances <i>between</i> clusters into a sum of a number of off-diagonal block matrices. Each matrix can be approximated by a skew-symmetric matrix by using a truncated Singular Value Decomposition (SVD) which exploits the properties of the skew-symmetric matrices. The model is fitted in a least-squares framework and an efficient Alternating Least Squares algorithm is provided. Finally, in order to show the potentiality of the model and the features of the resulting clusters, an extensive simulation study and an illustrative application to real data are presented.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"163 - 192"},"PeriodicalIF":1.4,"publicationDate":"2023-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00566-2.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Applications of dual regularized Laplacian matrix for community detection 双正则化拉普拉斯矩阵在群落检测中的应用
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-10-26 DOI: 10.1007/s11634-023-00565-3
Huan Qing, Jingli Wang
{"title":"Applications of dual regularized Laplacian matrix for community detection","authors":"Huan Qing,&nbsp;Jingli Wang","doi":"10.1007/s11634-023-00565-3","DOIUrl":"10.1007/s11634-023-00565-3","url":null,"abstract":"<div><p>Spectral clustering is widely used for detecting clusters in networks for community detection, while a small change on the graph Laplacian matrix could bring a dramatic improvement. In this paper, we propose a dual regularized graph Laplacian matrix and then employ it to the classical spectral clustering approach under the degree-corrected stochastic block model. If the number of communities is known as <i>K</i>, we consider more than <i>K</i> leading eigenvectors and weight them by their corresponding eigenvalues in the spectral clustering procedure to improve the performance. The improved spectral clustering method is dual regularized spectral clustering (DRSC). Theoretical analysis of DRSC shows that under mild conditions it yields stable consistent community detection. Meanwhile, we develop a strategy by taking advantage of DRSC and Newman’s modularity to estimate the number of communities <i>K</i>. We compare the performance of DRSC with several spectral methods and investigate the behaviors of our strategy for estimating <i>K</i> by substantial simulated networks and real-world networks. Numerical results show that DRSC enjoys satisfactory performance and our strategy on estimating <i>K</i> performs accurately and consistently, even in cases where there is only one community in a network.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"1001 - 1043"},"PeriodicalIF":1.4,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134909473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信