Advances in Data Analysis and Classification最新文献

筛选
英文 中文
Model-based clustering of functional data via mixtures of t distributions 通过 t 分布混合物对功能数据进行基于模型的聚类
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-05-12 DOI: 10.1007/s11634-023-00542-w
Cristina Anton, Iain Smith
{"title":"Model-based clustering of functional data via mixtures of t distributions","authors":"Cristina Anton,&nbsp;Iain Smith","doi":"10.1007/s11634-023-00542-w","DOIUrl":"10.1007/s11634-023-00542-w","url":null,"abstract":"<div><p>We propose a procedure, called T-funHDDC, for clustering multivariate functional data with outliers which extends the functional high dimensional data clustering (funHDDC) method (Schmutz et al. in Comput Stat 35:1101–1131, 2020) by considering a mixture of multivariate <i>t</i> distributions. We define a family of latent mixture models following the approach used for the parsimonious models considered in funHDDC and also constraining or not the degrees of freedom of the multivariate <i>t</i> distributions to be equal across the mixture components. The parameters of these models are estimated using an expectation maximization algorithm. In addition to proposing the T-funHDDC method, we add a family of parsimonious models to C-funHDDC, which is an alternative method for clustering multivariate functional data with outliers based on a mixture of contaminated normal distributions (Amovin-Assagba et al. in Comput Stat Data Anal 174:107496, 2022). We compare T-funHDDC, C-funHDDC, and other existing methods on simulated functional data with outliers and for real-world data. T-funHDDC outperforms funHDDC when applied to functional data with outliers, and its good performance makes it an alternative to C-funHDDC. We also apply the T-funHDDC method to the analysis of traffic flow in Edmonton, Canada.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"563 - 595"},"PeriodicalIF":1.4,"publicationDate":"2023-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81142509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finite mixture of hidden Markov models for tensor-variate time series data 张量变量时间序列数据的有限混合隐马尔科夫模型
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-04-29 DOI: 10.1007/s11634-023-00540-y
Abdullah Asilkalkan, Xuwen Zhu, Shuchismita Sarkar
{"title":"Finite mixture of hidden Markov models for tensor-variate time series data","authors":"Abdullah Asilkalkan,&nbsp;Xuwen Zhu,&nbsp;Shuchismita Sarkar","doi":"10.1007/s11634-023-00540-y","DOIUrl":"10.1007/s11634-023-00540-y","url":null,"abstract":"<div><p>The need to model data with higher dimensions, such as a tensor-variate framework where each observation is considered a three-dimensional object, increases due to rapid improvements in computational power and data storage capabilities. In this study, a finite mixture of hidden Markov model for tensor-variate time series data is developed. Simulation studies demonstrate high classification accuracy for both cluster and regime IDs. To further validate the usefulness of the proposed model, it is applied to real-life data with promising results.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"545 - 562"},"PeriodicalIF":1.4,"publicationDate":"2023-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84117395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of distance standard deviation in functional data analysis 距离标准差在功能数据分析中的应用
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-04-21 DOI: 10.1007/s11634-023-00538-6
Mirosław Krzyśko, Łukasz Smaga
{"title":"Application of distance standard deviation in functional data analysis","authors":"Mirosław Krzyśko,&nbsp;Łukasz Smaga","doi":"10.1007/s11634-023-00538-6","DOIUrl":"10.1007/s11634-023-00538-6","url":null,"abstract":"<div><p>This paper concerns the measurement and testing of equality of variability of functional data. We apply the distance standard deviation constructed based on distance correlation, which was recently introduced as a measure of spread. For functional data, the distance standard deviation seems to measure different kinds of variability, not only scale differences. Moreover, the distance standard deviation is just one real number, and for this reason, it is of more practical value than the covariance function, which is a more difficult object to interpret. For testing equality of variability in two groups, we propose a permutation method based on centered observations, which controls the type I error level much better than the standard permutation method. We also consider the applicability of other correlations to measure the variability of functional data. The finite sample properties of two-sample tests are investigated in extensive simulation studies. We also illustrate their use in five real data examples based on various data sets.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"431 - 454"},"PeriodicalIF":1.4,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00538-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90955483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An enhanced version of the SSA-HJ-biplot for time series with complex structure 针对具有复杂结构的时间序列的 SSA-HJ-iplot 增强版
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-04-18 DOI: 10.1007/s11634-023-00541-x
Alberto Silva, Adelaide Freitas
{"title":"An enhanced version of the SSA-HJ-biplot for time series with complex structure","authors":"Alberto Silva,&nbsp;Adelaide Freitas","doi":"10.1007/s11634-023-00541-x","DOIUrl":"10.1007/s11634-023-00541-x","url":null,"abstract":"<div><p>HJ-biplots can be used with singular spectral analysis to visualize and identify patterns in univariate time series. Named SSA-HJ-biplots, these graphs guarantee the simultaneous representation of the trajectory matrix’s rows and columns with maximum quality in the same factorial axes system and allow visualization of the separation of the time series components. Structural changes in the time series can make it challenging to visualize the components’ separation and lead to erroneous conclusions. This paper discusses an improved version of the SSA-HJ-biplot capable of handling this type of complexity. After separating the series’ signal and identifying points where structural changes occurred using multivariate techniques, the SSA-HJ-biplot is applied separately to the series’ homogeneous intervals, which is why some improvement in the visualization of the components’ separation is intended.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"409 - 430"},"PeriodicalIF":1.4,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87510976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Composite likelihood methods for parsimonious model-based clustering of mixed-type data 基于模型对混合型数据进行解析聚类的复合似然法
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-04-09 DOI: 10.1007/s11634-023-00539-5
Monia Ranalli, Roberto Rocci
{"title":"Composite likelihood methods for parsimonious model-based clustering of mixed-type data","authors":"Monia Ranalli,&nbsp;Roberto Rocci","doi":"10.1007/s11634-023-00539-5","DOIUrl":"10.1007/s11634-023-00539-5","url":null,"abstract":"<div><p>In this paper, we propose twelve parsimonious models for clustering mixed-type (ordinal and continuous) data. The dependence among the different types of variables is modeled by assuming that ordinal and continuous data follow a multivariate finite mixture of Gaussians, where the ordinal variables are a discretization of some continuous variates of the mixture. The general class of parsimonious models is based on a factor decomposition of the component-specific covariance matrices. Parameter estimation is carried out using a EM-type algorithm based on composite likelihood. The proposal is evaluated through a simulation study and an application to real data.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"381 - 407"},"PeriodicalIF":1.4,"publicationDate":"2023-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00539-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75109945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identification of representative trees in random forests based on a new tree-based distance measure 基于新的基于树的距离测量法识别随机森林中的代表性树
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-03-16 DOI: 10.1007/s11634-023-00537-7
Björn-Hergen Laabs, Ana Westenberger, Inke R. König
{"title":"Identification of representative trees in random forests based on a new tree-based distance measure","authors":"Björn-Hergen Laabs,&nbsp;Ana Westenberger,&nbsp;Inke R. König","doi":"10.1007/s11634-023-00537-7","DOIUrl":"10.1007/s11634-023-00537-7","url":null,"abstract":"<div><p>In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR (https://github.com/imbs-hl/timbR).</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"363 - 380"},"PeriodicalIF":1.4,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00537-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135553965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Threshold-based Naïve Bayes classifier 基于阈值的奈夫贝叶斯分类器
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-03-14 DOI: 10.1007/s11634-023-00536-8
Maurizio Romano, Giulia Contu, Francesco Mola, Claudio Conversano
{"title":"Threshold-based Naïve Bayes classifier","authors":"Maurizio Romano,&nbsp;Giulia Contu,&nbsp;Francesco Mola,&nbsp;Claudio Conversano","doi":"10.1007/s11634-023-00536-8","DOIUrl":"10.1007/s11634-023-00536-8","url":null,"abstract":"<div><p>The Threshold-based Naïve Bayes (Tb-NB) classifier is introduced as a (simple) improved version of the original Naïve Bayes classifier. Tb-NB extracts the sentiment from a Natural Language text corpus and allows the user not only to predict how much a sentence is positive (negative) but also to quantify a sentiment with a numeric value. It is based on the estimation of a single threshold value that concurs to define a decision rule that classifies a text into a positive (negative) opinion based on its content. One of the main advantage deriving from Tb-NB is the possibility to utilize its results as the input of post-hoc analysis aimed at observing how the quality associated to the different dimensions of a product or a service or, in a mirrored fashion, the different dimensions of customer satisfaction evolve in time or change with respect to different locations. The effectiveness of Tb-NB is evaluated analyzing data concerning the tourism industry and, specifically, hotel guests’ reviews from all hotels located in the Sardinian region and available on Booking.com. Moreover, Tb-NB is compared with other popular classifiers used in sentiment analysis in terms of model accuracy, resistance to noise and computational efficiency.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"325 - 361"},"PeriodicalIF":1.4,"publicationDate":"2023-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00536-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83512919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial for ADAC issue 1 of volume 17 (2023) ADAC第17卷第1期编辑(2023)
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-02-17 DOI: 10.1007/s11634-023-00535-9
Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 1 of volume 17 (2023)","authors":"Maurizio Vichi,&nbsp;Andrea Cerioli,&nbsp;Hans A. Kestler,&nbsp;Akinori Okada,&nbsp;Claus Weihs","doi":"10.1007/s11634-023-00535-9","DOIUrl":"10.1007/s11634-023-00535-9","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 1","pages":"1 - 4"},"PeriodicalIF":1.6,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00535-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50489816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components 采用半参数混合模型对具有不可忽略缺失的数据进行聚类
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-02-12 DOI: 10.1007/s11634-023-00534-w
Marie du Roy de Chaumaray, Matthieu Marbac
{"title":"Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components","authors":"Marie du Roy de Chaumaray,&nbsp;Matthieu Marbac","doi":"10.1007/s11634-023-00534-w","DOIUrl":"10.1007/s11634-023-00534-w","url":null,"abstract":"<div><p>We propose a semi-parametric clustering model assuming conditional independence given the component. One advantage is that this model can handle non-ignorable missingness. The model defines each component as a product of univariate probability distributions but makes no assumption on the form of each univariate density. Note that the mixture model is used for clustering but not for estimating the density of the full variables (observed and unobserved). Estimation is performed by maximizing an extension of the smoothed likelihood allowing missingness. This optimization is achieved by a Majorization-Minorization algorithm. We illustrate the relevance of our approach by numerical experiments conducted on simulated data. Under mild assumptions, we show the identifiability of the model defining the distribution of the observed data and the monotonicity of the algorithm. We also propose an extension of this new method to the case of mixed-type data that we illustrate on a real data set. The proposed method is implemented in the R package <span>MNARclust</span> available on CRAN.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"1081 - 1122"},"PeriodicalIF":1.6,"publicationDate":"2023-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50020807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust instance-dependent cost-sensitive classification 健壮的依赖实例的成本敏感分类
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2023-01-07 DOI: 10.1007/s11634-022-00533-3
Simon De Vos, Toon Vanderschueren, Tim Verdonck, Wouter Verbeke
{"title":"Robust instance-dependent cost-sensitive classification","authors":"Simon De Vos,&nbsp;Toon Vanderschueren,&nbsp;Tim Verdonck,&nbsp;Wouter Verbeke","doi":"10.1007/s11634-022-00533-3","DOIUrl":"10.1007/s11634-022-00533-3","url":null,"abstract":"<div><p>Instance-dependent cost-sensitive (IDCS) learning methods have proven useful for binary classification tasks where individual instances are associated with variable misclassification costs. However, we demonstrate in this paper by means of a series of experiments that IDCS methods are sensitive to noise and outliers in relation to instance-dependent misclassification costs and their performance strongly depends on the cost distribution of the data sample. Therefore, we propose a generic three-step framework to make IDCS methods more robust: (i) detect outliers automatically, (ii) correct outlying cost information in a data-driven way, and (iii) construct an IDCS learning method using the adjusted cost information. We apply this framework to cslogit, a logistic regression-based IDCS method, to obtain its robust version, which we name r-cslogit. The robustness of this approach is introduced in steps (i) and (ii), where we make use of robust estimators to detect and impute outlying costs of individual instances. The newly proposed r-cslogit method is tested on synthetic and semi-synthetic data and proven to be superior in terms of savings compared to its non-robust counterpart for variable levels of noise and outliers. All our code is made available online at https://github.com/SimonDeVos/Robust-IDCS.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"1057 - 1079"},"PeriodicalIF":1.6,"publicationDate":"2023-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50023687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信