Advances in Data Analysis and Classification最新文献

筛选
英文 中文
Determinantal consensus clustering 决定性共识聚类
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2022-08-25 DOI: 10.1007/s11634-022-00514-6
Serge Vicente, Alejandro Murua-Sazo
{"title":"Determinantal consensus clustering","authors":"Serge Vicente,&nbsp;Alejandro Murua-Sazo","doi":"10.1007/s11634-022-00514-6","DOIUrl":"10.1007/s11634-022-00514-6","url":null,"abstract":"<div><p>Random restart of a given algorithm produces many partitions that can be aggregated to yield a consensus clustering. Ensemble methods have been recognized as more robust approaches for data clustering than single clustering algorithms. We propose the use of determinantal point processes or DPPs for the random restart of clustering algorithms based on initial sets of center points, such as <i>k</i>-medoids or <i>k</i>-means. The relation between DPPs and kernel-based methods makes DPPs suitable to describe and quantify similarity between objects. DPPs favor diversity of the center points in initial sets, so that sets with similar points have less chance of being generated than sets with very distinct points. Most current inital sets are generated with center points sampled uniformly at random. We show through extensive simulations that, contrary to DPPs, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets. The latter are two key properties that make DPPs achieve good performance. Simulations with artificial datasets and applications to real datasets show that determinantal consensus clustering outperforms consensus clusterings which are based on uniform random sampling of center points.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"829 - 858"},"PeriodicalIF":1.6,"publicationDate":"2022-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50046217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sequential classification of customer behavior based on sequence-to-sequence learning with gated-attention neural networks 基于序列对序列学习的门控注意神经网络客户行为顺序分类
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2022-08-24 DOI: 10.1007/s11634-022-00517-3
Licheng Zhao, Yi Zuo, Katsutoshi Yada
{"title":"Sequential classification of customer behavior based on sequence-to-sequence learning with gated-attention neural networks","authors":"Licheng Zhao,&nbsp;Yi Zuo,&nbsp;Katsutoshi Yada","doi":"10.1007/s11634-022-00517-3","DOIUrl":"10.1007/s11634-022-00517-3","url":null,"abstract":"<div><p>During the last decade, an increasing number of supermarkets have begun to use RFID technology to track consumers' in-store movements to collect data on their shopping behavioral. Marketers hope that such new types of RFID data will improve the accuracy of the existing customer segmentation, and provide effective marketing positioning from the customer’s perspective. Therefore, this paper presents an integrated work on combining RFID data with traditional point of sales (POS) data, and proposes a sequential classification-based model to classify and identify consumers’ purchasing behavior. We chose an island area of the supermarket to perform the tracking experiment and collected customer behavioral data for two months. RFID data are used to extract behavior explanatory variables, such as residence time and wandering direction. For these customers, we extracted their purchasing historical data for the past three months from the POS system to define customer background and segmentation. Finally, this paper proposes a novel classification model based on sequence-to-sequence (Seq2seq) learning architecture. The encoder–decoder of Seq2seq uses an attention mechanism to pursue sequential inputs, with gating units in the encoder and decoder adjusting the output weights based on the input variables. The experimental results showed that the proposed model has a higher accuracy and area under curve value for customer classification and recognition compared with other benchmark models. Furthermore, the validity of behavioral description variables among heterogeneous customers was verified by adjusting the attention mechanism.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 3","pages":"549 - 581"},"PeriodicalIF":1.6,"publicationDate":"2022-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50044829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identification of representative trees in random forests based on a new tree-based distance measure 基于树木距离测度的随机森林中代表性树木的识别
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2022-08-19 DOI: 10.1101/2022.05.15.492004
Björn-Hergen Laabs von Holt, A. Westenberger, I. König
{"title":"Identification of representative trees in random forests based on a new tree-based distance measure","authors":"Björn-Hergen Laabs von Holt, A. Westenberger, I. König","doi":"10.1101/2022.05.15.492004","DOIUrl":"https://doi.org/10.1101/2022.05.15.492004","url":null,"abstract":"In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR ( https://github.com/imbs-hl/timbR ).","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"30 5","pages":"1-18"},"PeriodicalIF":1.6,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72628235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Localization processes for functional data analysis 功能数据分析的本地化过程
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2022-08-19 DOI: 10.1007/s11634-022-00512-8
Antonio Elías, Raúl Jiménez, J. E. Yukich
{"title":"Localization processes for functional data analysis","authors":"Antonio Elías,&nbsp;Raúl Jiménez,&nbsp;J. E. Yukich","doi":"10.1007/s11634-022-00512-8","DOIUrl":"10.1007/s11634-022-00512-8","url":null,"abstract":"<div><p>We propose an alternative to <i>k</i>-nearest neighbors for functional data whereby the approximating neighboring curves are piecewise functions built from a functional sample. Using a locally defined distance function that satisfies stabilization criteria, we establish pointwise and global approximation results in function spaces when the number of data curves is large. We exploit this feature to develop the asymptotic theory when a finite number of curves is observed at time-points given by an i.i.d. sample whose cardinality increases up to infinity. We use these results to investigate the problem of estimating unobserved segments of a partially observed functional data sample as well as to study the problem of functional classification and outlier detection. For such problems our methods are competitive with and sometimes superior to benchmark predictions in the field. The R package <span>localFDA</span> provides routines for computing the localization processes and the estimators proposed in this article.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 2","pages":"485 - 517"},"PeriodicalIF":1.6,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50494833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial for ADAC issue 3 of volume 16 (2022) ADAC第16卷第3期社论(2022)
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2022-08-16 DOI: 10.1007/s11634-022-00511-9
Maurizio Vichi, Andrea Ceroli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 3 of volume 16 (2022)","authors":"Maurizio Vichi,&nbsp;Andrea Ceroli,&nbsp;Hans A. Kestler,&nbsp;Akinori Okada,&nbsp;Claus Weihs","doi":"10.1007/s11634-022-00511-9","DOIUrl":"10.1007/s11634-022-00511-9","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"16 3","pages":"487 - 490"},"PeriodicalIF":1.6,"publicationDate":"2022-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50057265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Model based clustering of multinomial count data 基于模型的多项计数数据聚类
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2022-07-28 DOI: 10.1007/s11634-023-00547-5
Panagiotis Papastamoulis
{"title":"Model based clustering of multinomial count data","authors":"Panagiotis Papastamoulis","doi":"10.1007/s11634-023-00547-5","DOIUrl":"https://doi.org/10.1007/s11634-023-00547-5","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"50 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2022-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75623247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixed-effect models with trees 具有树的混合效应模型
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2022-07-08 DOI: 10.1007/s11634-022-00509-3
Anna Gottard, Giulia Vannucci, Leonardo Grilli, Carla Rampichini
{"title":"Mixed-effect models with trees","authors":"Anna Gottard,&nbsp;Giulia Vannucci,&nbsp;Leonardo Grilli,&nbsp;Carla Rampichini","doi":"10.1007/s11634-022-00509-3","DOIUrl":"10.1007/s11634-022-00509-3","url":null,"abstract":"<div><p>Tree-based regression models are a class of statistical models for predicting continuous response variables when the shape of the regression function is unknown. They naturally take into account both non-linearities and interactions. However, they struggle with linear and quasi-linear effects and assume <i>iid</i> data. This article proposes two new algorithms for jointly estimating an interpretable predictive mixed-effect model with two components: a linear part, capturing the main effects, and a non-parametric component consisting of three trees for capturing non-linearities and interactions among individual-level predictors, among cluster-level predictors or cross-level. The first proposed algorithm focuses on prediction. The second one is an extension which implements a post-selection inference strategy to provide valid inference. The performance of the two algorithms is validated via Monte Carlo studies. An application on INVALSI data illustrates the potentiality of the proposed approach.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 2","pages":"431 - 461"},"PeriodicalIF":1.6,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00509-3.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50462000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On mathematical optimization for clustering categories in contingency tables 列联表聚类范畴的数学优化
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2022-06-28 DOI: 10.1007/s11634-022-00508-4
Emilio Carrizosa, Vanesa Guerrero, Dolores Romero Morales
{"title":"On mathematical optimization for clustering categories in contingency tables","authors":"Emilio Carrizosa,&nbsp;Vanesa Guerrero,&nbsp;Dolores Romero Morales","doi":"10.1007/s11634-022-00508-4","DOIUrl":"10.1007/s11634-022-00508-4","url":null,"abstract":"<div><p>Many applications in data analysis study whether two categorical variables are independent using a function of the entries of their contingency table. Often, the categories of the variables, associated with the rows and columns of the table, are grouped, yielding a less granular representation of the categorical variables. The purpose of this is to attain reasonable sample sizes in the cells of the table and, more importantly, to incorporate expert knowledge on the allowable groupings. However, it is known that the conclusions on independence depend, in general, on the chosen granularity, as in the Simpson paradox. In this paper we propose a methodology to, for a given contingency table and a fixed granularity, find a clustered table with the highest <span>(chi ^2)</span> statistic. Repeating this procedure for different values of the granularity, we can either identify an <i>extreme grouping</i>, namely the largest granularity for which the statistical dependence is still detected, or conclude that it does not exist and that the two variables are dependent regardless of the size of the clustered table. For this problem, we propose an assignment mathematical formulation and a set partitioning one. Our approach is flexible enough to include constraints on the desirable structure of the clusters, such as must-link or cannot-link constraints on the categories that can, or cannot, be merged together, and ensure reasonable sample sizes in the cells of the clustered table from which trustful statistical conclusions can be derived. We illustrate the usefulness of our methodology using a dataset of a medical study. \u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 2","pages":"407 - 429"},"PeriodicalIF":1.6,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-022-00508-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50520905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Classification based on multivariate mixed type longitudinal data with an application to the EU-SILC database 基于多变量混合型纵向数据的分类及其在EU-SILC数据库中的应用
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2022-06-25 DOI: 10.1007/s11634-022-00504-8
Jan Vávra, Arnošt Komárek
{"title":"Classification based on multivariate mixed type longitudinal data with an application to the EU-SILC database","authors":"Jan Vávra,&nbsp;Arnošt Komárek","doi":"10.1007/s11634-022-00504-8","DOIUrl":"10.1007/s11634-022-00504-8","url":null,"abstract":"<div><p>Although many present day studies gather data of a diverse nature (numeric quantities, binary indicators or ordered categories) on the same units repeatedly over time, there only exist limited number of approaches in the literature to analyse so-called <i>mixed-type</i> longitudinal data. We present a statistical model capable of joint modelling several mixed-type outcomes, which also accounts for possible dependencies among the investigated outcomes. A thresholding approach to link binary or ordinal variables to their latent numeric counterparts allows us to jointly model all, including latent, numeric outcomes using a multivariate version of the linear mixed-effects model. We avoid the independence assumption over outcomes by relaxing the variance matrix of random effects to a completely general positive definite matrix. Moreover, we follow model-based clustering methodology to create a mixture of such models to model heterogeneity in the temporal evolution of the considered outcomes. The estimation of such an hierarchical model is approached by Bayesian principles with the use of Markov chain Monte Carlo methods. After a successful simulation study with the aim to examine the ability to consistently estimate the true parameter values and thus discover the different patterns, the EU-SILC dataset consisting of Czech households that were followed for 4 years in a time span from 2005 to 2016 was analysed. The households were classified into groups with a similar evolution of several closely related indicators of monetary poverty based on estimated classification probabilities.\u0000</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 2","pages":"369 - 406"},"PeriodicalIF":1.6,"publicationDate":"2022-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50512826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Correction to: Principal component analysis constrained by layered simple structures 更正:受分层简单结构约束的主成分分析
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2022-06-24 DOI: 10.1007/s11634-022-00506-6
Naoto Yamashita
{"title":"Correction to: Principal component analysis constrained by layered simple structures","authors":"Naoto Yamashita","doi":"10.1007/s11634-022-00506-6","DOIUrl":"10.1007/s11634-022-00506-6","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"16 4","pages":"1099 - 1100"},"PeriodicalIF":1.6,"publicationDate":"2022-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50435310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信