Advances in Data Analysis and Classification最新文献

筛选
英文 中文
Special issue on “New methodologies in clustering and classification for complex and/or big data” 关于 "复杂和/或海量数据的聚类和分类新方法 "的特刊
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2024-09-04 DOI: 10.1007/s11634-024-00605-6
Paula Brito, Andrea Cerioli, Luis Angel García-Escudero, Gilbert Saporta
{"title":"Special issue on “New methodologies in clustering and classification for complex and/or big data”","authors":"Paula Brito, Andrea Cerioli, Luis Angel García-Escudero, Gilbert Saporta","doi":"10.1007/s11634-024-00605-6","DOIUrl":"10.1007/s11634-024-00605-6","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142409860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Marginal models with individual-specific effects for the analysis of longitudinal bipartite networks 用于分析纵向双方位网络的具有特定个体效应的边际模型
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2024-09-03 DOI: 10.1007/s11634-024-00604-7
Francesco Bartolucci, Antonietta Mira, Stefano Peluso
{"title":"Marginal models with individual-specific effects for the analysis of longitudinal bipartite networks","authors":"Francesco Bartolucci, Antonietta Mira, Stefano Peluso","doi":"10.1007/s11634-024-00604-7","DOIUrl":"https://doi.org/10.1007/s11634-024-00604-7","url":null,"abstract":"<p>A new modeling framework for bipartite social networks arising from a sequence of partially time-ordered relational events is proposed. We directly model the joint distribution of the binary variables indicating if each single actor is involved or not in an event. The adopted parametrization is based on first- and second-order effects, formulated as in marginal models for categorical data and free higher order effects. In particular, second-order effects are log-odds ratios with meaningful interpretation from the social perspective in terms of tendency to cooperate, in contrast to first-order effects interpreted in terms of tendency of each single actor to participate in an event. These effects are parametrized on the basis of the event times, so that suitable latent trajectories of individual behaviors may be represented. Inference is based on a composite likelihood function, maximized by an algorithm with numerical complexity proportional to the square of the number of units in the network. A classification composite likelihood is used to cluster the actors, simplifying the interpretation of the data structure. The proposed approach is illustrated on simulated data and on a dataset of scientific articles published in four top statistical journals from 2003 to 2012.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Bagging to improve clustering methods in the context of three-dimensional shapes 在三维图形中使用套袋法改进聚类方法
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2024-08-21 DOI: 10.1007/s11634-024-00602-9
Inácio Nascimento, Raydonal Ospina, Getúlio Amorim
{"title":"Using Bagging to improve clustering methods in the context of three-dimensional shapes","authors":"Inácio Nascimento, Raydonal Ospina, Getúlio Amorim","doi":"10.1007/s11634-024-00602-9","DOIUrl":"https://doi.org/10.1007/s11634-024-00602-9","url":null,"abstract":"<p>Cluster Analysis techniques are a common approach to classifying objects within a dataset into distinct clusters. The clustering of geometric shapes of objects holds significant importance in various fields of study. To analyze the geometric shapes of objects, researchers often employ Statistical Shape Analysis methods, which retain crucial information after accounting for scaling, locating, and rotating an object. Consequently, several researchers have focused on adapting clustering algorithms for shape analysis. Recently, three-dimensional (3D) shape clustering has become crucial for analyzing, interpreting, and effectively utilizing 3D data across diverse industries, including medicine, robotics, civil engineering, and paleontology. In this study, we adapt the <i>K-means</i>, <i>CLARANS</i> and <i>Hill Climbing</i> methods using an approach based on the <i>Bagging</i> procedure to achieve enhanced clustering accuracy. We conduct simulation experiments for both isotropy and anisotropy scenarios, considering various dispersion variations. Furthermore, we apply the proposed approach to real datasets from relevant literature. We evaluate the obtained clusters using cluster validation measures, specifically the Rand Index and the Fowlkes-Mallows Index. Our results demonstrate substantial improvements in clustering quality when implementing the <i>Bagging</i> approach in conjunction with the <i>K-means</i>, <i>CLARANS</i> and <i>Hill Climbing</i> methods. The combination of the Bagging method and clustering algorithms provided substantial gains in the quality of the clusters.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The chiPower transformation: a valid alternative to logratio transformations in compositional data analysis chiPower 转换:组合数据分析中对数转换的有效替代方案
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2024-08-01 DOI: 10.1007/s11634-024-00600-x
Michael Greenacre
{"title":"The chiPower transformation: a valid alternative to logratio transformations in compositional data analysis","authors":"Michael Greenacre","doi":"10.1007/s11634-024-00600-x","DOIUrl":"10.1007/s11634-024-00600-x","url":null,"abstract":"<div><p>The approach to analysing compositional data has been dominated by the use of logratio transformations, to ensure exact subcompositional coherence and, in some situations, exact isometry as well. A problem with this approach is that data zeros, found in most applications, have to be replaced to allow the logarithmic transformation. An alternative new approach, called the ‘chiPower’ transformation, which allows data zeros, is to combine the standardization inherent in the chi-square distance in correspondence analysis, with the essential elements of the Box-Cox power transformation. The chiPower transformation is justified because it defines between-sample distances that tend to logratio distances for strictly positive data as the power parameter tends to zero, and are then equivalent to transforming to logratios. For data with zeros, a value of the power can be identified that brings the chiPower transformation as close as possible to a logratio transformation, without having to substitute the zeros. Especially in the area of high-dimensional data, this alternative approach can present such a high level of coherence and isometry as to be a valid approach to the analysis of compositional data. Furthermore, in a supervised learning context, if the compositional variables serve as predictors of a response in a modelling framework, for example generalized linear models, then the power can be used as a tuning parameter in optimizing the accuracy of prediction through cross-validation. The chiPower-transformed variables have a straightforward interpretation, since they are identified with single compositional parts, not ratios.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On some properties of Cronbach’s α coefficient for interval-valued data in questionnaires 关于调查问卷中区间值数据的克朗巴赫 α 系数的一些特性
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2024-07-26 DOI: 10.1007/s11634-024-00601-w
José García-García, María Ángeles Gil, María Asunción Lubiano
{"title":"On some properties of Cronbach’s α coefficient for interval-valued data in questionnaires","authors":"José García-García, María Ángeles Gil, María Asunción Lubiano","doi":"10.1007/s11634-024-00601-w","DOIUrl":"https://doi.org/10.1007/s11634-024-00601-w","url":null,"abstract":"<p>Along recent years, interval-valued rating scales have been considered as an alternative to traditional single-point psychometric tools for human evaluations, such as Likert-type or visual analogue scales. More concretely, in answering to intrinsically imprecise items in a questionnaire, interval-valued scales seem to allow capturing a richer information than conventional ones. When analyzing data from given performances of questionnaires, one of the main targets is that of ensuring the internal consistency of the items in a construct or latent variable. The most popular indicator of internal consistency, whenever answers to items are given in accordance with a numerically based/encoded scale, is the well-known Cronbach <i> α</i> coefficient. This paper aims to extend such a coefficient to the case of interval-valued answers and to analyze some of its main statistical properties. For this purpose, after presenting some formal preliminaries for interval-valued data, firstly Cronbach’s <i> α</i> coefficient is extended to the case in which the constructs of a questionnaire allow interval-valued answers to their items. The range of the potential values of the extended coefficient is then discussed. Furthermore, the asymptotic distribution of the sample Cronbach <i> α</i> coefficient along with its bias and consistency properties, are examined from a theoretical perspective. Finally, the preceding asymptotic distribution of the sample coefficient as well as the influence of the number of respondents to the questionnaire and the number of items in the constructs are empirically illustrated through simulation-based studies.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141770279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable Bayesian p-generalized probit and logistic regression 可扩展的贝叶斯 p 广义概率和逻辑回归
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2024-07-04 DOI: 10.1007/s11634-024-00599-1
Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu
{"title":"Scalable Bayesian p-generalized probit and logistic regression","authors":"Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu","doi":"10.1007/s11634-024-00599-1","DOIUrl":"https://doi.org/10.1007/s11634-024-00599-1","url":null,"abstract":"<p>The logit and probit link functions are arguably the two most common choices for binary regression models. Many studies have extended the choice of link functions to avoid possible misspecification and to improve the model fit to the data. We introduce the <i>p</i>-generalized Gaussian distribution (<i>p</i>-GGD) to binary regression in a Bayesian framework. The <i>p</i>-GGD has received considerable attention due to its flexibility in modeling the tails, while generalizing, for instance, over the standard normal distribution where <span>(p=2)</span> or the Laplace distribution where <span>(p=1)</span>. Here, we extend from maximum likelihood estimation (MLE) to Bayesian posterior estimation using Markov Chain Monte Carlo (MCMC) sampling for the model parameters <span>(beta)</span> and the link function parameter <i>p</i>. We use simulated and real-world data to verify the effect of different parameters <i>p</i> on the estimation results, and how logistic regression and probit regression can be incorporated into a broader framework. To make our Bayesian methods scalable in the case of large data, we also incorporate coresets to reduce the data before running the complex and time-consuming MCMC analysis. This allows us to perform very efficient calculations while retaining the original posterior parameter distributions up to little distortions both, in practice, and with theoretical guarantees.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dirichlet compound negative multinomial mixture models and applications Dirichlet 复合负多叉混合物模型及其应用
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2024-06-25 DOI: 10.1007/s11634-024-00598-2
Ornela Bregu, Nizar Bouguila
{"title":"Dirichlet compound negative multinomial mixture models and applications","authors":"Ornela Bregu, Nizar Bouguila","doi":"10.1007/s11634-024-00598-2","DOIUrl":"https://doi.org/10.1007/s11634-024-00598-2","url":null,"abstract":"<p>In this paper, we consider an alternative parametrization of Dirichlet Compound Negative Multinomial (DCNM) using rising polynomials. The new parametrization gets rid of Gamma functions and allows us to derive the Exact Fisher Information Matrix, which brings significant improvements to model performance due to feature correlation consideration. Second, we propose to improve the computation efficiency by approximating the DCNM model as a member of the exponential family of distributions, called EDCNM. The novel EDCNM model brings several advantages as compared to the DCNM model, such as a closed-form solution for maximum likelihood estimation, higher efficiency due to computational time reduction for sparse datasets, etc. Third, we implement Agglomerative Hierarchical clustering, where Kullback–Leibler divergence is derived and used to measure the distance between two EDCNM probability distributions. Finally, we integrate the Minimum Message Length criterion in our algorithm to estimate the optimal number of components of the mixture model. The merits of our proposed models are validated via challenging real-world applications in Natural Language Processing and Image/Video Recognition. Results reveal that the exponential approximation of the DCNM model has reduced significantly the computational complexity in high-dimensional feature spaces.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Natural language processing and financial markets: semi-supervised modelling of coronavirus and economic news 自然语言处理和金融市场:冠状病毒和经济新闻的半监督建模
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2024-06-19 DOI: 10.1007/s11634-024-00596-4
Carlos Moreno-Pérez, Marco Minozzo
{"title":"Natural language processing and financial markets: semi-supervised modelling of coronavirus and economic news","authors":"Carlos Moreno-Pérez, Marco Minozzo","doi":"10.1007/s11634-024-00596-4","DOIUrl":"https://doi.org/10.1007/s11634-024-00596-4","url":null,"abstract":"<p>This paper investigates the reactions of US financial markets to press news from January 2019 to 1 May 2020. To this end, we deduce the content and uncertainty of the news by developing apposite indices from the headlines and snippets of The New York Times, using unsupervised machine learning techniques. In particular, we use Latent Dirichlet Allocation to infer the content (topics) of the articles, and Word Embedding (implemented with the Skip-gram model) and K-Means to measure their uncertainty. In this way, we arrive at the definition of a set of daily topic-specific uncertainty indices. These indices are then used to find explanations for the behavior of the US financial markets by implementing a batch of EGARCH models. In substance, we find that two topic-specific uncertainty indices, one related to COVID-19 news and the other to trade war news, explain the bulk of the movements in the financial markets from the beginning of 2019 to end-April 2020. Moreover, we find that the topic-specific uncertainty index related to the economy and the Federal Reserve is positively related to the financial markets, meaning that our index is able to capture the actions of the Federal Reserve during periods of uncertainty.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial for ADAC issue 2 of volume 18 (2024) ADAC 第 18 卷(2024 年)第 2 期社论
IF 1.4 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2024-06-10 DOI: 10.1007/s11634-024-00597-3
Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 2 of volume 18 (2024)","authors":"Maurizio Vichi,&nbsp;Andrea Cerioli,&nbsp;Hans A. Kestler,&nbsp;Akinori Okada,&nbsp;Claus Weihs","doi":"10.1007/s11634-024-00597-3","DOIUrl":"10.1007/s11634-024-00597-3","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141366538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering large mixed-type data with ordinal variables 使用顺序变量对大型混合型数据进行聚类
IF 1.6 4区 计算机科学
Advances in Data Analysis and Classification Pub Date : 2024-05-27 DOI: 10.1007/s11634-024-00595-5
Gero Szepannek, Rabea Aschenbruck, Adalbert Wilhelm
{"title":"Clustering large mixed-type data with ordinal variables","authors":"Gero Szepannek, Rabea Aschenbruck, Adalbert Wilhelm","doi":"10.1007/s11634-024-00595-5","DOIUrl":"https://doi.org/10.1007/s11634-024-00595-5","url":null,"abstract":"<p>One of the most frequently used algorithms for clustering data with both numeric and categorical variables is the k-prototypes algorithm, an extension of the well-known k-means clustering. Gower’s distance denotes another popular approach for dealing with mixed-type data and is suitable not only for numeric and categorical but also for ordinal variables. In the paper a modification of the k-prototypes algorithm to Gower’s distance is proposed that ensures convergence. This provides a tool that allows to take into account ordinal information for clustering and can also be used for large data. A simulation study demonstrates convergence, good clustering results as well as small runtimes.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信