Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 4 of volume 18 (2024)","authors":"Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs","doi":"10.1007/s11634-024-00615-4","DOIUrl":"10.1007/s11634-024-00615-4","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"823 - 826"},"PeriodicalIF":1.4,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142679608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paula Brito, Andrea Cerioli, Luis Angel García-Escudero, Gilbert Saporta
{"title":"Special issue on “New methodologies in clustering and classification for complex and/or big data”","authors":"Paula Brito, Andrea Cerioli, Luis Angel García-Escudero, Gilbert Saporta","doi":"10.1007/s11634-024-00605-6","DOIUrl":"10.1007/s11634-024-00605-6","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"539 - 543"},"PeriodicalIF":1.4,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142409860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Bartolucci, Antonietta Mira, Stefano Peluso
{"title":"Marginal models with individual-specific effects for the analysis of longitudinal bipartite networks","authors":"Francesco Bartolucci, Antonietta Mira, Stefano Peluso","doi":"10.1007/s11634-024-00604-7","DOIUrl":"https://doi.org/10.1007/s11634-024-00604-7","url":null,"abstract":"<p>A new modeling framework for bipartite social networks arising from a sequence of partially time-ordered relational events is proposed. We directly model the joint distribution of the binary variables indicating if each single actor is involved or not in an event. The adopted parametrization is based on first- and second-order effects, formulated as in marginal models for categorical data and free higher order effects. In particular, second-order effects are log-odds ratios with meaningful interpretation from the social perspective in terms of tendency to cooperate, in contrast to first-order effects interpreted in terms of tendency of each single actor to participate in an event. These effects are parametrized on the basis of the event times, so that suitable latent trajectories of individual behaviors may be represented. Inference is based on a composite likelihood function, maximized by an algorithm with numerical complexity proportional to the square of the number of units in the network. A classification composite likelihood is used to cluster the actors, simplifying the interpretation of the data structure. The proposed approach is illustrated on simulated data and on a dataset of scientific articles published in four top statistical journals from 2003 to 2012.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"61 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Bagging to improve clustering methods in the context of three-dimensional shapes","authors":"Inácio Nascimento, Raydonal Ospina, Getúlio Amorim","doi":"10.1007/s11634-024-00602-9","DOIUrl":"https://doi.org/10.1007/s11634-024-00602-9","url":null,"abstract":"<p>Cluster Analysis techniques are a common approach to classifying objects within a dataset into distinct clusters. The clustering of geometric shapes of objects holds significant importance in various fields of study. To analyze the geometric shapes of objects, researchers often employ Statistical Shape Analysis methods, which retain crucial information after accounting for scaling, locating, and rotating an object. Consequently, several researchers have focused on adapting clustering algorithms for shape analysis. Recently, three-dimensional (3D) shape clustering has become crucial for analyzing, interpreting, and effectively utilizing 3D data across diverse industries, including medicine, robotics, civil engineering, and paleontology. In this study, we adapt the <i>K-means</i>, <i>CLARANS</i> and <i>Hill Climbing</i> methods using an approach based on the <i>Bagging</i> procedure to achieve enhanced clustering accuracy. We conduct simulation experiments for both isotropy and anisotropy scenarios, considering various dispersion variations. Furthermore, we apply the proposed approach to real datasets from relevant literature. We evaluate the obtained clusters using cluster validation measures, specifically the Rand Index and the Fowlkes-Mallows Index. Our results demonstrate substantial improvements in clustering quality when implementing the <i>Bagging</i> approach in conjunction with the <i>K-means</i>, <i>CLARANS</i> and <i>Hill Climbing</i> methods. The combination of the Bagging method and clustering algorithms provided substantial gains in the quality of the clusters.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"58 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The chiPower transformation: a valid alternative to logratio transformations in compositional data analysis","authors":"Michael Greenacre","doi":"10.1007/s11634-024-00600-x","DOIUrl":"10.1007/s11634-024-00600-x","url":null,"abstract":"<div><p>The approach to analysing compositional data has been dominated by the use of logratio transformations, to ensure exact subcompositional coherence and, in some situations, exact isometry as well. A problem with this approach is that data zeros, found in most applications, have to be replaced to allow the logarithmic transformation. An alternative new approach, called the ‘chiPower’ transformation, which allows data zeros, is to combine the standardization inherent in the chi-square distance in correspondence analysis, with the essential elements of the Box-Cox power transformation. The chiPower transformation is justified because it defines between-sample distances that tend to logratio distances for strictly positive data as the power parameter tends to zero, and are then equivalent to transforming to logratios. For data with zeros, a value of the power can be identified that brings the chiPower transformation as close as possible to a logratio transformation, without having to substitute the zeros. Especially in the area of high-dimensional data, this alternative approach can present such a high level of coherence and isometry as to be a valid approach to the analysis of compositional data. Furthermore, in a supervised learning context, if the compositional variables serve as predictors of a response in a modelling framework, for example generalized linear models, then the power can be used as a tuning parameter in optimizing the accuracy of prediction through cross-validation. The chiPower-transformed variables have a straightforward interpretation, since they are identified with single compositional parts, not ratios.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"769 - 796"},"PeriodicalIF":1.4,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José García-García, María Ángeles Gil, María Asunción Lubiano
{"title":"On some properties of Cronbach’s α coefficient for interval-valued data in questionnaires","authors":"José García-García, María Ángeles Gil, María Asunción Lubiano","doi":"10.1007/s11634-024-00601-w","DOIUrl":"https://doi.org/10.1007/s11634-024-00601-w","url":null,"abstract":"<p>Along recent years, interval-valued rating scales have been considered as an alternative to traditional single-point psychometric tools for human evaluations, such as Likert-type or visual analogue scales. More concretely, in answering to intrinsically imprecise items in a questionnaire, interval-valued scales seem to allow capturing a richer information than conventional ones. When analyzing data from given performances of questionnaires, one of the main targets is that of ensuring the internal consistency of the items in a construct or latent variable. The most popular indicator of internal consistency, whenever answers to items are given in accordance with a numerically based/encoded scale, is the well-known Cronbach <i> α</i> coefficient. This paper aims to extend such a coefficient to the case of interval-valued answers and to analyze some of its main statistical properties. For this purpose, after presenting some formal preliminaries for interval-valued data, firstly Cronbach’s <i> α</i> coefficient is extended to the case in which the constructs of a questionnaire allow interval-valued answers to their items. The range of the potential values of the extended coefficient is then discussed. Furthermore, the asymptotic distribution of the sample Cronbach <i> α</i> coefficient along with its bias and consistency properties, are examined from a theoretical perspective. Finally, the preceding asymptotic distribution of the sample coefficient as well as the influence of the number of respondents to the questionnaire and the number of items in the constructs are empirically illustrated through simulation-based studies.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"59 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141770279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu
{"title":"Scalable Bayesian p-generalized probit and logistic regression","authors":"Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu","doi":"10.1007/s11634-024-00599-1","DOIUrl":"https://doi.org/10.1007/s11634-024-00599-1","url":null,"abstract":"<p>The logit and probit link functions are arguably the two most common choices for binary regression models. Many studies have extended the choice of link functions to avoid possible misspecification and to improve the model fit to the data. We introduce the <i>p</i>-generalized Gaussian distribution (<i>p</i>-GGD) to binary regression in a Bayesian framework. The <i>p</i>-GGD has received considerable attention due to its flexibility in modeling the tails, while generalizing, for instance, over the standard normal distribution where <span>(p=2)</span> or the Laplace distribution where <span>(p=1)</span>. Here, we extend from maximum likelihood estimation (MLE) to Bayesian posterior estimation using Markov Chain Monte Carlo (MCMC) sampling for the model parameters <span>(beta)</span> and the link function parameter <i>p</i>. We use simulated and real-world data to verify the effect of different parameters <i>p</i> on the estimation results, and how logistic regression and probit regression can be incorporated into a broader framework. To make our Bayesian methods scalable in the case of large data, we also incorporate coresets to reduce the data before running the complex and time-consuming MCMC analysis. This allows us to perform very efficient calculations while retaining the original posterior parameter distributions up to little distortions both, in practice, and with theoretical guarantees.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"3 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dirichlet compound negative multinomial mixture models and applications","authors":"Ornela Bregu, Nizar Bouguila","doi":"10.1007/s11634-024-00598-2","DOIUrl":"https://doi.org/10.1007/s11634-024-00598-2","url":null,"abstract":"<p>In this paper, we consider an alternative parametrization of Dirichlet Compound Negative Multinomial (DCNM) using rising polynomials. The new parametrization gets rid of Gamma functions and allows us to derive the Exact Fisher Information Matrix, which brings significant improvements to model performance due to feature correlation consideration. Second, we propose to improve the computation efficiency by approximating the DCNM model as a member of the exponential family of distributions, called EDCNM. The novel EDCNM model brings several advantages as compared to the DCNM model, such as a closed-form solution for maximum likelihood estimation, higher efficiency due to computational time reduction for sparse datasets, etc. Third, we implement Agglomerative Hierarchical clustering, where Kullback–Leibler divergence is derived and used to measure the distance between two EDCNM probability distributions. Finally, we integrate the Minimum Message Length criterion in our algorithm to estimate the optimal number of components of the mixture model. The merits of our proposed models are validated via challenging real-world applications in Natural Language Processing and Image/Video Recognition. Results reveal that the exponential approximation of the DCNM model has reduced significantly the computational complexity in high-dimensional feature spaces.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"25 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Natural language processing and financial markets: semi-supervised modelling of coronavirus and economic news","authors":"Carlos Moreno-Pérez, Marco Minozzo","doi":"10.1007/s11634-024-00596-4","DOIUrl":"https://doi.org/10.1007/s11634-024-00596-4","url":null,"abstract":"<p>This paper investigates the reactions of US financial markets to press news from January 2019 to 1 May 2020. To this end, we deduce the content and uncertainty of the news by developing apposite indices from the headlines and snippets of The New York Times, using unsupervised machine learning techniques. In particular, we use Latent Dirichlet Allocation to infer the content (topics) of the articles, and Word Embedding (implemented with the Skip-gram model) and K-Means to measure their uncertainty. In this way, we arrive at the definition of a set of daily topic-specific uncertainty indices. These indices are then used to find explanations for the behavior of the US financial markets by implementing a batch of EGARCH models. In substance, we find that two topic-specific uncertainty indices, one related to COVID-19 news and the other to trade war news, explain the bulk of the movements in the financial markets from the beginning of 2019 to end-April 2020. Moreover, we find that the topic-specific uncertainty index related to the economy and the Federal Reserve is positively related to the financial markets, meaning that our index is able to capture the actions of the Federal Reserve during periods of uncertainty.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"82 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 2 of volume 18 (2024)","authors":"Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs","doi":"10.1007/s11634-024-00597-3","DOIUrl":"10.1007/s11634-024-00597-3","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 2","pages":"245 - 249"},"PeriodicalIF":1.4,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141366538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}