{"title":"A new model for counterfactual analysis for functional data","authors":"Emilio Carrizosa, Jasone Ramírez-Ayerbe, Dolores Romero Morales","doi":"10.1007/s11634-023-00563-5","DOIUrl":"10.1007/s11634-023-00563-5","url":null,"abstract":"<div><p>Counterfactual explanations have become a very popular interpretability tool to understand and explain how complex machine learning models make decisions for individual instances. Most of the research on counterfactual explainability focuses on tabular and image data and much less on models dealing with functional data. In this paper, a counterfactual analysis for functional data is addressed, in which the goal is to identify the samples of the dataset from which the counterfactual explanation is made of, as well as how they are combined so that the individual instance and its counterfactual are as close as possible. Our methodology can be used with different distance measures for multivariate functional data and is applicable to any score-based classifier. We illustrate our methodology using two different real-world datasets, one univariate and another multivariate.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"981 - 1000"},"PeriodicalIF":1.4,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00563-5.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135216055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesca Bassi, José Fernando Vera, Juan Antonio Marmolejo Martín
{"title":"Profile-based latent class distance association analyses for sparse tables:application to the attitude of European citizens towards sustainable tourism","authors":"Francesca Bassi, José Fernando Vera, Juan Antonio Marmolejo Martín","doi":"10.1007/s11634-023-00559-1","DOIUrl":"10.1007/s11634-023-00559-1","url":null,"abstract":"<div><p>Social and behavioural sciences often deal with the analysis of associations for cross-classified data. This paper focuses on the study of the patterns observed on European citizens regarding their attitude towards sustainable tourism, specifically their willingness to change travel and tourism habits to be more sustainable. The data collected the intention to comply with nine sustainable actions; answers to these questions generated individual profiles; moreover, European country belonging is reported. Therefore, unlike a variable-oriented approach, here we are interested in a person-oriented approach through profiles. Some traditional methods are limited in their performance when using profiles, for example, by sparseness of the contingency table. We removed many of these limitations by using a latent class distance association model, clustering the row profiles into classes and representing these together with the categories of the response variable in a low-dimensional space. We showed, furthermore, that an easy interpretation of associations between clusters’ centres and categories of a response variable can be incorporated in this framework in an intuitive way using unfolding. Results of the analyses outlined that citizens mostly committed to an environmentally friendly behavior live in Sweden and Romania; citizens less willing to change their habits towards a more sustainable behavior live in Belgium, Cyprus, France, Lithuania and the Netherlands. Citizens preparedness to change habits however depends also on their socio-demographic characteristics such as gender, age, occupation, type of community where living, household size, and the frequency of travelling before the Covid-19 pandemic.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"953 - 980"},"PeriodicalIF":1.4,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00559-1.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135884070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs
{"title":"Editorial for ADAC issue 4 of volume 17 (2023)","authors":"Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs","doi":"10.1007/s11634-023-00564-4","DOIUrl":"10.1007/s11634-023-00564-4","url":null,"abstract":"","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"17 4","pages":"823 - 827"},"PeriodicalIF":1.6,"publicationDate":"2023-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50028115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering interpretable structure in longitudinal predictors via coefficient trees","authors":"Özge Sürer, Daniel W. Apley, Edward C. Malthouse","doi":"10.1007/s11634-023-00562-6","DOIUrl":"10.1007/s11634-023-00562-6","url":null,"abstract":"<div><p>We consider the regression setting in which the response variable is not longitudinal (i.e., it is observed once for each case), but it is assumed to depend functionally on a set of predictors that are observed longitudinally, which is a specific form of functional predictors. In this situation, we often expect that the same predictor observed at nearby time points are more likely to be associated with the response in the same way. In such situations, we can exploit those aspects and discover groups of predictors that share the same (or similar) coefficient according to their temporal proximity. We propose a new algorithm called coefficient tree regression for data in which the non-longitudinal response depends on longitudinal predictors to efficiently discover the underlying temporal characteristics of the data. The approach results in a simple and highly interpretable tree structure from which the hierarchical relationships between groups of predictors that affect the response in a similar manner based on their temporal proximity can be observed, and we demonstrate with a real example that it can provide a clear and concise interpretation of the data. In numerical comparisons over a variety of examples, we show that our approach achieves substantially better predictive accuracy than existing competitors, most likely due to its inherent form of dimensionality reduction that is automatically discovered when fitting the model, in addition to having interpretability advantages and lower computational expense.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"911 - 951"},"PeriodicalIF":1.4,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136209673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generalized Cramér’s coefficient via f-divergence for contingency tables","authors":"Wataru Urasaki, Tomoyuki Nakagawa, Tomotaka Momozaki, Sadao Tomizawa","doi":"10.1007/s11634-023-00560-8","DOIUrl":"10.1007/s11634-023-00560-8","url":null,"abstract":"<div><p>Various measures in two-way contingency table analysis have been proposed to express the strength of association between row and column variables in contingency tables. Tomizawa et al. (2004) proposed more general measures, including Cramér’s coefficient, using the power-divergence. In this paper, we propose measures using the <i>f</i>-divergence that has a wider class than the power-divergence. Unlike statistical hypothesis tests, these measures provide quantification of the association structure in contingency tables. The contribution of our study is proving that a measure applying a function that satisfies the condition of the <i>f</i>-divergence has desirable properties for measuring the strength of association in contingency tables. With this contribution, we can easily construct a new measure using a divergence that has essential properties for the analyst. For example, we conducted numerical experiments with a measure applying the <span>(theta)</span>-divergence. Furthermore, we can give further interpretation of the association between the row and column variables in the contingency table, which could not be obtained with the conventional one. We also show a relationship between our proposed measures and the correlation coefficient in a bivariate normal distribution of latent variables in the contingency tables.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 4","pages":"893 - 910"},"PeriodicalIF":1.4,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00560-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135481525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mixture modeling with normalizing flows for spherical density estimation","authors":"Tin Lok James Ng, Andrew Zammit-Mangion","doi":"10.1007/s11634-023-00561-7","DOIUrl":"10.1007/s11634-023-00561-7","url":null,"abstract":"<div><p>Normalizing flows are objects used for modeling complicated probability density functions, and have attracted considerable interest in recent years. Many flexible families of normalizing flows have been developed. However, the focus to date has largely been on normalizing flows on Euclidean domains; while normalizing flows have been developed for spherical and other non-Euclidean domains, these are generally less flexible than their Euclidean counterparts. To address this shortcoming, in this work we introduce a mixture-of-normalizing-flows model to construct complicated probability density functions on the sphere. This model provides a flexible alternative to existing parametric, semiparametric, and nonparametric, finite mixture models. Model estimation is performed using the expectation maximization algorithm and a variant thereof. The model is applied to simulated data, where the benefit over the conventional (single component) normalizing flow is verified. The model is then applied to two real-world data sets of events occurring on the surface of Earth; the first relating to earthquakes, and the second to terrorist activity. In both cases, we see that the mixture-of-normalizing-flows model yields a good representation of the density of event occurrence.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"103 - 120"},"PeriodicalIF":1.4,"publicationDate":"2023-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135548086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parsimony and parameter estimation for mixtures of multivariate leptokurtic-normal distributions","authors":"Ryan P. Browne, Luca Bagnato, Antonio Punzo","doi":"10.1007/s11634-023-00558-2","DOIUrl":"10.1007/s11634-023-00558-2","url":null,"abstract":"<div><p>Mixtures of multivariate leptokurtic-normal distributions have been recently introduced in the clustering literature based on mixtures of elliptical heavy-tailed distributions. They have the advantage of having parameters directly related to the moments of practical interest. We derive two estimation procedures for these mixtures. The first one is based on the majorization-minimization algorithm, while the second is based on a fixed point approximation. Moreover, we introduce parsimonious forms of the considered mixtures and we use the illustrated estimation procedures to fit them. We use simulated and real data sets to investigate various aspects of the proposed models and algorithms.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"597 - 625"},"PeriodicalIF":1.4,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00558-2.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135535813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stanislav Nagy, Houyem Demni, Davide Buttarazzi, Giovanni C. Porzio
{"title":"Theory of angular depth for classification of directional data","authors":"Stanislav Nagy, Houyem Demni, Davide Buttarazzi, Giovanni C. Porzio","doi":"10.1007/s11634-023-00557-3","DOIUrl":"10.1007/s11634-023-00557-3","url":null,"abstract":"<div><p>Depth functions offer an array of tools that enable the introduction of quantile- and ranking-like approaches to multivariate and non-Euclidean datasets. We investigate the potential of using depths in the problem of nonparametric supervised classification of directional data, that is classification of data that naturally live on the unit sphere of a Euclidean space. In this paper, we address the problem mainly from a theoretical side, with the final goal of offering guidelines on which angular depth function should be adopted in classifying directional data. A set of desirable properties of an angular depth is put forward. With respect to these properties, we compare and contrast the most widely used angular depth functions. Simulated and real data are eventually exploited to showcase the main implications of the discussed theoretical results, with an emphasis on potentials and limits of the often disregarded angular halfspace depth.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 3","pages":"627 - 662"},"PeriodicalIF":1.4,"publicationDate":"2023-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135966377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edoardo Fibbi, Domenico Perrotta, Francesca Torti, Stefan Van Aelst, Tim Verdonck
{"title":"Co-clustering contaminated data: a robust model-based approach","authors":"Edoardo Fibbi, Domenico Perrotta, Francesca Torti, Stefan Van Aelst, Tim Verdonck","doi":"10.1007/s11634-023-00549-3","DOIUrl":"10.1007/s11634-023-00549-3","url":null,"abstract":"<div><p>The exploration and analysis of large high-dimensional data sets calls for well-thought techniques to extract the salient information from the data, such as co-clustering. Latent block models cast co-clustering in a probabilistic framework that extends finite mixture models to the two-way setting. Real-world data sets often contain anomalies which could be of interest <i>per se</i> and may make the results provided by standard, non-robust procedures unreliable. Also estimation of latent block models can be heavily affected by contaminated data. We propose an algorithm to compute robust estimates for latent block models. Experiments on both simulated and real data show that our method is able to resist high levels of contamination and can provide additional insight into the data by highlighting possible anomalies.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"121 - 161"},"PeriodicalIF":1.4,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-023-00549-3.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136061315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuwen Zhu, Yana Melnykov, Angelina S. Kolomoytseva
{"title":"Contamination transformation matrix mixture modeling for skewed data groups with heavy tails and scatter","authors":"Xuwen Zhu, Yana Melnykov, Angelina S. Kolomoytseva","doi":"10.1007/s11634-023-00550-w","DOIUrl":"10.1007/s11634-023-00550-w","url":null,"abstract":"<div><p>Model-based clustering is a popular application of the rapidly developing area of finite mixture modeling. While there is ample work focusing on clustering multivariate data, an increasing number of advancements have been aiming at the expansion of existing theory to the matrix-variate framework. Matrix-variate Gaussian mixtures are most popular in this setting despite the potential misfit for skewed and heavy-tailed data. To overcome this lack of flexibility, a new contaminated transformation matrix mixture model is proposed. We illustrate its utility in a series of experiments on simulated data and apply to a real-life data set containing COVID-related information. The performance of the developed model is promising in all considered settings.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"18 1","pages":"85 - 101"},"PeriodicalIF":1.4,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135741082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}