P. Kilian, Daniel Leyhr, Christopher J. Urban, O. Höner, A. Kelava
{"title":"A deep learning factor analysis model based on importance‐weighted variational inference and normalizing flow priors: Evaluation within a set of multidimensional performance assessments in youth elite soccer players","authors":"P. Kilian, Daniel Leyhr, Christopher J. Urban, O. Höner, A. Kelava","doi":"10.1002/sam.11632","DOIUrl":"https://doi.org/10.1002/sam.11632","url":null,"abstract":"Exploratory factor analysis is a widely used framework in the social and behavioral sciences. Since measurement errors are always present in human behavior data, latent factors, generating the observed data, are important to identify. While most factor analysis methods rely on linear relationships in the data‐generating process, deep learning models can provide more flexible modeling approaches. However, two problems need to be addressed. First, for interpretation, scaling assumptions are required, which can be (at least) cumbersome for deep generative models. Second, deep generative models are typically not identifiable, which is required in order to identify the underlying latent constructs. We developed a model that uses a variational autoencoder as an estimator for a complex factor analysis model based on importance‐weighted variational inference. In order to receive interpretable results and an identified model, we use a linear factor model with identification constraints in the measurement model. To maintain the flexibility of the model, we use normalizing flow latent priors. Within the evaluation of performance measures in a talent development program in soccer, we found more clarity in the separation of the identified underlying latent dimensions with our models compared to traditional PCA analyses.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114061712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Categorical classifiers in multiclass classification with imbalanced datasets","authors":"M. Carpita, Silvia Golia","doi":"10.1002/sam.11624","DOIUrl":"https://doi.org/10.1002/sam.11624","url":null,"abstract":"This paper discusses, in a multiclass classification setting, the issue of the choice of the so‐called categorical classifier, which is the procedure or criterion that transforms the probabilities produced by a probabilistic classifier into a single category or class. The standard choice is the Bayes Classifier (BC), but it has some limits with rare classes. This paper studies the classification performance of the BC versus two alternatives, that are the Max Difference Classifier (MDC) and Max Ratio Classifier (MRC), through an extensive simulation and some case studies. The results show that both MDC and MRC are preferable to BC in a multiclass setting with imbalanced data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129415765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new parametric approach to gender gap with application to EUSILC data in Poland and Italy","authors":"F. Greselin, Alina Jȩdrzejczak, Kamila Trzcińska","doi":"10.1002/sam.11623","DOIUrl":"https://doi.org/10.1002/sam.11623","url":null,"abstract":"Real income distribution comparisons are of interest to policy makers across European countries. Nowadays, a crucial component of income inequality remains the discrepancy between men and women, often called the gender gap. Since the gender gap is related to the whole distribution of incomes in a population, popular single metrics are not adequate, and previous studies applied the relative distribution method, a non‐parametric approach to the comparison of distributions. Here, we propose a parametric approach for estimating the relative distribution. Then we extend it to assess the impact of selected covariates—related to the personal characteristics of the samples—on the existing gender gap in both countries. In more detail, models for income were fitted to empirical data from Poland and Italy, from the European Survey of Income and Living Conditions (wave 2018). Afterwards, their parameters were employed to obtain the estimates of relative distribution characteristics. The methods applied in the study turned out to be relevant to describe the gender gap over the entire income range. Finally, the results of the empirical analysis are discussed to reveal similarities and substantial differences between the countries.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122848542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semiparametric detection of changepoints in location, scale, and copula","authors":"Gaurav Agarwal, I. Eckley, P. Fearnhead","doi":"10.1002/sam.11622","DOIUrl":"https://doi.org/10.1002/sam.11622","url":null,"abstract":"This paper proposes a new method to detect changepoints in the location and scale of univariate data sequences. The proposed method assumes that the data belong to the location‐scale family of distributions and estimate the associated densities nonparametrically. Specifically, the approach does not require knowledge of the functional form of the distribution of the data sequence. As such, the approach can detect changepoints in many distributions. We also propose a new method to detect changes in the location of multivariate sequences, using the marginals and a copula to capture the dependence between variables without the influence of marginal distributions. The performance of the proposed semiparametric approach is contrasted against both other competing nonparametric and Gaussian methods, via simulation studies, as well as applications arising from health and finance.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123915218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new formulation of sparse multiple kernel k$$ k $$ ‐means clustering and its applications","authors":"Wentao Qu, Xianchao Xiu, Jun Sun, Lingchen Kong","doi":"10.1002/sam.11621","DOIUrl":"https://doi.org/10.1002/sam.11621","url":null,"abstract":"Multiple kernel k$$ k $$ ‐means (MKKM) clustering has been an important research topic in statistical machine learning and data mining over the last few decades. MKKM combines a group of prespecified base kernels to improve the clustering performance. Although many efforts have been made to improve the performance of MKKM further, the present works do not sufficiently consider the potential structure of the partition matrix. In this paper, we propose a novel sparse multiple kernel k$$ k $$ ‐means (SMKKM) clustering by introducing a ℓ1$$ {ell}_1 $$ ‐norm to induce the sparsity of the partition matrix. We then design an efficient alternating algorithm with curve search technology. More importantly, the convergence and complexity analysis of the designed algorithm are established based on the optimality conditions of the SMKKM. Finally, extensive numerical experiments on synthetic and benchmark datasets demonstrate that the proposed method outperforms the state‐of‐the‐art methods in terms of clustering performance and robustness.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125939428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Association rules and decision rules","authors":"A. Mokkadem, M. Pelletier, Louis Raimbault","doi":"10.1002/sam.11620","DOIUrl":"https://doi.org/10.1002/sam.11620","url":null,"abstract":"Determining association rules of significant interest is an essential task within data mining and statistical analysis. In this paper, we first precisely define the notion of association rule. For this, we introduce a general model, which includes the usual transaction model, and which allows many operations on the association rules. Then, we interpret association rules as statistical decision rules. This interpretation leads to four decisional measures, one of them being the usual confidence. Then, we give some strategies based on the use of these four decisional measures in order to select or to construct association rules with a given consequent. We finally present an experimental study to illustrate these strategies. This study is carried out in R language, with the R‐package we specifically built for association rules mining.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125667533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simplicial depth: Characterization and reconstruction","authors":"P. Laketa, Stanislav Nagy","doi":"10.1002/sam.11618","DOIUrl":"https://doi.org/10.1002/sam.11618","url":null,"abstract":"Statistical depth functions have been designed with the intention of extending nonparametric inference toward multivariate setups. As such, the depths should serve as multivariate analogues of the quantile functions known from the analysis of real‐valued data. The so‐called characterization and reconstruction questions are among the fundamental open problems of the contemporary depth research. Roughly speaking, they ask: (a) Is it is possible that two different datasets, or more generally, two different probability distributions, correspond to identical depths, or does the depth function uniquely characterize the underlying distribution? (b) Knowing a depth function, can we reconstruct the corresponding distribution? For any given depth to constitute a fully‐fledged alternative to the quantile function, the depth must characterize wide classes of probability measures, and these measures must be simple to recover from their depths. We investigate these characterization/reconstruction questions for the classical simplicial depth for multivariate data. We show that, under mild conditions, datasets (represented by measures putting equal mass 1/n$$ 1/n $$ to each datum in a dataset of size n$$ n $$ ) and atomic measures are characterized by, and can be easily reconstructed from, their simplicial depth.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125919102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Share density‐based clustering of income data","authors":"Francesca Condino","doi":"10.1002/sam.11619","DOIUrl":"https://doi.org/10.1002/sam.11619","url":null,"abstract":"The Lorenz curve is a fundamental tool for analyzing income and wealth distribution and inequality. Indeed, the Lorenz curve and its derivative, the so‐called share density, provide valuable information regarding inequality. There is a widely recognized connection between the Lorenz curve and elements from information theory field. Starting from this evidence, the aim of this work is to compare the income inequality of different subgroups, by using a proper dissimilarity measure, borrowed from information theory, between parametric share densities. This measure is then considered for clustering purposes. To this end, a dynamic clustering algorithm is considered to group unconventional data, such as density functions. Finally, an application, regarding data from Survey on Households Income and Wealth (SHIW) by Bank of Italy, is shown.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132675543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seonghyeon Kim, Sara Kim, Kunwoong Kim, Yongdai Kim
{"title":"Lq regularization for fair artificial intelligence robust to covariate shift","authors":"Seonghyeon Kim, Sara Kim, Kunwoong Kim, Yongdai Kim","doi":"10.1002/sam.11616","DOIUrl":"https://doi.org/10.1002/sam.11616","url":null,"abstract":"It is well recognized that historical biases exist in training data against a certain sensitive group (e.g., non‐White, women) which are socially unacceptable, and these unfair biases are inherited in trained artificial intelligence (AI) models. Various learning algorithms have been proposed to remove or alleviate unfair biases in trained AI models. In this paper, we consider another type of bias in training data so‐called covariate shift in view of fair AI. Here, covariate shift means that training data do not represent the population of interest well. Covariate shift occurs when special sampling designs (e.g., stratified sampling) are used when collecting training data, or the population where training data are collected is different from the population of interest. When covariate shift exists, fair AI models on training data may not be fair in test data. To ensure fairness on test data, we develop computationally efficient learning algorithms robust to covariate shifts. In particular, we propose a robust fairness constraint based on the Lq norm which is a generic algorithm to be applied to various fairness AI problems without much hampering. By analyzing multiple benchmark datasets, we show that our proposed robust fairness AI algorithm improves existing fair AI algorithms much in terms of the fairness‐accuracy tradeoff to covariate shift and has significant computational advantages compared to other robust fair AI algorithms.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134429513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Buckley–James estimation of generalized additive accelerated lifetime model with ultrahigh‐dimensional data","authors":"Zichang Li, Xuejing Zhao","doi":"10.1002/sam.11615","DOIUrl":"https://doi.org/10.1002/sam.11615","url":null,"abstract":"High‐dimensional covariates in lifetime data is a challenge in survival analysis, especially in gene expression profile. The objective of this paper is to propose an efficient algorithm to extend the generalized additive model to survival data with high‐dimensional covariates. The algorithm is combined of generalized additive (GAM) model and Buckley–James estimation, which makes a nonparametric extension to the nonlinear model, where the GAM is exploited to illustrate the nonlinear effect of the covariates and the Buckley–James estimation is used to address the regression model with right‐censored response. In addition, we use maximal‐information‐coefficient (MIC)‐type variable screening and weighted p‐value to reduce dimension in high‐dimensional situations. The performance of the proposed algorithm is compared with the three benchmark models: Cox proportional hazards regression model, random survival forest, and BJ‐AFT on a simulated dataset and two real survival datasets. The results, evaluated by concordance index (C‐index) as well as modified mean squared error (mMSE), illustrated the superiority of the proposed algorithm.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133964222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}