{"title":"Semiparametric detection of changepoints in location, scale, and copula","authors":"Gaurav Agarwal, I. Eckley, P. Fearnhead","doi":"10.1002/sam.11622","DOIUrl":"https://doi.org/10.1002/sam.11622","url":null,"abstract":"This paper proposes a new method to detect changepoints in the location and scale of univariate data sequences. The proposed method assumes that the data belong to the location‐scale family of distributions and estimate the associated densities nonparametrically. Specifically, the approach does not require knowledge of the functional form of the distribution of the data sequence. As such, the approach can detect changepoints in many distributions. We also propose a new method to detect changes in the location of multivariate sequences, using the marginals and a copula to capture the dependence between variables without the influence of marginal distributions. The performance of the proposed semiparametric approach is contrasted against both other competing nonparametric and Gaussian methods, via simulation studies, as well as applications arising from health and finance.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123915218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Association rules and decision rules","authors":"A. Mokkadem, M. Pelletier, Louis Raimbault","doi":"10.1002/sam.11620","DOIUrl":"https://doi.org/10.1002/sam.11620","url":null,"abstract":"Determining association rules of significant interest is an essential task within data mining and statistical analysis. In this paper, we first precisely define the notion of association rule. For this, we introduce a general model, which includes the usual transaction model, and which allows many operations on the association rules. Then, we interpret association rules as statistical decision rules. This interpretation leads to four decisional measures, one of them being the usual confidence. Then, we give some strategies based on the use of these four decisional measures in order to select or to construct association rules with a given consequent. We finally present an experimental study to illustrate these strategies. This study is carried out in R language, with the R‐package we specifically built for association rules mining.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125667533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seonghyeon Kim, Sara Kim, Kunwoong Kim, Yongdai Kim
{"title":"Lq regularization for fair artificial intelligence robust to covariate shift","authors":"Seonghyeon Kim, Sara Kim, Kunwoong Kim, Yongdai Kim","doi":"10.1002/sam.11616","DOIUrl":"https://doi.org/10.1002/sam.11616","url":null,"abstract":"It is well recognized that historical biases exist in training data against a certain sensitive group (e.g., non‐White, women) which are socially unacceptable, and these unfair biases are inherited in trained artificial intelligence (AI) models. Various learning algorithms have been proposed to remove or alleviate unfair biases in trained AI models. In this paper, we consider another type of bias in training data so‐called covariate shift in view of fair AI. Here, covariate shift means that training data do not represent the population of interest well. Covariate shift occurs when special sampling designs (e.g., stratified sampling) are used when collecting training data, or the population where training data are collected is different from the population of interest. When covariate shift exists, fair AI models on training data may not be fair in test data. To ensure fairness on test data, we develop computationally efficient learning algorithms robust to covariate shifts. In particular, we propose a robust fairness constraint based on the Lq norm which is a generic algorithm to be applied to various fairness AI problems without much hampering. By analyzing multiple benchmark datasets, we show that our proposed robust fairness AI algorithm improves existing fair AI algorithms much in terms of the fairness‐accuracy tradeoff to covariate shift and has significant computational advantages compared to other robust fair AI algorithms.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134429513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Buckley–James estimation of generalized additive accelerated lifetime model with ultrahigh‐dimensional data","authors":"Zichang Li, Xuejing Zhao","doi":"10.1002/sam.11615","DOIUrl":"https://doi.org/10.1002/sam.11615","url":null,"abstract":"High‐dimensional covariates in lifetime data is a challenge in survival analysis, especially in gene expression profile. The objective of this paper is to propose an efficient algorithm to extend the generalized additive model to survival data with high‐dimensional covariates. The algorithm is combined of generalized additive (GAM) model and Buckley–James estimation, which makes a nonparametric extension to the nonlinear model, where the GAM is exploited to illustrate the nonlinear effect of the covariates and the Buckley–James estimation is used to address the regression model with right‐censored response. In addition, we use maximal‐information‐coefficient (MIC)‐type variable screening and weighted p‐value to reduce dimension in high‐dimensional situations. The performance of the proposed algorithm is compared with the three benchmark models: Cox proportional hazards regression model, random survival forest, and BJ‐AFT on a simulated dataset and two real survival datasets. The results, evaluated by concordance index (C‐index) as well as modified mean squared error (mMSE), illustrated the superiority of the proposed algorithm.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133964222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zarina Oflaz, Ceylan Yozgatlıgil, A. S. Selcuk-Kestel
{"title":"Estimation of disease progression for ischemic heart disease using latent Markov with covariates","authors":"Zarina Oflaz, Ceylan Yozgatlıgil, A. S. Selcuk-Kestel","doi":"10.1002/sam.11589","DOIUrl":"https://doi.org/10.1002/sam.11589","url":null,"abstract":"Contemporaneous monitoring of disease progression, in addition to early diagnosis, is important for the treatment of patients with chronic conditions. Chronic disease‐related factors are not easily tractable, and the existing data sets do not clearly reflect them, making diagnosis difficult. The primary issue is that databases maintained by health care, insurance, or governmental organizations typically do not contain clinical information and instead focus on patient appointments and demographic profiles. Due to the lack of thorough information on potential risk factors for a single patient, investigations on the nature of disease are imprecise. We suggest the use of a latent Markov model with variables in a latent process because it enables the panel analysis of many forms of data. The purpose of this study is to evaluate unobserved factors in ischemic heart disease (IHD) using longitudinal data from electronic health records. Based on the results we designate states as healthy, light, moderate, and severe to represent stages of disease progression. This study demonstrates that gender, patient age, and hospital visit frequency are all significant factors in the development of the disease. Females acquire IHD more rapidly than males, frequently developing from moderate and severe disease. In addition, it demonstrates that individuals under the age of 20 bypass the light state of IHD and proceed directly to the moderate state.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114991576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive boosting for ordinal target variables using neural networks","authors":"Insung Um, Geonseok Lee, K. Lee","doi":"10.1002/sam.11613","DOIUrl":"https://doi.org/10.1002/sam.11613","url":null,"abstract":"Boosting has proven its superiority by increasing the diversity of base classifiers, mainly in various classification problems. In reality, target variables in classification often are formed by numerical variables, in possession of ordinal information. However, existing boosting algorithms for classification are unable to reflect such ordinal target variables, resulting in non‐optimal solutions. In this paper, we propose a novel algorithm of ordinal encoding adaptive boosting (AdaBoost) using a multi‐dimensional encoding scheme for ordinal target variables. Extending an original binary‐class AdaBoost, the proposed algorithm is equipped with a multi‐class exponential loss function. We show that it achieves the Bayes classifier and establishes forward stagewise additive modeling. We demonstrate the performance of the proposed algorithm with a base learner as a neural network. Our experiments show that it outperforms existing boosting algorithms in various ordinal datasets.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116989208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabor Hannak, G. Horváth, Attila Kádár, Márk Dániel Szalai
{"title":"Bilateral‐Weighted Online Adaptive Isolation Forest for anomaly detection in streaming data","authors":"Gabor Hannak, G. Horváth, Attila Kádár, Márk Dániel Szalai","doi":"10.1002/sam.11612","DOIUrl":"https://doi.org/10.1002/sam.11612","url":null,"abstract":"We propose a method called Bilateral‐Weighted Online Adaptive Isolation Forest (BWOAIF) for unsupervised anomaly detection based on Isolation Forest (IF), which is applicable to streaming data and able to cope with concept drift. Similar to IF, the proposed method has only few hyperparameters whose effect on the performance are easy to interpret by human intuition and therefore easy to tune. BWOAIF ingests data and classifies it as normal or anomalous, and simultaneously adapts its classifier by removing old trees as well as by creating new ones. We show that BWOAIF adapts gradually to slow concept drifts, and, at the same time, it is able to adapt fast to sudden changes of the data distribution. Numerical results show the efficacy of the proposed algorithm and its ability to learn different classes of concept drifts, such as slow/fast concept shift, concept split, concept appearance, and concept disappearance.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123468562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Model selection with bootstrap validation","authors":"Rafael Savvides, Jarmo Mäkelä, K. Puolamäki","doi":"10.1002/sam.11606","DOIUrl":"https://doi.org/10.1002/sam.11606","url":null,"abstract":"Model selection is one of the most central tasks in supervised learning. Validation set methods are the standard way to accomplish this task: models are trained on training data, and the model with the smallest loss on the validation data is selected. However, it is generally not obvious how much validation data is required to make a reliable selection, which is essential when labeled data are scarce or expensive. We propose a bootstrap‐based algorithm, bootstrap validation (BSV), that uses the bootstrap to adjust the validation set size and to find the best‐performing model within a tolerance parameter specified by the user. We find that BSV works well in practice and can be used as a drop‐in replacement for validation set methods or k‐fold cross‐validation. The main advantage of BSV is that less validation data is typically needed, so more data can be used to train the model, resulting in better approximations and efficient use of validation data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117144076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchy‐assisted gene expression regulatory network analysis","authors":"Han Yan, Sanguo Zhang, Shuangge Ma","doi":"10.1002/sam.11609","DOIUrl":"https://doi.org/10.1002/sam.11609","url":null,"abstract":"Gene expressions have been extensively studied in biomedical research. With gene expression, network analysis, which takes a system perspective and examines the interconnections among genes, has been established as highly important and meaningful. In the construction of gene expression networks, a commonly adopted technique is high‐dimensional regularized regression. Network construction can be unadjusted (which focuses on gene expressions only) and adjusted (which also incorporates regulators of gene expressions), and the two types of construction have different implications and can be equally important. In this article, we propose a variable selection hierarchy to connect the unadjusted regression‐based network construction with the adjusted construction that incorporates two or more types of regulators. This hierarchy is sensible and amounts to additional information for both constructions, thus having the potential of improving variable selection and estimation. An effective computational algorithm is developed, and extensive simulation demonstrates the superiority of the proposed construction over multiple closely relevant alternatives. The analysis of TCGA data further demonstrates the practical utility of the proposed approach.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124058475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust deep neural network surrogate models with uncertainty quantification via adversarial training","authors":"Lixiang Zhang, Jia Li","doi":"10.1002/sam.11610","DOIUrl":"https://doi.org/10.1002/sam.11610","url":null,"abstract":"Surrogate models have been used to emulate mathematical simulators of physical or biological processes for computational efficiency. High‐speed simulation is crucial for conducting uncertainty quantification (UQ) when the simulation must repeat over many randomly sampled input points (aka the Monte Carlo method). A simulator can be so computationally intensive that UQ is only feasible with a surrogate model. Recently, deep neural network (DNN) surrogate models have gained popularity for their state‐of‐the‐art emulation accuracy. However, it is well‐known that DNN is prone to severe errors when input data are perturbed in particular ways, the very phenomenon which has inspired great interest in adversarial training. In the case of surrogate models, the concern is less about a deliberate attack exploiting the vulnerability of a DNN but more of the high sensitivity of its accuracy to input directions, an issue largely ignored by researchers using emulation models. In this paper, we show the severity of this issue through empirical studies and hypothesis testing. Furthermore, we adopt methods in adversarial training to enhance the robustness of DNN surrogate models. Experiments demonstrate that our approaches significantly improve the robustness of the surrogate models without compromising emulation accuracy.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123711917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}