{"title":"Editorial: Symposium Data Science and Statistics 2022","authors":"C. Bowen, M. Grosskopf","doi":"10.6339/23-jds212edi","DOIUrl":"https://doi.org/10.6339/23-jds212edi","url":null,"abstract":"","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Bayesian High-Dimensional Classification via Random Projection with Application to Gene Expression Data","authors":"Abhisek Chakraborty","doi":"10.6339/23-jds1102","DOIUrl":"https://doi.org/10.6339/23-jds1102","url":null,"abstract":"Inspired by the impressive successes of compress sensing-based machine learning algorithms, data augmentation-based efficient Gibbs samplers for Bayesian high-dimensional classification models are developed by compressing the design matrix to a much lower dimension. Ardent care is exercised in the choice of the projection mechanism, and an adaptive voting rule is employed to reduce sensitivity to the random projection matrix. Focusing on the high-dimensional Probit regression model, we note that the naive implementation of the data augmentation-based Gibbs sampler is not robust to the presence of co-linearity in the design matrix – a setup ubiquitous in $n<p$ problems. We demonstrate that a simple fix based on joint updates of parameters in the latent space circumnavigates this issue. With a computationally efficient MCMC scheme in place, we introduce an ensemble classifier by creating R (∼25–50) projected copies of the design matrix, and subsequently running R classification models with the R projected design matrix in parallel. We combine the output from the R replications via an adaptive voting scheme. Our scheme is inherently parallelizable and capable of taking advantage of modern computing environments often equipped with multiple cores. The empirical success of our methodology is illustrated in elaborate simulations and gene expression data applications. We also extend our methodology to a high-dimensional logistic regression model and carry out numerical studies to showcase its efficacy.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identification of Optimal Combined Moderators for Time to Relapse","authors":"Bang Wang, Yu Cheng, M. Levine","doi":"10.6339/23-jds1107","DOIUrl":"https://doi.org/10.6339/23-jds1107","url":null,"abstract":"Identifying treatment effect modifiers (i.e., moderators) plays an essential role in improving treatment efficacy when substantial treatment heterogeneity exists. However, studies are often underpowered for detecting treatment effect modifiers, and exploratory analyses that examine one moderator per statistical model often yield spurious interactions. Therefore, in this work, we focus on creating an intuitive and readily implementable framework to facilitate the discovery of treatment effect modifiers and to make treatment recommendations for time-to-event outcomes. To minimize the impact of a misspecified main effect and avoid complex modeling, we construct the framework by matching the treated with the controls and modeling the conditional average treatment effect via regressing the difference in the observed outcomes of a matched pair on the averaged moderators. Inverse-probability-of-censoring weighting is used to handle censored observations. As matching is the foundation of the proposed methods, we explore different matching metrics and recommend the use of Mahalanobis distance when both continuous and categorical moderators are present. After matching, the proposed framework can be flexibly combined with popular variable selection and prediction methods such as linear regression, least absolute shrinkage and selection operator (Lasso), and random forest to create different combinations of potential moderators. The optimal combination is determined by the out-of-bag prediction error and the area under the receiver operating characteristic curve in making correct treatment recommendations. We compare the performance of various combined moderators through extensive simulations and the analysis of real trial data. Our approach can be easily implemented using existing R packages, resulting in a straightforward optimal combined moderator to make treatment recommendations.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Revisiting the Use of Generalized Least Squares in Time Series Regression Models","authors":"Yue Fang, S. Koreisha, Q. Shao","doi":"10.6339/23-jds1108","DOIUrl":"https://doi.org/10.6339/23-jds1108","url":null,"abstract":"Linear regression models are widely used in empirical studies. When serial correlation is present in the residuals, generalized least squares (GLS) estimation is commonly used to improve estimation efficiency. This paper proposes the use of an alternative estimator, the approximate generalized least squares estimators based on high-order AR(p) processes (GLS-AR). We show that GLS-AR estimators are asymptotically efficient as GLS estimators, as both the number of AR lag, p, and the number of observations, n, increase together so that $p=o({n^{1/4}})$ in the limit. The proposed GLS-AR estimators do not require the identification of the residual serial autocorrelation structure and perform more robust in finite samples than the conventional FGLS-based tests. Finally, we illustrate the usefulness of GLS-AR method by applying it to the global warming data from 1850–2012.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing the Rainfall Pattern in Honduras Through Non-Homogeneous Hidden Markov Models","authors":"Gustavo Alexis Sabillón, D. Zuanetti","doi":"10.6339/23-jds1091","DOIUrl":"https://doi.org/10.6339/23-jds1091","url":null,"abstract":"One of the major climatic interests of the last decades has been to understand and describe the rainfall patterns of specific areas of the world as functions of other climate covariates. We do it for the historical climate monitoring data from Tegucigalpa, Honduras, using non-homogeneous hidden Markov models (NHMMs), which are dynamic models usually used to identify and predict heterogeneous regimes. For estimating the NHMM in an efficient and scalable way, we propose the stochastic Expectation-Maximization (EM) algorithm and a Bayesian method, and compare their performance in synthetic data. Although these methodologies have already been used for estimating several other statistical models, it is not the case of NHMMs which are still widely fitted by the traditional EM algorithm. We observe that, under tested conditions, the performance of the Bayesian and stochastic EM algorithms is similar and discuss their slight differences. Analyzing the Honduras rainfall data set, we identify three heterogeneous rainfall periods and select temperature and humidity as relevant covariates for explaining the dynamic relation among these periods.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Network A/B Testing: Nonparametric Statistical Significance Test Based on Cluster-Level Permutation","authors":"Hongwei Shang, Xiaolin Shi, Bai Jiang","doi":"10.6339/23-jds1112","DOIUrl":"https://doi.org/10.6339/23-jds1112","url":null,"abstract":"A/B testing is widely used for comparing two versions of a product and evaluating new proposed product features. It is of great importance for decision-making and has been applied as a golden standard in the IT industry. It is essentially a form of two-sample statistical hypothesis testing. Average treatment effect (ATE) and the corresponding p-value can be obtained under certain assumptions. One key assumption in traditional A/B testing is the stable-unit-treatment-value assumption (SUTVA): there is no interference among different units. It means that the observation on one unit is unaffected by the particular assignment of treatments to the other units. Nonetheless, interference is very common in social network settings where people communicate and spread information to their neighbors. Therefore, the SUTVA assumption is violated. Analysis ignoring this network effect will lead to biased estimation of ATE. Most existing works focus mainly on the design of experiment and data analysis in order to produce estimators with good performance in regards to bias and variance. Little attention has been paid to the calculation of p-value. We work on the calculation of p-value for the ATE estimator in network A/B tests. After a brief review of existing research methods on design of experiment based on graph cluster randomization and different ATE estimation methods, we propose a permutation method for calculating p-value based on permutation test at the cluster level. The effectiveness of the method against that based on individual-level permutation is validated in a simulation study mimicking realistic settings.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuguo Chen, Daniel Sewell, Panpan Zhang, Xuening Zhu
{"title":"Editorial: Advances in Network Data Science","authors":"Yuguo Chen, Daniel Sewell, Panpan Zhang, Xuening Zhu","doi":"10.6339/23-jds213edi","DOIUrl":"https://doi.org/10.6339/23-jds213edi","url":null,"abstract":"This special issue features nine articles on “Advances in Network Data Science”. Data science is an interdisciplinary research field utilizing scientific methods to facilitate knowledge and insights from structured and unstructured data across a broad range of domains. Network data are proliferating in many fields, and network data analysis has become a burgeoning research in the data science community. Due to the nature of heterogeneity and complexity of network data, classical statistical approaches for network model fitting face a great deal of challenges, especially for large-scale network data. Therefore, it becomes crucial to develop advanced methodological and computational tools to cope with challenges associated with massive and complex network data analyses. This special issue highlights some recent studies in the area of network data analysis, showcasing a variety of contributions in statistical methodology, two real-world applications, a software package for network generation, and a survey on handling missing values in networks. Five articles are published in the Statistical Data Science Section. Wang and Resnick (2023) employed point processes to investigate the macroscopic growth dynamics of geographically concentrated regional networks. They discovered that during the startup phase, a self-exciting point process effectively modeled the growth process, and subsequently, the growth of links could be suitably described by a non-homogeneous Poisson process. Komolafe","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FROSTY: A High-Dimensional Scale-Free Bayesian Network Learning Method","authors":"Joshua Bang, Sang-Yun Oh","doi":"10.6339/23-jds1097","DOIUrl":"https://doi.org/10.6339/23-jds1097","url":null,"abstract":"We propose a scalable Bayesian network learning algorithm based on sparse Cholesky decomposition. Our approach only requires observational data and user-specified confidence level as inputs and can estimate networks with thousands of variables. The computational complexity of the proposed method is $O({p^{3}})$ for a graph with p vertices. Extensive numerical experiments illustrate the usefulness of our method with promising results. In simulation, the initial step in our approach also improves an alternative Bayesian network structure estimation method that uses an undirected graph as an input.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R","authors":"Zhaoxing Wu, Chunming Zhang","doi":"10.6339/23-jds1096","DOIUrl":"https://doi.org/10.6339/23-jds1096","url":null,"abstract":"Analyzing “large p small n” data is becoming increasingly paramount in a wide range of application fields. As a projection pursuit index, the Penalized Discriminant Analysis ($mathrm{PDA}$) index, built upon the Linear Discriminant Analysis ($mathrm{LDA}$) index, is devised in Lee and Cook (2010) to classify high-dimensional data with promising results. Yet, there is little information available about its performance compared with the popular Support Vector Machine ($mathrm{SVM}$). This paper conducts extensive numerical studies to compare the performance of the $mathrm{PDA}$ index with the $mathrm{LDA}$ index and $mathrm{SVM}$, demonstrating that the $mathrm{PDA}$ index is robust to outliers and able to handle high-dimensional datasets with extremely small sample sizes, few important variables, and multiple classes. Analyses of several motivating real-world datasets reveal the practical advantages and limitations of individual methods, suggesting that the $mathrm{PDA}$ index provides a useful alternative tool for classifying complex high-dimensional data. These new insights, along with the hands-on implementation of the $mathrm{PDA}$ index functions in the R package classPP, facilitate statisticians and data scientists to make effective use of both sets of classification tools.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenru Zhou, Miranda Kroehl, Maxene Meier, A. Kaizer
{"title":"Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale Data","authors":"Wenru Zhou, Miranda Kroehl, Maxene Meier, A. Kaizer","doi":"10.6339/23-jds1099","DOIUrl":"https://doi.org/10.6339/23-jds1099","url":null,"abstract":"The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in finding the “optimal” boundary. In this paper, we summarized fifteen boundaries that consist of five error spending functions that allow early termination for futility, difference, or both, as well as a fixed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reflect different settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates different weights for the expected sample size under a null experiment with no difference between variants, the expected sample size under an experiment with a difference in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}