Yeyi Zhu, Ladia M Hernandez, Peter Mueller, Yongquan Dong, Michele R Forman
{"title":"Data Acquisition and Preprocessing in Studies on Humans: What Is Not Taught in Statistics Classes?","authors":"Yeyi Zhu, Ladia M Hernandez, Peter Mueller, Yongquan Dong, Michele R Forman","doi":"10.1080/00031305.2013.842498","DOIUrl":"10.1080/00031305.2013.842498","url":null,"abstract":"<p><p>The aim of this paper is to address issues in research that may be missing from statistics classes and important for (bio-)statistics students. In the context of a case study, we discuss data acquisition and preprocessing steps that fill the gap between research questions posed by subject matter scientists and statistical methodology for formal inference. Issues include participant recruitment, data collection training and standardization, variable coding, data review and verification, data cleaning and editing, and documentation. Despite the critical importance of these details in research, most of these issues are rarely discussed in an applied statistics program. One reason for the lack of more formal training is the difficulty in addressing the many challenges that can possibly arise in the course of a study in a systematic way. This article can help to bridge this gap between research questions and formal statistical inference by using an illustrative case study for a discussion. We hope that reading and discussing this paper and practicing data preprocessing exercises will sensitize statistics students to these important issues and achieve optimal conduct, quality control, analysis, and interpretation of a study.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"67 4","pages":"235-241"},"PeriodicalIF":1.8,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3912269/pdf/nihms537499.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32104198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Overview of Current Software Procedures for Fitting Linear Mixed Models.","authors":"Brady T West, Andrzej T Galecki","doi":"10.1198/tas.2011.11077","DOIUrl":"https://doi.org/10.1198/tas.2011.11077","url":null,"abstract":"At present, there are many software procedures available that enable statisticians to fit linear mixed models (LMMs) to continuous dependent variables in clustered or longitudinal datasets. LMMs are flexible tools for analyzing relationships among variables in these types of datasets, in that a variety of covariance structures can be used depending on the subject matter under study. The explicit random effects in LMMs allow analysts to make inferences about the variability between clusters or subjects in larger hypothetical populations, and examine cluster- or subject-level variables that explain portions of this variability. These models can also be used to analyze longitudinal or clustered datasets with data that are missing at random (MAR), and can accommodate time-varying covariates in longitudinal datasets. Although the software procedures currently available have many features in common, more specific analytic aspects of fitting LMMs (e.g., crossed random effects, appropriate hypothesis testing for variance components, diagnostics, incorporating sampling weights) may only be available in selected software procedures. With this article, we aim to perform a comprehensive and up-to-date comparison of the current capabilities of software procedures for fitting LMMs, and provide statisticians with a guide for selecting a software procedure appropriate for their analytic goals.","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 4","pages":"274-282"},"PeriodicalIF":1.8,"publicationDate":"2012-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1198/tas.2011.11077","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31375746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
American StatisticianPub Date : 2012-01-01Epub Date: 2012-03-21DOI: 10.1080/00031305.2012.676329
Bailey K Fosdick, Adrian E Raftery
{"title":"Estimating the Correlation in Bivariate Normal Data with Known Variances and Small Sample Sizes().","authors":"Bailey K Fosdick, Adrian E Raftery","doi":"10.1080/00031305.2012.676329","DOIUrl":"https://doi.org/10.1080/00031305.2012.676329","url":null,"abstract":"<p><p>We consider the problem of estimating the correlation in bivariate normal data when the means and variances are assumed known, with emphasis on the small sample case. We consider eight different estimators, several of them considered here for the first time in the literature. In a simulation study, we found that Bayesian estimators using the uniform and arc-sine priors outperformed several empirical and exact or approximate maximum likelihood estimators in small samples. The arc-sine prior did better for large values of the correlation. For testing whether the correlation is zero, we found that Bayesian hypothesis tests outperformed significance tests based on the empirical and exact or approximate maximum likelihood estimators considered in small samples, but that all tests performed similarly for sample size 50. These results lead us to suggest using the posterior mean with the arc-sine prior to estimate the correlation in small samples when the variances are assumed known.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 1","pages":"34-41"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.676329","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31302836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theodore G Karrison, Mark J Ratain, Walter M Stadler, Gary L Rosner
{"title":"Estimation of Progression-Free Survival for All Treated Patients in the Randomized Discontinuation Trial Design.","authors":"Theodore G Karrison, Mark J Ratain, Walter M Stadler, Gary L Rosner","doi":"10.1080/00031305.2012.720900","DOIUrl":"https://doi.org/10.1080/00031305.2012.720900","url":null,"abstract":"<p><p>The randomized discontinuation trial (RDT) design is an enrichment-type design that has been used in a variety of diseases to evaluate the efficacy of new treatments. The RDT design seeks to select a more homogeneous group of patients, consisting of those who are more likely to show a treatment benefit if one exists. In oncology, the RDT design has been applied to evaluate the effects of cytostatic agents, that is, drugs that act primarily by slowing tumor growth rather than shrinking tumors. In the RDT design, all patients receive treatment during an initial, open-label run-in period of duration <i>T</i>. Patients with objective response (substantial tumor shrinkage) remain on therapy while those with early progressive disease are removed from the trial. Patients with stable disease (SD) are then randomized to either continue active treatment or switched to placebo. The main analysis compares outcomes, for example, progression-free survival (PFS), between the two randomized arms. As a secondary objective, investigators may seek to estimate PFS for all treated patients, measured from the time of entry into the study, by combining information from the run-in and post run-in periods. For <i>t ≤ T</i>, PFS is estimated by the observed proportion of patients who are progression-free among all patients enrolled. For <i>t > T</i>, the estimate can be expressed as <i>Ŝ</i>(<i>t</i>) = <i>p̂</i><sub>OR</sub> × <i>Ŝ</i><sub>OR</sub>(<i>t - T</i>) + <i>p̂</i><sub>SD</sub> × <i>Ŝ</i><sub>SD</sub>(<i>t - T</i>), where <i>p̂</i><sub>OR</sub> is the estimated probability of response during the run-in period, <i>p̂</i><sub>SD</sub> is the estimated probability of SD, and <i>Ŝ</i><sub>OR</sub>(<i>t - T</i>) and <i>Ŝ</i><sub>SD</sub>(<i>t - T</i>) are the Kaplan-Meier estimates of subsequent PFS in the responders and patients with SD randomized to continue treatment, respectively. In this article, we derive the variance of <i>Ŝ</i>(<i>t</i>), enabling the construction of confidence intervals for both <i>S</i>(<i>t</i>) and the median survival time. Simulation results indicate that the method provides accurate coverage rates. An interesting aspect of the design is that outcomes during the run-in phase have a negative multinomial distribution, something not frequently encountered in practice.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 3","pages":"155-162"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.720900","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31736474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Simulation Based Evaluation of the Asymptotic Power Formulae for Cox Models in Small Sample Cases.","authors":"Mehmet Kocak, Arzu Onar-Thomas","doi":"10.1080/00031305.2012.703873","DOIUrl":"https://doi.org/10.1080/00031305.2012.703873","url":null,"abstract":"<p><p>Cox proportional hazards (PH) models are commonly used in medical research to investigate the associations between covariates and time to event outcomes. It is frequently noted that with less than ten events per covariate, these models produce spurious results, and therefore, should not be used. Statistical literature contains asymptotic power formulae for the Cox model which can be used to determine the number of events needed to detect an association. Here we investigate via simulations the performance of these formulae in small sample settings for Cox models with 1- or 2-covariates. Our simulations indicate that, when the number of events is small, the power estimate based on the asymptotic formulae is often inflated. The discrepancy between the asymptotic and empirical power is larger for the dichotomous covariate especially in cases where allocation of sample size to its levels is unequal. When more than one covariate is included in the same model, the discrepancy between the asymptotic power and the empirical power is even larger, especially when a high positive correlation exists between the two covariates.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 3","pages":"173-179"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.703873","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31798842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
American StatisticianPub Date : 2012-01-01Epub Date: 2012-06-12DOI: 10.1080/00031305.2012.671724
Robert S Poulson, Gary L Gadbury, David B Allison
{"title":"Treatment Heterogeneity and Individual Qualitative Interaction.","authors":"Robert S Poulson, Gary L Gadbury, David B Allison","doi":"10.1080/00031305.2012.671724","DOIUrl":"https://doi.org/10.1080/00031305.2012.671724","url":null,"abstract":"<p><p>Plausibility of high variability in treatment effects across individuals has been recognized as an important consideration in clinical studies. Surprisingly, little attention has been given to evaluating this variability in design of clinical trials or analyses of resulting data. High variation in a treatment's efficacy or safety across individuals (referred to herein as treatment heterogeneity) may have important consequences because the optimal treatment choice for an individual may be different from that suggested by a study of average effects. We call this an individual qualitative interaction (IQI), borrowing terminology from earlier work - referring to a qualitative interaction (QI) being present when the optimal treatment varies across a\"groups\" of individuals. At least three techniques have been proposed to investigate treatment heterogeneity: techniques to detect a QI, use of measures such as the density overlap of two outcome variables under different treatments, and use of cross-over designs to observe \"individual effects.\" We elucidate underlying connections among them, their limitations and some assumptions that may be required. We do so under a potential outcomes framework that can add insights to results from usual data analyses and to study design features that improve the capability to more directly assess treatment heterogeneity.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"66 1","pages":"16-24"},"PeriodicalIF":1.8,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/00031305.2012.671724","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31092749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Josue G Martinez, Raymond J Carroll, Samuel Müller, Joshua N Sampson, Nilanjan Chatterjee
{"title":"Empirical Performance of Cross-Validation With Oracle Methods in a Genomics Context.","authors":"Josue G Martinez, Raymond J Carroll, Samuel Müller, Joshua N Sampson, Nilanjan Chatterjee","doi":"10.1198/tas.2011.11052","DOIUrl":"10.1198/tas.2011.11052","url":null,"abstract":"<p><p>When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 4","pages":"223-228"},"PeriodicalIF":1.8,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3281424/pdf/nihms355303.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30470829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José R Zubizarreta, Caroline E Reinke, Rachel R Kelz, Jeffrey H Silber, Paul R Rosenbaum
{"title":"Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery.","authors":"José R Zubizarreta, Caroline E Reinke, Rachel R Kelz, Jeffrey H Silber, Paul R Rosenbaum","doi":"10.1198/tas.2011.11072","DOIUrl":"https://doi.org/10.1198/tas.2011.11072","url":null,"abstract":"<p><p>Matching for several nominal covariates with many levels has usually been thought to be difficult because these covariates combine to form an enormous number of interaction categories with few if any people in most such categories. Moreover, because nominal variables are not ordered, there is often no notion of a \"close substitute\" when an exact match is unavailable. In a case-control study of the risk factors for read-mission within 30 days of surgery in the Medicare population, we wished to match for 47 hospitals, 15 surgical procedures grouped or nested within 5 procedure groups, two genders, or 47 × 15 × 2 = 1410 categories. In addition, we wished to match as closely as possible for the continuous variable age (65-80 years). There were 1380 readmitted patients or cases. A fractional factorial experiment may balance main effects and low-order interactions without achieving balance for high-order interactions. In an analogous fashion, we balance certain main effects and low-order interactions among the covariates; moreover, we use as many exactly matched pairs as possible. This is done by creating a match that is exact for several variables, with a close match for age, and both a \"near-exact match\" and a \"finely balanced match\" for another nominal variable, in this case a 47 × 5 = 235 category variable representing the interaction of the 47 hospitals and the five surgical procedure groups. The method is easily implemented in R.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 4","pages":"229-238"},"PeriodicalIF":1.8,"publicationDate":"2011-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1198/tas.2011.11072","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32832138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Classification-Based Relabeling in Mixture Models.","authors":"Andrew J Cron, Mike West","doi":"10.1198/tast.2011.10170","DOIUrl":"https://doi.org/10.1198/tast.2011.10170","url":null,"abstract":"<p><p>Effective component relabeling in Bayesian analyses of mixture models is critical to the routine use of mixtures in classification with analysis based on Markov chain Monte Carlo methods. The classification-based relabeling approach here is computationally attractive and statistically effective, and scales well with sample size and number of mixture components concordant with enabling routine analyses of increasingly large data sets. Building on the best of existing methods, practical relabeling aims to match data:component classification indicators in MCMC iterates with those of a defined reference mixture distribution. The method performs as well as or better than existing methods in small dimensional problems, while being practically superior in problems with larger data sets as the approach is scalable. We describe examples and computational benchmarks, and provide supporting code with efficient computational implementation of the algorithm that will be of use to others in practical applications of mixture models.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 1","pages":"16-20"},"PeriodicalIF":1.8,"publicationDate":"2011-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1198/tast.2011.10170","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29927121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
American StatisticianPub Date : 2011-01-01Epub Date: 2012-01-01DOI: 10.1198/tast.2011.08294
Bo Lu, Robert Greevy, Xinyi Xu, Cole Beck
{"title":"Optimal Nonbipartite Matching and Its Statistical Applications.","authors":"Bo Lu, Robert Greevy, Xinyi Xu, Cole Beck","doi":"10.1198/tast.2011.08294","DOIUrl":"10.1198/tast.2011.08294","url":null,"abstract":"<p><p>Matching is a powerful statistical tool in design and analysis. Conventional two-group, or bipartite, matching has been widely used in practice. However, its utility is limited to simpler designs. In contrast, nonbipartite matching is not limited to the two-group case, handling multiparty matching situations. It can be used to find the set of matches that minimize the sum of distances based on a given distance matrix. It brings greater flexibility to the matching design, such as multigroup comparisons. Thanks to improvements in computing power and freely available algorithms to solve nonbipartite problems, the cost in terms of computation time and complexity is low. This article reviews the optimal nonbipartite matching algorithm and its statistical applications, including observational studies with complex designs and an exact distribution-free test comparing two multivariate distributions. We also introduce an R package that performs optimal nonbipartite matching. We present an easily accessible web application to make nonbipartite matching freely available to general researchers.</p>","PeriodicalId":50801,"journal":{"name":"American Statistician","volume":"65 1","pages":"21-30"},"PeriodicalIF":1.8,"publicationDate":"2011-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3501247/pdf/nihms412698.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31070271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}