{"title":"Guidelines and Best Practices to Share Deidentified Data and Code","authors":"Nicholas J. Horton, Sara Stoudt","doi":"arxiv-2405.18232","DOIUrl":"https://doi.org/arxiv-2405.18232","url":null,"abstract":"In 2022, the Journal of Statistics and Data Science Education (JSDSE)\u0000instituted augmented requirements for authors to post deidentified data and\u0000code underlying their papers. These changes were prompted by an increased focus\u0000on reproducibility and open science (NASEM 2019). A recent review of data\u0000availability practices noted that \"such policies help increase the\u0000reproducibility of the published literature, as well as make a larger body of\u0000data available for reuse and re-analysis\" (PLOS ONE, 2024). JSDSE values\u0000accessibility as it endeavors to share knowledge that can improve educational\u0000approaches to teaching statistics and data science. Because institution,\u0000environment, and students differ across readers of the journal, it is\u0000especially important to facilitate the transfer of a journal article's findings\u0000to new contexts. This process may require digging into more of the details,\u0000including the deidentified data and code. Our goal is to provide our readers\u0000and authors with a review of why the requirements for code and data sharing\u0000were instituted, summarize ongoing trends and developments in open science,\u0000discuss options for data and code sharing, and share advice for authors.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"133 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Epistemology behind Covariate Adjustment","authors":"Grayson L. Baird, Stephen L. Bieber","doi":"arxiv-2405.17224","DOIUrl":"https://doi.org/arxiv-2405.17224","url":null,"abstract":"It is often asserted that to control for the effects of confounders, one\u0000should include the confounding variables of concern in a statistical model as a\u0000covariate. Conversely, it is also asserted that control can only be concluded\u0000by design, where the results from an analysis can only be interpreted as\u0000evidence of an effect because the design controlled for the cause. To suggest\u0000otherwise is said to be a fallacy of cum hoc ergo propter hoc. Obviously, these\u0000two assertions create a conundrum: How can the effect of confounder be\u0000controlled for with analysis instead of by design without committing cum hoc\u0000ergo propter hoc? The present manuscript answers this conundrum.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"97 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141173143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Logic of Counterfactuals and the Epistemology of Causal Inference","authors":"Hanti Lin","doi":"arxiv-2405.11284","DOIUrl":"https://doi.org/arxiv-2405.11284","url":null,"abstract":"The 2021 Nobel Prize in Economics recognized a theory of causal inference,\u0000which deserves more attention from philosophers. To that end, I develop a\u0000dialectic that extends the Lewis-Stalnaker debate on a logical principle called\u0000Conditional Excluded Middle (CEM). I first play the good cop for CEM, and give\u0000a new argument for it: a Quine-Putnam indispensability argument based on the\u0000Nobel-Prize winning theory. But then I switch sides and play the bad cop: I\u0000undermine that argument with a new theory of causal inference that preserves\u0000the success of the original theory but dispenses with CEM.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141147813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Williams, Erin M. Schliep, Bailey Fosdick, Ryan Elmore
{"title":"Expected Points Above Average: A Novel NBA Player Metric Based on Bayesian Hierarchical Modeling","authors":"Benjamin Williams, Erin M. Schliep, Bailey Fosdick, Ryan Elmore","doi":"arxiv-2405.10453","DOIUrl":"https://doi.org/arxiv-2405.10453","url":null,"abstract":"Team and player evaluation in professional sport is extremely important given\u0000the financial implications of success/failure. It is especially critical to\u0000identify and retain elite shooters in the National Basketball Association\u0000(NBA), one of the premier basketball leagues worldwide because the ultimate\u0000goal of the game is to score more points than one's opponent. To this end we\u0000propose two novel basketball metrics: \"expected points\" for team-based\u0000comparisons and \"expected points above average (EPAA)\" as a player-evaluation\u0000tool. Both metrics leverage posterior samples from Bayesian hierarchical\u0000modeling framework to cluster teams and players based on their shooting\u0000propensities and abilities. We illustrate the concepts for the top 100 shot\u0000takers over the last decade and offer our metric as an additional metric for\u0000evaluating players.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141147846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identification of Single-Treatment Effects in Factorial Experiments","authors":"Guilherme Duarte","doi":"arxiv-2405.09797","DOIUrl":"https://doi.org/arxiv-2405.09797","url":null,"abstract":"Despite their cost, randomized controlled trials (RCTs) are widely regarded\u0000as gold-standard evidence in disciplines ranging from social science to\u0000medicine. In recent decades, researchers have increasingly sought to reduce the\u0000resource burden of repeated RCTs with factorial designs that simultaneously\u0000test multiple hypotheses, e.g. experiments that evaluate the effects of many\u0000medications or products simultaneously. Here I show that when multiple\u0000interventions are randomized in experiments, the effect any single intervention\u0000would have outside the experimental setting is not identified absent heroic\u0000assumptions, even if otherwise perfectly realistic conditions are achieved.\u0000This happens because single-treatment effects involve a counterfactual world\u0000with a single focal intervention, allowing other variables to take their\u0000natural values (which may be confounded or modified by the focal intervention).\u0000In contrast, observational studies and factorial experiments provide\u0000information about potential-outcome distributions with zero and multiple\u0000interventions, respectively. In this paper, I formalize sufficient conditions\u0000for the identifiability of those isolated quantities. I show that researchers\u0000who rely on this type of design have to justify either linearity of functional\u0000forms or -- in the nonparametric case -- specify with Directed Acyclic Graphs\u0000how variables are related in the real world. Finally, I develop nonparametric\u0000sharp bounds -- i.e., maximally informative best-/worst-case estimates\u0000consistent with limited RCT data -- that show when extrapolations about effect\u0000signs are empirically justified. These new results are illustrated with\u0000simulated data.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141062584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos del-Castillo-Negrete, Rylan Spence, Troy Butler, Clint Dawson
{"title":"Sequential Maximal Updated Density Parameter Estimation for Dynamical Systems with Parameter Drift","authors":"Carlos del-Castillo-Negrete, Rylan Spence, Troy Butler, Clint Dawson","doi":"arxiv-2405.08307","DOIUrl":"https://doi.org/arxiv-2405.08307","url":null,"abstract":"We present a novel method for generating sequential parameter estimates and\u0000quantifying epistemic uncertainty in dynamical systems within a data-consistent\u0000(DC) framework. The DC framework differs from traditional Bayesian approaches\u0000due to the incorporation of the push-forward of an initial density, which\u0000performs selective regularization in parameter directions not informed by the\u0000data in the resulting updated density. This extends a previous study that\u0000included the linear Gaussian theory within the DC framework and introduced the\u0000maximal updated density (MUD) estimate as an alternative to both least squares\u0000and maximum a posterior (MAP) estimates. In this work, we introduce algorithms\u0000for operational settings of MUD estimation in real or near-real time where\u0000spatio-temporal datasets arrive in packets to provide updated estimates of\u0000parameters and identify potential parameter drift. Computational diagnostics\u0000within the DC framework prove critical for evaluating (1) the quality of the DC\u0000update and MUD estimate and (2) the detection of parameter value drift. The\u0000algorithms are applied to estimate (1) wind drag parameters in a high-fidelity\u0000storm surge model, (2) thermal diffusivity field for a heat conductivity\u0000problem, and (3) changing infection and incubation rates of an epidemiological\u0000model.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140941397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting Short Response Ratings with Non-Content Related Features: A Hierarchical Modeling Approach","authors":"Aubrey Condor","doi":"arxiv-2405.08574","DOIUrl":"https://doi.org/arxiv-2405.08574","url":null,"abstract":"We explore whether the human ratings of open ended responses can be explained\u0000with non-content related features, and if such effects vary across different\u0000mathematics-related items. When scoring is rigorously defined and rooted in a\u0000measurement framework, educators intend that the features of a response which\u0000are indicative of the respondent's level of ability are contributing to scores.\u0000However, we find that features such as response length, a grammar score of the\u0000response, and a metric relating to key phrase frequency are significant\u0000predictors for response ratings. Although our findings are not causally\u0000conclusive, they may propel us to be more critical of he way in which we assess\u0000open ended responses, especially in high stakes scenarios. Educators take great\u0000care to provide unbiased, consistent ratings, but it may be that extraneous\u0000features unrelated to those which were intended to be rated are being\u0000evaluated.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140941398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nested Instrumental Variables Design: Switcher Average Treatment Effect, Identification, Efficient Estimation and Generalizability","authors":"Rui Wang, Ying-Qi Zhao, Oliver Dukes, Bo Zhang","doi":"arxiv-2405.07102","DOIUrl":"https://doi.org/arxiv-2405.07102","url":null,"abstract":"Instrumental variables (IV) are a commonly used tool to estimate causal\u0000effects from non-randomized data. A prototype of an IV is a randomized trial\u0000with non-compliance where the randomized treatment assignment serves as an IV\u0000for the non-ignorable treatment received. Under a monotonicity assumption, a\u0000valid IV non-parametrically identifies the average treatment effect among a\u0000non-identifiable complier subgroup, whose generalizability is often under\u0000debate. In many studies, there could exist multiple versions of an IV, for\u0000instance, different nudges to take the same treatment in different study sites\u0000in a multi-center clinical trial. These different versions of an IV may result\u0000in different compliance rates and offer a unique opportunity to study IV\u0000estimates' generalizability. In this article, we introduce a novel nested IV\u0000assumption and study identification of the average treatment effect among two\u0000latent subgroups: always-compliers and switchers, who are defined based on the\u0000joint potential treatment received under two versions of a binary IV. We derive\u0000the efficient influence function for the SWitcher Average Treatment Effect\u0000(SWATE) and propose efficient estimators. We then propose formal statistical\u0000tests of the generalizability of IV estimates based on comparing the\u0000conditional average treatment effect among the always-compliers and that among\u0000the switchers under the nested IV framework. We apply the proposed framework\u0000and method to the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer\u0000Screening Trial and study the causal effect of colorectal cancer screening and\u0000its generalizability.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140941495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Strategies for Rare Population Detection and Sampling: A Methodological Approach in Liguria","authors":"G. Lancia, E. Riccomagno","doi":"arxiv-2405.01342","DOIUrl":"https://doi.org/arxiv-2405.01342","url":null,"abstract":"Economic policy sciences are constantly investigating the quality of\u0000well-being of broad sections of the population in order to describe the current\u0000interdependence between unequal living conditions, low levels of education and\u0000a lack of integration into society. Such studies are often carried out in the\u0000form of surveys, e.g. as part of the EU-SILC program. If the survey is designed\u0000at national or international level, the results of the study are often used as\u0000a reference by a broad range of public institutions. However, the sampling\u0000strategy per se may not capture enough information to provide an accurate\u0000representation of all population strata. Problems might arise from rare, or\u0000hard-to-sample, populations and the conclusion of the study may be compromised\u0000or unrealistic. We propose here a two-phase methodology to identify rare,\u0000poorly sampled populations and then resample the hard-to-sample strata. We\u0000focused our attention on the 2019 EU-SILC section concerning the Italian region\u0000of Liguria. Methods based on dispersion indices or deep learning were used to\u0000detect rare populations. A multi-frame survey was proposed as the sampling\u0000design. The results showed that factors such as citizenship, material\u0000deprivation and large families are still fundamental characteristics that are\u0000difficult to capture.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140828627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"What's So Hard about the Monty Hall Problem?","authors":"Rafael C. Alvarado","doi":"arxiv-2405.00884","DOIUrl":"https://doi.org/arxiv-2405.00884","url":null,"abstract":"The Monty Hall problem is notorious for its deceptive simplicity. Although\u0000today it is widely used as a provocative thought experiment to introduce\u0000Bayesian thinking to students of probability, in the not so distant past it was\u0000rejected by established mathematicians. This essay provides some historical\u0000background to the problem and explains why it is considered so\u0000counter-intuitive to many. It is argued that the main barrier to understanding\u0000the problem is the back-grounding of the concept of dependence in probability\u0000theory as it is commonly taught. To demonstrate this, a Bayesian solution is\u0000provided and augmented with a probabilistic graphical model (PGM) inspired by\u0000the work of Pearl (1988, 1998). Although the Bayesian approach produces the\u0000correct answer, without a representation of the dependency structure of events\u0000implied by the problem, the salient fact that motivates the problem's solution\u0000remains hidden.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140828628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}