{"title":"NAICS Code Prediction Using Supervised Methods","authors":"C. Oehlert, Evan T. Schulz, Anne Parker","doi":"10.1080/2330443X.2022.2033654","DOIUrl":"https://doi.org/10.1080/2330443X.2022.2033654","url":null,"abstract":"Abstract When compiling industry statistics or selecting businesses for further study, researchers often rely on North American Industry Classification System (NAICS) codes. However, codes are self-reported on tax forms and reporting incorrect codes or even leaving the code blank has no tax consequences, so they are often unusable. IRSs Statistics of Income (SOI) program validates NAICS codes for businesses in the statistical samples used to produce official tax statistics for various filing populations, including sole proprietorships (those filing Form 1040 Schedule C) and corporations (those filing Forms 1120). In this article we leverage these samples to explore ways to improve NAICS code reporting for all filers in the relevant populations. For sole proprietorships, we overcame several record linkage complications to combine data from SOI samples with other administrative data. Using the SOI-validated NAICS code values as ground truth, we trained classification-tree-based models (randomForest) to predict NAICS industry sector from other tax return data, including text descriptions, for businesses which did or did not initially report a valid NAICS code. For both sole proprietorships and corporations, we were able to improve slightly on the accuracy of valid self-reported industry sector and correctly identify sector for over half of businesses with no informative reported NAICS code.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46698012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reconciling Evaluations of the Millennium Villages Project","authors":"A. Gelman, Shira Mitchell, J. Sachs, S. Sachs","doi":"10.1080/2330443X.2021.2019152","DOIUrl":"https://doi.org/10.1080/2330443X.2021.2019152","url":null,"abstract":"Abstract The Millennium Villages Project was an integrated rural development program carried out for a decade in 10 clusters of villages in sub-Saharan Africa starting in 2005, and in a few other sites for shorter durations. An evaluation of the 10 main sites compared to retrospectively chosen control sites estimated positive effects on a range of economic, social, and health outcomes (Mitchell et al. 2018). More recently, an outside group performed a prospective controlled (but also nonrandomized) evaluation of one of the shorter-duration sites and reported smaller or null results (Masset et al. 2020). Although these two conclusions seem contradictory, the differences can be explained by the fact that Mitchell et al. studied 10 sites where the project was implemented for 10 years, and Masset et al. studied one site with a program lasting less than 5 years, as well as differences in inference and framing. Insights from both evaluations should be valuable in considering future development efforts of this sort. Both studies are consistent with a larger picture of positive average impacts (compared to untreated villages) across a broad range of outcomes, but with effects varying across sites or requiring an adequate duration for impacts to be manifested.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44919973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graphical Measures Summarizing the Inequality of Income of Two Groups","authors":"Joshua Landon, Joseph Gastwirth","doi":"10.1080/2330443X.2021.2016084","DOIUrl":"https://doi.org/10.1080/2330443X.2021.2016084","url":null,"abstract":"Abstract Recently, Gastwirth proposed two transformations and of the Lorenz curve, which calculates the proportion of a population, cumulated from the poorest or middle, respectively, needed to have the same amount of income as top . Economists and policy makers are often interested in the comparative status of two groups, for example, females versus males or minority versus majority. This article adapts and extends the concept underlying the and curves to provide analogous curves comparing the relative status of two groups. Now one calculates the proportion of the minority group, cumulated from the bottom or middle needed to have the same total income as the top qth fraction of the majority group (after adjusting for sample size). The areas between these curves and the line of equality are analogous to the Gini index. The methodology is used to illustrate the change in the degree of inequality between males and females, as well as between black and white males, in the United States between 2000 and 2017, and can be used to examine disparities between the expenditures on health of minorities and white people.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46759615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating Local Prevalence of Obesity Via Survey Under Cost Constraints: Stratifying ZCTAs in Virginia’s Thomas Jefferson Health District","authors":"Benjamin J. Lobo, D. Bonds, K. Kafadar","doi":"10.1080/2330443X.2021.2016083","DOIUrl":"https://doi.org/10.1080/2330443X.2021.2016083","url":null,"abstract":"Abstract Currently, the most reliable estimate of the prevalence of obesity in Virginia’s Thomas Jefferson Health District (TJHD) comes from an annual telephone survey conducted by the Centers for Disease Control and Prevention. This district-wide estimate has limited use to decision makers who must target health interventions at a more granular level. A survey is one way of obtaining more granular estimates. This article describes the process of stratifying targeted geographic units (here, ZIP Code Tabulation Areas, or ZCTAs) prior to conducting the survey for those situations where cost considerations make it infeasible to sample each geographic unit (here, ZCTA) in the region (here, TJHD). Feature selection, allocation factor analysis, and hierarchical clustering were used to stratify ZCTAs. We describe the survey sampling strategy that we developed, by creating strata of ZCTAs; the data analysis using the R survey package; and the results. The resulting maps of obesity prevalence show stark differences in prevalence depending on the area of the health district, highlighting the importance of assessing health outcomes at a granular level. Our approach is a detailed and reproducible set of steps that can be used by others who face similar scenarios. Supplementary files for this article are available online.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43545360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The effect of COVID-19 vaccinations on self-reported depression and anxiety during February 2021","authors":"M. Rubinstein, A. Haviland, J. Breslau","doi":"10.1080/2330443x.2023.2190008","DOIUrl":"https://doi.org/10.1080/2330443x.2023.2190008","url":null,"abstract":"Using the COVID-19 Trends and Impacts Survey (CTIS), we examine the effect of COVID-19 vaccinations on (self-reported) feelings of depression and anxiety (\"depression\"), isolation, and worries about health, among vaccine-accepting survey respondents during February 2021. Assuming no unmeasured confounding, we estimate that vaccinations caused a -4.3 (-4.7, -3.8), -3.4 (-3.9, -2.9), and -4.8 (-5.4, -4.1) percentage point change in these outcomes, respectively. We further argue that these effects provide a lower bound on the mental health burden of the pandemic, implying that the COVID-19 pandemic was responsible for at least a 28.6 (25.3, 31.9) percent increase in feelings of depression and a 20.5 (17.3, 23.6) percent increase in feelings of isolation during February 2021 among vaccine-accepting CTIS survey respondents. We also posit a model where vaccinations affect depression through worries about health and feelings of isolation, and estimate the proportion mediated by each pathway. We find that feelings of social isolation is the stronger mediator, accounting for 41.0 (37.3, 44.7) percent of the total effect, while worries about health accounts for 9.4 (7.6, 11.1) percent of the total effect. We caution that the causal interpretation of these findings rests on strong assumptions. Nevertheless, as the pandemic continues, policymakers should also target interventions aimed at managing the substantial mental health burden associated with the COVID-19 pandemic.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45879409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ann King, Jacob Murri, Jake Callahan, Adrienne Russell, Tyler J. Jarvis
{"title":"Mathematical Analysis of Redistricting in Utah","authors":"Ann King, Jacob Murri, Jake Callahan, Adrienne Russell, Tyler J. Jarvis","doi":"10.1080/2330443X.2022.2105770","DOIUrl":"https://doi.org/10.1080/2330443X.2022.2105770","url":null,"abstract":"Abstract We discuss difficulties of evaluating partisan gerrymandering in the congressional districts in Utah and the failure of many common metrics in Utah. We explain why the Republican vote share in the least-Republican district (LRVS) is a good indicator of the advantage or disadvantage each party has in the Utah congressional districts. Although the LRVS only makes sense in settings with at most one competitive district, in that setting it directly captures the extent to which a given redistricting plan gives advantage or disadvantage to the Republican and Democratic parties. We use the LRVS to evaluate the most common measures of partisan gerrymandering in the context of Utah’s 2011 congressional districts. We do this by generating large ensembles of alternative redistricting plans using Markov chain Monte Carlo methods. We also discuss the implications of this new metric and our results on the question of whether the 2011 Utah congressional plan was gerrymandered.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45183055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Banafsheh Behzad, Bhavana Bheem, D. Elizondo, Deyana Marsh, Susan E. Martonosi
{"title":"Prevalence and Propagation of Fake News","authors":"Banafsheh Behzad, Bhavana Bheem, D. Elizondo, Deyana Marsh, Susan E. Martonosi","doi":"10.1080/2330443X.2023.2190368","DOIUrl":"https://doi.org/10.1080/2330443X.2023.2190368","url":null,"abstract":"In recent years, scholars have raised concerns on the effects that unreliable news, or\"fake news,\"has on our political sphere, and our democracy as a whole. For example, the propagation of fake news on social media is widely believed to have influenced the outcome of national elections, including the 2016 U.S. Presidential Election, and the 2020 COVID-19 pandemic. What drives the propagation of fake news on an individual level, and which interventions could effectively reduce the propagation rate? Our model disentangles bias from truthfulness of an article and examines the relationship between these two parameters and a reader's own beliefs. Using the model, we create policy recommendations for both social media platforms and individual social media users to reduce the spread of untruthful or highly biased news. We recommend that platforms sponsor unbiased truthful news, focus fact-checking efforts on mild to moderately biased news, recommend friend suggestions across the political spectrum, and provide users with reports about the political alignment of their feed. We recommend that individual social media users fact check news that strongly aligns with their political bias and read articles of opposing political bias.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47609705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Misuse of Statistical Reasoning: The Statistical Arguments Offered by Texas to the Supreme Court in an Attempt to Overturn the Results of the 2020 Election","authors":"W. Miao, Qing Pan, J. Gastwirth","doi":"10.1080/2330443X.2022.2050327","DOIUrl":"https://doi.org/10.1080/2330443X.2022.2050327","url":null,"abstract":"Abstract In December 2020, Texas filed a motion to the U.S. Supreme Court claiming that the four battleground states: Pennsylvania, Georgia, Michigan, and Wisconsin did not conduct their 2020 presidential elections in compliance with the Constitution. Texas supported its motion with a statistical analysis purportedly demonstrating that it was highly improbable that Biden had more votes than Trump in the four battleground states. This article points out that Texas’s claim is logically flawed and the analysis submitted violated several fundamental principles of statistics.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46560436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mikaela Meyer, Ahmed Hassafy, G. Lewis, Prasun Shrestha, A. Haviland, D. Nagin
{"title":"Changes in Crime Rates during the COVID-19 Pandemic","authors":"Mikaela Meyer, Ahmed Hassafy, G. Lewis, Prasun Shrestha, A. Haviland, D. Nagin","doi":"10.1080/2330443X.2022.2071369","DOIUrl":"https://doi.org/10.1080/2330443X.2022.2071369","url":null,"abstract":"Abstract We estimate changes in the rates of five FBI Part 1 crimes during the 2020 spring COVID-19 pandemic lockdown period and the period after the killing of George Floyd through December 2020. We use weekly crime rate data from 28 of the 70 largest cities in the United States from January 2018 to December 2020. Homicide rates were higher throughout 2020, including during early 2020 prior to March lockdowns. Auto thefts increased significantly during the summer and remainder of 2020. In contrast, robbery and larceny significantly declined during all three post-pandemic periods. Point estimates of burglary rates pointed to a decline for all four periods of 2020, but only the pre-pandemic period was statistically significant. We construct a city-level openness index to examine whether the degree of openness just prior to and during the lockdowns was associated with changing crime rates. Larceny and robbery rates both had a positive and significant association with the openness index implying lockdown restrictions reduced offense rates whereas the other three crime types had no detectable association. While opportunity theory is a tempting post hoc explanation of some of these findings, no single crime theory provides a plausible explanation of all the results. Supplementary materials for this article are available online.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42372574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking the Funding Line at the Swiss National Science Foundation: Bayesian Ranking and Lottery","authors":"R. Heyard, Manuela Ott, G. Salanti, M. Egger","doi":"10.1080/2330443X.2022.2086190","DOIUrl":"https://doi.org/10.1080/2330443X.2022.2086190","url":null,"abstract":"Abstract Funding agencies rely on peer review and expert panels to select the research deserving funding. Peer review has limitations, including bias against risky proposals or interdisciplinary research. The inter-rater reliability between reviewers and panels is low, particularly for proposals near the funding line. Funding agencies are also increasingly acknowledging the role of chance. The Swiss National Science Foundation (SNSF) introduced a lottery for proposals in the middle group of good but not excellent proposals. In this article, we introduce a Bayesian hierarchical model for the evaluation process. To rank the proposals, we estimate their expected ranks (ER), which incorporates both the magnitude and uncertainty of the estimated differences between proposals. A provisional funding line is defined based on ER and budget. The ER and its credible interval are used to identify proposals with similar quality and credible intervals that overlap with the provisional funding line. These proposals are entered into a lottery. We illustrate the approach for two SNSF grant schemes in career and project funding. We argue that the method could reduce bias in the evaluation process. R code, data and other materials for this article are available online.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43492548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}