{"title":"Improving Statistical Matching when Auxiliary Information is Available","authors":"Angelo Moretti, N. Shlomo","doi":"10.1093/jssam/smac038","DOIUrl":"https://doi.org/10.1093/jssam/smac038","url":null,"abstract":"\u0000 There is growing interest within National Statistical Institutes in combining available datasets containing information on a large variety of social domains. Statistical matching approaches can be used to integrate data sources through a common set of variables where each dataset contains different units that belong to the same target population. However, a common problem is related to the assumption of conditional independence among variables observed in different data sources. In this context, an auxiliary dataset containing all the variables jointly can be used to improve the statistical matching by providing information on the correlation structure of variables observed across different datasets. We propose modifying the prediction models from the auxiliary dataset through a calibration step and show that we can improve the outcome of statistical matching in a variety of settings. We evaluate the proposed approach via simulation and an application based on the European Union Statistics for Income and Living Conditions and Living Costs and Food Survey for the United Kingdom.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2023-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48388115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Raghunathan, K. Kirtland, Ji Li, K. White, B. Murthy, Xia Lin, Latreace Harris, L. Gibbs-Scharf, E. Zell
{"title":"Constructing State and National Estimates of Vaccination Rates from Immunization Information Systems","authors":"T. Raghunathan, K. Kirtland, Ji Li, K. White, B. Murthy, Xia Lin, Latreace Harris, L. Gibbs-Scharf, E. Zell","doi":"10.1093/jssam/smac042","DOIUrl":"https://doi.org/10.1093/jssam/smac042","url":null,"abstract":"\u0000 Immunization Information Systems are confidential computerized population-based systems that collect data from vaccination providers on individual vaccinations administered along with limited patient-level characteristics. Through a data use agreement, Centers for Disease Control and Prevention obtains the individual-level data and aggregates the number of vaccinations for geographical statistical areas defined by the US Census Bureau (counties or equivalent statistical entities) for each vaccine included in system. Currently, 599 counties, covering 11 states, collect and report data using a uniform protocol. We combine these data with inter-decennial population counts from the Population Estimates Program in the US Census Bureau and several covariates from a variety of sources to develop model-based estimates for each of the 3,142 counties in 50 states and the District of Columbia and then aggregate to the state and national levels. We use a hierarchical Bayesian model and Markov Chain Monte Carlo methods to obtain draws from the posterior predictive distribution of the vaccination rates. We use posterior predictive checks and cross-validation to assess the goodness of fit and to validate the models. We also compare the model-based estimates to direct estimates from the National Immunization Surveys.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2023-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41952610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gemechu Aga, David C Francis, Filip Jolevski, Jorge Rodriguez Meza, Joshua Seth Wimpey
{"title":"An Application of Adaptive Cluster Sampling to Surveying Informal Businesses","authors":"Gemechu Aga, David C Francis, Filip Jolevski, Jorge Rodriguez Meza, Joshua Seth Wimpey","doi":"10.1093/jssam/smac037","DOIUrl":"https://doi.org/10.1093/jssam/smac037","url":null,"abstract":"Abstract Informal business activity is ubiquitous around the world, but it is nearly always uncaptured by administrative data, registries, or commercial sources. For this reason, there are rarely adequate sampling frames available for survey implementers wishing to measure the activity and characteristics of the sector. This article applies a well-established sampling method for rare and/or clustered populations, Adaptive Cluster Sampling (ACS), to a novel population of informal businesses. Generally, it shows that efficiency gains through the application of ACS, when compared to Simple Random Sampling (SRS), are large, particularly at higher levels of fieldwork effort. In particular, ACS efficiency gains over SRS remain sizable at higher values of initial starting samples, but with comparatively high expansion thresholds, which can reduce the fieldwork effort.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135794712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lukas Olbrich, Yuliya Kosyakova, J. Sakshaug, Silvia Schwanhäuser
{"title":"Detecting Interviewer Fraud Using Multilevel Models","authors":"Lukas Olbrich, Yuliya Kosyakova, J. Sakshaug, Silvia Schwanhäuser","doi":"10.1093/jssam/smac036","DOIUrl":"https://doi.org/10.1093/jssam/smac036","url":null,"abstract":"\u0000 Interviewer falsification, such as the complete or partial fabrication of interview data, has been shown to substantially affect the results of survey data. In this study, we apply a method to identify falsifying face-to-face interviewers based on the development of their behavior over the survey field period. We postulate four potential falsifier types: steady low-effort falsifiers, steady high-effort falsifiers, learning falsifiers, and sudden falsifiers. Using large-scale survey data from Germany with verified falsifications, we apply multilevel models with interviewer effects on the intercept, scale, and slope of the interview sequence to test whether falsifiers can be detected based on their dynamic behavior. In addition to identifying a rather high-effort falsifier previously detected by the survey organization, the model flagged two additional suspicious interviewers exhibiting learning behavior, who were subsequently classified as deviant by the survey organization. We additionally apply the analysis approach to publicly available cross-national survey data and find multiple interviewers who show behavior consistent with the postulated falsifier types.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42430170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinghao Sun, Luk Van Baelen, Els Plettinckx, Forrest W Crawford
{"title":"Dependence-Robust Confidence Intervals for Capture-Recapture Surveys.","authors":"Jinghao Sun, Luk Van Baelen, Els Plettinckx, Forrest W Crawford","doi":"10.1093/jssam/smac031","DOIUrl":"10.1093/jssam/smac031","url":null,"abstract":"<p><p>Capture-recapture (CRC) surveys are used to estimate the size of a population whose members cannot be enumerated directly. CRC surveys have been used to estimate the number of Coronavirus Disease 2019 (COVID-19) infections, people who use drugs, sex workers, conflict casualties, and trafficking victims. When <i>k</i>-capture samples are obtained, counts of unit captures in subsets of samples are represented naturally by a <math><mrow><msup><mrow><mn>2</mn></mrow><mi>k</mi></msup></mrow></math> contingency table in which one element-the number of individuals appearing in none of the samples-remains unobserved. In the absence of additional assumptions, the population size is not identifiable (i.e., point identified). Stringent assumptions about the dependence between samples are often used to achieve point identification. However, real-world CRC surveys often use convenience samples in which the assumed dependence cannot be guaranteed, and population size estimates under these assumptions may lack empirical credibility. In this work, we apply the theory of partial identification to show that weak assumptions or qualitative knowledge about the nature of dependence between samples can be used to characterize a nontrivial confidence set for the true population size. We construct confidence sets under bounds on pairwise capture probabilities using two methods: test inversion bootstrap confidence intervals and profile likelihood confidence intervals. Simulation results demonstrate well-calibrated confidence sets for each method. In an extensive real-world study, we apply the new methodology to the problem of using heterogeneous survey data to estimate the number of people who inject drugs in Brussels, Belgium.</p>","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2022-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10646701/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44877571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Randal ZuWallack, Matt Jans, Thomas Brassell, Kisha Bailly, James Dayton, Priscilla Martinez, Deidre Patterson, Thomas K Greenfield, Katherine J Karriker-Jaffe
{"title":"Estimating Web Survey Mode and Panel Effects in a Nationwide Survey of Alcohol Use.","authors":"Randal ZuWallack, Matt Jans, Thomas Brassell, Kisha Bailly, James Dayton, Priscilla Martinez, Deidre Patterson, Thomas K Greenfield, Katherine J Karriker-Jaffe","doi":"10.1093/jssam/smac028","DOIUrl":"https://doi.org/10.1093/jssam/smac028","url":null,"abstract":"<p><p>Random-digit dialing (RDD) telephone surveys are challenged by declining response rates and increasing costs. Many surveys that were traditionally conducted via telephone are seeking cost-effective alternatives, such as address-based sampling (ABS) with self-administered web or mail questionnaires. At a fraction of the cost of both telephone and ABS surveys, opt-in web panels are an attractive alternative. The 2019-2020 National Alcohol Survey (NAS) employed three methods: (1) an RDD telephone survey (traditional NAS method); (2) an ABS push-to-web survey; and (3) an opt-in web panel. The study reported here evaluated differences in the three data-collection methods, which we will refer to as \"mode effects,\" on alcohol consumption and health topics. To evaluate mode effects, multivariate regression models were developed predicting these characteristics, and the presence of a mode effect on each outcome was determined by the significance of the three-level effect (RDD-telephone, ABS-web, opt-in web panel) in each model. Those results were then used to adjust for mode effects and produce a \"telephone-equivalent\" estimate for the ABS and panel data sources. The study found that ABS-web and RDD were similar for most estimates but exhibited differences for sensitive questions including getting drunk and experiencing depression. The opt-in web panel exhibited more differences between it and the other two survey modes. One notable example is the reporting of drinking alcohol at least 3-4 times per week, which was 21 percent for RDD-phone, 24 percent for ABS-web, and 34 percent for opt-in web panel. The regression model adjusts for mode effects, improving comparability with past surveys conducted by telephone; however, the models result in higher variance of the estimates. This method of adjusting for mode effects has broad applications to mode and sample transitions throughout the survey research industry.</p>","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2022-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10646698/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138460650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Simple Question Goes a Long Way: A Wording Experiment on Bank Account Ownership.","authors":"Marco Angrisani, Mick P Couper","doi":"10.1093/jssam/smab045","DOIUrl":"https://doi.org/10.1093/jssam/smab045","url":null,"abstract":"<p><p>Ownership of a bank account is an objective measure and should be relatively easy to elicit via survey questions. Yet, depending on the interview mode, the wording of the question and its placement within the survey may influence respondents' answers. The Health and Retirement Study (HRS) asset module, as administered online to members of the Understanding America Study (UAS), yielded substantially lower rates of reported bank account ownership than either a single question on ownership in the Current Population Survey (CPS) or the full asset module administered to HRS panelists (both interviewer-administered surveys). We designed and implemented an experiment in the UAS comparing the original HRS question eliciting bank account ownership with two alternative versions that were progressively simplified. We document strong evidence that the original question leads to systematic underestimation of bank account ownership. In contrast, the proportion of bank account owners obtained from the simplest alternative version of the question is very similar to the population benchmark estimate. We investigate treatment effect heterogeneity by cognitive ability and financial literacy. We find that questionnaire simplification affects responses of individuals with higher cognitive ability substantially less than those with lower cognitive ability. Our results suggest that high-quality data from surveys start from asking the right questions, which should be as simple and precise as possible and carefully adapted to the mode of interview.</p>","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9643168/pdf/smab045.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10370660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert H Lyles, Yuzi Zhang, Lin Ge, Cameron England, Kevin Ward, Timothy L Lash, Lance A Waller
{"title":"Using Capture-Recapture Methodology to Enhance Precision of Representative Sampling-Based Case Count Estimates.","authors":"Robert H Lyles, Yuzi Zhang, Lin Ge, Cameron England, Kevin Ward, Timothy L Lash, Lance A Waller","doi":"10.1093/jssam/smab052","DOIUrl":"https://doi.org/10.1093/jssam/smab052","url":null,"abstract":"<p><p>The application of serial principled sampling designs for diagnostic testing is often viewed as an ideal approach to monitoring prevalence and case counts of infectious or chronic diseases. Considering logistics and the need for timeliness and conservation of resources, surveillance efforts can generally benefit from creative designs and accompanying statistical methods to improve the precision of sampling-based estimates and reduce the size of the necessary sample. One option is to augment the analysis with available data from other surveillance streams that identify cases from the population of interest over the same timeframe, but may do so in a highly nonrepresentative manner. We consider monitoring a closed population (e.g., a long-term care facility, patient registry, or community), and encourage the use of capture-recapture methodology to produce an alternative case total estimate to the one obtained by principled sampling. With care in its implementation, even a relatively small simple or stratified random sample not only provides its own valid estimate, but provides the only fully defensible means of justifying a second estimate based on classical capture-recapture methods. We initially propose weighted averaging of the two estimators to achieve greater precision than can be obtained using either alone, and then show how a novel single capture-recapture estimator provides a unified and preferable alternative. We develop a variant on a Dirichlet-multinomial-based credible interval to accompany our hybrid design-based case count estimates, with a view toward improved coverage properties. Finally, we demonstrate the benefits of the approach through simulations designed to mimic an acute infectious disease daily monitoring program or an annual surveillance program to quantify new cases within a fixed patient registry.</p>","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9643167/pdf/smab052.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9785848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Empirical Best Prediction of Small Area Means Based on a Unit-Level Gamma-Poisson Model","authors":"Emily J. Berg","doi":"10.1093/jssam/smac026","DOIUrl":"https://doi.org/10.1093/jssam/smac026","url":null,"abstract":"\u0000 Existing small area estimation procedures for count data have important limitations. For instance, an M-quantile-based method is known to be less efficient than model-based procedures if the assumptions of the model hold. Also, frequentist inference procedures for Poisson generalized linear mixed models can be computationally intensive or require approximations. Furthermore, area-level models are incapable of incorporating unit-level covariates. We overcome these limitations by developing a small area estimation procedure for a unit-level gamma-Poisson model. The conjugate form of the model permits computationally simple estimation and prediction procedures. We obtain a closed-form expression for the empirical best predictor of the mean as well as a closed-form mean square error estimator. We validate the procedure through simulations. We illustrate the proposed method using a subset of data from the Iowa Seat-Belt Use survey.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41515091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated Classification for Open-Ended Questions with BERT","authors":"Hyukjun Gweon, Matthias Schonlau","doi":"10.1093/jssam/smad015","DOIUrl":"https://doi.org/10.1093/jssam/smad015","url":null,"abstract":"\u0000 Manual coding of text data from open-ended questions into different categories is time consuming and expensive. Automated coding uses statistical/machine learning to train on a small subset of manually-coded text answers. Recently, pretraining a general language model on vast amounts of unrelated data and then adapting the model to the specific application has proven effective in natural language processing. Using two data sets, we empirically investigate whether BERT, the currently dominant pretrained language model, is more effective at automated coding of answers to open-ended questions than other non-pretrained statistical learning approaches. We found fine-tuning the pretrained BERT parameters is essential as otherwise BERT is not competitive. Second, we found fine-tuned BERT barely beats the non-pretrained statistical learning approaches in terms of classification accuracy when trained on 100 manually coded observations. However, BERT’s relative advantage increases rapidly when more manually coded observations (e.g., 200–400) are available for training. We conclude that for automatically coding answers to open-ended questions BERT is preferable to non-pretrained models such as support vector machines and boosting.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2022-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46968013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}