{"title":"A Primer on the Data Cleaning Pipeline","authors":"Rebecca C Steorts","doi":"10.1093/jssam/smad017","DOIUrl":"https://doi.org/10.1093/jssam/smad017","url":null,"abstract":"Abstract The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this expansion, the statistical and methodological questions around data integration, or rather merging multiple data sources, have also grown. Specifically, the science of the “data cleaning pipeline” contains four stages that allow an analyst to perform downstream tasks, predictive analyses, or statistical analyses on “cleaned data.” This article provides a review of this emerging field, introducing technical terminology and commonly used methods.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135194364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: Improving Statistical Matching when Auxiliary Information is Available","authors":"","doi":"10.1093/jssam/smad023","DOIUrl":"https://doi.org/10.1093/jssam/smad023","url":null,"abstract":"","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135540950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chloe Howard, Lara M. Greaves, D. Osborne, C. Sibley
{"title":"Is there a Day of the Week Effect on Panel Response Rate to an Online Questionnaire Email Invitation?","authors":"Chloe Howard, Lara M. Greaves, D. Osborne, C. Sibley","doi":"10.1093/jssam/smad014","DOIUrl":"https://doi.org/10.1093/jssam/smad014","url":null,"abstract":"\u0000 Does the day of the week an email is sent inviting existing participants to complete a follow-up questionnaire for an annual online survey impact response rate? We answer this question using a preregistered experiment conducted as part of an ongoing national probability panel study in New Zealand. Across 14 consecutive days, existing participants in a panel study were randomly allocated a day of the week to receive an email inviting them to complete the next wave of the questionnaire online (N = 26,126). Valid responses included questionnaires completed within 31 days of receiving the initial invitation. Results revealed that the day the invitation was sent did not affect the likelihood of responding. These results are reassuring for researchers conducting ongoing panel studies and suggest that, once participants have joined a panel, the day of the week they are contacted does not impact their likelihood of responding to subsequent waves.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46873457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interviewer Involvement in Respondent Selection Moderates the Relationship between Response Rates and Sample Bias in Cross-National Survey Projects in Europe","authors":"M. Kołczyńska, P. Jabkowski, S. Eckman","doi":"10.1093/jssam/smad013","DOIUrl":"https://doi.org/10.1093/jssam/smad013","url":null,"abstract":"\u0000 Survey researchers and practitioners often assume that higher response rates are associated with a higher quality of survey data. However, the evidence for this claim in face-to-face surveys is mixed. To explain these mixed results, recent studies have proposed that interviewers’ involvement in respondent selection moderates the effect of response rates on data quality. Previous analyses based on data from the European Social Survey found that response rates are positively associated with data quality when interviewer involvement in respondent selection is minimal. However, the association between response rates and data quality is negative when interviewers are more involved in respondent selection through household frame creation or within-household selection of target persons. These studies have hypothesized that some interviewers deviate from prescribed selection procedures to select individuals with higher response propensities, which increase response rates while reducing data quality. We replicate these results with an extended dataset, including more recent European Social Survey rounds and three other European survey projects: the European Quality of Life Survey, European Values Study, and International Social Survey Programme. Based on our results, we recommend that surveys include procedures to verify respondent-selection practices into their fieldwork control procedures.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2023-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46265346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimation of Covid-19 Prevalence Dynamics from Pooled Data","authors":"Braden Scherting, A. Peel, R. Plowright, A. Hoegh","doi":"10.1093/jssam/smad011","DOIUrl":"https://doi.org/10.1093/jssam/smad011","url":null,"abstract":"\u0000 Estimating the prevalence of a disease, such as COVID-19, is necessary for evaluating and mitigating risks of its transmission. Estimates that consider how prevalence changes with time provide more information about these risks but are difficult to obtain due to the necessary survey intensity and commensurate testing costs. Motivated by a dataset on COVID-19, from the University of Notre Dame, we propose pooling and jointly testing multiple samples to reduce testing costs. A nonparametric, hierarchical Bayesian model is used to infer population prevalence from the pooled test results without needing to retest individuals from pools that test positive. This approach is shown to reduce uncertainty compared to individual testing at the same budget and to produce similar estimates compared to individual testing at a much higher budget through simulation studies and an analysis of COVID-19 infections at Notre Dame.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2023-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42079968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kyle Endres, E. Heiden, Ki H. Park, M. Losch, K. Harland, Anne L Abbott
{"title":"Experimenting with QR Codes and Envelope Size in Push-to-Web Surveys","authors":"Kyle Endres, E. Heiden, Ki H. Park, M. Losch, K. Harland, Anne L Abbott","doi":"10.1093/jssam/smad008","DOIUrl":"https://doi.org/10.1093/jssam/smad008","url":null,"abstract":"\u0000 Survey researchers are continually evaluating approaches to increase response rates, especially those that can be implemented with little or no costs. In this study, we experimentally evaluated whether or not including Quick Response (QR) codes in mailed recruitment materials for self-administered web surveys increased web survey participation. We also assessed whether mailing these materials in a non-standard envelope size (6 × 9 inch) yielded a higher response rate than invitations mailed in a standard, #10 envelope (4.125 × 9.5 inch). These experiments were embedded in a sequential mixed-mode (dual-frame phone and web) statewide survey. Including a QR code (in addition to a URL) significantly increased the response rate compared to invitations that only included a URL in our study. As expected, a consequence of including the QR code was an elevated number of completions on smartphones or tablets among households randomly assigned to the QR code condition. The use of a larger (6 × 9 inch) envelope did not affect the overall response rate but did significantly boost the response rate for the landline sample (envelopes addressed to “STATE resident”) while having little effect for the wireless sample (envelopes addressed by name). This study suggests that incorporating both QR codes and larger (6 × 9 inch) envelopes in mail recruitment materials for web surveys is a cost-effective approach to increase web participation.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2023-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44973615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Preferred Reporting Items for Complex Sample Survey Analysis (PRICSSA)","authors":"A. Seidenberg, R. Moser, B. West","doi":"10.1093/jssam/smac040","DOIUrl":"https://doi.org/10.1093/jssam/smac040","url":null,"abstract":"\u0000 Methodological issues pertaining to transparency and analytic error have been widely documented for publications featuring analysis of complex sample survey data. The availability of numerous public use datasets to researchers without adequate training in using these data likely contributes to these problems. In an effort to introduce standards for reporting analyses of survey data and promote replication, we propose the Preferred Reporting Items for Complex Sample Survey Analysis (PRICSSA), an itemized checklist to guide researchers publishing analyses using complex sample survey data. PRICSSA is modeled after other checklists (e.g., PRISMA, CONSORT) that have been widely adopted for other research designs. The PRICSSA items include a variety of survey characteristics, such as data collection dates, mode(s), response rate, and sample selection process. In addition, essential analytic information—such as sample sizes for all estimates, missing data rates and imputation methods (if applicable), disclosing if any data were deleted, specifying what survey weight and sample design variables were used along with method of variance estimation, and reporting design-adjusted standard errors/confidence intervals for all estimates—are also included. PRICSSA also recommends that authors make all corresponding software code available. Widespread adoption of PRICSSA will help improve the quality of secondary analyses of complex sample survey data through transparency and promote scientific rigor and reproducibility.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2023-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42455728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recent Advances in Data Integration","authors":"J. Sakshaug, R. Steorts","doi":"10.1093/jssam/smad009","DOIUrl":"https://doi.org/10.1093/jssam/smad009","url":null,"abstract":"\u0000 The availability of both survey and non-survey data sources, such as administrative data, social media data, and digital trace data, has grown rapidly over the past decade. With this expansion in data, the statistical, methodological, computational, and ethical challenges around integrating multiple data sources have also grown. This special issue addresses these challenges by highlighting recent innovations and applications in data integration and related topics.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48178429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Panel Conditioning in A Probability-based Longitudinal study: A Comparison of Respondents with Different Levels of Survey Experience","authors":"Fabienne Kraemer, Henning Silber, Bella Struminskaya, Matthias Sand, Michael Bosnjak, Joanna Koßmann, Bernd Weiß","doi":"10.1093/jssam/smad004","DOIUrl":"https://doi.org/10.1093/jssam/smad004","url":null,"abstract":"Abstract Learning effects due to repeated interviewing, also known as panel conditioning, are a major threat to response quality in later waves of a panel study. To date, research has not provided a clear picture regarding the circumstances, mechanisms, and dimensions of potential panel conditioning effects. In particular, the effects of conditioning frequency, that is, different levels of experience within a panel, on response quality are underexplored. Against this background, we investigated the effects of panel conditioning by using data from the GESIS Panel, a German mixed-mode probability-based panel study. Using two refreshment samples, we compared three panel cohorts with differing levels of experience on several response quality indicators related to the mechanisms of reflection, satisficing, and social desirability. Overall, we find evidence for both negative (i.e., disadvantageous for response quality) and positive (i.e., advantageous for response quality) panel conditioning. Highly experienced respondents were more likely to satisfice by speeding through the questionnaire. They also had a higher probability of refusing to answer sensitive questions than less experienced panel members. However, more experienced respondents were also more likely to optimize the response process by needing less time compared to panelists with lower experience levels (when controlling for speeding). In contrast, we did not find significant differences with respect to the number of “don’t know” responses, nondifferentiation, the selection of first response categories and mid-responses, and the number of nontriggered filter questions. Of the observed differences, speeding showed the highest magnitude with an average increase of 6.0 percentage points for highly experienced panel members compared to low experienced panelists.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135189112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Data Collection Interventions to Balance Cost and Quality in a Sequential Multimode Survey","authors":"Stephanie M Coffey, Michael R Elliott","doi":"10.1093/jssam/smad007","DOIUrl":"https://doi.org/10.1093/jssam/smad007","url":null,"abstract":"Abstract High-quality survey data collection is getting more expensive to conduct because of decreasing response rates and rising data collection costs. Responsive and adaptive designs have emerged as a framework for targeting and reallocating resources during the data collection period to improve survey data collection efficiency. Here, we report on the implementation and evaluation of a responsive design experiment in the National Survey of College Graduates that optimizes the cost-quality tradeoff by minimizing a function of data collection costs and the root mean squared error of a key survey measure, self-reported salary. We used a Bayesian framework to incorporate prior information and generate predictions of estimated response propensity, self-reported salary, and data collection costs for use in our optimization rule. At three points during the data collection process, we implement the optimization rule and identify cases for which reduced effort would have minimal effect on the mean squared error (RMSE) of mean self-reported salary while allowing us to reduce data collection costs. We find that this optimization process allowed us to reduce data collection costs by nearly 10 percent, without a statistically or practically significant increase in the RMSE of mean salary or a decrease in the unweighted response rate. This experiment demonstrates the potential for these types of designs to more effectively target data collection resources to reach survey quality goals.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135648136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}