Semi-Automated Nonresponse Detection for Open-Text Survey Data

IF 2.7 2区社会学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Social Science Computer Review Pub Date : 2024-05-10 DOI:10.1177/08944393241249720

Kristen Cibelli Hibben, Zachary Smith, Benjamin Rogers, Valerie Ryan, Paul Scanlon, Travis Hoppe

{"title":"Semi-Automated Nonresponse Detection for Open-Text Survey Data","authors":"Kristen Cibelli Hibben, Zachary Smith, Benjamin Rogers, Valerie Ryan, Paul Scanlon, Travis Hoppe","doi":"10.1177/08944393241249720","DOIUrl":null,"url":null,"abstract":"Open-ended survey questions can enable researchers to gain insights beyond more commonly used closed-ended question formats by allowing respondents an opportunity to provide information with few constraints and in their own words. Open-ended web probes are also increasingly used to inform the design and evaluation of survey questions. However, open-ended questions are more susceptible to insufficient or irrelevant responses that can be burdensome and time-consuming to identify and remove manually, often resulting in underuse of open-ended questions and, when used, potential inclusion of poor-quality data. To address these challenges, we developed and publicly released the Semi-Automated Nonresponse Detection for Survey text (SANDS), an item nonresponse detection approach based on a Bidirectional Transformer for Language Understanding model, fine-tuned using Simple Contrastive Sentence Embedding and targeted human coding, to categorize open-ended text data as valid or likely nonresponse. This approach is powerful in that it uses natural language processing as opposed to existing nonresponse detection approaches that have relied exclusively on rules or regular expressions or used bag-of-words approaches that tend to perform less well on short pieces of text, typos, or uncommon words, often prevalent in open-text survey data. This paper presents the development of SANDS and a quantitative evaluation of its performance and potential bias using open-text responses from a series of web probes as case studies. Overall, the SANDS model performed well in identifying a dataset of likely valid results to be used for quantitative or qualitative analysis, particularly on health-related data. Developed for generalizable use and accessible to others, the SANDS model can greatly improve the efficiency of identifying inadequate and irrelevant open-text responses, offering expanded opportunities for the use of open-text data to inform question design and improve survey data quality.","PeriodicalId":49509,"journal":{"name":"Social Science Computer Review","volume":"12 1","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Social Science Computer Review","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/08944393241249720","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Open-ended survey questions can enable researchers to gain insights beyond more commonly used closed-ended question formats by allowing respondents an opportunity to provide information with few constraints and in their own words. Open-ended web probes are also increasingly used to inform the design and evaluation of survey questions. However, open-ended questions are more susceptible to insufficient or irrelevant responses that can be burdensome and time-consuming to identify and remove manually, often resulting in underuse of open-ended questions and, when used, potential inclusion of poor-quality data. To address these challenges, we developed and publicly released the Semi-Automated Nonresponse Detection for Survey text (SANDS), an item nonresponse detection approach based on a Bidirectional Transformer for Language Understanding model, fine-tuned using Simple Contrastive Sentence Embedding and targeted human coding, to categorize open-ended text data as valid or likely nonresponse. This approach is powerful in that it uses natural language processing as opposed to existing nonresponse detection approaches that have relied exclusively on rules or regular expressions or used bag-of-words approaches that tend to perform less well on short pieces of text, typos, or uncommon words, often prevalent in open-text survey data. This paper presents the development of SANDS and a quantitative evaluation of its performance and potential bias using open-text responses from a series of web probes as case studies. Overall, the SANDS model performed well in identifying a dataset of likely valid results to be used for quantitative or qualitative analysis, particularly on health-related data. Developed for generalizable use and accessible to others, the SANDS model can greatly improve the efficiency of identifying inadequate and irrelevant open-text responses, offering expanded opportunities for the use of open-text data to inform question design and improve survey data quality.

查看原文本刊更多论文

开放文本调查数据的半自动无响应检测

开放式调查问题让受访者有机会在没有太多限制的情况下以自己的语言提供信息，从而使研究人员能够获得超越更常用的封闭式问题格式的见解。开放式网络调查也越来越多地用于调查问题的设计和评估。然而，开放式问题更容易出现回答不充分或不相关的情况，而人工识别和删除这些回答既麻烦又费时，这往往会导致开放式问题使用不足，即使使用了，也可能会纳入质量不高的数据。为了应对这些挑战，我们开发并公开发布了调查文本半自动无应答检测（SANDS），这是一种基于语言理解双向转换器模型的项目无应答检测方法，通过简单对比句嵌入和有针对性的人工编码进行微调，将开放式文本数据分为有效或可能无应答。这种方法的强大之处在于它使用了自然语言处理技术，而现有的非响应检测方法则完全依赖于规则或正则表达式，或使用词袋方法，这些方法在处理短文、错别字或不常用词时往往效果不佳，而这些情况在开放式文本调查数据中非常普遍。本文介绍了 SANDS 的开发过程，并以一系列网络调查中的开放文本回复为案例，对其性能和潜在偏差进行了定量评估。总体而言，SANDS 模型在识别可能有效的结果数据集方面表现良好，可用于定量或定性分析，尤其是与健康相关的数据。SANDS 模型的开发具有通用性，可供他人使用，可大大提高识别不充分和不相关的开放文本回答的效率，为使用开放文本数据为问题设计提供信息和提高调查数据质量提供更多机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Social Science Computer Review 社会科学-计算机：跨学科应用

CiteScore

9.00

自引率

4.90%

发文量

审稿时长

>12 weeks

期刊介绍： Unique Scope Social Science Computer Review is an interdisciplinary journal covering social science instructional and research applications of computing, as well as societal impacts of informational technology. Topics included: artificial intelligence, business, computational social science theory, computer-assisted survey research, computer-based qualitative analysis, computer simulation, economic modeling, electronic modeling, electronic publishing, geographic information systems, instrumentation and research tools, public administration, social impacts of computing and telecommunications, software evaluation, world-wide web resources for social scientists. Interdisciplinary Nature Because the Uses and impacts of computing are interdisciplinary, so is Social Science Computer Review. The journal is of direct relevance to scholars and scientists in a wide variety of disciplines. In its pages you''ll find work in the following areas: sociology, anthropology, political science, economics, psychology, computer literacy, computer applications, and methodology.