Katharina Meitinger, Sabien van der Sluis, Matthias Schonlau
{"title":"Keep the noise down: On the performance of automatic speech recognition of voice-recordings in web surveys","authors":"Katharina Meitinger, Sabien van der Sluis, Matthias Schonlau","doi":"10.29115/sp-2023-0022","DOIUrl":null,"url":null,"abstract":"Voice-recordings are increasingly implemented in web surveys, but the resulting audio data need to be transcribed before analysis. Since manual coding is too time- and work-intensive, researchers often rely on automatic speech recognition (ASR) systems for the transcription of the voice-recordings. However, ASR tools might create partly incorrect transcriptions and potentially change the content of responses. If the ASR performance (i.e., accuracy and validity) differs by subgroup and contextual factors, a bias is introduced in the analysis of open-ended questions. We assessed the impact of sociodemographic and contextual factors on the accuracy and validity of ASR transcriptions with data from the Longitudinal Internet Studies for the Social Sciences (LISS) panel collected in December 2020. We find that background noise reduces the accuracy and validity of ASR transcriptions. In addition, validity improved when the respondent was alone during the survey. Fortunately, we did not find any evidence of systematic differences across subgroups (age, sex, education), devices or respondent location.","PeriodicalId":74893,"journal":{"name":"Survey practice","volume":"7 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Survey practice","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29115/sp-2023-0022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Voice-recordings are increasingly implemented in web surveys, but the resulting audio data need to be transcribed before analysis. Since manual coding is too time- and work-intensive, researchers often rely on automatic speech recognition (ASR) systems for the transcription of the voice-recordings. However, ASR tools might create partly incorrect transcriptions and potentially change the content of responses. If the ASR performance (i.e., accuracy and validity) differs by subgroup and contextual factors, a bias is introduced in the analysis of open-ended questions. We assessed the impact of sociodemographic and contextual factors on the accuracy and validity of ASR transcriptions with data from the Longitudinal Internet Studies for the Social Sciences (LISS) panel collected in December 2020. We find that background noise reduces the accuracy and validity of ASR transcriptions. In addition, validity improved when the respondent was alone during the survey. Fortunately, we did not find any evidence of systematic differences across subgroups (age, sex, education), devices or respondent location.
网络调查中越来越多地使用语音记录,但由此产生的音频数据需要在分析前进行转录。由于人工编码耗时耗力,研究人员通常依赖自动语音识别(ASR)系统来转录语音记录。但是,ASR 工具可能会产生部分错误的转录,并有可能改变回答的内容。如果 ASR 的性能(即准确性和有效性)因亚群体和背景因素而异,那么在分析开放式问题时就会出现偏差。我们利用 2020 年 12 月收集的社会科学纵向互联网研究(LISS)小组数据,评估了社会人口和背景因素对 ASR 转录准确性和有效性的影响。我们发现,背景噪声会降低 ASR 转录的准确性和有效性。此外,当受访者在调查期间独自一人时,有效性也会提高。幸运的是,我们没有发现任何证据表明不同亚组(年龄、性别、教育程度)、设备或受访者所在地之间存在系统性差异。