Louis Hickman, Markus Langer, Rachel M Saef, Louis Tay
{"title":"Automated speech recognition bias in personnel selection: The case of automatically scored job interviews.","authors":"Louis Hickman, Markus Langer, Rachel M Saef, Louis Tay","doi":"10.1037/apl0001247","DOIUrl":null,"url":null,"abstract":"<p><p>Organizations, researchers, and software increasingly use automatic speech recognition (ASR) to transcribe speech to text. However, ASR can be less accurate for (i.e., biased against) certain demographic subgroups. This is concerning, given that the machine-learning (ML) models used to automatically score video interviews use ASR transcriptions of interviewee responses as inputs. To address these concerns, we investigate the extent of ASR bias and its effects in automatically scored interviews. Specifically, we compare the accuracy of ASR transcription for English as a second language (ESL) versus non-ESL interviewees, people of color (and Black interviewees separately) versus White interviewees, and male versus female interviewees. Then, we test whether ASR bias causes bias in ML model scores-both in terms of differential convergent correlations (i.e., subgroup differences in correlations between observed and ML scores) and differential means (i.e., shifts in subgroup differences from observed to ML scores). To do so, we apply one human and four ASR transcription methods to two samples of mock video interviews (<i>N</i>s = 1,014 and 414), and then we train and test models using these different transcripts to score multiple constructs. We observed significant bias in the commercial ASR services across nearly all comparisons, with the magnitude of bias differing across the ASR services. However, the transcription bias did not translate into meaningful measurement bias for the ML interview scores-whether in terms of differential convergent correlations or means. We discuss what these results mean for the nature of bias, fairness, and validity of ML models for scoring verbal open-ended responses. (PsycInfo Database Record (c) 2024 APA, all rights reserved).</p>","PeriodicalId":15135,"journal":{"name":"Journal of Applied Psychology","volume":" ","pages":""},"PeriodicalIF":9.4000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Psychology","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1037/apl0001247","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MANAGEMENT","Score":null,"Total":0}
引用次数: 0
Abstract
Organizations, researchers, and software increasingly use automatic speech recognition (ASR) to transcribe speech to text. However, ASR can be less accurate for (i.e., biased against) certain demographic subgroups. This is concerning, given that the machine-learning (ML) models used to automatically score video interviews use ASR transcriptions of interviewee responses as inputs. To address these concerns, we investigate the extent of ASR bias and its effects in automatically scored interviews. Specifically, we compare the accuracy of ASR transcription for English as a second language (ESL) versus non-ESL interviewees, people of color (and Black interviewees separately) versus White interviewees, and male versus female interviewees. Then, we test whether ASR bias causes bias in ML model scores-both in terms of differential convergent correlations (i.e., subgroup differences in correlations between observed and ML scores) and differential means (i.e., shifts in subgroup differences from observed to ML scores). To do so, we apply one human and four ASR transcription methods to two samples of mock video interviews (Ns = 1,014 and 414), and then we train and test models using these different transcripts to score multiple constructs. We observed significant bias in the commercial ASR services across nearly all comparisons, with the magnitude of bias differing across the ASR services. However, the transcription bias did not translate into meaningful measurement bias for the ML interview scores-whether in terms of differential convergent correlations or means. We discuss what these results mean for the nature of bias, fairness, and validity of ML models for scoring verbal open-ended responses. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
机构、研究人员和软件越来越多地使用自动语音识别(ASR)将语音转录为文本。然而,ASR 对某些人口亚群的准确性可能较低(即存在偏见)。鉴于用于视频访谈自动评分的机器学习(ML)模型使用 ASR 转录的受访者回答作为输入,这种情况令人担忧。为了解决这些问题,我们调查了 ASR 偏差的程度及其对自动评分访谈的影响。具体来说,我们比较了英语作为第二语言(ESL)与非英语作为第二语言受访者、有色人种(黑人受访者单独)与白人受访者以及男性与女性受访者的 ASR 转录准确性。然后,我们测试 ASR 的偏差是否会导致 ML 模型得分的偏差--无论是在差异收敛相关性(即观察得分与 ML 得分之间相关性的亚组差异)方面,还是在差异平均值(即观察得分与 ML 得分之间亚组差异的移动)方面。为此,我们对两个模拟视频访谈样本(Ns = 1,014 和 414)分别采用了一种人工转录方法和四种 ASR 转录方法,然后使用这些不同的转录结果对模型进行训练和测试,以对多个结构进行评分。在几乎所有的比较中,我们都观察到商业 ASR 服务存在明显的偏差,不同 ASR 服务的偏差程度也不同。但是,转录偏差并没有转化为有意义的 ML 访谈得分测量偏差--无论是在差异收敛相关性方面还是在平均值方面。我们将讨论这些结果对偏差的性质、公平性以及 ML 模型对口头开放式回答评分的有效性有何意义。(PsycInfo 数据库记录 (c) 2024 APA,保留所有权利)。
期刊介绍:
The Journal of Applied Psychology® focuses on publishing original investigations that contribute new knowledge and understanding to fields of applied psychology (excluding clinical and applied experimental or human factors, which are better suited for other APA journals). The journal primarily considers empirical and theoretical investigations that enhance understanding of cognitive, motivational, affective, and behavioral psychological phenomena in work and organizational settings. These phenomena can occur at individual, group, organizational, or cultural levels, and in various work settings such as business, education, training, health, service, government, or military institutions. The journal welcomes submissions from both public and private sector organizations, for-profit or nonprofit. It publishes several types of articles, including:
1.Rigorously conducted empirical investigations that expand conceptual understanding (original investigations or meta-analyses).
2.Theory development articles and integrative conceptual reviews that synthesize literature and generate new theories on psychological phenomena to stimulate novel research.
3.Rigorously conducted qualitative research on phenomena that are challenging to capture with quantitative methods or require inductive theory building.