人类和自动语音比较区域可变语音样本

IF 3 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2025-05-12 DOI:10.1016/j.specom.2025.103253

Vincent Hughes , Carmen Llamas , Thomas Kettig

{"title":"人类和自动语音比较区域可变语音样本","authors":"Vincent Hughes , Carmen Llamas , Thomas Kettig","doi":"10.1016/j.specom.2025.103253","DOIUrl":null,"url":null,"abstract":"<div><div>In this paper, we compare and combine human and automatic voice comparison results based on short, regionally variable speech samples. Likelihood ratio-like scores were extracted for 120 pairs of same- (45) and different-speaker (75) samples from a total of 896 British English listeners. The samples contained the voices of speakers from Newcastle and Middlesbrough (in North-East England), as well as speakers of Standard Southern British English (modern RP). In addition to within-accent comparisons, the experiment included between-accent, different-speaker comparisons for Middlesbrough and Newcastle, which are perceptually and regionally proximate accents. Scores were also computed using an x-vector PLDA automatic speaker recognition (ASR) system. The ASR system (EER=10.88 %, <em>C</em><sub>llr</sub>=0.48) outperformed the human listeners (EER=23.55 %, <em>C</em><sub>llr</sub>=0.75) overall and no improvement was found in the ASR output when fused with the listener scores. There was, unsurprisingly, considerable between-listener variability, with individual error rates varying from 0 % to 100 %. Performance was also variable according to the regional accent of the speakers. Notably, the ASR system performed worst with Newcastle samples, while humans performed best with the Newcastle samples. Human listeners were also more sensitive to high-salience between-accent comparisons, leading to almost categorical different-speaker conclusions, compared with the ASR system, whose performance with these samples was similar to within-accent comparisons.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103253"},"PeriodicalIF":3.0000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Human and automatic voice comparison with regionally variable speech samples\",\"authors\":\"Vincent Hughes , Carmen Llamas , Thomas Kettig\",\"doi\":\"10.1016/j.specom.2025.103253\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In this paper, we compare and combine human and automatic voice comparison results based on short, regionally variable speech samples. Likelihood ratio-like scores were extracted for 120 pairs of same- (45) and different-speaker (75) samples from a total of 896 British English listeners. The samples contained the voices of speakers from Newcastle and Middlesbrough (in North-East England), as well as speakers of Standard Southern British English (modern RP). In addition to within-accent comparisons, the experiment included between-accent, different-speaker comparisons for Middlesbrough and Newcastle, which are perceptually and regionally proximate accents. Scores were also computed using an x-vector PLDA automatic speaker recognition (ASR) system. The ASR system (EER=10.88 %, <em>C</em><sub>llr</sub>=0.48) outperformed the human listeners (EER=23.55 %, <em>C</em><sub>llr</sub>=0.75) overall and no improvement was found in the ASR output when fused with the listener scores. There was, unsurprisingly, considerable between-listener variability, with individual error rates varying from 0 % to 100 %. Performance was also variable according to the regional accent of the speakers. Notably, the ASR system performed worst with Newcastle samples, while humans performed best with the Newcastle samples. Human listeners were also more sensitive to high-salience between-accent comparisons, leading to almost categorical different-speaker conclusions, compared with the ASR system, whose performance with these samples was similar to within-accent comparisons.</div></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"172 \",\"pages\":\"Article 103253\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2025-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639325000688\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325000688","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们比较和结合基于短的、区域可变的语音样本的人类语音和自动语音比较结果。从总共896名英国英语听众中提取了120对相同（45）和不同（75）说话者样本的似然比得分。这些样本包括来自纽卡斯尔和米德尔斯堡（英格兰东北部）的说话者的声音，以及标准英国南部英语（现代RP）的说话者的声音。除了口音内比较之外，实验还包括米德尔斯堡和纽卡斯尔的口音之间、不同说话者之间的比较，这两种口音在感知上和区域上都是相近的。分数也使用x向量PLDA自动说话人识别（ASR）系统计算。ASR系统（EER= 10.88%, Cllr=0.48）的整体表现优于人类听者（EER= 23.55%, Cllr=0.75），当与听者评分融合时，ASR输出没有改善。不出所料，听众之间存在相当大的差异，个人错误率从0%到100%不等。根据说话者的地区口音，他们的表现也会有所不同。值得注意的是，ASR系统在处理纽卡斯尔样本时表现最差，而人类在处理纽卡斯尔样本时表现最好。与ASR系统相比，人类听众对高显著性的口音间比较也更敏感，从而得出几乎完全不同的说话者结论，ASR系统在这些样本中的表现与口音内比较相似。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Human and automatic voice comparison with regionally variable speech samples

In this paper, we compare and combine human and automatic voice comparison results based on short, regionally variable speech samples. Likelihood ratio-like scores were extracted for 120 pairs of same- (45) and different-speaker (75) samples from a total of 896 British English listeners. The samples contained the voices of speakers from Newcastle and Middlesbrough (in North-East England), as well as speakers of Standard Southern British English (modern RP). In addition to within-accent comparisons, the experiment included between-accent, different-speaker comparisons for Middlesbrough and Newcastle, which are perceptually and regionally proximate accents. Scores were also computed using an x-vector PLDA automatic speaker recognition (ASR) system. The ASR system (EER=10.88 %, C_llr=0.48) outperformed the human listeners (EER=23.55 %, C_llr=0.75) overall and no improvement was found in the ASR output when fused with the listener scores. There was, unsurprisingly, considerable between-listener variability, with individual error rates varying from 0 % to 100 %. Performance was also variable according to the regional accent of the speakers. Notably, the ASR system performed worst with Newcastle samples, while humans performed best with the Newcastle samples. Human listeners were also more sensitive to high-salience between-accent comparisons, leading to almost categorical different-speaker conclusions, compared with the ASR system, whose performance with these samples was similar to within-accent comparisons.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.