Prosodic cues strengthen human-AI voice boundaries: Listeners do not easily perceive human speakers and AI clones as the same person

Computers in Human Behavior: Artificial Humans Pub Date : 2026-03-01 Epub Date: 2026-02-07 DOI:10.1016/j.chbah.2026.100261

Wenjun Chen , Marc D. Pell , Xiaoming Jiang

{"title":"Prosodic cues strengthen human-AI voice boundaries: Listeners do not easily perceive human speakers and AI clones as the same person","authors":"Wenjun Chen , Marc D. Pell , Xiaoming Jiang","doi":"10.1016/j.chbah.2026.100261","DOIUrl":null,"url":null,"abstract":"<div><div>Previous studies concluded that listeners struggle to discriminate AI from human voices, but these studies used monotone-like speech and did not examine prosodic expressiveness, a key advantage of human over AI speakers. This study explores whether prosodic expressiveness facilitates human-AI voice discrimination. We recorded human prosodic speech with confident and doubtful expressions, trained AI models to replicate these prosodic patterns, had AI models generate new sentences, and then had human speakers produce equivalent prosodic expressions for the same sentences. In Experiment 1, we had 48 listeners rate humanlikeness and perceived confidence in 11,808 audio samples, finding that AI speech was consistently rated as less humanlike regardless of prosody. We selected 768 audios (AI × human, confident × doubtful prosody) for Experiment 2, where 80 listeners completed an identity discrimination task, telling whether two sounds were from the same speaker. Bayesian modeling results revealed near-ceiling performance for human-human/AI-AI pairs, with inconsistent prosodies decreasing accuracy by ∼7%, while listeners do not easily categorize AI and human as sharing the same identity (∼54% accuracy when prosody matches, dropping to ∼36% when inconsistent). We observed accuracy–reaction time synchronization; in human–AI/AI–human pairs only, however, listeners relied less on distance cues when the two voices’ identities were distant beyond a certain threshold. Overall, we found that listeners perceive AI speech as lower in humanlikeness, and prosodic variation further promotes rejecting AI and human voices as sharing the same identity, indicating that human acceptance of AI voices as equivalent to human voices is limited.</div></div>","PeriodicalId":100324,"journal":{"name":"Computers in Human Behavior: Artificial Humans","volume":"7 ","pages":"Article 100261"},"PeriodicalIF":0.0000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in Human Behavior: Artificial Humans","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949882126000125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/7 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Previous studies concluded that listeners struggle to discriminate AI from human voices, but these studies used monotone-like speech and did not examine prosodic expressiveness, a key advantage of human over AI speakers. This study explores whether prosodic expressiveness facilitates human-AI voice discrimination. We recorded human prosodic speech with confident and doubtful expressions, trained AI models to replicate these prosodic patterns, had AI models generate new sentences, and then had human speakers produce equivalent prosodic expressions for the same sentences. In Experiment 1, we had 48 listeners rate humanlikeness and perceived confidence in 11,808 audio samples, finding that AI speech was consistently rated as less humanlike regardless of prosody. We selected 768 audios (AI × human, confident × doubtful prosody) for Experiment 2, where 80 listeners completed an identity discrimination task, telling whether two sounds were from the same speaker. Bayesian modeling results revealed near-ceiling performance for human-human/AI-AI pairs, with inconsistent prosodies decreasing accuracy by ∼7%, while listeners do not easily categorize AI and human as sharing the same identity (∼54% accuracy when prosody matches, dropping to ∼36% when inconsistent). We observed accuracy–reaction time synchronization; in human–AI/AI–human pairs only, however, listeners relied less on distance cues when the two voices’ identities were distant beyond a certain threshold. Overall, we found that listeners perceive AI speech as lower in humanlikeness, and prosodic variation further promotes rejecting AI and human voices as sharing the same identity, indicating that human acceptance of AI voices as equivalent to human voices is limited.

查看原文本刊更多论文

韵律线索加强了人类与AI的声音界限：听众不容易将人类说话者和AI克隆体视为同一个人

之前的研究得出结论，听众很难区分人工智能和人类的声音，但这些研究使用的是单调的语音，没有研究韵律表现力，而韵律表现力是人类相对于人工智能说话者的一个关键优势。本研究探讨韵律表达是否有助于人类-人工智能语音识别。我们用自信和怀疑的表达方式记录人类的韵律语言，训练人工智能模型来复制这些韵律模式，让人工智能模型生成新的句子，然后让人类说话者为相同的句子生成等效的韵律表达。在实验1中，我们让48名听众对11808个音频样本的人类相似性和感知置信度进行评分，发现无论韵律如何，人工智能语音始终被评为不太像人类。我们选择了768个音频（人工智能x人类，自信x可疑的韵律）进行实验2，其中80名听众完成了身份识别任务，判断两个声音是否来自同一说话者。贝叶斯建模结果显示，人类/AI-AI配对的表现接近上限，不一致的韵律使准确率降低了7%，而听众不容易将AI和人类分类为具有相同身份（韵律匹配时准确率为54%，不一致时准确率降至36%）。我们观察到精度-反应时间同步；然而，仅在人类-人工智能/人工智能-人类配对中，当两种声音的身份距离超过一定阈值时，听众对距离线索的依赖程度就会降低。总体而言，我们发现听者认为人工智能语音与人类的相似度较低，韵律变化进一步促使人们拒绝人工智能和人类声音具有相同的身份，这表明人类对人工智能声音等同于人类声音的接受程度有限。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers in Human Behavior: Artificial Humans

自引率

0.00%

发文量