{"title":"Prosodic cues strengthen human-AI voice boundaries: Listeners do not easily perceive human speakers and AI clones as the same person","authors":"Wenjun Chen , Marc D. Pell , Xiaoming Jiang","doi":"10.1016/j.chbah.2026.100261","DOIUrl":null,"url":null,"abstract":"<div><div>Previous studies concluded that listeners struggle to discriminate AI from human voices, but these studies used monotone-like speech and did not examine prosodic expressiveness, a key advantage of human over AI speakers. This study explores whether prosodic expressiveness facilitates human-AI voice discrimination. We recorded human prosodic speech with confident and doubtful expressions, trained AI models to replicate these prosodic patterns, had AI models generate new sentences, and then had human speakers produce equivalent prosodic expressions for the same sentences. In Experiment 1, we had 48 listeners rate humanlikeness and perceived confidence in 11,808 audio samples, finding that AI speech was consistently rated as less humanlike regardless of prosody. We selected 768 audios (AI × human, confident × doubtful prosody) for Experiment 2, where 80 listeners completed an identity discrimination task, telling whether two sounds were from the same speaker. Bayesian modeling results revealed near-ceiling performance for human-human/AI-AI pairs, with inconsistent prosodies decreasing accuracy by ∼7%, while listeners do not easily categorize AI and human as sharing the same identity (∼54% accuracy when prosody matches, dropping to ∼36% when inconsistent). We observed accuracy–reaction time synchronization; in human–AI/AI–human pairs only, however, listeners relied less on distance cues when the two voices’ identities were distant beyond a certain threshold. Overall, we found that listeners perceive AI speech as lower in humanlikeness, and prosodic variation further promotes rejecting AI and human voices as sharing the same identity, indicating that human acceptance of AI voices as equivalent to human voices is limited.</div></div>","PeriodicalId":100324,"journal":{"name":"Computers in Human Behavior: Artificial Humans","volume":"7 ","pages":"Article 100261"},"PeriodicalIF":0.0000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in Human Behavior: Artificial Humans","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949882126000125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/7 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Previous studies concluded that listeners struggle to discriminate AI from human voices, but these studies used monotone-like speech and did not examine prosodic expressiveness, a key advantage of human over AI speakers. This study explores whether prosodic expressiveness facilitates human-AI voice discrimination. We recorded human prosodic speech with confident and doubtful expressions, trained AI models to replicate these prosodic patterns, had AI models generate new sentences, and then had human speakers produce equivalent prosodic expressions for the same sentences. In Experiment 1, we had 48 listeners rate humanlikeness and perceived confidence in 11,808 audio samples, finding that AI speech was consistently rated as less humanlike regardless of prosody. We selected 768 audios (AI × human, confident × doubtful prosody) for Experiment 2, where 80 listeners completed an identity discrimination task, telling whether two sounds were from the same speaker. Bayesian modeling results revealed near-ceiling performance for human-human/AI-AI pairs, with inconsistent prosodies decreasing accuracy by ∼7%, while listeners do not easily categorize AI and human as sharing the same identity (∼54% accuracy when prosody matches, dropping to ∼36% when inconsistent). We observed accuracy–reaction time synchronization; in human–AI/AI–human pairs only, however, listeners relied less on distance cues when the two voices’ identities were distant beyond a certain threshold. Overall, we found that listeners perceive AI speech as lower in humanlikeness, and prosodic variation further promotes rejecting AI and human voices as sharing the same identity, indicating that human acceptance of AI voices as equivalent to human voices is limited.