Automatic speech recognition and the transcription of indistinct forensic audio: how do the new generation of systems fare?

IF 1.5 Q2 COMMUNICATION
Debbie Loakes
{"title":"Automatic speech recognition and the transcription of indistinct forensic audio: how do the new generation of systems fare?","authors":"Debbie Loakes","doi":"10.3389/fcomm.2024.1281407","DOIUrl":null,"url":null,"abstract":"This study provides an update on an earlier study in the “Capturing Talk” research topic, which aimed to demonstrate how automatic speech recognition (ASR) systems work with indistinct forensic-like audio, in comparison with good-quality audio. Since that time, there has been rapid technological advancement, with newer systems having access to extremely large language models and having their performance proclaimed as being human-like in accuracy. This study compares various ASR systems, including OpenAI’s Whisper, to continue to test how well automatic speaker recognition works with forensic-like audio. The results show that the transcription of a good-quality audio file is at ceiling for some systems, with no errors. For the poor-quality (forensic-like) audio, Whisper was the best performing system but had only 50% of the entire speech material correct. The results for the poor-quality audio were also generally variable across the systems, with differences depending on whether a .wav or .mp3 file was used and differences between earlier and later versions of the same system. Additionally, and against expectations, Whisper showed a drop in performance over a 2-month period. While more material was transcribed in the later attempt, more was also incorrect. This study concludes that forensic-like audio is not suitable for automatic analysis.","PeriodicalId":31739,"journal":{"name":"Frontiers in Communication","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fcomm.2024.1281407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMMUNICATION","Score":null,"Total":0}
引用次数: 0

Abstract

This study provides an update on an earlier study in the “Capturing Talk” research topic, which aimed to demonstrate how automatic speech recognition (ASR) systems work with indistinct forensic-like audio, in comparison with good-quality audio. Since that time, there has been rapid technological advancement, with newer systems having access to extremely large language models and having their performance proclaimed as being human-like in accuracy. This study compares various ASR systems, including OpenAI’s Whisper, to continue to test how well automatic speaker recognition works with forensic-like audio. The results show that the transcription of a good-quality audio file is at ceiling for some systems, with no errors. For the poor-quality (forensic-like) audio, Whisper was the best performing system but had only 50% of the entire speech material correct. The results for the poor-quality audio were also generally variable across the systems, with differences depending on whether a .wav or .mp3 file was used and differences between earlier and later versions of the same system. Additionally, and against expectations, Whisper showed a drop in performance over a 2-month period. While more material was transcribed in the later attempt, more was also incorrect. This study concludes that forensic-like audio is not suitable for automatic analysis.
自动语音识别和模糊法证音频的转录:新一代系统表现如何?
本研究是对 "捕捉谈话 "研究课题早期研究的更新,该研究旨在展示自动语音识别(ASR)系统与高质量音频相比,如何处理模糊不清的法证类音频。从那时起,技术发展突飞猛进,较新的系统可以访问超大语言模型,并宣称其性能在准确性上与人类无异。本研究对包括 OpenAI 的 Whisper 在内的各种 ASR 系统进行了比较,以继续测试说话人自动识别在类似法证音频中的表现。结果表明,对某些系统来说,高质量音频文件的转录达到了上限,没有出现任何错误。对于质量较差(类似法证)的音频,Whisper 是性能最好的系统,但整个语音材料中只有 50% 是正确的。各系统对劣质音频的处理结果也不尽相同,这取决于使用的是 .wav 文件还是 .mp3 文件,以及同一系统的早期版本和后期版本之间的差异。此外,与预期不同的是,Whisper 的性能在两个月内有所下降。虽然后期尝试转录的材料更多,但错误也更多。本研究的结论是,法医类音频不适合自动分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.30
自引率
8.30%
发文量
284
审稿时长
14 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信