自动语音识别和模糊法证音频的转录：新一代系统表现如何？

IF 1.5 Q2 COMMUNICATION

Frontiers in Communication Pub Date : 2024-02-14 DOI:10.3389/fcomm.2024.1281407

Debbie Loakes

{"title":"自动语音识别和模糊法证音频的转录：新一代系统表现如何？","authors":"Debbie Loakes","doi":"10.3389/fcomm.2024.1281407","DOIUrl":null,"url":null,"abstract":"This study provides an update on an earlier study in the “Capturing Talk” research topic, which aimed to demonstrate how automatic speech recognition (ASR) systems work with indistinct forensic-like audio, in comparison with good-quality audio. Since that time, there has been rapid technological advancement, with newer systems having access to extremely large language models and having their performance proclaimed as being human-like in accuracy. This study compares various ASR systems, including OpenAI’s Whisper, to continue to test how well automatic speaker recognition works with forensic-like audio. The results show that the transcription of a good-quality audio file is at ceiling for some systems, with no errors. For the poor-quality (forensic-like) audio, Whisper was the best performing system but had only 50% of the entire speech material correct. The results for the poor-quality audio were also generally variable across the systems, with differences depending on whether a .wav or .mp3 file was used and differences between earlier and later versions of the same system. Additionally, and against expectations, Whisper showed a drop in performance over a 2-month period. While more material was transcribed in the later attempt, more was also incorrect. This study concludes that forensic-like audio is not suitable for automatic analysis.","PeriodicalId":31739,"journal":{"name":"Frontiers in Communication","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic speech recognition and the transcription of indistinct forensic audio: how do the new generation of systems fare?\",\"authors\":\"Debbie Loakes\",\"doi\":\"10.3389/fcomm.2024.1281407\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study provides an update on an earlier study in the “Capturing Talk” research topic, which aimed to demonstrate how automatic speech recognition (ASR) systems work with indistinct forensic-like audio, in comparison with good-quality audio. Since that time, there has been rapid technological advancement, with newer systems having access to extremely large language models and having their performance proclaimed as being human-like in accuracy. This study compares various ASR systems, including OpenAI’s Whisper, to continue to test how well automatic speaker recognition works with forensic-like audio. The results show that the transcription of a good-quality audio file is at ceiling for some systems, with no errors. For the poor-quality (forensic-like) audio, Whisper was the best performing system but had only 50% of the entire speech material correct. The results for the poor-quality audio were also generally variable across the systems, with differences depending on whether a .wav or .mp3 file was used and differences between earlier and later versions of the same system. Additionally, and against expectations, Whisper showed a drop in performance over a 2-month period. While more material was transcribed in the later attempt, more was also incorrect. This study concludes that forensic-like audio is not suitable for automatic analysis.\",\"PeriodicalId\":31739,\"journal\":{\"name\":\"Frontiers in Communication\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2024-02-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Communication\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fcomm.2024.1281407\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMMUNICATION\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fcomm.2024.1281407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMMUNICATION","Score":null,"Total":0}

引用次数: 0

摘要

本研究是对 "捕捉谈话 "研究课题早期研究的更新，该研究旨在展示自动语音识别（ASR）系统与高质量音频相比，如何处理模糊不清的法证类音频。从那时起，技术发展突飞猛进，较新的系统可以访问超大语言模型，并宣称其性能在准确性上与人类无异。本研究对包括 OpenAI 的 Whisper 在内的各种 ASR 系统进行了比较，以继续测试说话人自动识别在类似法证音频中的表现。结果表明，对某些系统来说，高质量音频文件的转录达到了上限，没有出现任何错误。对于质量较差（类似法证）的音频，Whisper 是性能最好的系统，但整个语音材料中只有 50% 是正确的。各系统对劣质音频的处理结果也不尽相同，这取决于使用的是 .wav 文件还是 .mp3 文件，以及同一系统的早期版本和后期版本之间的差异。此外，与预期不同的是，Whisper 的性能在两个月内有所下降。虽然后期尝试转录的材料更多，但错误也更多。本研究的结论是，法医类音频不适合自动分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic speech recognition and the transcription of indistinct forensic audio: how do the new generation of systems fare?

This study provides an update on an earlier study in the “Capturing Talk” research topic, which aimed to demonstrate how automatic speech recognition (ASR) systems work with indistinct forensic-like audio, in comparison with good-quality audio. Since that time, there has been rapid technological advancement, with newer systems having access to extremely large language models and having their performance proclaimed as being human-like in accuracy. This study compares various ASR systems, including OpenAI’s Whisper, to continue to test how well automatic speaker recognition works with forensic-like audio. The results show that the transcription of a good-quality audio file is at ceiling for some systems, with no errors. For the poor-quality (forensic-like) audio, Whisper was the best performing system but had only 50% of the entire speech material correct. The results for the poor-quality audio were also generally variable across the systems, with differences depending on whether a .wav or .mp3 file was used and differences between earlier and later versions of the same system. Additionally, and against expectations, Whisper showed a drop in performance over a 2-month period. While more material was transcribed in the later attempt, more was also incorrect. This study concludes that forensic-like audio is not suitable for automatic analysis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in Communication COMMUNICATION-

CiteScore

3.30

自引率

8.30%

发文量

284

审稿时长

14 weeks