BERSting at the screams: A benchmark for distanced, emotional and shouted speech recognition

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-05-16 DOI:10.1016/j.csl.2025.101815

Paige Tuttösí , Mantaj Dhillon , Luna Sang , Shane Eastwood , Poorvi Bhatia , Quang Minh Dinh , Avni Kapoor , Yewon Jin , Angelica Lim

{"title":"BERSting at the screams: A benchmark for distanced, emotional and shouted speech recognition","authors":"Paige Tuttösí , Mantaj Dhillon , Luna Sang , Shane Eastwood , Poorvi Bhatia , Quang Minh Dinh , Avni Kapoor , Yewon Jin , Angelica Lim","doi":"10.1016/j.csl.2025.101815","DOIUrl":null,"url":null,"abstract":"<div><div>Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 h of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101815"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000403","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 h of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.

查看原文本刊更多论文

对尖叫进行识别：远距离、情绪性和喊叫声语音识别的基准

一些语音识别任务，如自动语音识别（ASR），在许多报告的指标中正在接近或已经达到人类的表现。然而，他们仍然在复杂的现实世界中挣扎，比如远距离讲话。之前的挑战已经发布了数据集来解决远程ASR问题，然而，重点仍然主要集中在距离上，特别是依赖于多麦克风阵列系统。在这里，我们提出了B（基本）E（运动）R（随机短语）S（如何）t(S) （BERSt）数据集。该数据集包含来自98位演员的近4小时的英语演讲，这些演员有不同的地区和非母语口音。这些数据是在演员家中的智能手机上收集的，因此包括至少98种不同的声学环境。这些数据还包括7种不同的情绪提示，以及喊叫和说话的话语。智能手机被放置在19个不同的位置，包括障碍物和与演员不同的房间。这些数据是公开可用的，可用于评估各种语音识别任务，包括：ASR、呼喊检测和语音情感识别（SER）。我们为ASR和SER任务提供了初始基准，发现ASR会随着距离和呼喊水平的增加而降低，并根据预期的情绪表现出不同的表现。我们的研究结果表明，BERSt数据集对于ASR和SER任务都具有挑战性，需要继续工作来提高这些系统的鲁棒性，以便更准确地在现实世界中使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.