Computer Speech and Language最新文献_第3页

Exploring knowledge distillation for low-resource multi-modal streaming ASR in the CHiME-8 MMCSG challenge 在CHiME-8 MMCSG挑战中探索低资源多模态流ASR的知识蒸馏

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-06-06 DOI: 10.1016/j.csl.2025.101837

Hongbo Lan, Ya Jiang, Jun Du, Qing Wang

{"title":"Exploring knowledge distillation for low-resource multi-modal streaming ASR in the CHiME-8 MMCSG challenge","authors":"Hongbo Lan, Ya Jiang, Jun Du, Qing Wang","doi":"10.1016/j.csl.2025.101837","DOIUrl":"10.1016/j.csl.2025.101837","url":null,"abstract":"<div><div>In the CHiME-8 Multi-modal Conversational Speech Recognition for Smart Glasses (MMCSG) challenge, participants were tasked with achieving real-time transcription of two-person conversations recorded with smart glasses. To address the scarcity of real-world data, we propose a knowledge distillation framework where a non-streaming teacher model, trained on augmented multi-channel audio, guides a streaming student model. Leveraging simulated data with varying overlap rates, the framework employs a logit-based Kullback–Leibler divergence loss alongside mean square error losses on hidden states and attention maps of Fast-Conformer layers to transfer knowledge from the teacher to the student, significantly improving the performance of the audio-only streaming automatic speech recognition (ASR) model. Furthermore, we exploit the synergy and complementarity of inertial measurement unit and audio data by developing a novel multi-modal streaming ASR model. Meanwhile, cross-modal distillation is performed by adopting the non-streaming audio-only teacher to guide the streaming multi-modal student. Experimental results demonstrate that our proposed multi-modal fusion and teacher-student learning framework effectively enhance the performance of streaming ASR models. Notably, our approach secured the first place in the sub-track of the CHiME-8 MMCSG challenge.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101837"},"PeriodicalIF":3.1,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Universal constituency treebanking and parsing: A pilot study 通用选区树银行和解析：一项试点研究

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-06-06 DOI: 10.1016/j.csl.2025.101826

Jianling Li , Meishan Zhang , Jianrong Wang , Min Zhang , Yue Zhang

{"title":"Universal constituency treebanking and parsing: A pilot study","authors":"Jianling Li , Meishan Zhang , Jianrong Wang , Min Zhang , Yue Zhang","doi":"10.1016/j.csl.2025.101826","DOIUrl":"10.1016/j.csl.2025.101826","url":null,"abstract":"<div><div>Universal language processing is crucial for developing models that work across multiple languages. However, universal constituency parsing has lagged due to the lack of annotated universal constituency (UC) treebanks. To address this, we propose two cost-effective approaches. First, we unify existing annotated language-specific treebanks using phrase label mapping to create UC trees, but this is limited to only a handful of languages. Second, we develop a novel method to convert Universal Dependency (UD) treebanks into UC treebanks using large language models (LLMs) with syntactic knowledge, enabling the construction of UC treebanks for over 150 languages. We adopt the graph-based max margin model as our baseline and introduce a language adapter to fine-tune the universal parser. Our experiments show that the language adapter maintains performance for high-resource languages and improves performance for low-resource languages. We evaluate different scales of multilingual pre-trained models, confirming the effectiveness and robustness of our approach. In summary, we conduct the first pilot study on universal constituency parsing, introducing novel methods for creating and utilizing UC treebanks, thereby advancing treebanking and parsing methodologies.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101826"},"PeriodicalIF":3.1,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Microphone array geometry-independent multi-talker distant ASR: NTT system for DASR task of the CHiME-8 challenge 麦克风阵列几何无关的多对讲机远程ASR: NTT系统用于DASR任务的CHiME-8挑战

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-06-04 DOI: 10.1016/j.csl.2025.101820

Naoyuki Kamo , Naohiro Tawara , Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet , Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

{"title":"Microphone array geometry-independent multi-talker distant ASR: NTT system for DASR task of the CHiME-8 challenge","authors":"Naoyuki Kamo , Naohiro Tawara , Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet , Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki","doi":"10.1016/j.csl.2025.101820","DOIUrl":"10.1016/j.csl.2025.101820","url":null,"abstract":"<div><div>In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles a variety of recording conditions, from dinner parties to professional meetings and from two speakers to eight. We perform diarization first, followed by speech enhancement, and then ASR as the challenge baseline. However, we introduced several key refinements. First, we derived a powerful speaker diarization relying on end-to-end speaker diarization with vector clustering (EEND-VC), multi-channel speaker counting using enhanced embeddings from EEND-VC, and target-speaker voice activity detection (TS-VAD). For speech enhancement, we introduced a novel microphone selection rule to better select the most relevant microphones among those distributed microphones and investigated improvements to beamforming. Finally, for ASR, we developed several models exploiting Whisper and WavLM speech foundation models. In this paper, we present the original results we submitted to the challenge and updated results we obtained afterward. Our strongest system achieves a 63% relative macro tcpWER improvement over the baseline and outperforms the challenge best results on the NOTSOFAR-1 meeting evaluation data among geometry-independent systems.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101820"},"PeriodicalIF":3.1,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144289130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ASVspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech ASVspoof 5：设计、收集和验证使用众包语音进行欺骗、深度伪造和对抗性攻击检测的资源

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-05-28 DOI: 10.1016/j.csl.2025.101825

Xin Wang , Héctor Delgado , Hemlata Tak , Jee-weon Jung , Hye-jin Shim , Massimiliano Todisco , Ivan Kukanov , Xuechen Liu , Md Sahidullah , Tomi Kinnunen , Nicholas Evans , Kong Aik Lee , Junichi Yamagishi , Myeonghun Jeong , Ge Zhu , Yongyi Zang , You Zhang , Soumi Maiti , Florian Lux , Nicolas Müller , Vishwanath Singh

{"title":"ASVspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech","authors":"Xin Wang , Héctor Delgado , Hemlata Tak , Jee-weon Jung , Hye-jin Shim , Massimiliano Todisco , Ivan Kukanov , Xuechen Liu , Md Sahidullah , Tomi Kinnunen , Nicholas Evans , Kong Aik Lee , Junichi Yamagishi , Myeonghun Jeong , Ge Zhu , Yongyi Zang , You Zhang , Soumi Maiti , Florian Lux , Nicolas Müller , Vishwanath Singh","doi":"10.1016/j.csl.2025.101825","DOIUrl":"10.1016/j.csl.2025.101825","url":null,"abstract":"<div><div>ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks as well as the design of detection solutions. We introduce the ASVspoof 5 database which is generated in a crowdsourced fashion from data collected in diverse acoustic conditions (cf. studio-quality data for earlier ASVspoof databases) and from <span><math><mo>∼</mo></math></span>2000 speakers (cf. <span><math><mo>∼</mo></math></span>100 earlier). The database contains attacks generated with 32 different algorithms, also crowdsourced, and optimised to varying degrees using new surrogate detection models. Among them are attacks generated with a mix of legacy and contemporary text-to-speech synthesis and voice conversion models, in addition to adversarial attacks which are incorporated for the first time. ASVspoof 5 protocols comprise seven speaker-disjoint partitions. They include two distinct partitions for the training of different sets of attack models, two more for the development and evaluation of surrogate detection models, and then three additional partitions which comprise the ASVspoof 5 training, development and evaluation sets. An auxiliary set of data collected from an additional 30k speakers can also be used to train speaker encoders for the implementation of attack algorithms. Also described herein is an experimental validation of the new ASVspoof 5 database using a set of automatic speaker verification and spoof/deepfake baseline detectors. With the exception of protocols and tools for the generation of spoofed/deepfake speech, the resources described in this paper, already used by participants of the ASVspoof 5 challenge in 2024, are now all freely available to the community.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101825"},"PeriodicalIF":3.1,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144194998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Aspect-level sentiment analysis based on graph convolutional networks and interactive aggregate attention 基于图卷积网络和交互聚合注意力的方面级情感分析

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-05-26 DOI: 10.1016/j.csl.2025.101819

Yuxin Wu, Guofeng Deng

{"title":"Aspect-level sentiment analysis based on graph convolutional networks and interactive aggregate attention","authors":"Yuxin Wu, Guofeng Deng","doi":"10.1016/j.csl.2025.101819","DOIUrl":"10.1016/j.csl.2025.101819","url":null,"abstract":"<div><div>Sentiment analysis has always been an important task in the artificial intelligence field, and aspect-level sentiment analysis involves fine-grained sentiment analysis. Recently, graph convolutional networks (GCNs) built on sentence dependency trees have been widely used in aspect-level sentiment analysis. Because GCNs have good aggregation effects, they can efficiently aggregate the information of neighboring nodes. However, many previous studies concerning graph neural networks only focused on the information between nodes and did not effectively explore the connections between aspects and sentences or highlight the parts with high aspect relevance. To address this problem, we propose a new GCN. When constructing a dependency tree-based graph, affective information and position index information are added to each node to enhance the graph. In addition, we use an interactive aggregate attention mechanism, which utilizes the aggregated information related to the connections between aspects and sentences from the GCN to highlight the important parts so that the model can fully learn the relationships between aspects and sentences. Finally, we validate our model on four public benchmark datasets and attain improvements over the state-of-the-art methods.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101819"},"PeriodicalIF":3.1,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards privacy-preserving conversation analysis in everyday life: Exploring the privacy-utility trade-off 日常生活中的隐私保护对话分析：探讨隐私与效用的权衡

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-05-22 DOI: 10.1016/j.csl.2025.101823

Jule Pohlhausen , Francesco Nespoli , Jörg Bitzer

{"title":"Towards privacy-preserving conversation analysis in everyday life: Exploring the privacy-utility trade-off","authors":"Jule Pohlhausen , Francesco Nespoli , Jörg Bitzer","doi":"10.1016/j.csl.2025.101823","DOIUrl":"10.1016/j.csl.2025.101823","url":null,"abstract":"<div><div>Recordings in everyday life provide valuable insights for health-related applications, such as analyzing conversational behavior as an indicator of social interaction and well-being. However, these recordings require privacy preservation of both the speech content and the speaker’s identity of all persons involved. This article investigates privacy-preserving features feasible for power-constrained recording devices by combining smoothing and subsampling in the frequency and time domain with a low-cost speaker anonymization technique. A speech recognition and a speaker verification system are used to evaluate privacy protection, whereas a voice activity detection and a speaker diarization model are used to assess the utility for analyzing conversations. The evaluation results demonstrate that combining speaker anonymization with the aforementioned smoothing and subsampling protects speech privacy, albeit at the expense of utility performance. Overall, our privacy-preserving methods offer various trade-offs between privacy and utility, reflecting the requirements of different application scenarios.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101823"},"PeriodicalIF":3.1,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144189338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design choices for PixIT-based speaker-attributed ASR: Team ToTaTo at the NOTSOFAR-1 challenge 基于pixit的演讲者属性ASR的设计选择：ToTaTo团队在NOTSOFAR-1挑战中

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-05-22 DOI: 10.1016/j.csl.2025.101824

Joonas Kalda , Séverin Baroudi , Martin Lebourdais , Clément Pagés , Ricard Marxer , Tanel Alumäe , Hervé Bredin

{"title":"Design choices for PixIT-based speaker-attributed ASR: Team ToTaTo at the NOTSOFAR-1 challenge","authors":"Joonas Kalda , Séverin Baroudi , Martin Lebourdais , Clément Pagés , Ricard Marxer , Tanel Alumäe , Hervé Bredin","doi":"10.1016/j.csl.2025.101824","DOIUrl":"10.1016/j.csl.2025.101824","url":null,"abstract":"<div><div>PixIT is a recently proposed joint training framework that integrates Permutation Invariant Training (PIT) for speaker diarization and Mixture Invariant Training (MixIT) for speech separation. By leveraging diarization labels, PixIT addresses MixIT’s limitations, producing aligned sources and speaker activations that enable automatic long-form separation. We investigate applications of PixIT on the speaker-attributed automatic speech recognition (SA-ASR) task based on our systems for the NOTSOFAR-1 Challenge. We explore modifications to the joint ToTaToNet by integrating advanced self-supervised learning (SSL) features and masking networks. We show that fine-tuning an ASR system on PixIT-separated sources significantly boosts downstream SA-ASR performance, outperforming standard diarization-based baselines without relying on synthetic data. We explore lightweight post-processing heuristics for improving SA-ASR timestamp errors caused by long silences and artifacts present in file-level separated sources. We also show the potential of extracting speaker embeddings for the diarization pipeline directly from separated sources, with performance rivaling standard methods without any fine-tuning of speaker embeddings. On the NOTSOFAR-1 Challenge dataset, our PixIT-based approach outperforms the CSS-based baseline by 20% in terms of tcpWER after fine-tuning the ASR system on the separated sources. Notably, even when using the same ASR model as the baseline, our system is able to outperform it, without using any of the provided domain-specific synthetic data. These advancements position PixIT as a robust and flexible solution for real-world SA-ASR.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101824"},"PeriodicalIF":3.1,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144131279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards decoupling frontend enhancement and backend recognition in monaural robust ASR 单鲁棒ASR中前端增强与后端识别的解耦研究

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-05-20 DOI: 10.1016/j.csl.2025.101821

Yufeng Yang , Ashutosh Pandey , DeLiang Wang

{"title":"Towards decoupling frontend enhancement and backend recognition in monaural robust ASR","authors":"Yufeng Yang , Ashutosh Pandey , DeLiang Wang","doi":"10.1016/j.csl.2025.101821","DOIUrl":"10.1016/j.csl.2025.101821","url":null,"abstract":"<div><div>It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain, a TF-CrossNet time–frequency domain, and an MP-SENet magnitude-phase based enhancement model. The proposed systems decouple frontend enhancement and backend ASR, with the latter trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN, TF-CrossNet, and MP-SENet enhanced speech all translate to improved ASR results in noisy and reverberant environments, and generalize well to real acoustic scenarios. The proposed system outperforms the baselines trained on corrupted speech directly. Furthermore, it cuts the previous best word error rate (WER) on CHiME-2 by 28.4% relatively with a 5.6% WER, and achieves <span><math><mrow><mn>3</mn><mo>.</mo><mn>3</mn><mo>/</mo><mn>4</mn><mo>.</mo><mn>4</mn><mtext>%</mtext></mrow></math></span> WER on single-channel CHiME-4 simulated/real test data without training on CHiME-4. We also observe consistent improvements using noise-robust Whisper as the backend ASR model.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101821"},"PeriodicalIF":3.1,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144106632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An item response theory framework to evaluate automatic speech recognition systems against speech difficulty 基于项目反应理论的语音自动识别系统言语困难评价

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-05-16 DOI: 10.1016/j.csl.2025.101817

Chaina Santos Oliveira, Ricardo B.C. Prudêncio

{"title":"An item response theory framework to evaluate automatic speech recognition systems against speech difficulty","authors":"Chaina Santos Oliveira, Ricardo B.C. Prudêncio","doi":"10.1016/j.csl.2025.101817","DOIUrl":"10.1016/j.csl.2025.101817","url":null,"abstract":"<div><div>Evaluating the performance of Automatic Speech Recognition (ASR) systems is very relevant for selecting good techniques and understanding their advantages and limitations. ASR systems are usually evaluated by adopting test sets of audio speeches, ideally with different difficulty levels. In this sense, it is important to analyse whether a system under test correctly transcribes easy test speeches, while being robust to the most difficult ones. In this paper, a novel framework is proposed for evaluating ASR systems, which covers two complementary issues: (1) to measure the difficulty of each test speech; and (2) to analyse each ASR system’s performance against the difficulty level. Regarding the first issue, the framework measures speech difficulty by adopting Item Response Theory (IRT). Regarding the second issue, the Recognizer Characteristic Curve (RCC) is proposed, which is a plot of the ASR system’s performance versus speech difficulty. ASR performance is further analysed by a two-dimensional plot, in which speech difficulty is decomposed by IRT into sentence difficulty and speaker quality. In the experiments, the proposed framework was applied in a test set produced by adopting text-to-speech tools, with diverse speakers and sentences. Additionally, noise injection was applied to produce test items with even higher difficulty levels. In the experiments, noise injection actually increases difficulty and generates a wide variety of speeches to assess ASR performance. However, it is essential to pay attention that high noise levels can lead to an unreliable evaluation. The proposed plots were helpful for both identifying robust ASR systems as well as for choosing the noise level that results in both diversity and reliability.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101817"},"PeriodicalIF":3.1,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144072019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BERSting at the screams: A benchmark for distanced, emotional and shouted speech recognition 对尖叫进行识别：远距离、情绪性和喊叫声语音识别的基准

IF 3.1 3区计算机科学

Computer Speech and Language Pub Date : 2025-05-16 DOI: 10.1016/j.csl.2025.101815

Paige Tuttösí , Mantaj Dhillon , Luna Sang , Shane Eastwood , Poorvi Bhatia , Quang Minh Dinh , Avni Kapoor , Yewon Jin , Angelica Lim

{"title":"BERSting at the screams: A benchmark for distanced, emotional and shouted speech recognition","authors":"Paige Tuttösí , Mantaj Dhillon , Luna Sang , Shane Eastwood , Poorvi Bhatia , Quang Minh Dinh , Avni Kapoor , Yewon Jin , Angelica Lim","doi":"10.1016/j.csl.2025.101815","DOIUrl":"10.1016/j.csl.2025.101815","url":null,"abstract":"<div><div>Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 h of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101815"},"PeriodicalIF":3.1,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144106633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0