Toward fast meeting transcription: NAIST system for CHiME-8 NOTSOFAR-1 task and its analysis

IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yuta Hirano , Mau Nguyen , Kakeru Azuma , Jan Meyer Saragih , Sakriani Sakti
{"title":"Toward fast meeting transcription: NAIST system for CHiME-8 NOTSOFAR-1 task and its analysis","authors":"Yuta Hirano ,&nbsp;Mau Nguyen ,&nbsp;Kakeru Azuma ,&nbsp;Jan Meyer Saragih ,&nbsp;Sakriani Sakti","doi":"10.1016/j.csl.2025.101836","DOIUrl":null,"url":null,"abstract":"<div><div>This paper reports on the NAIST system submitted to the CHIME-8 challenge’s NOTSOFAR-1 (Natural Office Talkers in Settings of Far-field Audio Recordings) task, including results and analyses from several additional experiments. While fast processing is crucial for real-world applications, the CHIME-7 challenge focused solely on reducing error rate, neglecting the practical aspects of system performance such as inference speed. Therefore, this research aims to develop a practical system by improving recognition accuracy while simultaneously reducing inference speed. To address this challenge, we propose enhancing the baseline module architecture by modifying both the CSS and ASR modules. Specifically, the ASR module was built based on a WavLM large feature extractor and a Zipformer transducer. Furthermore, we employed reverberation removal using block-wise weighted prediction error (WPE) as preprocessing for the speech separation module. The proposed system achieved a relative reduction in tcpWER of 11.6% for single-channel tracks and 18.7% for multi-channel tracks compared to the baseline system. Moreover, the proposed system operates up to six times faster than the baseline system while achieving superior tcpWER results. We also report on the observed changes in system performance due to variations in the amount of training data for the ASR model, as well the impact of the maximum word-length setting in the transducer-based ASR module on the subsequent diarization system, based on findings from our system development.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101836"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000610","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

This paper reports on the NAIST system submitted to the CHIME-8 challenge’s NOTSOFAR-1 (Natural Office Talkers in Settings of Far-field Audio Recordings) task, including results and analyses from several additional experiments. While fast processing is crucial for real-world applications, the CHIME-7 challenge focused solely on reducing error rate, neglecting the practical aspects of system performance such as inference speed. Therefore, this research aims to develop a practical system by improving recognition accuracy while simultaneously reducing inference speed. To address this challenge, we propose enhancing the baseline module architecture by modifying both the CSS and ASR modules. Specifically, the ASR module was built based on a WavLM large feature extractor and a Zipformer transducer. Furthermore, we employed reverberation removal using block-wise weighted prediction error (WPE) as preprocessing for the speech separation module. The proposed system achieved a relative reduction in tcpWER of 11.6% for single-channel tracks and 18.7% for multi-channel tracks compared to the baseline system. Moreover, the proposed system operates up to six times faster than the baseline system while achieving superior tcpWER results. We also report on the observed changes in system performance due to variations in the amount of training data for the ASR model, as well the impact of the maximum word-length setting in the transducer-based ASR module on the subsequent diarization system, based on findings from our system development.
面向快速会议转录:CHiME-8 NOTSOFAR-1任务的NAIST系统及其分析
本文报告了提交给CHIME-8挑战的NOTSOFAR-1(远场录音设置中的自然办公室通话者)任务的NAIST系统,包括几个额外实验的结果和分析。虽然快速处理对于现实世界的应用至关重要,但CHIME-7挑战只关注降低错误率,而忽略了系统性能的实际方面,如推理速度。因此,本研究旨在开发一种实用的系统,在提高识别精度的同时降低推理速度。为了应对这一挑战,我们建议通过修改CSS和ASR模块来增强基线模块架构。具体来说,ASR模块是基于WavLM大型特征提取器和Zipformer换能器构建的。此外,我们使用块加权预测误差(WPE)作为语音分离模块的预处理来去除混响。与基线系统相比,该系统实现了单通道轨道的tcpWER相对降低11.6%,多通道轨道的tcpWER相对降低18.7%。此外,该系统的运行速度比基准系统快6倍,同时实现了更好的tcpWER结果。我们还报告了由于ASR模型的训练数据量的变化而观察到的系统性能变化,以及基于传感器的ASR模块中最大字长设置对后续diarization系统的影响,这些都是基于我们系统开发的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信