Toward fast meeting transcription: NAIST system for CHiME-8 NOTSOFAR-1 task and its analysis

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-07-08 DOI:10.1016/j.csl.2025.101836

Yuta Hirano , Mau Nguyen , Kakeru Azuma , Jan Meyer Saragih , Sakriani Sakti

{"title":"Toward fast meeting transcription: NAIST system for CHiME-8 NOTSOFAR-1 task and its analysis","authors":"Yuta Hirano , Mau Nguyen , Kakeru Azuma , Jan Meyer Saragih , Sakriani Sakti","doi":"10.1016/j.csl.2025.101836","DOIUrl":null,"url":null,"abstract":"<div><div>This paper reports on the NAIST system submitted to the CHIME-8 challenge’s NOTSOFAR-1 (Natural Office Talkers in Settings of Far-field Audio Recordings) task, including results and analyses from several additional experiments. While fast processing is crucial for real-world applications, the CHIME-7 challenge focused solely on reducing error rate, neglecting the practical aspects of system performance such as inference speed. Therefore, this research aims to develop a practical system by improving recognition accuracy while simultaneously reducing inference speed. To address this challenge, we propose enhancing the baseline module architecture by modifying both the CSS and ASR modules. Specifically, the ASR module was built based on a WavLM large feature extractor and a Zipformer transducer. Furthermore, we employed reverberation removal using block-wise weighted prediction error (WPE) as preprocessing for the speech separation module. The proposed system achieved a relative reduction in tcpWER of 11.6% for single-channel tracks and 18.7% for multi-channel tracks compared to the baseline system. Moreover, the proposed system operates up to six times faster than the baseline system while achieving superior tcpWER results. We also report on the observed changes in system performance due to variations in the amount of training data for the ASR model, as well the impact of the maximum word-length setting in the transducer-based ASR module on the subsequent diarization system, based on findings from our system development.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101836"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000610","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

This paper reports on the NAIST system submitted to the CHIME-8 challenge’s NOTSOFAR-1 (Natural Office Talkers in Settings of Far-field Audio Recordings) task, including results and analyses from several additional experiments. While fast processing is crucial for real-world applications, the CHIME-7 challenge focused solely on reducing error rate, neglecting the practical aspects of system performance such as inference speed. Therefore, this research aims to develop a practical system by improving recognition accuracy while simultaneously reducing inference speed. To address this challenge, we propose enhancing the baseline module architecture by modifying both the CSS and ASR modules. Specifically, the ASR module was built based on a WavLM large feature extractor and a Zipformer transducer. Furthermore, we employed reverberation removal using block-wise weighted prediction error (WPE) as preprocessing for the speech separation module. The proposed system achieved a relative reduction in tcpWER of 11.6% for single-channel tracks and 18.7% for multi-channel tracks compared to the baseline system. Moreover, the proposed system operates up to six times faster than the baseline system while achieving superior tcpWER results. We also report on the observed changes in system performance due to variations in the amount of training data for the ASR model, as well the impact of the maximum word-length setting in the transducer-based ASR module on the subsequent diarization system, based on findings from our system development.

查看原文本刊更多论文

面向快速会议转录：CHiME-8 NOTSOFAR-1任务的NAIST系统及其分析

本文报告了提交给CHIME-8挑战的NOTSOFAR-1（远场录音设置中的自然办公室通话者）任务的NAIST系统，包括几个额外实验的结果和分析。虽然快速处理对于现实世界的应用至关重要，但CHIME-7挑战只关注降低错误率，而忽略了系统性能的实际方面，如推理速度。因此，本研究旨在开发一种实用的系统，在提高识别精度的同时降低推理速度。为了应对这一挑战，我们建议通过修改CSS和ASR模块来增强基线模块架构。具体来说，ASR模块是基于WavLM大型特征提取器和Zipformer换能器构建的。此外，我们使用块加权预测误差（WPE）作为语音分离模块的预处理来去除混响。与基线系统相比，该系统实现了单通道轨道的tcpWER相对降低11.6%，多通道轨道的tcpWER相对降低18.7%。此外，该系统的运行速度比基准系统快6倍，同时实现了更好的tcpWER结果。我们还报告了由于ASR模型的训练数据量的变化而观察到的系统性能变化，以及基于传感器的ASR模块中最大字长设置对后续diarization系统的影响，这些都是基于我们系统开发的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.