Computer Speech and Language最新文献

筛选
英文 中文
Towards explainable spoofed speech attribution and detection: A probabilistic approach for characterizing speech synthesizer components 向可解释的欺骗语音归因和检测:表征语音合成器组件的概率方法
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-06-11 DOI: 10.1016/j.csl.2025.101840
Jagabandhu Mishra , Manasi Chhibber , Hye-jin Shim , Tomi H. Kinnunen
{"title":"Towards explainable spoofed speech attribution and detection: A probabilistic approach for characterizing speech synthesizer components","authors":"Jagabandhu Mishra ,&nbsp;Manasi Chhibber ,&nbsp;Hye-jin Shim ,&nbsp;Tomi H. Kinnunen","doi":"10.1016/j.csl.2025.101840","DOIUrl":"10.1016/j.csl.2025.101840","url":null,"abstract":"<div><div>We propose an explainable probabilistic framework for characterizing spoofed speech by decomposing it into probabilistic attribute embeddings. Unlike raw high-dimensional countermeasure embeddings, which lack interpretability, the proposed probabilistic attribute embeddings aim to detect specific speech synthesizer components, represented through high-level attributes and their corresponding values. We use these probabilistic embeddings with four classifier back-ends to address two downstream tasks: spoofing detection and spoofing attack attribution. The former is the well-known bonafide-spoof detection task, whereas the latter seeks to identify the source method (generator) of a spoofed utterance. We additionally use Shapley values, a widely used technique in machine learning, to quantify the relative contribution of each attribute value to the decision-making process in each task. Results on the ASVspoof2019 dataset demonstrate the substantial role of waveform generator, conversion model outputs, and inputs in spoofing detection; and inputs, speaker, and duration modeling in spoofing attack attribution. In the detection task, the probabilistic attribute embeddings achieve 99.7% balanced accuracy and 0.22% equal error rate (EER), closely matching the performance of raw embeddings (99.9% balanced accuracy and 0.22% EER). Similarly, in the attribution task, our embeddings achieve 90.23% balanced accuracy and 2.07% EER, compared to 90.16% and 2.11% with raw embeddings. These results demonstrate that the proposed framework is both inherently explainable by design and capable of achieving performance comparable to raw CM embeddings.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101840"},"PeriodicalIF":3.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144298503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Raw acoustic-articulatory multimodal dysarthric speech recognition 原始声学-发音多模态困难语音识别
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-06-10 DOI: 10.1016/j.csl.2025.101839
Zhengjun Yue , Erfan Loweimi , Zoran Cvetkovic , Jon Barker , Heidi Christensen
{"title":"Raw acoustic-articulatory multimodal dysarthric speech recognition","authors":"Zhengjun Yue ,&nbsp;Erfan Loweimi ,&nbsp;Zoran Cvetkovic ,&nbsp;Jon Barker ,&nbsp;Heidi Christensen","doi":"10.1016/j.csl.2025.101839","DOIUrl":"10.1016/j.csl.2025.101839","url":null,"abstract":"<div><div>Automatic speech recognition (ASR) for dysarthric speech is challenging. The acoustic characteristics of dysarthric speech are highly variable and there are often fewer distinguishing cues between phonetic tokens. Multimodal ASR utilises the data from other modalities to facilitate the task when a single acoustic modality proves insufficient. Articulatory information, which encapsulates knowledge about the speech production process, may constitute such a complementary modality. Although multimodal acoustic-articulatory ASR has received increasing attention recently, incorporating real articulatory data is under-explored for dysarthric speech recognition. This paper investigates the effectiveness of multimodal acoustic modelling using real dysarthric speech articulatory information in combination with acoustic features, especially raw signal representations which are more informative than classic features, leading to learning representations tailored to dysarthric ASR. In particular, various raw acoustic-articulatory multimodal dysarthric speech recognition systems are developed and compared with similar systems with hand-crafted features. Furthermore, the difference between dysarthric and typical speech in terms of articulatory information is systematically analysed by using a statistical space distribution indicator called Maximum Articulator Motion Range (MAMR). Additionally, we used mutual information analysis to investigate the robustness and phonetic information content of the articulatory features, offering insights that support feature selection and the ASR results. Experimental results on the widely used TORGO dysarthric speech dataset show that combining the articulatory and raw acoustic features at the empirically found optimal fusion level achieves a notable performance gain, leading to up to 7.6% and 12.8% relative word error rate (WER) reduction for dysarthric and typical speech, respectively.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101839"},"PeriodicalIF":3.1,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sentiment analysis for live video comments with variational residual representations 基于变分残差表示的实时视频评论情感分析
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-06-09 DOI: 10.1016/j.csl.2025.101838
Changfan Luo , Ling Fang , Bensheng Qiu
{"title":"Sentiment analysis for live video comments with variational residual representations","authors":"Changfan Luo ,&nbsp;Ling Fang ,&nbsp;Bensheng Qiu","doi":"10.1016/j.csl.2025.101838","DOIUrl":"10.1016/j.csl.2025.101838","url":null,"abstract":"<div><div>Live video comment (LVC) is valuable for public opinion analysis, communication, and user engagement. Analyzing the sentiment in LVC is crucial for understanding their content, especially when strong emotions are involved. However, compared to normal text, LVC exhibits a stronger real-time nature, as well as context-dependent and cross-modal misalignment. Conventional sentiment analysis methods rely solely on textual information and explicit context, yet current multi-modal sentiment analysis models are insufficient to discriminate context and align multi-modal information. To address these challenges, we propose a novel variational residual fusion network based on a variational autoencoder for sentiment analysis of LVCs. Especially, an autofilter module is introduced in the encoder to filter out useful surrounding comments as contextual information for the target comment. A residual fusion module is embedded between the encoder and decoder to discriminate the most relevant visual information, facilitating the alignment of multi-modal information and thereby enhancing the learning of target comment representation. Furthermore, our method follows a multi-task learning scheme to help the model reinforce the representation of the target comments and improve the effectiveness of sentiment analysis. Extensive experiments suggest the effectiveness of the proposed framework in this work. <span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101838"},"PeriodicalIF":3.1,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144263942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring knowledge distillation for low-resource multi-modal streaming ASR in the CHiME-8 MMCSG challenge 在CHiME-8 MMCSG挑战中探索低资源多模态流ASR的知识蒸馏
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-06-06 DOI: 10.1016/j.csl.2025.101837
Hongbo Lan, Ya Jiang, Jun Du, Qing Wang
{"title":"Exploring knowledge distillation for low-resource multi-modal streaming ASR in the CHiME-8 MMCSG challenge","authors":"Hongbo Lan,&nbsp;Ya Jiang,&nbsp;Jun Du,&nbsp;Qing Wang","doi":"10.1016/j.csl.2025.101837","DOIUrl":"10.1016/j.csl.2025.101837","url":null,"abstract":"<div><div>In the CHiME-8 Multi-modal Conversational Speech Recognition for Smart Glasses (MMCSG) challenge, participants were tasked with achieving real-time transcription of two-person conversations recorded with smart glasses. To address the scarcity of real-world data, we propose a knowledge distillation framework where a non-streaming teacher model, trained on augmented multi-channel audio, guides a streaming student model. Leveraging simulated data with varying overlap rates, the framework employs a logit-based Kullback–Leibler divergence loss alongside mean square error losses on hidden states and attention maps of Fast-Conformer layers to transfer knowledge from the teacher to the student, significantly improving the performance of the audio-only streaming automatic speech recognition (ASR) model. Furthermore, we exploit the synergy and complementarity of inertial measurement unit and audio data by developing a novel multi-modal streaming ASR model. Meanwhile, cross-modal distillation is performed by adopting the non-streaming audio-only teacher to guide the streaming multi-modal student. Experimental results demonstrate that our proposed multi-modal fusion and teacher-student learning framework effectively enhance the performance of streaming ASR models. Notably, our approach secured the first place in the sub-track of the CHiME-8 MMCSG challenge.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101837"},"PeriodicalIF":3.1,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Universal constituency treebanking and parsing: A pilot study 通用选区树银行和解析:一项试点研究
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-06-06 DOI: 10.1016/j.csl.2025.101826
Jianling Li , Meishan Zhang , Jianrong Wang , Min Zhang , Yue Zhang
{"title":"Universal constituency treebanking and parsing: A pilot study","authors":"Jianling Li ,&nbsp;Meishan Zhang ,&nbsp;Jianrong Wang ,&nbsp;Min Zhang ,&nbsp;Yue Zhang","doi":"10.1016/j.csl.2025.101826","DOIUrl":"10.1016/j.csl.2025.101826","url":null,"abstract":"<div><div>Universal language processing is crucial for developing models that work across multiple languages. However, universal constituency parsing has lagged due to the lack of annotated universal constituency (UC) treebanks. To address this, we propose two cost-effective approaches. First, we unify existing annotated language-specific treebanks using phrase label mapping to create UC trees, but this is limited to only a handful of languages. Second, we develop a novel method to convert Universal Dependency (UD) treebanks into UC treebanks using large language models (LLMs) with syntactic knowledge, enabling the construction of UC treebanks for over 150 languages. We adopt the graph-based max margin model as our baseline and introduce a language adapter to fine-tune the universal parser. Our experiments show that the language adapter maintains performance for high-resource languages and improves performance for low-resource languages. We evaluate different scales of multilingual pre-trained models, confirming the effectiveness and robustness of our approach. In summary, we conduct the first pilot study on universal constituency parsing, introducing novel methods for creating and utilizing UC treebanks, thereby advancing treebanking and parsing methodologies.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101826"},"PeriodicalIF":3.1,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Microphone array geometry-independent multi-talker distant ASR: NTT system for DASR task of the CHiME-8 challenge 麦克风阵列几何无关的多对讲机远程ASR: NTT系统用于DASR任务的CHiME-8挑战
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-06-04 DOI: 10.1016/j.csl.2025.101820
Naoyuki Kamo , Naohiro Tawara , Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet , Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki
{"title":"Microphone array geometry-independent multi-talker distant ASR: NTT system for DASR task of the CHiME-8 challenge","authors":"Naoyuki Kamo ,&nbsp;Naohiro Tawara ,&nbsp;Atsushi Ando,&nbsp;Takatomo Kano,&nbsp;Hiroshi Sato,&nbsp;Rintaro Ikeshita,&nbsp;Takafumi Moriya,&nbsp;Shota Horiguchi,&nbsp;Kohei Matsuura,&nbsp;Atsunori Ogawa,&nbsp;Alexis Plaquet ,&nbsp;Takanori Ashihara,&nbsp;Tsubasa Ochiai,&nbsp;Masato Mimura,&nbsp;Marc Delcroix,&nbsp;Tomohiro Nakatani,&nbsp;Taichi Asami,&nbsp;Shoko Araki","doi":"10.1016/j.csl.2025.101820","DOIUrl":"10.1016/j.csl.2025.101820","url":null,"abstract":"<div><div>In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles a variety of recording conditions, from dinner parties to professional meetings and from two speakers to eight. We perform diarization first, followed by speech enhancement, and then ASR as the challenge baseline. However, we introduced several key refinements. First, we derived a powerful speaker diarization relying on end-to-end speaker diarization with vector clustering (EEND-VC), multi-channel speaker counting using enhanced embeddings from EEND-VC, and target-speaker voice activity detection (TS-VAD). For speech enhancement, we introduced a novel microphone selection rule to better select the most relevant microphones among those distributed microphones and investigated improvements to beamforming. Finally, for ASR, we developed several models exploiting Whisper and WavLM speech foundation models. In this paper, we present the original results we submitted to the challenge and updated results we obtained afterward. Our strongest system achieves a 63% relative macro tcpWER improvement over the baseline and outperforms the challenge best results on the NOTSOFAR-1 meeting evaluation data among geometry-independent systems.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101820"},"PeriodicalIF":3.1,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144289130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ASVspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech ASVspoof 5:设计、收集和验证使用众包语音进行欺骗、深度伪造和对抗性攻击检测的资源
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-05-28 DOI: 10.1016/j.csl.2025.101825
Xin Wang , Héctor Delgado , Hemlata Tak , Jee-weon Jung , Hye-jin Shim , Massimiliano Todisco , Ivan Kukanov , Xuechen Liu , Md Sahidullah , Tomi Kinnunen , Nicholas Evans , Kong Aik Lee , Junichi Yamagishi , Myeonghun Jeong , Ge Zhu , Yongyi Zang , You Zhang , Soumi Maiti , Florian Lux , Nicolas Müller , Vishwanath Singh
{"title":"ASVspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech","authors":"Xin Wang ,&nbsp;Héctor Delgado ,&nbsp;Hemlata Tak ,&nbsp;Jee-weon Jung ,&nbsp;Hye-jin Shim ,&nbsp;Massimiliano Todisco ,&nbsp;Ivan Kukanov ,&nbsp;Xuechen Liu ,&nbsp;Md Sahidullah ,&nbsp;Tomi Kinnunen ,&nbsp;Nicholas Evans ,&nbsp;Kong Aik Lee ,&nbsp;Junichi Yamagishi ,&nbsp;Myeonghun Jeong ,&nbsp;Ge Zhu ,&nbsp;Yongyi Zang ,&nbsp;You Zhang ,&nbsp;Soumi Maiti ,&nbsp;Florian Lux ,&nbsp;Nicolas Müller ,&nbsp;Vishwanath Singh","doi":"10.1016/j.csl.2025.101825","DOIUrl":"10.1016/j.csl.2025.101825","url":null,"abstract":"<div><div>ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks as well as the design of detection solutions. We introduce the ASVspoof 5 database which is generated in a crowdsourced fashion from data collected in diverse acoustic conditions (cf. studio-quality data for earlier ASVspoof databases) and from <span><math><mo>∼</mo></math></span>2000 speakers (cf. <span><math><mo>∼</mo></math></span>100 earlier). The database contains attacks generated with 32 different algorithms, also crowdsourced, and optimised to varying degrees using new surrogate detection models. Among them are attacks generated with a mix of legacy and contemporary text-to-speech synthesis and voice conversion models, in addition to adversarial attacks which are incorporated for the first time. ASVspoof 5 protocols comprise seven speaker-disjoint partitions. They include two distinct partitions for the training of different sets of attack models, two more for the development and evaluation of surrogate detection models, and then three additional partitions which comprise the ASVspoof 5 training, development and evaluation sets. An auxiliary set of data collected from an additional 30k speakers can also be used to train speaker encoders for the implementation of attack algorithms. Also described herein is an experimental validation of the new ASVspoof 5 database using a set of automatic speaker verification and spoof/deepfake baseline detectors. With the exception of protocols and tools for the generation of spoofed/deepfake speech, the resources described in this paper, already used by participants of the ASVspoof 5 challenge in 2024, are now all freely available to the community.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101825"},"PeriodicalIF":3.1,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144194998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Aspect-level sentiment analysis based on graph convolutional networks and interactive aggregate attention 基于图卷积网络和交互聚合注意力的方面级情感分析
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-05-26 DOI: 10.1016/j.csl.2025.101819
Yuxin Wu, Guofeng Deng
{"title":"Aspect-level sentiment analysis based on graph convolutional networks and interactive aggregate attention","authors":"Yuxin Wu,&nbsp;Guofeng Deng","doi":"10.1016/j.csl.2025.101819","DOIUrl":"10.1016/j.csl.2025.101819","url":null,"abstract":"<div><div>Sentiment analysis has always been an important task in the artificial intelligence field, and aspect-level sentiment analysis involves fine-grained sentiment analysis. Recently, graph convolutional networks (GCNs) built on sentence dependency trees have been widely used in aspect-level sentiment analysis. Because GCNs have good aggregation effects, they can efficiently aggregate the information of neighboring nodes. However, many previous studies concerning graph neural networks only focused on the information between nodes and did not effectively explore the connections between aspects and sentences or highlight the parts with high aspect relevance. To address this problem, we propose a new GCN. When constructing a dependency tree-based graph, affective information and position index information are added to each node to enhance the graph. In addition, we use an interactive aggregate attention mechanism, which utilizes the aggregated information related to the connections between aspects and sentences from the GCN to highlight the important parts so that the model can fully learn the relationships between aspects and sentences. Finally, we validate our model on four public benchmark datasets and attain improvements over the state-of-the-art methods.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101819"},"PeriodicalIF":3.1,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards privacy-preserving conversation analysis in everyday life: Exploring the privacy-utility trade-off 日常生活中的隐私保护对话分析:探讨隐私与效用的权衡
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-05-22 DOI: 10.1016/j.csl.2025.101823
Jule Pohlhausen , Francesco Nespoli , Jörg Bitzer
{"title":"Towards privacy-preserving conversation analysis in everyday life: Exploring the privacy-utility trade-off","authors":"Jule Pohlhausen ,&nbsp;Francesco Nespoli ,&nbsp;Jörg Bitzer","doi":"10.1016/j.csl.2025.101823","DOIUrl":"10.1016/j.csl.2025.101823","url":null,"abstract":"<div><div>Recordings in everyday life provide valuable insights for health-related applications, such as analyzing conversational behavior as an indicator of social interaction and well-being. However, these recordings require privacy preservation of both the speech content and the speaker’s identity of all persons involved. This article investigates privacy-preserving features feasible for power-constrained recording devices by combining smoothing and subsampling in the frequency and time domain with a low-cost speaker anonymization technique. A speech recognition and a speaker verification system are used to evaluate privacy protection, whereas a voice activity detection and a speaker diarization model are used to assess the utility for analyzing conversations. The evaluation results demonstrate that combining speaker anonymization with the aforementioned smoothing and subsampling protects speech privacy, albeit at the expense of utility performance. Overall, our privacy-preserving methods offer various trade-offs between privacy and utility, reflecting the requirements of different application scenarios.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101823"},"PeriodicalIF":3.1,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144189338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design choices for PixIT-based speaker-attributed ASR: Team ToTaTo at the NOTSOFAR-1 challenge 基于pixit的演讲者属性ASR的设计选择:ToTaTo团队在NOTSOFAR-1挑战中
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-05-22 DOI: 10.1016/j.csl.2025.101824
Joonas Kalda , Séverin Baroudi , Martin Lebourdais , Clément Pagés , Ricard Marxer , Tanel Alumäe , Hervé Bredin
{"title":"Design choices for PixIT-based speaker-attributed ASR: Team ToTaTo at the NOTSOFAR-1 challenge","authors":"Joonas Kalda ,&nbsp;Séverin Baroudi ,&nbsp;Martin Lebourdais ,&nbsp;Clément Pagés ,&nbsp;Ricard Marxer ,&nbsp;Tanel Alumäe ,&nbsp;Hervé Bredin","doi":"10.1016/j.csl.2025.101824","DOIUrl":"10.1016/j.csl.2025.101824","url":null,"abstract":"<div><div>PixIT is a recently proposed joint training framework that integrates Permutation Invariant Training (PIT) for speaker diarization and Mixture Invariant Training (MixIT) for speech separation. By leveraging diarization labels, PixIT addresses MixIT’s limitations, producing aligned sources and speaker activations that enable automatic long-form separation. We investigate applications of PixIT on the speaker-attributed automatic speech recognition (SA-ASR) task based on our systems for the NOTSOFAR-1 Challenge. We explore modifications to the joint ToTaToNet by integrating advanced self-supervised learning (SSL) features and masking networks. We show that fine-tuning an ASR system on PixIT-separated sources significantly boosts downstream SA-ASR performance, outperforming standard diarization-based baselines without relying on synthetic data. We explore lightweight post-processing heuristics for improving SA-ASR timestamp errors caused by long silences and artifacts present in file-level separated sources. We also show the potential of extracting speaker embeddings for the diarization pipeline directly from separated sources, with performance rivaling standard methods without any fine-tuning of speaker embeddings. On the NOTSOFAR-1 Challenge dataset, our PixIT-based approach outperforms the CSS-based baseline by 20% in terms of tcpWER after fine-tuning the ASR system on the separated sources. Notably, even when using the same ASR model as the baseline, our system is able to outperform it, without using any of the provided domain-specific synthetic data. These advancements position PixIT as a robust and flexible solution for real-world SA-ASR.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101824"},"PeriodicalIF":3.1,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144131279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信