{"title":"A Mixture-of-Experts model for multimodal emotion recognition in conversations","authors":"Soumya Dutta , Smruthi Balaji , Sriram Ganapathy","doi":"10.1016/j.csl.2026.101965","DOIUrl":"10.1016/j.csl.2026.101965","url":null,"abstract":"<div><div>Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose <strong>Mi</strong>xture of <strong>S</strong>peech-<strong>T</strong>ext <strong>E</strong>xperts for <strong>R</strong>ecognition of <strong>E</strong>motions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts – speech-only, text-only, and cross-modal – using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regularization across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets – IEMOCAP, MELD, and MOSI – show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101965"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liu Zhanghui, Zhang Wentao, Chen Yuzhong, Lin Yixin
{"title":"Dialogue summarization with topic enhancement and factual consistency contrast","authors":"Liu Zhanghui, Zhang Wentao, Chen Yuzhong, Lin Yixin","doi":"10.1016/j.csl.2026.101958","DOIUrl":"10.1016/j.csl.2026.101958","url":null,"abstract":"<div><div>Abstractive dialogue summarization aims to capture key information from dialogues and generate a brief textual summary. However, due to the dynamic interactive characteristics of dialogues, the key information is very scattered in the context of the dialogue, and there are numerous instances of implicit semantic information, which frequently leads to the generation of summaries with inconsistent facts. Despite recent efforts in modelling the context of the dialogue, abstractive dialogue summarization still faces some significant challenges. The first challenge is how to effectively identify the key information in the dialogue and incorporate additional feature information. The second challenge is how to effectively improve the factual consistency between generated summaries and dialogues. To address these challenges, we propose a dialogue summarization model with Topic Enhancement and Factual Consistency Contrast (DS_DTEFCC). Firstly, we design a topic-enhanced context encoder to discover latent topic features and identify key information in the context of dialogues. This context encoder enables DS_DTEFCC to achieve effective fusion between topic features and dialogue representations. Secondly, we propose a novel auxiliary task to enhance factual consistency for the primary dialogue summarization task. In this auxiliary task, we introduce a fact enhancement strategy and a fact perturbation strategy to construct positive and negative samples, respectively. Furthermore, by constructing factual contrastive triplets for contrastive learning, this auxiliary task can effectively assist DS_DTEFCC in generating summaries with factual consistency. Experimental results on two public benchmark datasets demonstrate that DS_DTEFCC achieves significant performance improvement over other state-of-the-art baseline models.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101958"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Da-Hee Yang , Dail Kim , Joon-Hyuk Chang , Jeonghwan Choi , Han-Gil Moon
{"title":"A dual-branch parallel network for speech enhancement and restoration","authors":"Da-Hee Yang , Dail Kim , Joon-Hyuk Chang , Jeonghwan Choi , Han-Gil Moon","doi":"10.1016/j.csl.2026.101959","DOIUrl":"10.1016/j.csl.2026.101959","url":null,"abstract":"<div><div>We present a novel general speech restoration model, <strong>DBP-Net</strong> (dual-branch parallel network), designed to effectively handle complex real-world distortions including noise, reverberation, and bandwidth degradation. Unlike prior approaches that rely on a single processing path or separate models for enhancement and restoration, DBP-Net introduces a unified architecture with <em>dual parallel branches</em>—a masking-based branch for distortion suppression and a mapping-based branch for spectrum reconstruction. A key innovation behind DBP-Net lies in the <em>parameter sharing</em> between the two branches and a <em>cross-branch skip fusion</em>, where the output of the masking branch is explicitly fused into the mapping branch. This design enables DBP-Net to simultaneously leverage complementary learning strategies – suppression and generation – within a lightweight framework. Experimental results show that DBP-Net significantly outperforms existing baselines in comprehensive speech restoration tasks while maintaining a compact model size. These findings suggest that DBP-Net offers an effective and scalable solution for unified speech enhancement and restoration in diverse distortion scenarios.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101959"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ibon Vales Cortina , Owais Mujtaba Khanday , Marc Ouellet , José L. Pérez-Córdoba , Pablo Rodríguez-San Esteban , Laura Miccoli , Alberto Galdón , Gonzalo Olivares Granados , Jose A. Gonzalez-Lopez
{"title":"UGR-MINDVOICE: A multimodal EEG-audio dataset for overt and covert Iberian Spanish speech production","authors":"Ibon Vales Cortina , Owais Mujtaba Khanday , Marc Ouellet , José L. Pérez-Córdoba , Pablo Rodríguez-San Esteban , Laura Miccoli , Alberto Galdón , Gonzalo Olivares Granados , Jose A. Gonzalez-Lopez","doi":"10.1016/j.csl.2026.101964","DOIUrl":"10.1016/j.csl.2026.101964","url":null,"abstract":"<div><div>We present UGR-MINDVOICE, the University of Granada (UGR) multimodal electroencephalography (EEG) and audio dataset for overt and covert speech in Iberian Spanish intended for basic neuroscience and brain-computer interface (BCI) research. The dataset features EEG and audio recordings from 15 native Spanish speakers engaged in both overt and covert speech production tasks. This dataset is unique in its inclusion of all Spanish phonemes and a diverse set of words spanning various semantic categories and different usage frequencies. Validation of the dataset confirmed the presence of robust sensory event-related potentials, including the visual P100 and the auditory N1 (N100), indicating reliable early perceptual processing and sustained participant attention to both visual and auditory stimuli. Additionally, the EEG data were classified into rest, covert speech, and overt speech conditions with an accuracy of 81.40%, demonstrating active participant engagement in the tasks. By providing synchronised EEG and audio data for overt speech, along with EEG data for the same stimuli during covert speech, UGR-MINDVOICE constitutes a valuable resource for advancing research in basic neuroscience and brain-computer interfaces, particularly in the domain of silent speech communication. The full dataset is openly available on the Open Science Framework (OSF) (<span><span>https://osf.io/6sh5d</span><svg><path></path></svg></span>), and all accompanying code and analysis scripts are provided in a public GitHub repository (<span><span>https://github.com/owaismujtaba/mind-voice</span><svg><path></path></svg></span>).</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101964"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weizhao Zhang , Mengjuan Wang , Junzhi Li , Hongwu Yang
{"title":"Cro-MTVITS: An end-to-end cross-lingual speech synthesis model for Mandarin and multi-dialect Tibetan based on VITS","authors":"Weizhao Zhang , Mengjuan Wang , Junzhi Li , Hongwu Yang","doi":"10.1016/j.csl.2026.101956","DOIUrl":"10.1016/j.csl.2026.101956","url":null,"abstract":"<div><div>Cross-lingual speech synthesis is a key research focus in speech synthesis, allowing a single model to generate speech in multiple languages for one speaker. In China, while Mandarin is the official language, approximately 4 million people speak Tibetan as their native language. Previous Mandarin–Tibetan cross-lingual researches have largely concentrated on the Lhasa dialect, often overlooking the Kham and Amdo dialects, and have relied on autoregressive models, which still produce speech quality inferior to that of major languages. To address these challenges, we propose Cro-MTVITS, an end-to-end cross-lingual speech synthesis model for Mandarin and multi-dialect Tibetan. Firstly, we constructed a large-scale multi-dialect Tibetan corpus covering Lhasa, Kham, and Amdo dialects, totaling 52.2 h. Then, we developed a baseline model based on VITS, incorporating speaker and language embeddings into the text encoder, posterior encoder, decoder, stochastic duration predictor (SDP) and flow to enable cross-lingual synthesis. Finally, we enhanced this baseline model with an improved posterior encoder, SDP, and pre-trained language and speech models, yielding significant performance gains. Cro-MTVITS consistently achieved higher mean opinion score (MOS) values than the VITS baseline across all languages and scenarios, with improvements ranging from 0.07 to 0.21 points. Statistical tests confirmed that Cro-MTVITS significantly outperforms the baseline. Overall, experimental results demonstrate that our model surpasses the baseline in both subjective and objective evaluations, enabling high-quality cross-lingual speech synthesis between Mandarin and multi-dialect Tibetan. The synthesized speech samples can be found on demos<span><span><sup>1</sup></span></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101956"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deepening graph-based approaches for Portuguese open information extraction with LLM augmentation","authors":"Gabriel Silva , Mário Rodrigues , António Teixeira , Marlene Amorim","doi":"10.1016/j.csl.2026.101963","DOIUrl":"10.1016/j.csl.2026.101963","url":null,"abstract":"<div><div>Utilizing richer information, such as structural and syntactic details, can enhance Natural Language Processing (NLP) tasks like Open Information Extraction (Open IE), particularly for languages with limited resources like Portuguese. Knowledge Graphs (KGs) offer a robust solution by unifying diverse annotations and enabling the application of Graph Machine Learning (Graph ML).</div><div>This paper presents an advanced framework for Portuguese Open IE, integrating KGs and Graph ML with Large Language Model (LLM) augmentation. Our framework employs a three-stage process: (1) initial Knowledge Graph (KG) construction from text, followed by (2) Predicate Extraction and (3) Subject/Object Extraction, both leveraging GraphSAGE models. Large Language Models (LLMs) (DeepSeek) are used for augmentation when Graph ML predictions are absent or for refining/validating extractions.</div><div>We present two versions of a system that was evaluated on a Portuguese dataset. Automatic evaluation (word-based) for the best version of the system yielded an F1-score of 64.9% for Predicate extraction and 89.7% for Subject/Object extraction. The final end-to-end performance of the system is an F1-score of 58.2%.</div><div>A human evaluation was conducted on 51 Portuguese sentences (yielding 100 triples) by two annotators, achieving a substantial agreement (Cohen’s Kappa of 0.67). The system extracted an average of 1.84 triples per sentence, with 53.9% deemed correct. Notably, this version significantly reduced invalid/wrong extractions to 6.6% from 31.7% in the previous version, demonstrating improved Precision while maintaining the ability to extract multiple meaningful triples.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101963"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Vijayalakshmi , Anushiya Rachel Gladston , B. Ramani , M.P. Actlin Jeeva , K. Anantha Krishnan , T. Lavanya , T. Nagarajan
{"title":"Leveraging synthetic speech: TTS-driven data augmentation for effective dysarthric speech recognition","authors":"P. Vijayalakshmi , Anushiya Rachel Gladston , B. Ramani , M.P. Actlin Jeeva , K. Anantha Krishnan , T. Lavanya , T. Nagarajan","doi":"10.1016/j.csl.2026.101961","DOIUrl":"10.1016/j.csl.2026.101961","url":null,"abstract":"<div><div>Dysarthria is a neuro-motor speech disorder that impairs a person’s ability to communicate. This necessitates a communication aid to enable interaction with both individuals and computers, typically in the form of an automatic speech recognition (ASR) system. However, conventional ASR systems exhibit high word error rates (WER) when applied to dysarthric speech necessitating a dysarthric ASR (DASR) system. In the current work, DASR systems are developed using SSN TDSC (Tamil Dysarthric Speech Corpus) dataset, targeting mild and moderate dysarthria. Initially, a baseline DASR system is developed with original dysarthric speech data resulting in WER of 9.71% for mild and 19.54 % for moderate dysarthria respectively. In order to develop a DASR system with low WER enormous amount of dysarthric speech data is required. However, recording several hours of speech data from dysarthric speakers is difficult owing to their medical condition. To address this data scarcity, we explore data augmentation using text-to-speech (TTS) synthesis to generate additional dysarthric speech data. In this study, various TTS models, namely, hidden Markov model-based TTS (HTS), FastSpeech2 and Tacotron2 are used for synthesizing dysarthric speech. The current work focuses on identifying the properties that the synthetic speech must exhibit to aid in improving the performance of DASR systems and to derive the required amount of dysarthric speech data. Based on the subjective and objective evaluations carried out on the synthetic speech, FastSpeech2 outperforms the other TTS models considered in terms of preserving the dysarthric speech properties. Training the DASR systems using FastSpeech2-derived augmented data resulted in reduced WERs of 3.49% for mild and 13.17% for moderate dysarthria. Further experiments revealed that a reduction in WER (2.67% & 8.32% for mild and moderate dysarthria) is achieved when moderate amount of augmented data from multiple synthesizers (Fastspeech2 & Tacotron2) is used for training. These results demonstrate the effectiveness of TTS-based data augmentation in improving DASR performance.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101961"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TriTSP: A triangular joint reasoning networks for target–stance prediction","authors":"JiaYu Zhang, HongLi Zhang, ChunYu Liu, ZeShu Tian, Chao Meng, YuXiang Ma","doi":"10.1016/j.csl.2026.101962","DOIUrl":"10.1016/j.csl.2026.101962","url":null,"abstract":"<div><div>Target–stance prediction is a novel task evolved from the traditional stance detection task, aiming to predict the pair of target and stance from each tweet. The target–stance prediction task is currently solved by the two-stage method. Although this method effectively alleviates the dependence on manually labeled target information, the errors generated in the first-stage target identification task will directly have a negative impact on the performance of the second-stage stance detection task, resulting in obvious error cascades. Moreover, it is difficult to establish effective feature interactions between the two subtasks. To tackle the above problems, we propose a triangular joint reasoning model named TriTSP. The proposed model unifies the target features and stance features in the joint prediction manner to capture the correlations and interactions between them. Furthermore, inspired by the way humans express stances, we incorporate expanded stance triangle framework into our model to infer the specified target–stance pair through the explicit pairs contained in social media. Our proposed model not only eliminates error cascades, but also effectively improves the performance of the target–stance prediction task. Experiments on two benchmark datasets demonstrate that our proposed model has significant advantages over the current state-of-the-art models.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101962"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sara Barahona, Juan Ignacio Alvarez-Trejos, Alicia Lozano-Diez, Daniel Ramos, Doroteo T. Toledano
{"title":"Exploring efficient attention strategies in conformer-based sound event detection","authors":"Sara Barahona, Juan Ignacio Alvarez-Trejos, Alicia Lozano-Diez, Daniel Ramos, Doroteo T. Toledano","doi":"10.1016/j.csl.2026.101967","DOIUrl":"10.1016/j.csl.2026.101967","url":null,"abstract":"<div><div>Sound Event Detection (SED) requires models that can accurately localize and classify overlapping audio events within complex acoustic environments. Conformer-based architectures have demonstrated promising performance by leveraging self-attention to capture long-range dependencies. However, this global attention can be accumulated across layers, which can blur local temporal boundaries and reduce detection accuracy, especially for short or closely spaced events. While increasing the input sequence length can help recover temporal detail, the quadratic complexity of Conformers’ self-attention significantly increases computational costs. To address this, we propose integrating the Efficient Conformer architecture, which introduces subsampling along the input sequence length, effectively reducing the temporal dimension within blocks. This design enables processing longer input sequences at finer temporal resolution, enhancing localization accuracy without extending output length. Using the DCASE Challenge 2023 Task 4 benchmark, system performance is evaluated via the threshold-independent Polyphonic Sound Detection Score (PSDS), measuring both localization precision (PSDS1) and class robustness (PSDS2). Experiments on the DESED validation dataset demonstrate that the Efficient Conformer not only improves temporal resolution and long-range dependency modeling, but also outperforms standard Conformer and Convolutional Recurrent Neural Network (CRNN) baselines in PSDS2. Additionally, we explore lightweight attention mechanisms employing squeeze-and-excitation blocks to emulate frequency-axis translation invariance of Frequency Dynamic Convolutions (FDY). Our approach achieves performance comparable to heavier models like FDY+Conformer, while reducing computational cost by over 69%, showing promising results for Conformer-based systems in terms of precision and model efficiency.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101967"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LaFresCat: A studio-quality Catalan multi-accent speech dataset for text-to-speech synthesis","authors":"Alex Peiró-Lilja , Carme Armentano-Oller , José Giraldo , Wendy Elvira-García , Ignasi Esquerra , Rodolfo Zevallos , Cristina España-Bonet , Martí Llopart-Font , Baybars Külebi , Mireia Farrús","doi":"10.1016/j.csl.2026.101945","DOIUrl":"10.1016/j.csl.2026.101945","url":null,"abstract":"<div><div>Current text-to-speech (TTS) systems are capable of learning the phonetics of a language accurately given that the speech data used to train such models covers all phonetic phenomena. For languages with different varieties, this includes all their richness and accents. This is the case of Catalan, a mid-resourced language with several dialects or accents. Although there are various publicly available corpora, there is a lack of high-quality open-access data for speech technologies covering its variety of accents. Common Voice includes recordings of Catalan speakers from different regions; however, accent labeling has been shown to be inaccurate, and artificially enhanced samples may be unsuitable for TTS. To address these limitations, we present LaFresCat, the first studio-quality Catalan multi-accent dataset. LaFresCat comprises 3.5 h of professionally recording speech covering four of the most prominent Catalan accents: Balearic, Central, North-Western, and Valencian. In this work, we provide a detailed description of the dataset design: utterances were selected to be phonetically balanced, detailed speaker instructions were provided, native speakers from the regions corresponding to the Catalan accents were hired, and the recordings were formatted and post-processed. The resulting dataset, LaFresCat, is publicly available. To preliminarily evaluate the dataset, we trained and assessed a lightweight flow-based TTS system, which is also provided as a by-product. We also analyzed LaFresCat samples and the corresponding TTS-generated samples at the phonetic level, employing expert annotations and Pillai scores to quantify acoustic vowel overlap. Preliminary results suggest a significant improvement in predicted mean opinion score (UTMOS), with an increase of 0.42 points when the TTS system is fine-tuned on LaFresCat rather than trained from scratch, starting from a pre-trained version based on Central Catalan data from Common Voice. Subsequent human expert annotations achieved nearly 90% accuracy in accent classification for LaFresCat recordings. However, although the TTS tends to homogenize pronunciation, it still learns distinct dialectal patterns. This assessment offers key insights for establishing a baseline to guide future evaluations of Catalan multi-accent TTS systems and further studies of LaFresCat.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101945"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}