{"title":"Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data","authors":"Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen","doi":"arxiv-2409.10969","DOIUrl":"https://doi.org/arxiv-2409.10969","url":null,"abstract":"While large language models (LLMs) have been explored in the speech domain\u0000for both generation and recognition tasks, their applications are predominantly\u0000confined to the monolingual scenario, with limited exploration in multilingual\u0000and code-switched (CS) contexts. Additionally, speech generation and\u0000recognition tasks are often handled separately, such as VALL-E and Qwen-Audio.\u0000In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating\u0000multilingual speech generation and recognition tasks within the single LLM.\u0000Furthermore, we develop an effective data construction approach that splits and\u0000concatenates words from different languages to equip LLMs with CS synthesis\u0000ability without relying on CS data. The experimental results demonstrate that\u0000our model outperforms other baselines with a comparable data scale.\u0000Furthermore, our data construction approach not only equips LLMs with CS speech\u0000synthesis capability with comparable speaker consistency and similarity to any\u0000given speaker, but also improves the performance of LLMs in multilingual speech\u0000generation and recognition tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy","authors":"Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Yuki Uranishi","doi":"arxiv-2409.10848","DOIUrl":"https://doi.org/arxiv-2409.10848","url":null,"abstract":"Audio-driven 3D facial animation has made immersive progress both in research\u0000and application developments. The newest approaches focus on Transformer-based\u0000methods and diffusion-based methods, however, there is still gap in the\u0000vividness and emotional expression between the generated animation and real\u0000human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion\u0000policy model for 3D facial animation prediction. This method generates variable\u0000and realistic human facial movements by predicting the 3D vertex trajectory on\u0000the 3D facial template with diffusion policy instead of facial generation for\u0000every frame. It takes audio and vertex states as observations to predict the\u0000vertex trajectory and imitate real human facial expressions, which keeps the\u0000continuous and natural flow of human emotions. The experiments show that our\u0000approach is effective in variable and dynamic facial motion synthesizing.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spontaneous Informal Speech Dataset for Punctuation Restoration","authors":"Xing Yi Liu, Homayoon Beigi","doi":"arxiv-2409.11241","DOIUrl":"https://doi.org/arxiv-2409.11241","url":null,"abstract":"Presently, punctuation restoration models are evaluated almost solely on\u0000well-structured, scripted corpora. On the other hand, real-world ASR systems\u0000and post-processing pipelines typically apply towards spontaneous speech with\u0000significant irregularities, stutters, and deviations from perfect grammar. To\u0000address this discrepancy, we introduce SponSpeech, a punctuation restoration\u0000dataset derived from informal speech sources, which includes punctuation and\u0000casing information. In addition to publicly releasing the dataset, we\u0000contribute a filtering pipeline that can be used to generate more data. Our\u0000filtering pipeline examines the quality of both speech audio and transcription\u0000text. We also carefully construct a ``challenging\" test set, aimed at\u0000evaluating models' ability to leverage audio information to predict otherwise\u0000grammatically ambiguous punctuation. SponSpeech is available at\u0000https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset\u0000building and model runs.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James Brooks-Park, Martin Bo Møller, Jan Østergaard, Søren Bech, Steven van de Par
{"title":"Room impulse response prototyping using receiver distance estimations for high quality room equalisation algorithms","authors":"James Brooks-Park, Martin Bo Møller, Jan Østergaard, Søren Bech, Steven van de Par","doi":"arxiv-2409.10131","DOIUrl":"https://doi.org/arxiv-2409.10131","url":null,"abstract":"Room equalisation aims to increase the quality of loudspeaker reproduction in\u0000reverberant environments, compensating for colouration caused by imperfect room\u0000reflections and frequency dependant loudspeaker directivity. A common technique\u0000in the field of room equalisation, is to invert a prototype Room Impulse\u0000Response (RIR). Rather than inverting a single RIR at the listening position, a\u0000prototype response is composed of several responses distributed around the\u0000listening area. This paper proposes a method of impulse response prototyping,\u0000using estimated receiver positions, to form a weighted average prototype\u0000response. A method of receiver distance estimation is described, supporting the\u0000implementation of the prototype RIR. The proposed prototyping method is\u0000compared to other methods by measuring their post equalisation spectral\u0000deviation at several positions in a simulated room.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement","authors":"Wenze Ren, Haibin Wu, Yi-Cheng Lin, Xuanjun Chen, Rong Chao, Kuo-Hsuan Hung, You-Jin Li, Wen-Yuan Ting, Hsin-Min Wang, Yu Tsao","doi":"arxiv-2409.10376","DOIUrl":"https://doi.org/arxiv-2409.10376","url":null,"abstract":"In multichannel speech enhancement, effectively capturing spatial and\u0000spectral information across different microphones is crucial for noise\u0000reduction. Traditional methods, such as CNN or LSTM, attempt to model the\u0000temporal dynamics of full-band and sub-band spectral and spatial features.\u0000However, these approaches face limitations in fully modeling complex temporal\u0000dependencies, especially in dynamic acoustic environments. To overcome these\u0000challenges, we modify the current advanced model McNet by introducing an\u0000improved version of Mamba, a state-space model, and further propose MCMamba.\u0000MCMamba has been completely reengineered to integrate full-band and narrow-band\u0000spatial information with sub-band and full-band spectral features, providing a\u0000more comprehensive approach to modeling spatial and spectral information. Our\u0000experimental results demonstrate that MCMamba significantly improves the\u0000modeling of spatial and spectral features in multichannel speech enhancement,\u0000outperforming McNet and achieving state-of-the-art performance on the CHiME-3\u0000dataset. Additionally, we find that Mamba performs exceptionally well in\u0000modeling spectral information.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating Training Objectives for Generative Speech Enhancement","authors":"Julius Richter, Danilo de Oliveira, Timo Gerkmann","doi":"arxiv-2409.10753","DOIUrl":"https://doi.org/arxiv-2409.10753","url":null,"abstract":"Generative speech enhancement has recently shown promising advancements in\u0000improving speech quality in noisy environments. Multiple diffusion-based\u0000frameworks exist, each employing distinct training objectives and learning\u0000techniques. This paper aims at explaining the differences between these\u0000frameworks by focusing our investigation on score-based generative models and\u0000Schr\"odinger bridge. We conduct a series of comprehensive experiments to\u0000compare their performance and highlight differing training behaviors.\u0000Furthermore, we propose a novel perceptual loss function tailored for the\u0000Schr\"odinger bridge framework, demonstrating enhanced performance and improved\u0000perceptual quality of the enhanced speech signals. All experimental code and\u0000pre-trained models are publicly available to facilitate further research and\u0000development in this.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Sudipto Siam Dip, Md Anik Hasan, Sapnil Sarker Bipro, Md Abdur Raiyan, Mohammod Abdul Motin
{"title":"oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models","authors":"Muhammad Sudipto Siam Dip, Md Anik Hasan, Sapnil Sarker Bipro, Md Abdur Raiyan, Mohammod Abdul Motin","doi":"arxiv-2409.10240","DOIUrl":"https://doi.org/arxiv-2409.10240","url":null,"abstract":"In this study, we address the challenge of speaker recognition using a novel\u0000data augmentation technique of adding noise to enrollment files. This technique\u0000efficiently aligns the sources of test and enrollment files, enhancing\u0000comparability. Various pre-trained models were employed, with the resnet model\u0000achieving the highest DCF of 0.84 and an EER of 13.44. The augmentation\u0000technique notably improved these results to 0.75 DCF and 12.79 EER for the\u0000resnet model. Comparative analysis revealed the superiority of resnet over\u0000models such as ECPA, Mel-spectrogram, Payonnet, and Titanet large. Results,\u0000along with different augmentation schemes, contribute to the success of RoboVox\u0000far-field speaker recognition in this paper","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RF-GML: Reference-Free Generative Machine Listener","authors":"Arijit Biswas, Guanxin Jiang","doi":"arxiv-2409.10210","DOIUrl":"https://doi.org/arxiv-2409.10210","url":null,"abstract":"This paper introduces a novel reference-free (RF) audio quality metric called\u0000the RF-Generative Machine Listener (RF-GML), designed to evaluate coded mono,\u0000stereo, and binaural audio at a 48 kHz sample rate. RF-GML leverages transfer\u0000learning from a state-of-the-art full-reference (FR) Generative Machine\u0000Listener (GML) with minimal architectural modifications. The term \"generative\"\u0000refers to the model's ability to generate an arbitrary number of simulated\u0000listening scores. Unlike existing RF models, RF-GML accurately predicts\u0000subjective quality scores across diverse content types and codecs. Extensive\u0000evaluations demonstrate its superiority in rating unencoded audio and\u0000distinguishing different levels of coding artifacts. RF-GML's performance and\u0000versatility make it a valuable tool for coded audio quality assessment and\u0000monitoring in various applications, all without the need for a reference\u0000signal.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages","authors":"Ming-Hao Hsu, Kuan Po Huang, Hung-yi Lee","doi":"arxiv-2409.10429","DOIUrl":"https://doi.org/arxiv-2409.10429","url":null,"abstract":"This paper presents Meta-Whisper, a novel approach to improve automatic\u0000speech recognition (ASR) for low-resource languages using the Whisper model. By\u0000leveraging Meta In-Context Learning (Meta-ICL) and a k-Nearest Neighbors (KNN)\u0000algorithm for sample selection, Meta-Whisper enhances Whisper's ability to\u0000recognize speech in unfamiliar languages without extensive fine-tuning.\u0000Experiments on the ML-SUPERB dataset show that Meta-Whisper significantly\u0000reduces the Character Error Rate (CER) for low-resource languages compared to\u0000the original Whisper model. This method offers a promising solution for\u0000developing more adaptable multilingual ASR systems, particularly for languages\u0000with limited resources.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hitesh Tulsiani, David M. Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister
{"title":"An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems","authors":"Hitesh Tulsiani, David M. Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister","doi":"arxiv-2409.10515","DOIUrl":"https://doi.org/arxiv-2409.10515","url":null,"abstract":"Dialog systems, such as voice assistants, are expected to engage with users\u0000in complex, evolving conversations. Unfortunately, traditional automatic speech\u0000recognition (ASR) systems deployed in such applications are usually trained to\u0000recognize each turn independently and lack the ability to adapt to the\u0000conversational context or incorporate user feedback. In this work, we introduce\u0000a general framework for ASR in dialog systems that can go beyond learning from\u0000single-turn utterances and learn over time how to adapt to both explicit\u0000supervision and implicit user feedback present in multi-turn conversations. We\u0000accomplish that by leveraging advances in student-teacher learning and\u0000context-aware dialog processing, and designing contrastive self-supervision\u0000approaches with Ohm, a new online hard-negative mining approach. We show that\u0000leveraging our new framework compared to traditional training leads to relative\u0000WER reductions of close to 10% in real-world dialog systems, and up to 26% on\u0000public synthetic data.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}