Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Longhao Li, Qijie Shao, Linju Yang, Kai Diao, Lei Xie
{"title":"Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text","authors":"Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Longhao Li, Qijie Shao, Linju Yang, Kai Diao, Lei Xie","doi":"arxiv-2409.11214","DOIUrl":"https://doi.org/arxiv-2409.11214","url":null,"abstract":"Integrating audio encoders with LLMs through connectors has enabled these\u0000models to process and comprehend audio modalities, significantly enhancing\u0000speech-to-text tasks, including automatic speech recognition (ASR) and\u0000automatic speech translation (AST). However, these methods often overlook the\u0000critical aspect of language adaptation in multilingual settings, relying\u0000instead on multilingual data without adequately addressing language\u0000differences. To address this gap, we propose the Ideal-LLM model, which employs\u0000dual multilingual encoders to enrich language feature information and utilizes\u0000a language-adapted connector to target the adaptation of each language\u0000specifically. By leveraging the complementary strengths of Whisper and MMS\u0000encoders, our approach ensures richer multilingual representations.\u0000Additionally, the language-adapted connector enhances modal transformation via\u0000a language weight selector tailored for each language. Experimental results\u0000demonstrate that Ideal-LLM significantly improves ASR performance, achieving a\u000032.6% relative reduction in average word error rates compared to the standard\u0000speech encoder integrated with LLMs and yields an average BLEU score of 36.78\u0000for AST task.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning","authors":"Ilaria Manco, Justin Salamon, Oriol Nieto","doi":"arxiv-2409.11498","DOIUrl":"https://doi.org/arxiv-2409.11498","url":null,"abstract":"Audio-text contrastive models have become a powerful approach in music\u0000representation learning. Despite their empirical success, however, little is\u0000known about the influence of key design choices on the quality of music-text\u0000representations learnt through this framework. In this work, we expose these\u0000design choices within the constraints of limited data and computation budgets,\u0000and establish a more solid understanding of their impact grounded in empirical\u0000observations along three axes: the choice of base encoders, the level of\u0000curation in training data, and the use of text augmentation. We find that data\u0000curation is the single most important factor for music-text contrastive\u0000training in resource-constrained scenarios. Motivated by this insight, we\u0000introduce two novel techniques, Augmented View Dropout and TextSwap, which\u0000increase the diversity and descriptiveness of text inputs seen in training.\u0000Through our experiments we demonstrate that these are effective at boosting\u0000performance across different pre-training regimes, model architectures, and\u0000downstream data distributions, without incurring higher computational costs or\u0000requiring additional training data.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer","authors":"Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu","doi":"arxiv-2409.10819","DOIUrl":"https://doi.org/arxiv-2409.10819","url":null,"abstract":"Latent diffusion models have shown promising results in text-to-audio (T2A)\u0000generation tasks, yet previous models have encountered difficulties in\u0000generation quality, computational cost, diffusion sampling, and data\u0000preparation. In this paper, we introduce EzAudio, a transformer-based T2A\u0000diffusion model, to handle these challenges. Our approach includes several key\u0000innovations: (1) We build the T2A model on the latent space of a 1D waveform\u0000Variational Autoencoder (VAE), avoiding the complexities of handling 2D\u0000spectrogram representations and using an additional neural vocoder. (2) We\u0000design an optimized diffusion transformer architecture specifically tailored\u0000for audio latent representations and diffusion modeling, which enhances\u0000convergence speed, training stability, and memory usage, making the training\u0000process easier and more efficient. (3) To tackle data scarcity, we adopt a\u0000data-efficient training strategy that leverages unlabeled data for learning\u0000acoustic dependencies, audio caption data annotated by audio-language models\u0000for text-to-audio alignment learning, and human-labeled data for fine-tuning.\u0000(4) We introduce a classifier-free guidance (CFG) rescaling method that\u0000simplifies EzAudio by achieving strong prompt alignment while preserving great\u0000audio quality when using larger CFG scores, eliminating the need to struggle\u0000with finding the optimal CFG score to balance this trade-off. EzAudio surpasses\u0000existing open-source models in both objective metrics and subjective\u0000evaluations, delivering realistic listening experiences while maintaining a\u0000streamlined model structure, low training costs, and an easy-to-follow training\u0000pipeline. Code, data, and pre-trained models are released at:\u0000https://haidog-yaqub.github.io/EzAudio-Page/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hsi-Che Lin, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee
{"title":"Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection","authors":"Hsi-Che Lin, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee","doi":"arxiv-2409.10985","DOIUrl":"https://doi.org/arxiv-2409.10985","url":null,"abstract":"Speech Emotion Recognition (SER) is a crucial component in developing\u0000general-purpose AI agents capable of natural human-computer interaction.\u0000However, building robust multilingual SER systems remains challenging due to\u0000the scarcity of labeled data in languages other than English and Chinese. In\u0000this paper, we propose an approach to enhance SER performance in low SER\u0000resource languages by leveraging data from high-resource languages.\u0000Specifically, we employ expressive Speech-to-Speech translation (S2ST) combined\u0000with a novel bootstrapping data selection pipeline to generate labeled data in\u0000the target language. Extensive experiments demonstrate that our method is both\u0000effective and generalizable across different upstream models and languages. Our\u0000results suggest that this approach can facilitate the development of more\u0000scalable and robust multilingual SER systems.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LC-Protonets: Multi-label Few-shot learning for world music audio tagging","authors":"Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos","doi":"arxiv-2409.11264","DOIUrl":"https://doi.org/arxiv-2409.11264","url":null,"abstract":"We introduce Label-Combination Prototypical Networks (LC-Protonets) to\u0000address the problem of multi-label few-shot classification, where a model must\u0000generalize to new classes based on only a few available examples. Extending\u0000Prototypical Networks, LC-Protonets generate one prototype per label\u0000combination, derived from the power set of labels present in the limited\u0000training items, rather than one prototype per label. Our method is applied to\u0000automatic audio tagging across diverse music datasets, covering various\u0000cultures and including both modern and traditional music, and is evaluated\u0000against existing approaches in the literature. The results demonstrate a\u0000significant performance improvement in almost all domains and training setups\u0000when using LC-Protonets for multi-label classification. In addition to training\u0000a few-shot learning model from scratch, we explore the use of a pre-trained\u0000model, obtained via supervised learning, to embed items in the feature space.\u0000Fine-tuning improves the generalization ability of all methods, yet\u0000LC-Protonets achieve high-level performance even without fine-tuning, in\u0000contrast to the comparative approaches. We finally analyze the scalability of\u0000the proposed method, providing detailed quantitative metrics from our\u0000experiments. The implementation and experimental setup are made publicly\u0000available, offering a benchmark for future research.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu
{"title":"High-Resolution Speech Restoration with Latent Diffusion Model","authors":"Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu","doi":"arxiv-2409.11145","DOIUrl":"https://doi.org/arxiv-2409.11145","url":null,"abstract":"Traditional speech enhancement methods often oversimplify the task of\u0000restoration by focusing on a single type of distortion. Generative models that\u0000handle multiple distortions frequently struggle with phone reconstruction and\u0000high-frequency harmonics, leading to breathing and gasping artifacts that\u0000reduce the intelligibility of reconstructed speech. These models are also\u0000computationally demanding, and many solutions are restricted to producing\u0000outputs in the wide-band frequency range, which limits their suitability for\u0000professional applications. To address these challenges, we propose Hi-ResLDM, a\u0000novel generative model based on latent diffusion designed to remove multiple\u0000distortions and restore speech recordings to studio quality, sampled at 48kHz.\u0000We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and\u0000Conditional Flow Matching (CFM) components, demonstrating superior performance\u0000in regenerating high-frequency-band details. Hi-ResLDM not only excels in\u0000non-instrusive metrics but is also consistently preferred in human evaluation\u0000and performs competitively on intrusive evaluations, making it ideal for\u0000high-resolution speech restoration.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabriel Bibbó, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley
{"title":"The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection","authors":"Gabriel Bibbó, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley","doi":"arxiv-2409.11262","DOIUrl":"https://doi.org/arxiv-2409.11262","url":null,"abstract":"This paper presents a residential audio dataset to support sound event\u0000detection research for smart home applications aimed at promoting wellbeing for\u0000older adults. The dataset is constructed by deploying audio recording systems\u0000in the homes of 8 participants aged 55-80 years for a 7-day period. Acoustic\u0000characteristics are documented through detailed floor plans and construction\u0000material information to enable replication of the recording environments for AI\u0000model deployment. A novel automated speech removal pipeline is developed, using\u0000pre-trained audio neural networks to detect and remove segments containing\u0000spoken voice, while preserving segments containing other sound events. The\u0000resulting dataset consists of privacy-compliant audio recordings that\u0000accurately capture the soundscapes and activities of daily living within\u0000residential spaces. The paper details the dataset creation methodology, the\u0000speech removal pipeline utilizing cascaded model architectures, and an analysis\u0000of the vocal label distribution to validate the speech removal process. This\u0000dataset enables the development and benchmarking of sound event detection\u0000models tailored specifically for in-home applications.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Source Disentanglement in Neural Audio Codec","authors":"Xiaoyu Bie, Xubo Liu, Gaël Richard","doi":"arxiv-2409.11228","DOIUrl":"https://doi.org/arxiv-2409.11228","url":null,"abstract":"Neural audio codecs have significantly advanced audio compression by\u0000efficiently converting continuous audio signals into discrete tokens. These\u0000codecs preserve high-quality sound and enable sophisticated sound generation\u0000through generative models trained on these tokens. However, existing neural\u0000codec models are typically trained on large, undifferentiated audio datasets,\u0000neglecting the essential discrepancies between sound domains like speech,\u0000music, and environmental sound effects. This oversight complicates data\u0000modeling and poses additional challenges to the controllability of sound\u0000generation. To tackle these issues, we introduce the Source-Disentangled Neural\u0000Audio Codec (SD-Codec), a novel approach that combines audio coding and source\u0000separation. By jointly learning audio resynthesis and separation, SD-Codec\u0000explicitly assigns audio signals from different domains to distinct codebooks,\u0000sets of discrete representations. Experimental results indicate that SD-Codec\u0000not only maintains competitive resynthesis quality but also, supported by the\u0000separation results, demonstrates successful disentanglement of different\u0000sources in the latent space, thereby enhancing interpretability in audio codec\u0000and providing potential finer control over the audio generation process.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul
{"title":"Speech Recognition for Analysis of Police Radio Communication","authors":"Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul","doi":"arxiv-2409.10858","DOIUrl":"https://doi.org/arxiv-2409.10858","url":null,"abstract":"Police departments around the world use two-way radio for coordination. These\u0000broadcast police communications (BPC) are a unique source of information about\u0000everyday police activity and emergency response. Yet BPC are not transcribed,\u0000and their naturalistic audio properties make automatic transcription\u0000challenging. We collect a corpus of roughly 62,000 manually transcribed radio\u0000transmissions (~46 hours of audio) to evaluate the feasibility of automatic\u0000speech recognition (ASR) using modern recognition models. We evaluate the\u0000performance of off-the-shelf speech recognizers, models fine-tuned on BPC data,\u0000and customized end-to-end models. We find that both human and machine\u0000transcription is challenging in this domain. Large off-the-shelf ASR models\u0000perform poorly, but fine-tuned models can reach the approximate range of human\u0000performance. Our work suggests directions for future work, including analysis\u0000of short utterances and potential miscommunication in police radio\u0000interactions. We make our corpus and data annotation pipeline available to\u0000other researchers, to enable further research on recognition and analysis of\u0000police communication.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia
{"title":"Learning Spatially-Aware Language and Audio Embedding","authors":"Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia","doi":"arxiv-2409.11369","DOIUrl":"https://doi.org/arxiv-2409.11369","url":null,"abstract":"Humans can picture a sound scene given an imprecise natural language\u0000description. For example, it is easy to imagine an acoustic environment given a\u0000phrase like \"the lion roar came from right behind me!\". For a machine to have\u0000the same degree of comprehension, the machine must know what a lion is\u0000(semantic attribute), what the concept of \"behind\" is (spatial attribute) and\u0000how these pieces of linguistic information align with the semantic and spatial\u0000attributes of the sound (what a roar sounds like when its coming from behind).\u0000State-of-the-art audio foundation models which learn to map between audio\u0000scenes and natural textual descriptions, are trained on non-spatial audio and\u0000text pairs, and hence lack spatial awareness. In contrast, sound event\u0000localization and detection models are limited to recognizing sounds from a\u0000fixed number of classes, and they localize the source to absolute position\u0000(e.g., 0.2m) rather than a position described using natural language (e.g.,\u0000\"next to me\"). To address these gaps, we present ELSA a spatially aware-audio\u0000and text embedding model trained using multimodal contrastive learning. ELSA\u0000supports non-spatial audio, spatial audio, and open vocabulary text captions\u0000describing both the spatial and semantic components of sound. To train ELSA:\u0000(a) we spatially augment the audio and captions of three open-source audio\u0000datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture\u0000the semantics of non-spatial audio, and the semantics and spatial attributes of\u0000spatial audio using contrastive learning. ELSA is competitive with\u0000state-of-the-art for both semantic retrieval and 3D source localization. In\u0000particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above\u0000the baseline, and outperforms by -11.6{deg} mean-absolute-error in 3D source\u0000localization over the baseline.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}