Manasi Chhibber, Jagabandhu Mishra, Hyejin Shim, Tomi H. Kinnunen
{"title":"An Explainable Probabilistic Attribute Embedding Approach for Spoofed Speech Characterization","authors":"Manasi Chhibber, Jagabandhu Mishra, Hyejin Shim, Tomi H. Kinnunen","doi":"arxiv-2409.11027","DOIUrl":"https://doi.org/arxiv-2409.11027","url":null,"abstract":"We propose a novel approach for spoofed speech characterization through\u0000explainable probabilistic attribute embeddings. In contrast to high-dimensional\u0000raw embeddings extracted from a spoofing countermeasure (CM) whose dimensions\u0000are not easy to interpret, the probabilistic attributes are designed to gauge\u0000the presence or absence of sub-components that make up a specific spoofing\u0000attack. These attributes are then applied to two downstream tasks: spoofing\u0000detection and attack attribution. To enforce interpretability also to the\u0000back-end, we adopt a decision tree classifier. Our experiments on the\u0000ASVspoof2019 dataset with spoof CM embeddings extracted from three models\u0000(AASIST, Rawboost-AASIST, SSL-AASIST) suggest that the performance of the\u0000attribute embeddings are on par with the original raw spoof CM embeddings for\u0000both tasks. The best performance achieved with the proposed approach for\u0000spoofing detection and attack attribution, in terms of accuracy, is 99.7% and\u000099.2%, respectively, compared to 99.7% and 94.7% using the raw CM embeddings.\u0000To analyze the relative contribution of each attribute, we estimate their\u0000Shapley values. Attributes related to acoustic feature prediction, waveform\u0000generation (vocoder), and speaker modeling are found important for spoofing\u0000detection; while duration modeling, vocoder, and input type play a role in\u0000spoofing attack attribution.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley
{"title":"PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing","authors":"Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley","doi":"arxiv-2409.10831","DOIUrl":"https://doi.org/arxiv-2409.10831","url":null,"abstract":"The recent explosion of generative AI-Music systems has raised numerous\u0000concerns over data copyright, licensing music from musicians, and the conflict\u0000between open-source AI and large prestige companies. Such issues highlight the\u0000need for publicly available, copyright-free musical data, in which there is a\u0000large shortage, particularly for symbolic music data. To alleviate this issue,\u0000we present PDMX: a large-scale open-source dataset of over 250K public domain\u0000MusicXML scores collected from the score-sharing forum MuseScore, making it the\u0000largest available copyright-free symbolic music dataset to our knowledge. PDMX\u0000additionally includes a wealth of both tag and user interaction metadata,\u0000allowing us to efficiently analyze the dataset and filter for high quality\u0000user-generated scores. Given the additional metadata afforded by our data\u0000collection process, we conduct multitrack music generation experiments\u0000evaluating how different representative subsets of PDMX lead to different\u0000behaviors in downstream models, and how user-rating statistics can be used as\u0000an effective measure of data quality. Examples can be found at\u0000https://pnlong.github.io/PDMX.demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"210 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hsi-Che Lin, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee
{"title":"Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection","authors":"Hsi-Che Lin, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee","doi":"arxiv-2409.10985","DOIUrl":"https://doi.org/arxiv-2409.10985","url":null,"abstract":"Speech Emotion Recognition (SER) is a crucial component in developing\u0000general-purpose AI agents capable of natural human-computer interaction.\u0000However, building robust multilingual SER systems remains challenging due to\u0000the scarcity of labeled data in languages other than English and Chinese. In\u0000this paper, we propose an approach to enhance SER performance in low SER\u0000resource languages by leveraging data from high-resource languages.\u0000Specifically, we employ expressive Speech-to-Speech translation (S2ST) combined\u0000with a novel bootstrapping data selection pipeline to generate labeled data in\u0000the target language. Extensive experiments demonstrate that our method is both\u0000effective and generalizable across different upstream models and languages. Our\u0000results suggest that this approach can facilitate the development of more\u0000scalable and robust multilingual SER systems.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LC-Protonets: Multi-label Few-shot learning for world music audio tagging","authors":"Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos","doi":"arxiv-2409.11264","DOIUrl":"https://doi.org/arxiv-2409.11264","url":null,"abstract":"We introduce Label-Combination Prototypical Networks (LC-Protonets) to\u0000address the problem of multi-label few-shot classification, where a model must\u0000generalize to new classes based on only a few available examples. Extending\u0000Prototypical Networks, LC-Protonets generate one prototype per label\u0000combination, derived from the power set of labels present in the limited\u0000training items, rather than one prototype per label. Our method is applied to\u0000automatic audio tagging across diverse music datasets, covering various\u0000cultures and including both modern and traditional music, and is evaluated\u0000against existing approaches in the literature. The results demonstrate a\u0000significant performance improvement in almost all domains and training setups\u0000when using LC-Protonets for multi-label classification. In addition to training\u0000a few-shot learning model from scratch, we explore the use of a pre-trained\u0000model, obtained via supervised learning, to embed items in the feature space.\u0000Fine-tuning improves the generalization ability of all methods, yet\u0000LC-Protonets achieve high-level performance even without fine-tuning, in\u0000contrast to the comparative approaches. We finally analyze the scalability of\u0000the proposed method, providing detailed quantitative metrics from our\u0000experiments. The implementation and experimental setup are made publicly\u0000available, offering a benchmark for future research.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu
{"title":"High-Resolution Speech Restoration with Latent Diffusion Model","authors":"Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu","doi":"arxiv-2409.11145","DOIUrl":"https://doi.org/arxiv-2409.11145","url":null,"abstract":"Traditional speech enhancement methods often oversimplify the task of\u0000restoration by focusing on a single type of distortion. Generative models that\u0000handle multiple distortions frequently struggle with phone reconstruction and\u0000high-frequency harmonics, leading to breathing and gasping artifacts that\u0000reduce the intelligibility of reconstructed speech. These models are also\u0000computationally demanding, and many solutions are restricted to producing\u0000outputs in the wide-band frequency range, which limits their suitability for\u0000professional applications. To address these challenges, we propose Hi-ResLDM, a\u0000novel generative model based on latent diffusion designed to remove multiple\u0000distortions and restore speech recordings to studio quality, sampled at 48kHz.\u0000We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and\u0000Conditional Flow Matching (CFM) components, demonstrating superior performance\u0000in regenerating high-frequency-band details. Hi-ResLDM not only excels in\u0000non-instrusive metrics but is also consistently preferred in human evaluation\u0000and performs competitively on intrusive evaluations, making it ideal for\u0000high-resolution speech restoration.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabriel Bibbó, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley
{"title":"The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection","authors":"Gabriel Bibbó, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley","doi":"arxiv-2409.11262","DOIUrl":"https://doi.org/arxiv-2409.11262","url":null,"abstract":"This paper presents a residential audio dataset to support sound event\u0000detection research for smart home applications aimed at promoting wellbeing for\u0000older adults. The dataset is constructed by deploying audio recording systems\u0000in the homes of 8 participants aged 55-80 years for a 7-day period. Acoustic\u0000characteristics are documented through detailed floor plans and construction\u0000material information to enable replication of the recording environments for AI\u0000model deployment. A novel automated speech removal pipeline is developed, using\u0000pre-trained audio neural networks to detect and remove segments containing\u0000spoken voice, while preserving segments containing other sound events. The\u0000resulting dataset consists of privacy-compliant audio recordings that\u0000accurately capture the soundscapes and activities of daily living within\u0000residential spaces. The paper details the dataset creation methodology, the\u0000speech removal pipeline utilizing cascaded model architectures, and an analysis\u0000of the vocal label distribution to validate the speech removal process. This\u0000dataset enables the development and benchmarking of sound event detection\u0000models tailored specifically for in-home applications.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Source Disentanglement in Neural Audio Codec","authors":"Xiaoyu Bie, Xubo Liu, Gaël Richard","doi":"arxiv-2409.11228","DOIUrl":"https://doi.org/arxiv-2409.11228","url":null,"abstract":"Neural audio codecs have significantly advanced audio compression by\u0000efficiently converting continuous audio signals into discrete tokens. These\u0000codecs preserve high-quality sound and enable sophisticated sound generation\u0000through generative models trained on these tokens. However, existing neural\u0000codec models are typically trained on large, undifferentiated audio datasets,\u0000neglecting the essential discrepancies between sound domains like speech,\u0000music, and environmental sound effects. This oversight complicates data\u0000modeling and poses additional challenges to the controllability of sound\u0000generation. To tackle these issues, we introduce the Source-Disentangled Neural\u0000Audio Codec (SD-Codec), a novel approach that combines audio coding and source\u0000separation. By jointly learning audio resynthesis and separation, SD-Codec\u0000explicitly assigns audio signals from different domains to distinct codebooks,\u0000sets of discrete representations. Experimental results indicate that SD-Codec\u0000not only maintains competitive resynthesis quality but also, supported by the\u0000separation results, demonstrates successful disentanglement of different\u0000sources in the latent space, thereby enhancing interpretability in audio codec\u0000and providing potential finer control over the audio generation process.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul
{"title":"Speech Recognition for Analysis of Police Radio Communication","authors":"Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul","doi":"arxiv-2409.10858","DOIUrl":"https://doi.org/arxiv-2409.10858","url":null,"abstract":"Police departments around the world use two-way radio for coordination. These\u0000broadcast police communications (BPC) are a unique source of information about\u0000everyday police activity and emergency response. Yet BPC are not transcribed,\u0000and their naturalistic audio properties make automatic transcription\u0000challenging. We collect a corpus of roughly 62,000 manually transcribed radio\u0000transmissions (~46 hours of audio) to evaluate the feasibility of automatic\u0000speech recognition (ASR) using modern recognition models. We evaluate the\u0000performance of off-the-shelf speech recognizers, models fine-tuned on BPC data,\u0000and customized end-to-end models. We find that both human and machine\u0000transcription is challenging in this domain. Large off-the-shelf ASR models\u0000perform poorly, but fine-tuned models can reach the approximate range of human\u0000performance. Our work suggests directions for future work, including analysis\u0000of short utterances and potential miscommunication in police radio\u0000interactions. We make our corpus and data annotation pipeline available to\u0000other researchers, to enable further research on recognition and analysis of\u0000police communication.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia
{"title":"Learning Spatially-Aware Language and Audio Embedding","authors":"Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia","doi":"arxiv-2409.11369","DOIUrl":"https://doi.org/arxiv-2409.11369","url":null,"abstract":"Humans can picture a sound scene given an imprecise natural language\u0000description. For example, it is easy to imagine an acoustic environment given a\u0000phrase like \"the lion roar came from right behind me!\". For a machine to have\u0000the same degree of comprehension, the machine must know what a lion is\u0000(semantic attribute), what the concept of \"behind\" is (spatial attribute) and\u0000how these pieces of linguistic information align with the semantic and spatial\u0000attributes of the sound (what a roar sounds like when its coming from behind).\u0000State-of-the-art audio foundation models which learn to map between audio\u0000scenes and natural textual descriptions, are trained on non-spatial audio and\u0000text pairs, and hence lack spatial awareness. In contrast, sound event\u0000localization and detection models are limited to recognizing sounds from a\u0000fixed number of classes, and they localize the source to absolute position\u0000(e.g., 0.2m) rather than a position described using natural language (e.g.,\u0000\"next to me\"). To address these gaps, we present ELSA a spatially aware-audio\u0000and text embedding model trained using multimodal contrastive learning. ELSA\u0000supports non-spatial audio, spatial audio, and open vocabulary text captions\u0000describing both the spatial and semantic components of sound. To train ELSA:\u0000(a) we spatially augment the audio and captions of three open-source audio\u0000datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture\u0000the semantics of non-spatial audio, and the semantics and spatial attributes of\u0000spatial audio using contrastive learning. ELSA is competitive with\u0000state-of-the-art for both semantic retrieval and 3D source localization. In\u0000particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above\u0000the baseline, and outperforms by -11.6{deg} mean-absolute-error in 3D source\u0000localization over the baseline.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Nespoli, Daniel Barreda, Patrick A. Naylor
{"title":"Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora","authors":"Francesco Nespoli, Daniel Barreda, Patrick A. Naylor","doi":"arxiv-2409.11107","DOIUrl":"https://doi.org/arxiv-2409.11107","url":null,"abstract":"In recent years, automatic speech recognition (ASR) models greatly improved\u0000transcription performance both in clean, low noise, acoustic conditions and in\u0000reverberant environments. However, all these systems rely on the availability\u0000of hundreds of hours of labelled training data in specific acoustic conditions.\u0000When such a training dataset is not available, the performance of the system is\u0000heavily impacted. For example, this happens when a specific acoustic\u0000environment or a particular population of speakers is under-represented in the\u0000training dataset. Specifically, in this paper we investigate the effect of\u0000accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a\u0000strategy based on zero-shot text-to-speech to augment the accented speech\u0000corpora. We show that this augmentation method is able to mitigate the loss in\u0000performance of the ASR system on accented data up to 5% word error rate\u0000reduction (WERR). In conclusion, we demonstrate that by incorporating a modest\u0000fraction of real with synthetically generated data, the ASR system exhibits\u0000superior performance compared to a model trained exclusively on authentic\u0000accented speech with up to 14% WERR.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}