{"title":"Discrete Unit based Masking for Improving Disentanglement in Voice Conversion","authors":"Philip H. Lee, Ismail Rasim Ulgen, Berrak Sisman","doi":"arxiv-2409.11560","DOIUrl":"https://doi.org/arxiv-2409.11560","url":null,"abstract":"Voice conversion (VC) aims to modify the speaker's identity while preserving\u0000the linguistic content. Commonly, VC methods use an encoder-decoder\u0000architecture, where disentangling the speaker's identity from linguistic\u0000information is crucial. However, the disentanglement approaches used in these\u0000methods are limited as the speaker features depend on the phonetic content of\u0000the utterance, compromising disentanglement. This dependency is amplified with\u0000attention-based methods. To address this, we introduce a novel masking\u0000mechanism in the input before speaker encoding, masking certain discrete speech\u0000units that correspond highly with phoneme classes. Our work aims to reduce the\u0000phonetic dependency of speaker features by restricting access to some phonetic\u0000information. Furthermore, since our approach is at the input level, it is\u0000applicable to any encoder-decoder based VC framework. Our approach improves\u0000disentanglement and conversion performance across multiple VC methods, showing\u0000significant effectiveness, particularly in attention-based method, with 44%\u0000relative improvement in objective intelligibility.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genta Indra Winata, Hanyang Zhao, Anirban Das, Wenpin Tang, David D. Yao, Shi-Xiong Zhang, Sambit Sahu
{"title":"Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey","authors":"Genta Indra Winata, Hanyang Zhao, Anirban Das, Wenpin Tang, David D. Yao, Shi-Xiong Zhang, Sambit Sahu","doi":"arxiv-2409.11564","DOIUrl":"https://doi.org/arxiv-2409.11564","url":null,"abstract":"Preference tuning is a crucial process for aligning deep generative models\u0000with human preferences. This survey offers a thorough overview of recent\u0000advancements in preference tuning and the integration of human feedback. The\u0000paper is organized into three main sections: 1) introduction and preliminaries:\u0000an introduction to reinforcement learning frameworks, preference tuning tasks,\u0000models, and datasets across various modalities: language, speech, and vision,\u0000as well as different policy approaches, 2) in-depth examination of each\u0000preference tuning approach: a detailed analysis of the methods used in\u0000preference tuning, and 3) applications, discussion, and future directions: an\u0000exploration of the applications of preference tuning in downstream tasks,\u0000including evaluation methods for different modalities, and an outlook on future\u0000research directions. Our objective is to present the latest methodologies in\u0000preference tuning and model alignment, enhancing the understanding of this\u0000field for researchers and practitioners. We hope to encourage further\u0000engagement and innovation in this area.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yufeng Yang, Desh Raj, Ju Lin, Niko Moritz, Junteng Jia, Gil Keren, Egor Lakomkin, Yiteng Huang, Jacob Donley, Jay Mahadeokar, Ozlem Kalinli
{"title":"M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses","authors":"Yufeng Yang, Desh Raj, Ju Lin, Niko Moritz, Junteng Jia, Gil Keren, Egor Lakomkin, Yiteng Huang, Jacob Donley, Jay Mahadeokar, Ozlem Kalinli","doi":"arxiv-2409.11494","DOIUrl":"https://doi.org/arxiv-2409.11494","url":null,"abstract":"The growing popularity of multi-channel wearable devices, such as smart\u0000glasses, has led to a surge of applications such as targeted speech recognition\u0000and enhanced hearing. However, current approaches to solve these tasks use\u0000independently trained models, which may not benefit from large amounts of\u0000unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel\u0000speech foundation model for smart glasses, which is designed to leverage\u0000large-scale self-supervised learning (SSL) in an array-geometry agnostic\u0000approach. While prior work on multi-channel speech SSL only evaluated on\u0000simulated settings, we curate a suite of real downstream tasks to evaluate our\u0000model, namely (i) conversational automatic speech recognition (ASR), (ii)\u0000spherical active source localization, and (iii) glasses wearer voice activity\u0000detection, which are sourced from the MMCSG and EasyCom datasets. We show that\u0000a general-purpose M-BEST-RQ encoder is able to match or surpass supervised\u0000models across all tasks. For the conversational ASR task in particular, using\u0000only 8 hours of labeled speech, our model outperforms a supervised ASR baseline\u0000that is trained on 2000 hours of labeled data, which demonstrates the\u0000effectiveness of our approach.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of pretrained language models on music understanding","authors":"Yannis Vasilakis, Rachel Bittner, Johan Pauwels","doi":"arxiv-2409.11449","DOIUrl":"https://doi.org/arxiv-2409.11449","url":null,"abstract":"Music-text multimodal systems have enabled new approaches to Music\u0000Information Research (MIR) applications such as audio-to-text and text-to-audio\u0000retrieval, text-based song generation, and music captioning. Despite the\u0000reported success, little effort has been put into evaluating the musical\u0000knowledge of Large Language Models (LLM). In this paper, we demonstrate that\u0000LLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g.\u0000'rock song without guitar'), and 3) sensitivity towards the presence of\u0000specific words. We quantified these properties as a triplet-based accuracy,\u0000evaluating the ability to model the relative similarity of labels in a\u0000hierarchical ontology. We leveraged the Audioset ontology to generate triplets\u0000consisting of an anchor, a positive (relevant) label, and a negative (less\u0000relevant) label for the genre and instruments sub-tree. We evaluated the\u0000triplet-based musical knowledge for six general-purpose Transformer-based\u0000models. The triplets obtained through this methodology required filtering, as\u0000some were difficult to judge and therefore relatively uninformative for\u0000evaluation purposes. Despite the relatively high accuracy reported,\u0000inconsistencies are evident in all six models, suggesting that off-the-shelf\u0000LLMs need adaptation to music before use.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models","authors":"Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul","doi":"arxiv-2409.10999","DOIUrl":"https://doi.org/arxiv-2409.10999","url":null,"abstract":"Audio language models can understand audio inputs and perform a range of\u0000audio-related tasks based on instructions, such as speech recognition and audio\u0000captioning, where the instructions are usually textual prompts. Audio language\u0000models are mostly initialized from pre-trained audio encoders and large\u0000language models (LLMs). Although these pre-trained components were developed to\u0000support multiple languages, audio-language models are trained predominantly on\u0000English data, which may limit their usability to only English instructions or\u0000English speech inputs. First, this paper examines the performance of existing\u0000audio language models in an underserved language using Thai as an example. This\u0000paper demonstrates that, despite being built on multilingual backbones, audio\u0000language models do not exhibit cross-lingual emergent abilities to low-resource\u0000languages. Second, this paper studies data mixture for developing audio\u0000language models that are optimized for a target language as well as English. In\u0000addition. this paper integrates audio comprehension and speech\u0000instruction-following capabilities into a single unified model. Our experiments\u0000provide insights into data mixture for enhancing instruction-following\u0000capabilities in both a low-resource language and English. Our model,\u0000Typhoon-Audio, outperforms existing open-source audio language models by a\u0000considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in\u0000both English and Thai languages.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaime Garcia-Martinez, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas
{"title":"SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation","authors":"Jaime Garcia-Martinez, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas","doi":"arxiv-2409.10995","DOIUrl":"https://doi.org/arxiv-2409.10995","url":null,"abstract":"Recent advancements in music source separation have significantly progressed,\u0000particularly in isolating vocals, drums, and bass elements from mixed tracks.\u0000These developments owe much to the creation and use of large-scale, multitrack\u0000datasets dedicated to these specific components. However, the challenge of\u0000extracting similarly sounding sources from orchestra recordings has not been\u0000extensively explored, largely due to a scarcity of comprehensive and clean (i.e\u0000bleed-free) multitrack datasets. In this paper, we introduce a novel multitrack\u0000dataset called SynthSOD, developed using a set of simulation techniques to\u0000create a realistic (i.e. using high-quality soundfonts), musically motivated,\u0000and heterogeneous training set comprising different dynamics, natural tempo\u0000changes, styles, and conditions. Moreover, we demonstrate the application of a\u0000widely used baseline music separation model trained on our synthesized dataset\u0000w.r.t to the well-known EnsembleSet, and evaluate its performance under both\u0000synthetic and real-world conditions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samee Arif, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, Awais Athar
{"title":"WER We Stand: Benchmarking Urdu ASR Models","authors":"Samee Arif, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, Awais Athar","doi":"arxiv-2409.11252","DOIUrl":"https://doi.org/arxiv-2409.11252","url":null,"abstract":"This paper presents a comprehensive evaluation of Urdu Automatic Speech\u0000Recognition (ASR) models. We analyze the performance of three ASR model\u0000families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), along\u0000with a detailed examination of the most frequent wrong words and error types\u0000including insertions, deletions, and substitutions. Our analysis is conducted\u0000using two types of datasets, read speech and conversational speech. Notably, we\u0000present the first conversational speech dataset designed for benchmarking Urdu\u0000ASR models. We find that seamless-large outperforms other ASR models on the\u0000read speech dataset, while whisper-large performs best on the conversational\u0000speech dataset. Furthermore, this evaluation highlights the complexities of\u0000assessing ASR models for low-resource languages like Urdu using quantitative\u0000metrics alone and emphasizes the need for a robust Urdu text normalization\u0000system. Our findings contribute valuable insights for developing robust ASR\u0000systems for low-resource languages like Urdu.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Longhao Li, Qijie Shao, Linju Yang, Kai Diao, Lei Xie
{"title":"Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text","authors":"Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Longhao Li, Qijie Shao, Linju Yang, Kai Diao, Lei Xie","doi":"arxiv-2409.11214","DOIUrl":"https://doi.org/arxiv-2409.11214","url":null,"abstract":"Integrating audio encoders with LLMs through connectors has enabled these\u0000models to process and comprehend audio modalities, significantly enhancing\u0000speech-to-text tasks, including automatic speech recognition (ASR) and\u0000automatic speech translation (AST). However, these methods often overlook the\u0000critical aspect of language adaptation in multilingual settings, relying\u0000instead on multilingual data without adequately addressing language\u0000differences. To address this gap, we propose the Ideal-LLM model, which employs\u0000dual multilingual encoders to enrich language feature information and utilizes\u0000a language-adapted connector to target the adaptation of each language\u0000specifically. By leveraging the complementary strengths of Whisper and MMS\u0000encoders, our approach ensures richer multilingual representations.\u0000Additionally, the language-adapted connector enhances modal transformation via\u0000a language weight selector tailored for each language. Experimental results\u0000demonstrate that Ideal-LLM significantly improves ASR performance, achieving a\u000032.6% relative reduction in average word error rates compared to the standard\u0000speech encoder integrated with LLMs and yields an average BLEU score of 36.78\u0000for AST task.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning","authors":"Ilaria Manco, Justin Salamon, Oriol Nieto","doi":"arxiv-2409.11498","DOIUrl":"https://doi.org/arxiv-2409.11498","url":null,"abstract":"Audio-text contrastive models have become a powerful approach in music\u0000representation learning. Despite their empirical success, however, little is\u0000known about the influence of key design choices on the quality of music-text\u0000representations learnt through this framework. In this work, we expose these\u0000design choices within the constraints of limited data and computation budgets,\u0000and establish a more solid understanding of their impact grounded in empirical\u0000observations along three axes: the choice of base encoders, the level of\u0000curation in training data, and the use of text augmentation. We find that data\u0000curation is the single most important factor for music-text contrastive\u0000training in resource-constrained scenarios. Motivated by this insight, we\u0000introduce two novel techniques, Augmented View Dropout and TextSwap, which\u0000increase the diversity and descriptiveness of text inputs seen in training.\u0000Through our experiments we demonstrate that these are effective at boosting\u0000performance across different pre-training regimes, model architectures, and\u0000downstream data distributions, without incurring higher computational costs or\u0000requiring additional training data.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer","authors":"Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu","doi":"arxiv-2409.10819","DOIUrl":"https://doi.org/arxiv-2409.10819","url":null,"abstract":"Latent diffusion models have shown promising results in text-to-audio (T2A)\u0000generation tasks, yet previous models have encountered difficulties in\u0000generation quality, computational cost, diffusion sampling, and data\u0000preparation. In this paper, we introduce EzAudio, a transformer-based T2A\u0000diffusion model, to handle these challenges. Our approach includes several key\u0000innovations: (1) We build the T2A model on the latent space of a 1D waveform\u0000Variational Autoencoder (VAE), avoiding the complexities of handling 2D\u0000spectrogram representations and using an additional neural vocoder. (2) We\u0000design an optimized diffusion transformer architecture specifically tailored\u0000for audio latent representations and diffusion modeling, which enhances\u0000convergence speed, training stability, and memory usage, making the training\u0000process easier and more efficient. (3) To tackle data scarcity, we adopt a\u0000data-efficient training strategy that leverages unlabeled data for learning\u0000acoustic dependencies, audio caption data annotated by audio-language models\u0000for text-to-audio alignment learning, and human-labeled data for fine-tuning.\u0000(4) We introduce a classifier-free guidance (CFG) rescaling method that\u0000simplifies EzAudio by achieving strong prompt alignment while preserving great\u0000audio quality when using larger CFG scores, eliminating the need to struggle\u0000with finding the optimal CFG score to balance this trade-off. EzAudio surpasses\u0000existing open-source models in both objective metrics and subjective\u0000evaluations, delivering realistic listening experiences while maintaining a\u0000streamlined model structure, low training costs, and an easy-to-follow training\u0000pipeline. Code, data, and pre-trained models are released at:\u0000https://haidog-yaqub.github.io/EzAudio-Page/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}