Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg
{"title":"Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens","authors":"Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg","doi":"arxiv-2409.06656","DOIUrl":"https://doi.org/arxiv-2409.06656","url":null,"abstract":"We propose Sortformer, a novel neural model for speaker diarization, trained\u0000with unconventional objectives compared to existing end-to-end diarization\u0000models. The permutation problem in speaker diarization has long been regarded\u0000as a critical challenge. Most prior end-to-end diarization systems employ\u0000permutation invariant loss (PIL), which optimizes for the permutation that\u0000yields the lowest error. In contrast, we introduce Sort Loss, which enables a\u0000diarization model to autonomously resolve permutation, with or without PIL. We\u0000demonstrate that combining Sort Loss and PIL achieves performance competitive\u0000with state-of-the-art end-to-end diarization models trained exclusively with\u0000PIL. Crucially, we present a streamlined multispeaker ASR architecture that\u0000leverages Sortformer as a speaker supervision model, embedding speaker label\u0000estimation within the ASR encoder state using a sinusoidal kernel function.\u0000This approach resolves the speaker permutation problem through sorted\u0000objectives, effectively bridging speaker-label timestamps and speaker tokens.\u0000In our experiments, we show that the proposed multispeaker ASR architecture,\u0000enhanced with speaker supervision, improves performance via adapter techniques.\u0000Code and trained models will be made publicly available via the NVIDIA NeMo\u0000framework","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Two-Stage Band-Split Mamba-2 Network for Music Separation","authors":"Jinglin Bai, Yuan Fang, Jiajie Wang, Xueliang Zhang","doi":"arxiv-2409.06245","DOIUrl":"https://doi.org/arxiv-2409.06245","url":null,"abstract":"Music source separation (MSS) aims to separate mixed music into its distinct\u0000tracks, such as vocals, bass, drums, and more. MSS is considered to be a\u0000challenging audio separation task due to the complexity of music signals.\u0000Although the RNN and Transformer architecture are not perfect, they are\u0000commonly used to model the music sequence for MSS. Recently, Mamba-2 has\u0000already demonstrated high efficiency in various sequential modeling tasks, but\u0000its superiority has not been investigated in MSS. This paper applies Mamba-2\u0000with a two-stage strategy, which introduces residual mapping based on the mask\u0000method, effectively compensating for the details absent in the mask and further\u0000improving separation performance. Experiments confirm the superiority of\u0000bidirectional Mamba-2 and the effectiveness of the two-stage network in MSS.\u0000The source code is publicly accessible at\u0000https://github.com/baijinglin/TS-BSmamba2.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng
{"title":"LLaMA-Omni: Seamless Speech Interaction with Large Language Models","authors":"Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng","doi":"arxiv-2409.06666","DOIUrl":"https://doi.org/arxiv-2409.06666","url":null,"abstract":"Models like GPT-4o enable real-time interaction with large language models\u0000(LLMs) through speech, significantly enhancing user experience compared to\u0000traditional text-based interaction. However, there is still a lack of\u0000exploration on how to build speech interaction models based on open-source\u0000LLMs. To address this, we propose LLaMA-Omni, a novel model architecture\u0000designed for low-latency and high-quality speech interaction with LLMs.\u0000LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM,\u0000and a streaming speech decoder. It eliminates the need for speech\u0000transcription, and can simultaneously generate text and speech responses\u0000directly from speech instructions with extremely low latency. We build our\u0000model based on the latest Llama-3.1-8B-Instruct model. To align the model with\u0000speech interaction scenarios, we construct a dataset named InstructS2S-200K,\u0000which includes 200K speech instructions and corresponding speech responses.\u0000Experimental results show that compared to previous speech-language models,\u0000LLaMA-Omni provides better responses in both content and style, with a response\u0000latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3\u0000days on just 4 GPUs, paving the way for the efficient development of\u0000speech-language models in the future.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Jing, Kun Zhou, Andreas Triantafyllopoulos, Björn W. Schuller
{"title":"Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models","authors":"Xin Jing, Kun Zhou, Andreas Triantafyllopoulos, Björn W. Schuller","doi":"arxiv-2409.06451","DOIUrl":"https://doi.org/arxiv-2409.06451","url":null,"abstract":"While current emotional text-to-speech (TTS) systems can generate highly\u0000intelligible emotional speech, achieving fine control over emotion rendering of\u0000the output speech still remains a significant challenge. In this paper, we\u0000introduce ParaEVITS, a novel emotional TTS framework that leverages the\u0000compositionality of natural language to enhance control over emotional\u0000rendering. By incorporating a text-audio encoder inspired by ParaCLAP, a\u0000contrastive language-audio pretraining (CLAP) model for computational\u0000paralinguistics, the diffusion model is trained to generate emotional\u0000embeddings based on textual emotional style descriptions. Our framework first\u0000trains on reference audio using the audio encoder, then fine-tunes a diffusion\u0000model to process textual inputs from ParaCLAP's text encoder. During inference,\u0000speech attributes such as pitch, jitter, and loudness are manipulated using\u0000only textual conditioning. Our experiments demonstrate that ParaEVITS\u0000effectively control emotion rendering without compromising speech quality.\u0000Speech demos are publicly available.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time","authors":"Yue Li, Koen V. Hindriks, Florian A. Kunneman","doi":"arxiv-2409.06274","DOIUrl":"https://doi.org/arxiv-2409.06274","url":null,"abstract":"Spectral subtraction, widely used for its simplicity, has been employed to\u0000address the Robot Ego Speech Filtering (RESF) problem for detecting speech\u0000contents of human interruption from robot's single-channel microphone\u0000recordings when it is speaking. However, this approach suffers from\u0000oversubtraction in the fundamental frequency range (FFR), leading to degraded\u0000speech content recognition. To address this, we propose a Two-Mask\u0000Conformer-based Metric Generative Adversarial Network (CMGAN) to enhance the\u0000detected speech and improve recognition results. Our model compensates for\u0000oversubtracted FFR values with high-frequency information and long-term\u0000features and then de-noises the new spectrogram. In addition, we introduce an\u0000incremental processing method that allows semi-real-time audio processing with\u0000streaming input on a network trained on long fixed-length input. Evaluations of\u0000two datasets, including one with unseen noise, demonstrate significant\u0000improvements in recognition accuracy and the effectiveness of the proposed\u0000two-mask approach and incremental processing, enhancing the robustness of the\u0000proposed RESF pipeline in real-world HRI scenarios.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Attention-Based Beamformer For Multi-Channel Speech Enhancement","authors":"Jinglin Bai, Hao Li, Xueliang Zhang, Fei Chen","doi":"arxiv-2409.06456","DOIUrl":"https://doi.org/arxiv-2409.06456","url":null,"abstract":"Minimum Variance Distortionless Response (MVDR) is a classical adaptive\u0000beamformer that theoretically ensures the distortionless transmission of\u0000signals in the target direction. Its performance in noise reduction actually\u0000depends on the accuracy of the noise spatial covariance matrix (SCM) estimate.\u0000Although recent deep learning has shown remarkable performance in multi-channel\u0000speech enhancement, the property of distortionless response still makes MVDR\u0000highly popular in real applications. In this paper, we propose an\u0000attention-based mechanism to calculate the speech and noise SCM and then apply\u0000MVDR to obtain the enhanced speech. Moreover, a deep learning architecture\u0000using the inplace convolution operator and frequency-independent LSTM has\u0000proven effective in facilitating SCM estimation. The model is optimized in an\u0000end-to-end manner. Experimental results indicate that the proposed method is\u0000extremely effective in tracking moving or stationary speakers under non-causal\u0000and causal conditions, outperforming other baselines. It is worth mentioning\u0000that our model has only 0.35 million parameters, making it easy to be deployed\u0000on edge devices.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury
{"title":"Multi-Source Music Generation with Latent Diffusion","authors":"Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury","doi":"arxiv-2409.06190","DOIUrl":"https://doi.org/arxiv-2409.06190","url":null,"abstract":"Most music generation models directly generate a single music mixture. To\u0000allow for more flexible and controllable generation, the Multi-Source Diffusion\u0000Model (MSDM) has been proposed to model music as a mixture of multiple\u0000instrumental sources (e.g., piano, drums, bass, and guitar). Its goal is to use\u0000one single diffusion model to generate consistent music sources, which are\u0000further mixed to form the music. Despite its capabilities, MSDM is unable to\u0000generate songs with rich melodies and often generates empty sounds. Also, its\u0000waveform diffusion introduces significant Gaussian noise artifacts, which\u0000compromises audio quality. In response, we introduce a multi-source latent\u0000diffusion model (MSLDM) that employs Variational Autoencoders (VAEs) to encode\u0000each instrumental source into a distinct latent representation. By training a\u0000VAE on all music sources, we efficiently capture each source's unique\u0000characteristics in a source latent that our diffusion model models jointly.\u0000This approach significantly enhances the total and partial generation of music\u0000by leveraging the VAE's latent compression and noise-robustness. The compressed\u0000source latent also facilitates more efficient generation. Subjective listening\u0000tests and Frechet Audio Distance (FAD) scores confirm that our model\u0000outperforms MSDM, showcasing its practical and enhanced applicability in music\u0000generation systems. We also emphasize that modeling sources is more effective\u0000than direct music mixture modeling. Codes and models are available at\u0000https://github.com/XZWY/MSLDM. Demos are available at\u0000https://xzwy.github.io/MSLDMDemo.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SpeechTaxi: On Multilingual Semantic Speech Classification","authors":"Lennart Keller, Goran Glavaš","doi":"arxiv-2409.06372","DOIUrl":"https://doi.org/arxiv-2409.06372","url":null,"abstract":"Recent advancements in multilingual speech encoding as well as transcription\u0000raise the question of the most effective approach to semantic speech\u0000classification. Concretely, can (1) end-to-end (E2E) classifiers obtained by\u0000fine-tuning state-of-the-art multilingual speech encoders (MSEs) match or\u0000surpass the performance of (2) cascading (CA), where speech is first\u0000transcribed into text and classification is delegated to a text-based\u0000classifier. To answer this, we first construct SpeechTaxi, an 80-hour\u0000multilingual dataset for semantic speech classification of Bible verses,\u0000covering 28 diverse languages. We then leverage SpeechTaxi to conduct a wide\u0000range of experiments comparing E2E and CA in monolingual semantic speech\u0000classification as well as in cross-lingual transfer. We find that E2E based on\u0000MSEs outperforms CA in monolingual setups, i.e., when trained on in-language\u0000data. However, MSEs seem to have poor cross-lingual transfer abilities, with\u0000E2E substantially lagging CA both in (1) zero-shot transfer to languages unseen\u0000in training and (2) multilingual training, i.e., joint training on multiple\u0000languages. Finally, we devise a novel CA approach based on transcription to\u0000Romanized text as a language-agnostic intermediate representation and show that\u0000it represents a robust solution for languages without native ASR support. Our\u0000SpeechTaxi dataset is publicly available at: https://huggingface.co/\u0000datasets/LennartKeller/SpeechTaxi/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Janssen 2.0: Audio Inpainting in the Time-frequency Domain","authors":"Ondřej Mokrý, Peter Balušík, Pavel Rajmic","doi":"arxiv-2409.06392","DOIUrl":"https://doi.org/arxiv-2409.06392","url":null,"abstract":"The paper focuses on inpainting missing parts of an audio signal spectrogram.\u0000First, a recent successful approach based on an untrained neural network is\u0000revised and its several modifications are proposed, improving the\u0000signal-to-noise ratio of the restored audio. Second, the Janssen algorithm, the\u0000autoregression-based state-of-the-art for time-domain audio inpainting, is\u0000adapted for the time-frequency setting. This novel method, coined Janssen-TF,\u0000is compared to the neural network approach using both objective metrics and a\u0000subjective listening test, proving Janssen-TF to be superior in all the\u0000considered measures.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yizhou Tan, Yanru Wu, Yuanbo Hou, Xin Xu, Hui Bu, Shengchen Li, Dick Botteldooren, Mark D. Plumbley
{"title":"Exploring Differences between Human Perception and Model Inference in Audio Event Recognition","authors":"Yizhou Tan, Yanru Wu, Yuanbo Hou, Xin Xu, Hui Bu, Shengchen Li, Dick Botteldooren, Mark D. Plumbley","doi":"arxiv-2409.06580","DOIUrl":"https://doi.org/arxiv-2409.06580","url":null,"abstract":"Audio Event Recognition (AER) traditionally focuses on detecting and\u0000identifying audio events. Most existing AER models tend to detect all potential\u0000events without considering their varying significance across different\u0000contexts. This makes the AER results detected by existing models often have a\u0000large discrepancy with human auditory perception. Although this is a critical\u0000and significant issue, it has not been extensively studied by the Detection and\u0000Classification of Sound Scenes and Events (DCASE) community because solving it\u0000is time-consuming and labour-intensive. To address this issue, this paper\u0000introduces the concept of semantic importance in AER, focusing on exploring the\u0000differences between human perception and model inference. This paper constructs\u0000a Multi-Annotated Foreground Audio Event Recognition (MAFAR) dataset, which\u0000comprises audio recordings labelled by 10 professional annotators. Through\u0000labelling frequency and variance, the MAFAR dataset facilitates the\u0000quantification of semantic importance and analysis of human perception. By\u0000comparing human annotations with the predictions of ensemble pre-trained\u0000models, this paper uncovers a significant gap between human perception and\u0000model inference in both semantic identification and existence detection of\u0000audio events. Experimental results reveal that human perception tends to ignore\u0000subtle or trivial events in the event semantic identification, while model\u0000inference is easily affected by events with noises. Meanwhile, in event\u0000existence detection, models are usually more sensitive than humans.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"164 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}