Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan
{"title":"Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction","authors":"Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan","doi":"arxiv-2409.11964","DOIUrl":"https://doi.org/arxiv-2409.11964","url":null,"abstract":"In this technical report, we describe the SNTL-NTU team's submission for Task\u00001 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection\u0000and classification of acoustic scenes and events (DCASE) 2024 challenge. Three\u0000systems are introduced to tackle training splits of different sizes. For small\u0000training splits, we explored reducing the complexity of the provided baseline\u0000model by reducing the number of base channels. We introduce data augmentation\u0000in the form of mixup to increase the diversity of training samples. For the\u0000larger training splits, we use FocusNet to provide confusing class information\u0000to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models\u0000and baseline models trained on the original sampling rate of 44.1 kHz. We use\u0000Knowledge Distillation to distill the ensemble model to the baseline student\u0000model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile\u0000development dataset yielded the highest average testing accuracy of (62.21,\u000059.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over\u0000the three systems.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee
{"title":"Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference","authors":"Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee","doi":"arxiv-2409.12117","DOIUrl":"https://doi.org/arxiv-2409.12117","url":null,"abstract":"Large language models (LLMs) have significantly advanced audio processing\u0000through audio codecs that convert audio into discrete tokens, enabling the\u0000application of language modeling techniques to audio data. However, audio\u0000codecs often operate at high frame rates, resulting in slow training and\u0000inference, especially for autoregressive models. To address this challenge, we\u0000present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that\u0000leverages finite scalar quantization and adversarial training with large speech\u0000language models to achieve high-quality audio compression with a 1.89 kbps\u0000bitrate and 21.5 frames per second. We demonstrate that our novel codec can\u0000make the inference of LLM-based text-to-speech models around three times faster\u0000while improving intelligibility and producing quality comparable to previous\u0000models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models","authors":"EverestAI, :, Sijin Chen, Yuan Feng, Laipeng He, Tianwei He, Wendi He, Yanni Hu, Bin Lin, Yiting Lin, Pengfei Tan, Chengwei Tian, Chen Wang, Zhicheng Wang, Ruoye Xie, Jingjing Yin, Jianhao Ye, Jixun Yao, Quanlei Yan, Yuguang Yang","doi":"arxiv-2409.12139","DOIUrl":"https://doi.org/arxiv-2409.12139","url":null,"abstract":"With the advent of the big data and large language model era, zero-shot\u0000personalized rapid customization has emerged as a significant trend. In this\u0000report, we introduce Takin AudioLLM, a series of techniques and models, mainly\u0000including Takin TTS, Takin VC, and Takin Morphing, specifically designed for\u0000audiobook production. These models are capable of zero-shot speech production,\u0000generating high-quality speech that is nearly indistinguishable from real human\u0000speech and facilitating individuals to customize the speech content according\u0000to their own needs. Specifically, we first introduce Takin TTS, a neural codec\u0000language model that builds upon an enhanced neural speech codec and a\u0000multi-task training framework, capable of generating high-fidelity natural\u0000speech in a zero-shot way. For Takin VC, we advocate an effective content and\u0000timbre joint modeling approach to improve the speaker similarity, while\u0000advocating for a conditional flow matching based decoder to further enhance its\u0000naturalness and expressiveness. Last, we propose the Takin Morphing system with\u0000highly decoupled and advanced timbre and prosody modeling approaches, which\u0000enables individuals to customize speech production with their preferred timbre\u0000and prosody in a precise and controllable manner. Extensive experiments\u0000validate the effectiveness and robustness of our Takin AudioLLM series models.\u0000For detailed demos, please refer to https://takinaudiollm.github.io.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pareto Data Framework: Steps Towards Resource-Efficient Decision Making Using Minimum Viable Data (MVD)","authors":"Tashfain Ahmed, Josh Siegel","doi":"arxiv-2409.12112","DOIUrl":"https://doi.org/arxiv-2409.12112","url":null,"abstract":"This paper introduces the Pareto Data Framework, an approach for identifying\u0000and selecting the Minimum Viable Data (MVD) required for enabling machine\u0000learning applications on constrained platforms such as embedded systems, mobile\u0000devices, and Internet of Things (IoT) devices. We demonstrate that strategic\u0000data reduction can maintain high performance while significantly reducing\u0000bandwidth, energy, computation, and storage costs. The framework identifies\u0000Minimum Viable Data (MVD) to optimize efficiency across resource-constrained\u0000environments without sacrificing performance. It addresses common inefficient\u0000practices in an IoT application such as overprovisioning of sensors and\u0000overprecision, and oversampling of signals, proposing scalable solutions for\u0000optimal sensor selection, signal extraction and transmission, and data\u0000representation. An experimental methodology demonstrates effective acoustic\u0000data characterization after downsampling, quantization, and truncation to\u0000simulate reduced-fidelity sensors and network and storage constraints; results\u0000shows that performance can be maintained up to 95% with sample rates reduced\u0000by 75% and bit depths and clip length reduced by 50% which translates into\u0000substantial cost and resource reduction. These findings have implications on\u0000the design and development of constrained systems. The paper also discusses\u0000broader implications of the framework, including the potential to democratize\u0000advanced AI technologies across IoT applications and sectors such as\u0000agriculture, transportation, and manufacturing to improve access and multiply\u0000the benefits of data-driven insights.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper","authors":"Jiaming Zhou, Shiwan Zhao, Jiabei He, Hui Wang, Wenjia Zeng, Yong Chen, Haoqin Sun, Aobo Kong, Yong Qin","doi":"arxiv-2409.11889","DOIUrl":"https://doi.org/arxiv-2409.11889","url":null,"abstract":"State-of-the-art models like OpenAI's Whisper exhibit strong performance in\u0000multilingual automatic speech recognition (ASR), but they still face challenges\u0000in accurately recognizing diverse subdialects. In this paper, we propose\u0000M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation\u0000approach designed to enhance ASR performance in low-resource settings. Building\u0000on the principles of in-context learning (ICL) and retrieval-augmented\u0000techniques, our method employs sentence-level ICL in the pre-processing stage\u0000to harness contextual information, while integrating token-level k-Nearest\u0000Neighbors (kNN) retrieval as a post-processing step to further refine the final\u0000output distribution. By synergistically combining sentence-level and\u0000token-level retrieval strategies, M2R-whisper effectively mitigates various\u0000types of recognition errors. Experiments conducted on Mandarin and subdialect\u0000datasets, including AISHELL-1 and KeSpeech, demonstrate substantial\u0000improvements in ASR accuracy, all achieved without any parameter updates.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"METEOR: Melody-aware Texture-controllable Symbolic Orchestral Music Generation","authors":"Dinh-Viet-Toan Le, Yi-Hsuan Yang","doi":"arxiv-2409.11753","DOIUrl":"https://doi.org/arxiv-2409.11753","url":null,"abstract":"Western music is often characterized by a homophonic texture, in which the\u0000musical content can be organized into a melody and an accompaniment. In\u0000orchestral music, in particular, the composer can select specific\u0000characteristics for each instrument's part within the accompaniment, while also\u0000needing to adapt the melody to suit the capabilities of the instruments\u0000performing it. In this work, we propose METEOR, a model for Melody-aware\u0000Texture-controllable Orchestral music generation. This model performs symbolic\u0000multi-track music style transfer with a focus on melodic fidelity. We allow\u0000bar- and track-level controllability of the accompaniment with various textural\u0000attributes while keeping a homophonic texture. We show that the model can\u0000achieve controllability performances similar to strong baselines while greatly\u0000improve melodic fidelity.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations","authors":"Haopeng Geng, Daisuke Saito, Minematsu Nobuaki","doi":"arxiv-2409.11742","DOIUrl":"https://doi.org/arxiv-2409.11742","url":null,"abstract":"Evaluating speech intelligibility is a critical task in computer-aided\u0000language learning systems. Traditional methods often rely on word error rates\u0000(WER) provided by automatic speech recognition (ASR) as intelligibility scores.\u0000However, this approach has significant limitations due to notable differences\u0000between human speech recognition (HSR) and ASR. A promising alternative is to\u0000involve a native (L1) speaker in shadowing what nonnative (L2) speakers say.\u0000Breakdowns or mispronunciations in the L1 speaker's shadowing utterance can\u0000serve as indicators for assessing L2 speech intelligibility. In this study, we\u0000propose a speech generation system that simulates the L1 shadowing process\u0000using voice conversion (VC) techniques and latent speech representations. Our\u0000experimental results demonstrate that this method effectively replicates the L1\u0000shadowing process, offering an innovative tool to evaluate L2 speech\u0000intelligibility. Notably, systems that utilize self-supervised speech\u0000representations (S3R) show a higher degree of similarity to real L1 shadowing\u0000utterances in both linguistic accuracy and naturalness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SALT: Standardized Audio event Label Taxonomy","authors":"Paraskevas StamatiadisIDS, S2A, LTCI, Michel OlveraIDS, S2A, LTCI, Slim EssidIDS, S2A, LTCI","doi":"arxiv-2409.11746","DOIUrl":"https://doi.org/arxiv-2409.11746","url":null,"abstract":"Machine listening systems often rely on fixed taxonomies to organize and\u0000label audio data, key for training and evaluating deep neural networks (DNNs)\u0000and other supervised algorithms. However, such taxonomies face significant\u0000constraints: they are composed of application-dependent predefined categories,\u0000which hinders the integration of new or varied sounds, and exhibits limited\u0000cross-dataset compatibility due to inconsistent labeling standards. To overcome\u0000these limitations, we introduce SALT: Standardized Audio event Label Taxonomy.\u0000Building upon the hierarchical structure of AudioSet's ontology, our taxonomy\u0000extends and standardizes labels across 24 publicly available environmental\u0000sound datasets, allowing the mapping of class labels from diverse datasets to a\u0000unified system. Our proposal comes with a new Python package designed for\u0000navigating and utilizing this taxonomy, easing cross-dataset label searching\u0000and hierarchical exploration. Notably, our package allows effortless data\u0000aggregation from diverse sources, hence easy experimentation with combined\u0000datasets.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems","authors":"Anusha Prakash, Hema A Murthy","doi":"arxiv-2409.11915","DOIUrl":"https://doi.org/arxiv-2409.11915","url":null,"abstract":"Sentences in Indian languages are generally longer than those in English.\u0000Indian languages are also considered to be phrase-based, wherein semantically\u0000complete phrases are concatenated to make up sentences. Long utterances lead to\u0000poor training of text-to-speech models and result in poor prosody during\u0000synthesis. In this work, we explore an inter-pausal unit (IPU) based approach\u0000in the end-to-end (E2E) framework, focusing on synthesising\u0000conversational-style text. We consider both autoregressive Tacotron2 and\u0000non-autoregressive FastSpeech2 architectures in our study and perform\u0000experiments with three Indian languages, namely, Hindi, Tamil and Telugu. With\u0000the IPU-based Tacotron2 approach, we see a reduction in insertion and deletion\u0000errors in the synthesised audio, providing an alternative approach to the\u0000FastSpeech(2) network in terms of error reduction. The IPU-based approach\u0000requires less computational resources and produces prosodically richer\u0000synthesis compared to conventional sentence-based systems.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Large Language Models By Layerwise Attention Shortcuts","authors":"Prateek Verma, Mert Pilanci","doi":"arxiv-2409.10870","DOIUrl":"https://doi.org/arxiv-2409.10870","url":null,"abstract":"Transformer architectures are the backbone of the modern AI revolution.\u0000However, they are based on simply stacking the same blocks in dozens of layers\u0000and processing information sequentially from one block to another. In this\u0000paper, we propose to challenge this and introduce adaptive computations for\u0000LLM-like setups, which allow the final layer to attend to all of the\u0000intermediate layers as it deems fit through the attention mechanism, thereby\u0000introducing computational textbf{attention shortcuts}. These shortcuts can\u0000thus make the architecture depth and context adaptive. We showcase four\u0000different datasets, namely acoustic tokens, natural language, and symbolic\u0000music, and we achieve superior performance for GPT-like architecture. We give\u0000evidence via attention maps that the models learn complex dependencies across\u0000layers that are adaptive in context and depth depending on the input tokens.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}