Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li
{"title":"GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis","authors":"Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li","doi":"arxiv-2407.10471","DOIUrl":"https://doi.org/arxiv-2407.10471","url":null,"abstract":"Amid the burgeoning development of generative models like diffusion models,\u0000the task of differentiating synthesized audio from its natural counterpart\u0000grows more daunting. Deepfake detection offers a viable solution to combat this\u0000challenge. Yet, this defensive measure unintentionally fuels the continued\u0000refinement of generative models. Watermarking emerges as a proactive and\u0000sustainable tactic, preemptively regulating the creation and dissemination of\u0000synthesized content. Thus, this paper, as a pioneer, proposes the generative\u0000robust audio watermarking method (Groot), presenting a paradigm for proactively\u0000supervising the synthesized audio and its source diffusion models. In this\u0000paradigm, the processes of watermark generation and audio synthesis occur\u0000simultaneously, facilitated by parameter-fixed diffusion models equipped with a\u0000dedicated encoder. The watermark embedded within the audio can subsequently be\u0000retrieved by a lightweight decoder. The experimental results highlight Groot's\u0000outstanding performance, particularly in terms of robustness, surpassing that\u0000of the leading state-of-the-art methods. Beyond its impressive resilience\u0000against individual post-processing attacks, Groot exhibits exceptional\u0000robustness when facing compound attacks, maintaining an average watermark\u0000extraction accuracy of around 95%.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang
{"title":"LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis","authors":"Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang","doi":"arxiv-2407.10468","DOIUrl":"https://doi.org/arxiv-2407.10468","url":null,"abstract":"Latent diffusion models have shown promising results in audio generation,\u0000making notable advancements over traditional methods. However, their\u0000performance, while impressive with short audio clips, faces challenges when\u0000extended to longer audio sequences. These challenges are due to model's\u0000self-attention mechanism and training predominantly on 10-second clips, which\u0000complicates the extension to longer audio without adaptation. In response to\u0000these issues, we introduce a novel approach, LiteFocus that enhances the\u0000inference of existing audio latent diffusion models in long audio synthesis.\u0000Observed the attention pattern in self-attention, we employ a dual sparse form\u0000for attention calculation, designated as same-frequency focus and\u0000cross-frequency compensation, which curtails the attention computation under\u0000same-frequency constraints, while enhancing audio quality through\u0000cross-frequency refillment. LiteFocus demonstrates substantial reduction on\u0000inference time with diffusion-based TTA model by 1.99x in synthesizing\u000080-second audio clips while also obtaining improved audio quality.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandre d'HoogeLaBRI, SCRIME, Louis BigoLaBRI, SCRIME, Ken Déguernel, Nicolas Martin
{"title":"Guitar Chord Diagram Suggestion for Western Popular Music","authors":"Alexandre d'HoogeLaBRI, SCRIME, Louis BigoLaBRI, SCRIME, Ken Déguernel, Nicolas Martin","doi":"arxiv-2407.14260","DOIUrl":"https://doi.org/arxiv-2407.14260","url":null,"abstract":"Chord diagrams are used by guitar players to show where and how to play a\u0000chord on the fretboard. They are useful to beginners learning chords or for\u0000sharing the hand positions required to play a song.However, the diagrams\u0000presented on guitar learning toolsare usually selected from an existing\u0000databaseand rarely represent the actual positions used by performers.In this\u0000paper, we propose a tool which suggests a chord diagram for achord label,taking\u0000into account the diagram of the previous chord.Based on statistical analysis of\u0000the DadaGP and mySongBook datasets, we show that some chord diagrams are\u0000over-represented in western popular musicand that some chords can be played in\u0000more than 20 different ways.We argue that taking context into account can\u0000improve the variety and the quality of chord diagram suggestion, and compare\u0000this approach with a model taking only the current chord label into account.We\u0000show that adding previous context improves the F1-score on this task by up to\u000027% and reduces the propensity of the model to suggest standard open chords.We\u0000also define the notion of texture in the context of chord diagrams andshow\u0000through a variety of metrics that our model improves textureconsistencywith the\u0000previous diagram.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"339 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion","authors":"Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng","doi":"arxiv-2407.10373","DOIUrl":"https://doi.org/arxiv-2407.10373","url":null,"abstract":"Visual acoustic matching (VAM) is pivotal for enhancing the immersive\u0000experience, and the task of dereverberation is effective in improving audio\u0000intelligibility. Existing methods treat each task independently, overlooking\u0000the inherent reciprocity between them. Moreover, these methods depend on paired\u0000training data, which is challenging to acquire, impeding the utilization of\u0000extensive unpaired data. In this paper, we introduce MVSD, a mutual learning\u0000framework based on diffusion models. MVSD considers the two tasks\u0000symmetrically, exploiting the reciprocal relationship to facilitate learning\u0000from inverse tasks and overcome data scarcity. Furthermore, we employ the\u0000diffusion model as foundational conditional converters to circumvent the\u0000training instability and over-smoothing drawbacks of conventional GAN\u0000architectures. Specifically, MVSD employs two converters: one for VAM called\u0000reverberator and one for dereverberation called dereverberator. The\u0000dereverberator judges whether the reverberation audio generated by reverberator\u0000sounds like being in the conditional visual scenario, and vice versa. By\u0000forming a closed loop, these two converters can generate informative feedback\u0000signals to optimize the inverse tasks, even with easily acquired one-way\u0000unpaired data. Extensive experiments on two standard benchmarks, i.e.,\u0000SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can\u0000improve the performance of the reverberator and dereverberator and better match\u0000specified visual scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DDFAD: Dataset Distillation Framework for Audio Data","authors":"Wenbo Jiang, Rui Zhang, Hongwei Li, Xiaoyuan Liu, Haomiao Yang, Shui Yu","doi":"arxiv-2407.10446","DOIUrl":"https://doi.org/arxiv-2407.10446","url":null,"abstract":"Deep neural networks (DNNs) have achieved significant success in numerous\u0000applications. The remarkable performance of DNNs is largely attributed to the\u0000availability of massive, high-quality training datasets. However, processing\u0000such massive training data requires huge computational and storage resources.\u0000Dataset distillation is a promising solution to this problem, offering the\u0000capability to compress a large dataset into a smaller distilled dataset. The\u0000model trained on the distilled dataset can achieve comparable performance to\u0000the model trained on the whole dataset. While dataset distillation has been demonstrated in image data, none have\u0000explored dataset distillation for audio data. In this work, for the first time,\u0000we propose a Dataset Distillation Framework for Audio Data (DDFAD).\u0000Specifically, we first propose the Fused Differential MFCC (FD-MFCC) as\u0000extracted features for audio data. After that, the FD-MFCC is distilled through\u0000the matching training trajectory distillation method. Finally, we propose an\u0000audio signal reconstruction algorithm based on the Griffin-Lim Algorithm to\u0000reconstruct the audio signal from the distilled FD-MFCC. Extensive experiments\u0000demonstrate the effectiveness of DDFAD on various audio datasets. In addition,\u0000we show that DDFAD has promising application prospects in many applications,\u0000such as continual learning and neural architecture search.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features","authors":"Jing Luo, Xinyu Yang, Dorien Herremans","doi":"arxiv-2407.10462","DOIUrl":"https://doi.org/arxiv-2407.10462","url":null,"abstract":"Controllable music generation promotes the interaction between humans and\u0000composition systems by projecting the users' intent on their desired music. The\u0000challenge of introducing controllability is an increasingly important issue in\u0000the symbolic music generation field. When building controllable generative\u0000popular multi-instrument music systems, two main challenges typically present\u0000themselves, namely weak controllability and poor music quality. To address\u0000these issues, we first propose spatiotemporal features as powerful and\u0000fine-grained controls to enhance the controllability of the generative model.\u0000In addition, an efficient music representation called REMI_Track is designed to\u0000convert multitrack music into multiple parallel music sequences and shorten the\u0000sequence length of each track with Byte Pair Encoding (BPE) techniques.\u0000Subsequently, we release BandControlNet, a conditional model based on parallel\u0000Transformers, to tackle the multiple music sequences and generate high-quality\u0000music samples that are conditioned to the given spatiotemporal control\u0000features. More concretely, the two specially designed modules of\u0000BandControlNet, namely structure-enhanced self-attention (SE-SA) and\u0000Cross-Track Transformer (CTT), are utilized to strengthen the resulting musical\u0000structure and inter-track harmony modeling respectively. Experimental results\u0000tested on two popular music datasets of different lengths demonstrate that the\u0000proposed BandControlNet outperforms other conditional music generation models\u0000on most objective metrics in terms of fidelity and inference speed and shows\u0000great robustness in generating long music samples. The subjective evaluations\u0000show BandControlNet trained on short datasets can generate music with\u0000comparable quality to state-of-the-art models, while outperforming them\u0000significantly using longer datasets.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"185 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà
{"title":"Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity","authors":"Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà","doi":"arxiv-2407.10387","DOIUrl":"https://doi.org/arxiv-2407.10387","url":null,"abstract":"Video-to-audio (V2A) generation leverages visual-only video features to\u0000render plausible sounds that match the scene. Importantly, the generated sound\u0000onsets should match the visual actions that are aligned with them, otherwise\u0000unnatural synchronization artifacts arise. Recent works have explored the\u0000progression of conditioning sound generators on still images and then video\u0000features, focusing on quality and semantic matching while ignoring\u0000synchronization, or by sacrificing some amount of quality to focus on improving\u0000synchronization only. In this work, we propose a V2A generative model, named\u0000MaskVAT, that interconnects a full-band high-quality general audio codec with a\u0000sequence-to-sequence masked generative model. This combination allows modeling\u0000both high audio quality, semantic matching, and temporal synchronicity at the\u0000same time. Our results show that, by combining a high-quality codec with the\u0000proper pre-trained audio-visual features and a sequence-to-sequence parallel\u0000structure, we are able to yield highly synchronized results on one hand, whilst\u0000being competitive with the state of the art of non-codec generative audio\u0000models. Sample videos and generated audios are available at\u0000https://maskvat.github.io .","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie
{"title":"Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification","authors":"Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie","doi":"arxiv-2407.10048","DOIUrl":"https://doi.org/arxiv-2407.10048","url":null,"abstract":"Trained on 680,000 hours of massive speech data, Whisper is a multitasking,\u0000multilingual speech foundation model demonstrating superior performance in\u0000automatic speech recognition, translation, and language identification.\u0000However, its applicability in speaker verification (SV) tasks remains\u0000unexplored, particularly in low-data-resource scenarios where labeled speaker\u0000data in specific domains are limited. To fill this gap, we propose a\u0000lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV.\u0000Given that Whisper is not specifically optimized for SV tasks, we introduce a\u0000representation selection module to quantify the speaker-specific\u0000characteristics contained in each layer of Whisper and select the top-k layers\u0000with prominent discriminative speaker features. To aggregate pivotal\u0000speaker-related features while diminishing non-speaker redundancies across the\u0000selected top-k distinct layers of Whisper, we design a multi-layer aggregation\u0000module in Whisper-SV to integrate multi-layer representations into a singular,\u0000compacted representation for SV. In the multi-layer aggregation module, we\u0000employ convolutional layers with shortcut connections among different layers to\u0000refine speaker characteristics derived from multi-layer representations from\u0000Whisper. In addition, an attention aggregation layer is used to reduce\u0000non-speaker interference and amplify speaker-specific cues for SV tasks.\u0000Finally, a simple classification module is used for speaker classification.\u0000Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV\u0000achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively,\u0000showing superior performance in low-data-resource SV scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Interpretation Gap in Text-to-Music Generation Models","authors":"Yongyi Zang, Yixiao Zhang","doi":"arxiv-2407.10328","DOIUrl":"https://doi.org/arxiv-2407.10328","url":null,"abstract":"Large-scale text-to-music generation models have significantly enhanced music\u0000creation capabilities, offering unprecedented creative freedom. However, their\u0000ability to collaborate effectively with human musicians remains limited. In\u0000this paper, we propose a framework to describe the musical interaction process,\u0000which includes expression, interpretation, and execution of controls. Following\u0000this framework, we argue that the primary gap between existing text-to-music\u0000models and musicians lies in the interpretation stage, where models lack the\u0000ability to interpret controls from musicians. We also propose two strategies to\u0000address this gap and call on the music information retrieval community to\u0000tackle the interpretation challenge to improve human-AI musical collaboration.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucca Emmanuel Pineli Simões, Lucas Brandão Rodrigues, Rafaela Mota Silva, Gustavo Rodrigues da Silva
{"title":"Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks","authors":"Lucca Emmanuel Pineli Simões, Lucas Brandão Rodrigues, Rafaela Mota Silva, Gustavo Rodrigues da Silva","doi":"arxiv-2407.08658","DOIUrl":"https://doi.org/arxiv-2407.08658","url":null,"abstract":"This paper presents the development and comparative evaluation of three voice\u0000command pipelines for controlling a Tello drone, using speech recognition and\u0000deep learning techniques. The aim is to enhance human-machine interaction by\u0000enabling intuitive voice control of drone actions. The pipelines developed\u0000include: (1) a traditional Speech-to-Text (STT) followed by a Large Language\u0000Model (LLM) approach, (2) a direct voice-to-function mapping model, and (3) a\u0000Siamese neural network-based system. Each pipeline was evaluated based on\u0000inference time, accuracy, efficiency, and flexibility. Detailed methodologies,\u0000dataset preparation, and evaluation metrics are provided, offering a\u0000comprehensive analysis of each pipeline's strengths and applicability across\u0000different scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141609941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}