arXiv - CS - Sound最新文献_第4页

GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis GROOT：为基于扩散模型的音频合成生成鲁棒水印

arXiv - CS - Sound Pub Date : 2024-07-15 DOI: arxiv-2407.10471

Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li

{"title":"GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis","authors":"Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li","doi":"arxiv-2407.10471","DOIUrl":"https://doi.org/arxiv-2407.10471","url":null,"abstract":"Amid the burgeoning development of generative models like diffusion models,\u0000the task of differentiating synthesized audio from its natural counterpart\u0000grows more daunting. Deepfake detection offers a viable solution to combat this\u0000challenge. Yet, this defensive measure unintentionally fuels the continued\u0000refinement of generative models. Watermarking emerges as a proactive and\u0000sustainable tactic, preemptively regulating the creation and dissemination of\u0000synthesized content. Thus, this paper, as a pioneer, proposes the generative\u0000robust audio watermarking method (Groot), presenting a paradigm for proactively\u0000supervising the synthesized audio and its source diffusion models. In this\u0000paradigm, the processes of watermark generation and audio synthesis occur\u0000simultaneously, facilitated by parameter-fixed diffusion models equipped with a\u0000dedicated encoder. The watermark embedded within the audio can subsequently be\u0000retrieved by a lightweight decoder. The experimental results highlight Groot's\u0000outstanding performance, particularly in terms of robustness, surpassing that\u0000of the leading state-of-the-art methods. Beyond its impressive resilience\u0000against individual post-processing attacks, Groot exhibits exceptional\u0000robustness when facing compound attacks, maintaining an average watermark\u0000extraction accuracy of around 95%.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis LiteFocus：用于长音频合成的加速扩散推理

arXiv - CS - Sound Pub Date : 2024-07-15 DOI: arxiv-2407.10468

Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang

引用次数: 0

Guitar Chord Diagram Suggestion for Western Popular Music 西方流行音乐吉他和弦图建议

arXiv - CS - Sound Pub Date : 2024-07-15 DOI: arxiv-2407.14260

Alexandre d'HoogeLaBRI, SCRIME, Louis BigoLaBRI, SCRIME, Ken Déguernel, Nicolas Martin

{"title":"Guitar Chord Diagram Suggestion for Western Popular Music","authors":"Alexandre d'HoogeLaBRI, SCRIME, Louis BigoLaBRI, SCRIME, Ken Déguernel, Nicolas Martin","doi":"arxiv-2407.14260","DOIUrl":"https://doi.org/arxiv-2407.14260","url":null,"abstract":"Chord diagrams are used by guitar players to show where and how to play a\u0000chord on the fretboard. They are useful to beginners learning chords or for\u0000sharing the hand positions required to play a song.However, the diagrams\u0000presented on guitar learning toolsare usually selected from an existing\u0000databaseand rarely represent the actual positions used by performers.In this\u0000paper, we propose a tool which suggests a chord diagram for achord label,taking\u0000into account the diagram of the previous chord.Based on statistical analysis of\u0000the DadaGP and mySongBook datasets, we show that some chord diagrams are\u0000over-represented in western popular musicand that some chords can be played in\u0000more than 20 different ways.We argue that taking context into account can\u0000improve the variety and the quality of chord diagram suggestion, and compare\u0000this approach with a model taking only the current chord label into account.We\u0000show that adding previous context improves the F1-score on this task by up to\u000027% and reduces the propensity of the model to suggest standard open chords.We\u0000also define the notion of texture in the context of chord diagrams andshow\u0000through a variety of metrics that our model improves textureconsistencywith the\u0000previous diagram.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"339 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion 通过视觉场景驱动扩散进行声学匹配和消除混响的相互学习

arXiv - CS - Sound Pub Date : 2024-07-15 DOI: arxiv-2407.10373

Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng

{"title":"Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion","authors":"Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng","doi":"arxiv-2407.10373","DOIUrl":"https://doi.org/arxiv-2407.10373","url":null,"abstract":"Visual acoustic matching (VAM) is pivotal for enhancing the immersive\u0000experience, and the task of dereverberation is effective in improving audio\u0000intelligibility. Existing methods treat each task independently, overlooking\u0000the inherent reciprocity between them. Moreover, these methods depend on paired\u0000training data, which is challenging to acquire, impeding the utilization of\u0000extensive unpaired data. In this paper, we introduce MVSD, a mutual learning\u0000framework based on diffusion models. MVSD considers the two tasks\u0000symmetrically, exploiting the reciprocal relationship to facilitate learning\u0000from inverse tasks and overcome data scarcity. Furthermore, we employ the\u0000diffusion model as foundational conditional converters to circumvent the\u0000training instability and over-smoothing drawbacks of conventional GAN\u0000architectures. Specifically, MVSD employs two converters: one for VAM called\u0000reverberator and one for dereverberation called dereverberator. The\u0000dereverberator judges whether the reverberation audio generated by reverberator\u0000sounds like being in the conditional visual scenario, and vice versa. By\u0000forming a closed loop, these two converters can generate informative feedback\u0000signals to optimize the inverse tasks, even with easily acquired one-way\u0000unpaired data. Extensive experiments on two standard benchmarks, i.e.,\u0000SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can\u0000improve the performance of the reverberator and dereverberator and better match\u0000specified visual scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DDFAD: Dataset Distillation Framework for Audio Data DDFAD：音频数据的数据集蒸馏框架

arXiv - CS - Sound Pub Date : 2024-07-15 DOI: arxiv-2407.10446

Wenbo Jiang, Rui Zhang, Hongwei Li, Xiaoyuan Liu, Haomiao Yang, Shui Yu

{"title":"DDFAD: Dataset Distillation Framework for Audio Data","authors":"Wenbo Jiang, Rui Zhang, Hongwei Li, Xiaoyuan Liu, Haomiao Yang, Shui Yu","doi":"arxiv-2407.10446","DOIUrl":"https://doi.org/arxiv-2407.10446","url":null,"abstract":"Deep neural networks (DNNs) have achieved significant success in numerous\u0000applications. The remarkable performance of DNNs is largely attributed to the\u0000availability of massive, high-quality training datasets. However, processing\u0000such massive training data requires huge computational and storage resources.\u0000Dataset distillation is a promising solution to this problem, offering the\u0000capability to compress a large dataset into a smaller distilled dataset. The\u0000model trained on the distilled dataset can achieve comparable performance to\u0000the model trained on the whole dataset. While dataset distillation has been demonstrated in image data, none have\u0000explored dataset distillation for audio data. In this work, for the first time,\u0000we propose a Dataset Distillation Framework for Audio Data (DDFAD).\u0000Specifically, we first propose the Fused Differential MFCC (FD-MFCC) as\u0000extracted features for audio data. After that, the FD-MFCC is distilled through\u0000the matching training trajectory distillation method. Finally, we propose an\u0000audio signal reconstruction algorithm based on the Griffin-Lim Algorithm to\u0000reconstruct the audio signal from the distilled FD-MFCC. Extensive experiments\u0000demonstrate the effectiveness of DDFAD on various audio datasets. In addition,\u0000we show that DDFAD has promising application prospects in many applications,\u0000such as continual learning and neural architecture search.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features BandControlNet：基于并行变换器的可转向流行音乐生成与细粒度时空特征

arXiv - CS - Sound Pub Date : 2024-07-15 DOI: arxiv-2407.10462

Jing Luo, Xinyu Yang, Dorien Herremans

{"title":"BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features","authors":"Jing Luo, Xinyu Yang, Dorien Herremans","doi":"arxiv-2407.10462","DOIUrl":"https://doi.org/arxiv-2407.10462","url":null,"abstract":"Controllable music generation promotes the interaction between humans and\u0000composition systems by projecting the users' intent on their desired music. The\u0000challenge of introducing controllability is an increasingly important issue in\u0000the symbolic music generation field. When building controllable generative\u0000popular multi-instrument music systems, two main challenges typically present\u0000themselves, namely weak controllability and poor music quality. To address\u0000these issues, we first propose spatiotemporal features as powerful and\u0000fine-grained controls to enhance the controllability of the generative model.\u0000In addition, an efficient music representation called REMI_Track is designed to\u0000convert multitrack music into multiple parallel music sequences and shorten the\u0000sequence length of each track with Byte Pair Encoding (BPE) techniques.\u0000Subsequently, we release BandControlNet, a conditional model based on parallel\u0000Transformers, to tackle the multiple music sequences and generate high-quality\u0000music samples that are conditioned to the given spatiotemporal control\u0000features. More concretely, the two specially designed modules of\u0000BandControlNet, namely structure-enhanced self-attention (SE-SA) and\u0000Cross-Track Transformer (CTT), are utilized to strengthen the resulting musical\u0000structure and inter-track harmony modeling respectively. Experimental results\u0000tested on two popular music datasets of different lengths demonstrate that the\u0000proposed BandControlNet outperforms other conditional music generation models\u0000on most objective metrics in terms of fidelity and inference speed and shows\u0000great robustness in generating long music samples. The subjective evaluations\u0000show BandControlNet trained on short datasets can generate music with\u0000comparable quality to state-of-the-art models, while outperforming them\u0000significantly using longer datasets.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"185 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity 具有增强同步性的屏蔽式生成视频音频转换器

arXiv - CS - Sound Pub Date : 2024-07-15 DOI: arxiv-2407.10387

Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà

{"title":"Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity","authors":"Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà","doi":"arxiv-2407.10387","DOIUrl":"https://doi.org/arxiv-2407.10387","url":null,"abstract":"Video-to-audio (V2A) generation leverages visual-only video features to\u0000render plausible sounds that match the scene. Importantly, the generated sound\u0000onsets should match the visual actions that are aligned with them, otherwise\u0000unnatural synchronization artifacts arise. Recent works have explored the\u0000progression of conditioning sound generators on still images and then video\u0000features, focusing on quality and semantic matching while ignoring\u0000synchronization, or by sacrificing some amount of quality to focus on improving\u0000synchronization only. In this work, we propose a V2A generative model, named\u0000MaskVAT, that interconnects a full-band high-quality general audio codec with a\u0000sequence-to-sequence masked generative model. This combination allows modeling\u0000both high audio quality, semantic matching, and temporal synchronicity at the\u0000same time. Our results show that, by combining a high-quality codec with the\u0000proper pre-trained audio-visual features and a sequence-to-sequence parallel\u0000structure, we are able to yield highly synchronized results on one hand, whilst\u0000being competitive with the state of the art of non-codec generative audio\u0000models. Sample videos and generated audios are available at\u0000https://maskvat.github.io .","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification Whisper-SV：为低数据资源扬声器验证调整 Whisper

arXiv - CS - Sound Pub Date : 2024-07-14 DOI: arxiv-2407.10048

Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie

{"title":"Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification","authors":"Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie","doi":"arxiv-2407.10048","DOIUrl":"https://doi.org/arxiv-2407.10048","url":null,"abstract":"Trained on 680,000 hours of massive speech data, Whisper is a multitasking,\u0000multilingual speech foundation model demonstrating superior performance in\u0000automatic speech recognition, translation, and language identification.\u0000However, its applicability in speaker verification (SV) tasks remains\u0000unexplored, particularly in low-data-resource scenarios where labeled speaker\u0000data in specific domains are limited. To fill this gap, we propose a\u0000lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV.\u0000Given that Whisper is not specifically optimized for SV tasks, we introduce a\u0000representation selection module to quantify the speaker-specific\u0000characteristics contained in each layer of Whisper and select the top-k layers\u0000with prominent discriminative speaker features. To aggregate pivotal\u0000speaker-related features while diminishing non-speaker redundancies across the\u0000selected top-k distinct layers of Whisper, we design a multi-layer aggregation\u0000module in Whisper-SV to integrate multi-layer representations into a singular,\u0000compacted representation for SV. In the multi-layer aggregation module, we\u0000employ convolutional layers with shortcut connections among different layers to\u0000refine speaker characteristics derived from multi-layer representations from\u0000Whisper. In addition, an attention aggregation layer is used to reduce\u0000non-speaker interference and amplify speaker-specific cues for SV tasks.\u0000Finally, a simple classification module is used for speaker classification.\u0000Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV\u0000achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively,\u0000showing superior performance in low-data-resource SV scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Interpretation Gap in Text-to-Music Generation Models 文本到音乐生成模型中的解读差距

arXiv - CS - Sound Pub Date : 2024-07-14 DOI: arxiv-2407.10328

Yongyi Zang, Yixiao Zhang

引用次数: 0

Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks 评估用于无人机控制的语音命令管道：从 STT 和 LLM 到直接分类和连体网络

arXiv - CS - Sound Pub Date : 2024-07-10 DOI: arxiv-2407.08658

Lucca Emmanuel Pineli Simões, Lucas Brandão Rodrigues, Rafaela Mota Silva, Gustavo Rodrigues da Silva

引用次数: 0