{"title":"DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models","authors":"Tzu-Quan Lin, Hung-yi Lee, Hao Tang","doi":"arxiv-2406.05464","DOIUrl":"https://doi.org/arxiv-2406.05464","url":null,"abstract":"Self-supervised speech models have shown to be useful for various tasks, but\u0000their large size limits the use in devices with low computing power and memory.\u0000In this work, we explore early exit, an approach for reducing latency by\u0000exiting the forward process of a network early. Most approaches of early exit\u0000need a separate early exit model for each task, with some even requiring\u0000fine-tuning of the entire pretrained model. We introduce Data Adaptive\u0000Self-Supervised Early Exit (DAISY), an approach that decides when to exit based\u0000on the self-supervised loss, eliminating the need for multiple round of\u0000training and fine-tuning. DAISY matches the performance of HuBERT on the\u0000MiniSUPERB benchmark, but with much faster inference times. Our analysis on the\u0000adaptivity of DAISY shows that the model exits early (using fewer layers) on\u0000clean data while exits late (using more layers) on noisy data, dynamically\u0000adjusting the computational cost of inference based on the noise level of each\u0000sample.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation","authors":"Paige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim","doi":"arxiv-2406.05515","DOIUrl":"https://doi.org/arxiv-2406.05515","url":null,"abstract":"Acoustic context effects, where surrounding changes in pitch, rate or timbre\u0000influence the perception of a sound, are well documented in speech perception,\u0000but how they interact with language background remains unclear. Using a\u0000reverse-correlation approach, we systematically varied the pitch and speech\u0000rate in phrases around different pairs of vowels for second language (L2)\u0000speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a\u0000data-driven manner, the prosodic profiles that bias their perception. Testing\u0000English and French speakers (n=25), we showed that vowel perception is in fact\u0000influenced by conflicting effects from the surrounding pitch and speech rate: a\u0000congruent proximal effect 0.2s pre-target and a distal contrastive effect up to\u00001s before; and found that L1 and L2 speakers exhibited strikingly similar\u0000prosodic profiles in perception. We provide a novel method to investigate\u0000acoustic context effects across stimuli, timescales, and acoustic domain.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the Benefits of Tokenization of Discrete Acoustic Units","authors":"Avihu Dekel, Raul Fernandez","doi":"arxiv-2406.05547","DOIUrl":"https://doi.org/arxiv-2406.05547","url":null,"abstract":"Tokenization algorithms that merge the units of a base vocabulary into\u0000larger, variable-rate units have become standard in natural language processing\u0000tasks. This idea, however, has been mostly overlooked when the vocabulary\u0000consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based\u0000representation that is playing an increasingly important role due to the\u0000success of discrete language-modeling techniques. In this paper, we showcase\u0000the advantages of tokenization of phonetic units and of DAUs on three\u0000prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised\u0000speech generation using DAU language modeling. We demonstrate that tokenization\u0000yields significant improvements in terms of performance, as well as training\u0000and inference speed, across all three tasks. We also offer theoretical insights\u0000to provide some explanation for the superior performance observed.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling","authors":"Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo","doi":"arxiv-2406.04321","DOIUrl":"https://doi.org/arxiv-2406.04321","url":null,"abstract":"In this work, we systematically study music generation conditioned solely on\u0000the video. First, we present a large-scale dataset comprising 190K video-music\u0000pairs, including various genres such as movie trailers, advertisements, and\u0000documentaries. Furthermore, we propose VidMuse, a simple framework for\u0000generating music aligned with video inputs. VidMuse stands out by producing\u0000high-fidelity music that is both acoustically and semantically aligned with the\u0000video. By incorporating local and global visual cues, VidMuse enables the\u0000creation of musically coherent audio tracks that consistently match the video\u0000content through Long-Short-Term modeling. Through extensive experiments,\u0000VidMuse outperforms existing models in terms of audio quality, diversity, and\u0000audio-visual alignment. The code and datasets will be available at\u0000https://github.com/ZeyueT/VidMuse/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141552029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Operational Latent Spaces","authors":"Scott H. Hawley, Austin R. Tackett","doi":"arxiv-2406.02699","DOIUrl":"https://doi.org/arxiv-2406.02699","url":null,"abstract":"We investigate the construction of latent spaces through self-supervised\u0000learning to support semantically meaningful operations. Analogous to\u0000operational amplifiers, these \"operational latent spaces\" (OpLaS) not only\u0000demonstrate semantic structure such as clustering but also support common\u0000transformational operations with inherent semantic meaning. Some operational\u0000latent spaces are found to have arisen \"unintentionally\" in the progress toward\u0000some (other) self-supervised learning objective, in which unintended but still\u0000useful properties are discovered among the relationships of points in the\u0000space. Other spaces may be constructed \"intentionally\" by developers\u0000stipulating certain kinds of clustering or transformations intended to produce\u0000the desired structure. We focus on the intentional creation of operational\u0000latent spaces via self-supervised learning, including the introduction of\u0000rotation operators via a novel \"FiLMR\" layer, which can be used to enable\u0000ring-like symmetries found in some musical constructions.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141552030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sungho Lee, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji
{"title":"Searching For Music Mixing Graphs: A Pruning Approach","authors":"Sungho Lee, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji","doi":"arxiv-2406.01049","DOIUrl":"https://doi.org/arxiv-2406.01049","url":null,"abstract":"Music mixing is compositional -- experts combine multiple audio processors to\u0000achieve a cohesive mix from dry source tracks. We propose a method to reverse\u0000engineer this process from the input and output audio. First, we create a\u0000mixing console that applies all available processors to every chain. Then,\u0000after the initial console parameter optimization, we alternate between removing\u0000redundant processors and fine-tuning. We achieve this through differentiable\u0000implementation of both processors and pruning. Consequently, we find a sparse\u0000mixing graph that achieves nearly identical matching quality of the full mixing\u0000console. We apply this procedure to dry-mix pairs from various datasets and\u0000collect graphs that also can be used to train neural networks for music mixing\u0000applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141254791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan
{"title":"DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation","authors":"Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan","doi":"arxiv-2405.20289","DOIUrl":"https://doi.org/arxiv-2405.20289","url":null,"abstract":"Controllable music generation methods are critical for human-centered\u0000AI-based music creation, but are currently limited by speed, quality, and\u0000control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\u0000particular, offers state-of-the-art results, but is over 10x slower than\u0000real-time, limiting practical use. We propose Distilled Diffusion\u0000Inference-Time T -Optimization (or DITTO-2), a new method to speed up\u0000inference-time optimization-based control and unlock faster-than-real-time\u0000generation for a wide-variety of applications such as music inpainting,\u0000outpainting, intensity, melody, and musical structure control. Our method works\u0000by (1) distilling a pre-trained diffusion model for fast sampling via an\u0000efficient, modified consistency or consistency trajectory distillation process\u0000(2) performing inference-time optimization using our distilled model with\u0000one-step sampling as an efficient surrogate optimization task and (3) running a\u0000final multi-step sampling generation (decoding) using our estimated noise\u0000latents for best-quality, fast, controllable generation. Through thorough\u0000evaluation, we find our method not only speeds up generation over 10-20x, but\u0000simultaneously improves control adherence and generation quality all at once.\u0000Furthermore, we apply our approach to a new application of maximizing text\u0000adherence (CLAP score) and show we can convert an unconditional diffusion model\u0000without text inputs into a model that yields state-of-the-art text control.\u0000Sound examples can be found at https://ditto-music.github.io/ditto2/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141187830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LLMs Meet Multimodal Generation and Editing: A Survey","authors":"Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen","doi":"arxiv-2405.19334","DOIUrl":"https://doi.org/arxiv-2405.19334","url":null,"abstract":"With the recent advancement in large language models (LLMs), there is a\u0000growing interest in combining LLMs with multimodal learning. Previous surveys\u0000of multimodal large language models (MLLMs) mainly focus on understanding. This\u0000survey elaborates on multimodal generation across different domains, including\u0000image, video, 3D, and audio, where we highlight the notable advancements with\u0000milestone works in these fields. Specifically, we exhaustively investigate the\u0000key technical components behind methods and multimodal datasets utilized in\u0000these studies. Moreover, we dig into tool-augmented multimodal agents that can\u0000use existing generative models for human-computer interaction. Lastly, we also\u0000comprehensively discuss the advancement in AI safety and investigate emerging\u0000applications as well as future prospects. Our work provides a systematic and\u0000insightful overview of multimodal generation, which is expected to advance the\u0000development of Artificial Intelligence for Generative Content (AIGC) and world\u0000models. A curated list of all related papers can be found at\u0000https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings","authors":"Tariq Adnan, Abdelrahman Abdelkader, Zipei Liu, Ekram Hossain, Sooyong Park, MD Saiful Islam, Ehsan Hoque","doi":"arxiv-2405.17206","DOIUrl":"https://doi.org/arxiv-2405.17206","url":null,"abstract":"We present a framework to recognize Parkinson's disease (PD) through an\u0000English pangram utterance speech collected using a web application from diverse\u0000recording settings and environments, including participants' homes. Our dataset\u0000includes a global cohort of 1306 participants, including 392 diagnosed with PD.\u0000Leveraging the diversity of the dataset, spanning various demographic\u0000properties (such as age, sex, and ethnicity), we used deep learning embeddings\u0000derived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBind\u0000representing the speech dynamics associated with PD. Our novel fusion model for\u0000PD classification, which aligns different speech embeddings into a cohesive\u0000feature space, demonstrated superior performance over standard\u0000concatenation-based fusion models and other baselines (including models built\u0000on traditional acoustic features). In a randomized data split configuration,\u0000the model achieved an Area Under the Receiver Operating Characteristic Curve\u0000(AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysis\u0000confirmed that our model performs equitably across various demographic\u0000subgroups in terms of sex, ethnicity, and age, and remains robust regardless of\u0000disease duration. Furthermore, our model, when tested on two entirely unseen\u0000test datasets collected from clinical settings and from a PD care center,\u0000maintained AUROC scores of 82.12% and 78.44%, respectively. This affirms the\u0000model's robustness and it's potential to enhance accessibility and health\u0000equity in real-world applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyue Piao, Christian Frisson, Bavo Van Kerrebroeck, Marcelo M. Wanderley
{"title":"Enhancing DMI Interactions by Integrating Haptic Feedback for Intricate Vibrato Technique","authors":"Ziyue Piao, Christian Frisson, Bavo Van Kerrebroeck, Marcelo M. Wanderley","doi":"arxiv-2405.10502","DOIUrl":"https://doi.org/arxiv-2405.10502","url":null,"abstract":"This paper investigates the integration of force feedback in Digital Musical\u0000Instruments (DMI), specifically evaluating the reproduction of intricate\u0000vibrato techniques using haptic feedback controllers. We introduce our system\u0000for vibrato modulation using force feedback, composed of Bend-aid (a web-based\u0000sequencer platform using pre-designed haptic feedback models) and TorqueTuner\u0000(an open-source 1 Degree-of-Freedom (DoF) rotary haptic device for generating\u0000programmable haptic effects). We designed a formal user study to assess the\u0000impact of each haptic mode on user experience in a vibrato mimicry task. Twenty\u0000musically trained participants rated their user experience for the three haptic\u0000modes (Smooth, Detent, and Spring) using four Likert-scale scores: comfort,\u0000flexibility, ease of control, and helpfulness for the task. Finally, we asked\u0000participants to share their reflections. Our research indicates that while the\u0000Spring mode can help with light vibrato, preferences for haptic modes vary\u0000based on musical training background. This emphasizes the need for adaptable\u0000task interfaces and flexible haptic feedback in DMI design.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}