{"title":"VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling","authors":"Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo","doi":"arxiv-2406.04321","DOIUrl":"https://doi.org/arxiv-2406.04321","url":null,"abstract":"In this work, we systematically study music generation conditioned solely on\u0000the video. First, we present a large-scale dataset comprising 190K video-music\u0000pairs, including various genres such as movie trailers, advertisements, and\u0000documentaries. Furthermore, we propose VidMuse, a simple framework for\u0000generating music aligned with video inputs. VidMuse stands out by producing\u0000high-fidelity music that is both acoustically and semantically aligned with the\u0000video. By incorporating local and global visual cues, VidMuse enables the\u0000creation of musically coherent audio tracks that consistently match the video\u0000content through Long-Short-Term modeling. Through extensive experiments,\u0000VidMuse outperforms existing models in terms of audio quality, diversity, and\u0000audio-visual alignment. The code and datasets will be available at\u0000https://github.com/ZeyueT/VidMuse/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141552029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Operational Latent Spaces","authors":"Scott H. Hawley, Austin R. Tackett","doi":"arxiv-2406.02699","DOIUrl":"https://doi.org/arxiv-2406.02699","url":null,"abstract":"We investigate the construction of latent spaces through self-supervised\u0000learning to support semantically meaningful operations. Analogous to\u0000operational amplifiers, these \"operational latent spaces\" (OpLaS) not only\u0000demonstrate semantic structure such as clustering but also support common\u0000transformational operations with inherent semantic meaning. Some operational\u0000latent spaces are found to have arisen \"unintentionally\" in the progress toward\u0000some (other) self-supervised learning objective, in which unintended but still\u0000useful properties are discovered among the relationships of points in the\u0000space. Other spaces may be constructed \"intentionally\" by developers\u0000stipulating certain kinds of clustering or transformations intended to produce\u0000the desired structure. We focus on the intentional creation of operational\u0000latent spaces via self-supervised learning, including the introduction of\u0000rotation operators via a novel \"FiLMR\" layer, which can be used to enable\u0000ring-like symmetries found in some musical constructions.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141552030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sungho Lee, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji
{"title":"Searching For Music Mixing Graphs: A Pruning Approach","authors":"Sungho Lee, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji","doi":"arxiv-2406.01049","DOIUrl":"https://doi.org/arxiv-2406.01049","url":null,"abstract":"Music mixing is compositional -- experts combine multiple audio processors to\u0000achieve a cohesive mix from dry source tracks. We propose a method to reverse\u0000engineer this process from the input and output audio. First, we create a\u0000mixing console that applies all available processors to every chain. Then,\u0000after the initial console parameter optimization, we alternate between removing\u0000redundant processors and fine-tuning. We achieve this through differentiable\u0000implementation of both processors and pruning. Consequently, we find a sparse\u0000mixing graph that achieves nearly identical matching quality of the full mixing\u0000console. We apply this procedure to dry-mix pairs from various datasets and\u0000collect graphs that also can be used to train neural networks for music mixing\u0000applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141254791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan
{"title":"DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation","authors":"Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan","doi":"arxiv-2405.20289","DOIUrl":"https://doi.org/arxiv-2405.20289","url":null,"abstract":"Controllable music generation methods are critical for human-centered\u0000AI-based music creation, but are currently limited by speed, quality, and\u0000control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\u0000particular, offers state-of-the-art results, but is over 10x slower than\u0000real-time, limiting practical use. We propose Distilled Diffusion\u0000Inference-Time T -Optimization (or DITTO-2), a new method to speed up\u0000inference-time optimization-based control and unlock faster-than-real-time\u0000generation for a wide-variety of applications such as music inpainting,\u0000outpainting, intensity, melody, and musical structure control. Our method works\u0000by (1) distilling a pre-trained diffusion model for fast sampling via an\u0000efficient, modified consistency or consistency trajectory distillation process\u0000(2) performing inference-time optimization using our distilled model with\u0000one-step sampling as an efficient surrogate optimization task and (3) running a\u0000final multi-step sampling generation (decoding) using our estimated noise\u0000latents for best-quality, fast, controllable generation. Through thorough\u0000evaluation, we find our method not only speeds up generation over 10-20x, but\u0000simultaneously improves control adherence and generation quality all at once.\u0000Furthermore, we apply our approach to a new application of maximizing text\u0000adherence (CLAP score) and show we can convert an unconditional diffusion model\u0000without text inputs into a model that yields state-of-the-art text control.\u0000Sound examples can be found at https://ditto-music.github.io/ditto2/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141187830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LLMs Meet Multimodal Generation and Editing: A Survey","authors":"Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen","doi":"arxiv-2405.19334","DOIUrl":"https://doi.org/arxiv-2405.19334","url":null,"abstract":"With the recent advancement in large language models (LLMs), there is a\u0000growing interest in combining LLMs with multimodal learning. Previous surveys\u0000of multimodal large language models (MLLMs) mainly focus on understanding. This\u0000survey elaborates on multimodal generation across different domains, including\u0000image, video, 3D, and audio, where we highlight the notable advancements with\u0000milestone works in these fields. Specifically, we exhaustively investigate the\u0000key technical components behind methods and multimodal datasets utilized in\u0000these studies. Moreover, we dig into tool-augmented multimodal agents that can\u0000use existing generative models for human-computer interaction. Lastly, we also\u0000comprehensively discuss the advancement in AI safety and investigate emerging\u0000applications as well as future prospects. Our work provides a systematic and\u0000insightful overview of multimodal generation, which is expected to advance the\u0000development of Artificial Intelligence for Generative Content (AIGC) and world\u0000models. A curated list of all related papers can be found at\u0000https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"214 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings","authors":"Tariq Adnan, Abdelrahman Abdelkader, Zipei Liu, Ekram Hossain, Sooyong Park, MD Saiful Islam, Ehsan Hoque","doi":"arxiv-2405.17206","DOIUrl":"https://doi.org/arxiv-2405.17206","url":null,"abstract":"We present a framework to recognize Parkinson's disease (PD) through an\u0000English pangram utterance speech collected using a web application from diverse\u0000recording settings and environments, including participants' homes. Our dataset\u0000includes a global cohort of 1306 participants, including 392 diagnosed with PD.\u0000Leveraging the diversity of the dataset, spanning various demographic\u0000properties (such as age, sex, and ethnicity), we used deep learning embeddings\u0000derived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBind\u0000representing the speech dynamics associated with PD. Our novel fusion model for\u0000PD classification, which aligns different speech embeddings into a cohesive\u0000feature space, demonstrated superior performance over standard\u0000concatenation-based fusion models and other baselines (including models built\u0000on traditional acoustic features). In a randomized data split configuration,\u0000the model achieved an Area Under the Receiver Operating Characteristic Curve\u0000(AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysis\u0000confirmed that our model performs equitably across various demographic\u0000subgroups in terms of sex, ethnicity, and age, and remains robust regardless of\u0000disease duration. Furthermore, our model, when tested on two entirely unseen\u0000test datasets collected from clinical settings and from a PD care center,\u0000maintained AUROC scores of 82.12% and 78.44%, respectively. This affirms the\u0000model's robustness and it's potential to enhance accessibility and health\u0000equity in real-world applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyue Piao, Christian Frisson, Bavo Van Kerrebroeck, Marcelo M. Wanderley
{"title":"Enhancing DMI Interactions by Integrating Haptic Feedback for Intricate Vibrato Technique","authors":"Ziyue Piao, Christian Frisson, Bavo Van Kerrebroeck, Marcelo M. Wanderley","doi":"arxiv-2405.10502","DOIUrl":"https://doi.org/arxiv-2405.10502","url":null,"abstract":"This paper investigates the integration of force feedback in Digital Musical\u0000Instruments (DMI), specifically evaluating the reproduction of intricate\u0000vibrato techniques using haptic feedback controllers. We introduce our system\u0000for vibrato modulation using force feedback, composed of Bend-aid (a web-based\u0000sequencer platform using pre-designed haptic feedback models) and TorqueTuner\u0000(an open-source 1 Degree-of-Freedom (DoF) rotary haptic device for generating\u0000programmable haptic effects). We designed a formal user study to assess the\u0000impact of each haptic mode on user experience in a vibrato mimicry task. Twenty\u0000musically trained participants rated their user experience for the three haptic\u0000modes (Smooth, Detent, and Spring) using four Likert-scale scores: comfort,\u0000flexibility, ease of control, and helpfulness for the task. Finally, we asked\u0000participants to share their reflections. Our research indicates that while the\u0000Spring mode can help with light vibrato, preferences for haptic modes vary\u0000based on musical training background. This emphasizes the need for adaptable\u0000task interfaces and flexible haptic feedback in DMI design.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparative Study of Recurrent Neural Networks for Virtual Analog Audio Effects Modeling","authors":"Riccardo Simionato, Stefano Fasciani","doi":"arxiv-2405.04124","DOIUrl":"https://doi.org/arxiv-2405.04124","url":null,"abstract":"Analog electronic circuits are at the core of an important category of\u0000musical devices. The nonlinear features of their electronic components give\u0000analog musical devices a distinctive timbre and sound quality, making them\u0000highly desirable. Artificial neural networks have rapidly gained popularity for\u0000the emulation of analog audio effects circuits, particularly recurrent\u0000networks. While neural approaches have been successful in accurately modeling\u0000distortion circuits, they require architectural improvements that account for\u0000parameter conditioning and low latency response. In this article, we explore\u0000the application of recent machine learning advancements for virtual analog\u0000modeling. We compare State Space models and Linear Recurrent Units against the\u0000more common Long Short Term Memory networks. These have shown promising ability\u0000in sequence to sequence modeling tasks, showing a notable improvement in signal\u0000history encoding. Our comparative study uses these black box neural modeling\u0000techniques with a variety of audio effects. We evaluate the performance and\u0000limitations using multiple metrics aiming to assess the models' ability to\u0000accurately replicate energy envelopes, frequency contents, and transients in\u0000the audio signal. To incorporate control parameters we employ the Feature wise\u0000Linear Modulation method. Long Short Term Memory networks exhibit better\u0000accuracy in emulating distortions and equalizers, while the State Space model,\u0000followed by Long Short Term Memory networks when integrated in an encoder\u0000decoder structure, outperforms others in emulating saturation and compression.\u0000When considering long time variant characteristics, the State Space model\u0000demonstrates the greatest accuracy. The Long Short Term Memory and, in\u0000particular, Linear Recurrent Unit networks present more tendency to introduce\u0000audio artifacts.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140927682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transhuman Ansambl - Voice Beyond Language","authors":"Lucija Ivsic, Jon McCormack, Vince Dziekan","doi":"arxiv-2405.03134","DOIUrl":"https://doi.org/arxiv-2405.03134","url":null,"abstract":"In this paper we present the design and development of the Transhuman\u0000Ansambl, a novel interactive singing-voice interface which senses its\u0000environment and responds to vocal input with vocalisations using human voice.\u0000Designed for live performance with a human performer and as a standalone sound\u0000installation, the ansambl consists of sixteen bespoke virtual singers arranged\u0000in a circle. When performing live, the virtual singers listen to the human\u0000performer and respond to their singing by reading pitch, intonation and volume\u0000cues. In a standalone sound installation mode, singers use ultrasonic distance\u0000sensors to sense audience presence. Developed as part of the 1st author's\u0000practice-based PhD and artistic practice as a live performer, this work employs\u0000the singing-voice to explore voice interactions in HCI beyond language, and\u0000innovative ways of live performing. How is technology supporting the effect of\u0000intimacy produced through voice? Does the act of surrounding the audience with\u0000responsive virtual singers challenge the traditional roles of\u0000performer-listener? To answer these questions, we draw upon the 1st author's\u0000experience with the system, and the interdisciplinary field of voice studies\u0000that consider the voice as the sound medium independent of language, capable of\u0000enacting a reciprocal connection between bodies.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"161 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Determined Multichannel Blind Source Separation with Clustered Source Model","authors":"Jianyu Wang, Shanzheng Guan","doi":"arxiv-2405.03118","DOIUrl":"https://doi.org/arxiv-2405.03118","url":null,"abstract":"The independent low-rank matrix analysis (ILRMA) method stands out as a\u0000prominent technique for multichannel blind audio source separation. It\u0000leverages nonnegative matrix factorization (NMF) and nonnegative canonical\u0000polyadic decomposition (NCPD) to model source parameters. While it effectively\u0000captures the low-rank structure of sources, the NMF model overlooks\u0000inter-channel dependencies. On the other hand, NCPD preserves intrinsic\u0000structure but lacks interpretable latent factors, making it challenging to\u0000incorporate prior information as constraints. To address these limitations, we\u0000introduce a clustered source model based on nonnegative block-term\u0000decomposition (NBTD). This model defines blocks as outer products of vectors\u0000(clusters) and matrices (for spectral structure modeling), offering\u0000interpretable latent vectors. Moreover, it enables straightforward integration\u0000of orthogonality constraints to ensure independence among source images.\u0000Experimental results demonstrate that our proposed method outperforms ILRMA and\u0000its extensions in anechoic conditions and surpasses the original ILRMA in\u0000simulated reverberant environments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}