arXiv - CS - Sound最新文献_第7页

A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings 一种利用半监督语音嵌入进行 PD 检测的新型融合架构

arXiv - CS - Sound Pub Date : 2024-05-21 DOI: arxiv-2405.17206

Tariq Adnan, Abdelrahman Abdelkader, Zipei Liu, Ekram Hossain, Sooyong Park, MD Saiful Islam, Ehsan Hoque

{"title":"A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings","authors":"Tariq Adnan, Abdelrahman Abdelkader, Zipei Liu, Ekram Hossain, Sooyong Park, MD Saiful Islam, Ehsan Hoque","doi":"arxiv-2405.17206","DOIUrl":"https://doi.org/arxiv-2405.17206","url":null,"abstract":"We present a framework to recognize Parkinson's disease (PD) through an\u0000English pangram utterance speech collected using a web application from diverse\u0000recording settings and environments, including participants' homes. Our dataset\u0000includes a global cohort of 1306 participants, including 392 diagnosed with PD.\u0000Leveraging the diversity of the dataset, spanning various demographic\u0000properties (such as age, sex, and ethnicity), we used deep learning embeddings\u0000derived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBind\u0000representing the speech dynamics associated with PD. Our novel fusion model for\u0000PD classification, which aligns different speech embeddings into a cohesive\u0000feature space, demonstrated superior performance over standard\u0000concatenation-based fusion models and other baselines (including models built\u0000on traditional acoustic features). In a randomized data split configuration,\u0000the model achieved an Area Under the Receiver Operating Characteristic Curve\u0000(AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysis\u0000confirmed that our model performs equitably across various demographic\u0000subgroups in terms of sex, ethnicity, and age, and remains robust regardless of\u0000disease duration. Furthermore, our model, when tested on two entirely unseen\u0000test datasets collected from clinical settings and from a PD care center,\u0000maintained AUROC scores of 82.12% and 78.44%, respectively. This affirms the\u0000model's robustness and it's potential to enhance accessibility and health\u0000equity in real-world applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing DMI Interactions by Integrating Haptic Feedback for Intricate Vibrato Technique 通过整合触觉反馈增强复杂颤音技术的 DMI 互动

arXiv - CS - Sound Pub Date : 2024-05-17 DOI: arxiv-2405.10502

Ziyue Piao, Christian Frisson, Bavo Van Kerrebroeck, Marcelo M. Wanderley

{"title":"Enhancing DMI Interactions by Integrating Haptic Feedback for Intricate Vibrato Technique","authors":"Ziyue Piao, Christian Frisson, Bavo Van Kerrebroeck, Marcelo M. Wanderley","doi":"arxiv-2405.10502","DOIUrl":"https://doi.org/arxiv-2405.10502","url":null,"abstract":"This paper investigates the integration of force feedback in Digital Musical\u0000Instruments (DMI), specifically evaluating the reproduction of intricate\u0000vibrato techniques using haptic feedback controllers. We introduce our system\u0000for vibrato modulation using force feedback, composed of Bend-aid (a web-based\u0000sequencer platform using pre-designed haptic feedback models) and TorqueTuner\u0000(an open-source 1 Degree-of-Freedom (DoF) rotary haptic device for generating\u0000programmable haptic effects). We designed a formal user study to assess the\u0000impact of each haptic mode on user experience in a vibrato mimicry task. Twenty\u0000musically trained participants rated their user experience for the three haptic\u0000modes (Smooth, Detent, and Spring) using four Likert-scale scores: comfort,\u0000flexibility, ease of control, and helpfulness for the task. Finally, we asked\u0000participants to share their reflections. Our research indicates that while the\u0000Spring mode can help with light vibrato, preferences for haptic modes vary\u0000based on musical training background. This emphasizes the need for adaptable\u0000task interfaces and flexible haptic feedback in DMI design.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparative Study of Recurrent Neural Networks for Virtual Analog Audio Effects Modeling 用于虚拟模拟音频效果建模的递归神经网络比较研究

arXiv - CS - Sound Pub Date : 2024-05-07 DOI: arxiv-2405.04124

Riccardo Simionato, Stefano Fasciani

{"title":"Comparative Study of Recurrent Neural Networks for Virtual Analog Audio Effects Modeling","authors":"Riccardo Simionato, Stefano Fasciani","doi":"arxiv-2405.04124","DOIUrl":"https://doi.org/arxiv-2405.04124","url":null,"abstract":"Analog electronic circuits are at the core of an important category of\u0000musical devices. The nonlinear features of their electronic components give\u0000analog musical devices a distinctive timbre and sound quality, making them\u0000highly desirable. Artificial neural networks have rapidly gained popularity for\u0000the emulation of analog audio effects circuits, particularly recurrent\u0000networks. While neural approaches have been successful in accurately modeling\u0000distortion circuits, they require architectural improvements that account for\u0000parameter conditioning and low latency response. In this article, we explore\u0000the application of recent machine learning advancements for virtual analog\u0000modeling. We compare State Space models and Linear Recurrent Units against the\u0000more common Long Short Term Memory networks. These have shown promising ability\u0000in sequence to sequence modeling tasks, showing a notable improvement in signal\u0000history encoding. Our comparative study uses these black box neural modeling\u0000techniques with a variety of audio effects. We evaluate the performance and\u0000limitations using multiple metrics aiming to assess the models' ability to\u0000accurately replicate energy envelopes, frequency contents, and transients in\u0000the audio signal. To incorporate control parameters we employ the Feature wise\u0000Linear Modulation method. Long Short Term Memory networks exhibit better\u0000accuracy in emulating distortions and equalizers, while the State Space model,\u0000followed by Long Short Term Memory networks when integrated in an encoder\u0000decoder structure, outperforms others in emulating saturation and compression.\u0000When considering long time variant characteristics, the State Space model\u0000demonstrates the greatest accuracy. The Long Short Term Memory and, in\u0000particular, Linear Recurrent Unit networks present more tendency to introduce\u0000audio artifacts.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140927682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

POPDG: Popular 3D Dance Generation with PopDanceSet POPDG：使用 PopDanceSet 生成流行 3D 舞蹈

arXiv - CS - Sound Pub Date : 2024-05-06 DOI: arxiv-2405.03178

Zhenye Luo, Min Ren, Xuecai Hu, Yongzhen Huang, Li Yao

引用次数: 0

Transhuman Ansambl - Voice Beyond Language 超人类 Ansambl - 超越语言的声音

arXiv - CS - Sound Pub Date : 2024-05-06 DOI: arxiv-2405.03134

Lucija Ivsic, Jon McCormack, Vince Dziekan

{"title":"Transhuman Ansambl - Voice Beyond Language","authors":"Lucija Ivsic, Jon McCormack, Vince Dziekan","doi":"arxiv-2405.03134","DOIUrl":"https://doi.org/arxiv-2405.03134","url":null,"abstract":"In this paper we present the design and development of the Transhuman\u0000Ansambl, a novel interactive singing-voice interface which senses its\u0000environment and responds to vocal input with vocalisations using human voice.\u0000Designed for live performance with a human performer and as a standalone sound\u0000installation, the ansambl consists of sixteen bespoke virtual singers arranged\u0000in a circle. When performing live, the virtual singers listen to the human\u0000performer and respond to their singing by reading pitch, intonation and volume\u0000cues. In a standalone sound installation mode, singers use ultrasonic distance\u0000sensors to sense audience presence. Developed as part of the 1st author's\u0000practice-based PhD and artistic practice as a live performer, this work employs\u0000the singing-voice to explore voice interactions in HCI beyond language, and\u0000innovative ways of live performing. How is technology supporting the effect of\u0000intimacy produced through voice? Does the act of surrounding the audience with\u0000responsive virtual singers challenge the traditional roles of\u0000performer-listener? To answer these questions, we draw upon the 1st author's\u0000experience with the system, and the interdisciplinary field of voice studies\u0000that consider the voice as the sound medium independent of language, capable of\u0000enacting a reciprocal connection between bodies.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"161 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Determined Multichannel Blind Source Separation with Clustered Source Model 利用聚类声源模型确定多通道盲声源分离技术

arXiv - CS - Sound Pub Date : 2024-05-06 DOI: arxiv-2405.03118

Jianyu Wang, Shanzheng Guan

{"title":"Determined Multichannel Blind Source Separation with Clustered Source Model","authors":"Jianyu Wang, Shanzheng Guan","doi":"arxiv-2405.03118","DOIUrl":"https://doi.org/arxiv-2405.03118","url":null,"abstract":"The independent low-rank matrix analysis (ILRMA) method stands out as a\u0000prominent technique for multichannel blind audio source separation. It\u0000leverages nonnegative matrix factorization (NMF) and nonnegative canonical\u0000polyadic decomposition (NCPD) to model source parameters. While it effectively\u0000captures the low-rank structure of sources, the NMF model overlooks\u0000inter-channel dependencies. On the other hand, NCPD preserves intrinsic\u0000structure but lacks interpretable latent factors, making it challenging to\u0000incorporate prior information as constraints. To address these limitations, we\u0000introduce a clustered source model based on nonnegative block-term\u0000decomposition (NBTD). This model defines blocks as outer products of vectors\u0000(clusters) and matrices (for spectral structure modeling), offering\u0000interpretable latent vectors. Moreover, it enables straightforward integration\u0000of orthogonality constraints to ensure independence among source images.\u0000Experimental results demonstrate that our proposed method outperforms ILRMA and\u0000its extensions in anechoic conditions and surpasses the original ILRMA in\u0000simulated reverberant environments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Whispy: Adapting STT Whisper Models to Real-Time Environments Whispy：根据实时环境调整 STT Whisper 模型

arXiv - CS - Sound Pub Date : 2024-05-06 DOI: arxiv-2405.03484

Antonio Bevilacqua, Paolo Saviano, Alessandro Amirante, Simon Pietro Romano

引用次数: 0

Deep Space Separable Distillation for Lightweight Acoustic Scene Classification 用于轻量级声学场景分类的深空可分离蒸馏技术

arXiv - CS - Sound Pub Date : 2024-05-06 DOI: arxiv-2405.03567

ShuQi Ye, Yuan Tian

引用次数: 0

Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models 莫扎特的触摸基于预训练大型模型的轻量级多模态音乐生成框架

arXiv - CS - Sound Pub Date : 2024-05-05 DOI: arxiv-2405.02801

Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu

{"title":"Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models","authors":"Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu","doi":"arxiv-2405.02801","DOIUrl":"https://doi.org/arxiv-2405.02801","url":null,"abstract":"In recent years, AI-Generated Content (AIGC) has witnessed rapid\u0000advancements, facilitating the generation of music, images, and other forms of\u0000artistic expression across various industries. However, researches on general\u0000multi-modal music generation model remain scarce. To fill this gap, we propose\u0000a multi-modal music generation framework Mozart's Touch. It could generate\u0000aligned music with the cross-modality inputs, such as images, videos and text.\u0000Mozart's Touch is composed of three main components: Multi-modal Captioning\u0000Module, Large Language Model (LLM) Understanding & Bridging Module, and Music\u0000Generation Module. Unlike traditional approaches, Mozart's Touch requires no\u0000training or fine-tuning pre-trained models, offering efficiency and\u0000transparency through clear, interpretable prompts. We also introduce\u0000\"LLM-Bridge\" method to resolve the heterogeneous representation problems\u0000between descriptive texts of different modalities. We conduct a series of\u0000objective and subjective evaluations on the proposed model, and results\u0000indicate that our model surpasses the performance of current state-of-the-art\u0000models. Our codes and examples is availble at:\u0000https://github.com/WangTooNaive/MozartsTouch","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction 利用频率自适应声场预测进行视听导航的 Sim2Real 传输

arXiv - CS - Sound Pub Date : 2024-05-05 DOI: arxiv-2405.02821

Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman

{"title":"Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction","authors":"Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman","doi":"arxiv-2405.02821","DOIUrl":"https://doi.org/arxiv-2405.02821","url":null,"abstract":"Sim2real transfer has received increasing attention lately due to the success\u0000of learning robotic tasks in simulation end-to-end. While there has been a lot\u0000of progress in transferring vision-based navigation policies, the existing\u0000sim2real strategy for audio-visual navigation performs data augmentation\u0000empirically without measuring the acoustic gap. The sound differs from light in\u0000that it spans across much wider frequencies and thus requires a different\u0000solution for sim2real. We propose the first treatment of sim2real for\u0000audio-visual navigation by disentangling it into acoustic field prediction\u0000(AFP) and waypoint navigation. We first validate our design choice in the\u0000SoundSpaces simulator and show improvement on the Continuous AudioGoal\u0000navigation benchmark. We then collect real-world data to measure the spectral\u0000difference between the simulation and the real world by training AFP models\u0000that only take a specific frequency subband as input. We further propose a\u0000frequency-adaptive strategy that intelligently selects the best frequency band\u0000for prediction based on both the measured spectral difference and the energy\u0000distribution of the received audio, which improves the performance on the real\u0000data. Lastly, we build a real robot platform and show that the transferred\u0000policy can successfully navigate to sounding objects. This work demonstrates\u0000the potential of building intelligent agents that can see, hear, and act\u0000entirely from simulation, and transferring them to the real world.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0