arXiv - CS - Sound最新文献_第3页

Sample-Efficient Diffusion for Text-To-Speech Synthesis 文本到语音合成的样本高效扩散

arXiv - CS - Sound Pub Date : 2024-09-01 DOI: arxiv-2409.03717

Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu

引用次数: 0

Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation 光学音乐识别中的知识发现：利用实例分割加强信息检索

arXiv - CS - Sound Pub Date : 2024-08-27 DOI: arxiv-2408.15002

Elona Shatri, George Fazekas

{"title":"Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation","authors":"Elona Shatri, George Fazekas","doi":"arxiv-2408.15002","DOIUrl":"https://doi.org/arxiv-2408.15002","url":null,"abstract":"Optical Music Recognition (OMR) automates the transcription of musical\u0000notation from images into machine-readable formats like MusicXML, MEI, or MIDI,\u0000significantly reducing the costs and time of manual transcription. This study\u0000explores knowledge discovery in OMR by applying instance segmentation using\u0000Mask R-CNN to enhance the detection and delineation of musical symbols in sheet\u0000music. Unlike Optical Character Recognition (OCR), OMR must handle the\u0000intricate semantics of Common Western Music Notation (CWMN), where symbol\u0000meanings depend on shape, position, and context. Our approach leverages\u0000instance segmentation to manage the density and overlap of musical symbols,\u0000facilitating more precise information retrieval from music scores. Evaluations\u0000on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with\u0000our method achieving a mean Average Precision (mAP) of up to 59.70% in dense\u0000symbol environments, achieving comparable results to object detection.\u0000Furthermore, using traditional computer vision techniques, we add a parallel\u0000step for staff detection to infer the pitch for the recognised symbols. This\u0000study emphasises the role of pixel-wise segmentation in advancing accurate\u0000music symbol recognition, contributing to knowledge discovery in OMR. Our\u0000findings indicate that instance segmentation provides more precise\u0000representations of musical symbols, particularly in densely populated scores,\u0000advancing OMR technology. We make our implementation, pre-processing scripts,\u0000trained models, and evaluation results publicly available to support further\u0000research and development.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"218 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition 通过对齐和标签匹配加强模态融合，实现多模态情感识别

arXiv - CS - Sound Pub Date : 2024-08-18 DOI: arxiv-2408.09438

Qifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li

引用次数: 0

A New Dataset, Notation Software, and Representation for Computational Schenkerian Analysis 用于计算申克分析的新数据集、符号软件和表示法

arXiv - CS - Sound Pub Date : 2024-08-13 DOI: arxiv-2408.07184

Stephen Ni-Hahn, Weihan Xu, Jerry Yin, Rico Zhu, Simon Mak, Yue Jiang, Cynthia Rudin

{"title":"A New Dataset, Notation Software, and Representation for Computational Schenkerian Analysis","authors":"Stephen Ni-Hahn, Weihan Xu, Jerry Yin, Rico Zhu, Simon Mak, Yue Jiang, Cynthia Rudin","doi":"arxiv-2408.07184","DOIUrl":"https://doi.org/arxiv-2408.07184","url":null,"abstract":"Schenkerian Analysis (SchA) is a uniquely expressive method of music\u0000analysis, combining elements of melody, harmony, counterpoint, and form to\u0000describe the hierarchical structure supporting a work of music. However,\u0000despite its powerful analytical utility and potential to improve music\u0000understanding and generation, SchA has rarely been utilized by the computer\u0000music community. This is in large part due to the paucity of available\u0000high-quality data in a computer-readable format. With a larger corpus of\u0000Schenkerian data, it may be possible to infuse machine learning models with a\u0000deeper understanding of musical structure, thus leading to more \"human\"\u0000results. To encourage further research in Schenkerian analysis and its\u0000potential benefits for music informatics and generation, this paper presents\u0000three main contributions: 1) a new and growing dataset of SchAs, the largest in\u0000human- and computer-readable formats to date (>140 excerpts), 2) a novel\u0000software for visualization and collection of SchA data, and 3) a novel,\u0000flexible representation of SchA as a heterogeneous-edge graph data structure.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling MIDI-to-Tab：通过掩码语言建模进行吉他谱推理

arXiv - CS - Sound Pub Date : 2024-08-09 DOI: arxiv-2408.05024

Drew Edwards, Xavier Riley, Pedro Sarmento, Simon Dixon

{"title":"MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling","authors":"Drew Edwards, Xavier Riley, Pedro Sarmento, Simon Dixon","doi":"arxiv-2408.05024","DOIUrl":"https://doi.org/arxiv-2408.05024","url":null,"abstract":"Guitar tablatures enrich the structure of traditional music notation by\u0000assigning each note to a string and fret of a guitar in a particular tuning,\u0000indicating precisely where to play the note on the instrument. The problem of\u0000generating tablature from a symbolic music representation involves inferring\u0000this string and fret assignment per note across an entire composition or\u0000performance. On the guitar, multiple string-fret assignments are possible for\u0000most pitches, which leads to a large combinatorial space that prevents\u0000exhaustive search approaches. Most modern methods use constraint-based dynamic\u0000programming to minimize some cost function (e.g. hand position movement). In\u0000this work, we introduce a novel deep learning solution to symbolic guitar\u0000tablature estimation. We train an encoder-decoder Transformer model in a masked\u0000language modeling paradigm to assign notes to strings. The model is first\u0000pre-trained on DadaGP, a dataset of over 25K tablatures, and then fine-tuned on\u0000a curated set of professionally transcribed guitar performances. Given the\u0000subjective nature of assessing tablature quality, we conduct a user study\u0000amongst guitarists, wherein we ask participants to rate the playability of\u0000multiple versions of tablature for the same four-bar excerpt. The results\u0000indicate our system significantly outperforms competing algorithms.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework DiM-Gesture：利用自适应层归一化技术生成协同语音手势 Mamba-2 框架

arXiv - CS - Sound Pub Date : 2024-08-01 DOI: arxiv-2408.00370

Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma

{"title":"DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework","authors":"Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma","doi":"arxiv-2408.00370","DOIUrl":"https://doi.org/arxiv-2408.00370","url":null,"abstract":"Speech-driven gesture generation is an emerging domain within virtual human\u0000creation, where current methods predominantly utilize Transformer-based\u0000architectures that necessitate extensive memory and are characterized by slow\u0000inference speeds. In response to these limitations, we propose\u0000textit{DiM-Gestures}, a novel end-to-end generative model crafted to create\u0000highly personalized 3D full-body gestures solely from raw speech audio,\u0000employing Mamba-based architectures. This model integrates a Mamba-based fuzzy\u0000feature extractor with a non-autoregressive Adaptive Layer Normalization\u0000(AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba\u0000framework and a WavLM pre-trained model, autonomously derives implicit,\u0000continuous fuzzy features, which are then unified into a singular latent\u0000feature. This feature is processed by the AdaLN Mamba-2, which implements a\u0000uniform conditional mechanism across all tokens to robustly model the interplay\u0000between the fuzzy features and the resultant gesture sequence. This innovative\u0000approach guarantees high fidelity in gesture-speech synchronization while\u0000maintaining the naturalness of the gestures. Employing a diffusion model for\u0000training and inference, our framework has undergone extensive subjective and\u0000objective evaluations on the ZEGGS and BEAT datasets. These assessments\u0000substantiate our model's enhanced performance relative to contemporary\u0000state-of-the-art methods, demonstrating competitive outcomes with the DiTs\u0000architecture (Persona-Gestors) while optimizing memory usage and accelerating\u0000inference speed.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141885341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation 利用对比表征在音频分类中实现稳健的少量类增量学习

arXiv - CS - Sound Pub Date : 2024-07-27 DOI: arxiv-2407.19265

Riyansha SinghIIT Kanpur, India, Parinita NemaIISER Bhopal, India, Vinod K KurmiIISER Bhopal, India

引用次数: 0

Implementation and Applications of WakeWords Integrated with Speaker Recognition: A Case Study WakeWords 与说话人识别整合的实施与应用：案例研究

arXiv - CS - Sound Pub Date : 2024-07-25 DOI: arxiv-2407.18985

Alexandre Costa Ferro Filho, Elisa Ayumi Masasi de Oliveira, Iago Alves Brito, Pedro Martins Bittencourt

{"title":"Implementation and Applications of WakeWords Integrated with Speaker Recognition: A Case Study","authors":"Alexandre Costa Ferro Filho, Elisa Ayumi Masasi de Oliveira, Iago Alves Brito, Pedro Martins Bittencourt","doi":"arxiv-2407.18985","DOIUrl":"https://doi.org/arxiv-2407.18985","url":null,"abstract":"This paper explores the application of artificial intelligence techniques in\u0000audio and voice processing, focusing on the integration of wake words and\u0000speaker recognition for secure access in embedded systems. With the growing\u0000prevalence of voice-activated devices such as Amazon Alexa, ensuring secure and\u0000user-specific interactions has become paramount. Our study aims to enhance the\u0000security framework of these systems by leveraging wake words for initial\u0000activation and speaker recognition to validate user permissions. By\u0000incorporating these AI-driven methodologies, we propose a robust solution that\u0000restricts system usage to authorized individuals, thereby mitigating\u0000unauthorized access risks. This research delves into the algorithms and\u0000technologies underpinning wake word detection and speaker recognition,\u0000evaluates their effectiveness in real-world applications, and discusses the\u0000potential for their implementation in various embedded systems, emphasizing\u0000security and user convenience. The findings underscore the feasibility and\u0000advantages of employing these AI techniques to create secure, user-friendly\u0000voice-activated systems.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Enhanced Classification of Abnormal Lung sound in Multi-breath: A Light Weight Multi-label and Multi-head Attention Classification Method 增强多重呼吸中异常肺音的分类：轻量级多标签和多头注意力分类方法

arXiv - CS - Sound Pub Date : 2024-07-15 DOI: arxiv-2407.10828

Yi-Wei Chua, Yun-Chien Cheng

引用次数: 0

Towards zero-shot amplifier modeling: One-to-many amplifier modeling via tone embedding control 迈向零射频放大器建模：通过音调嵌入控制进行一对多放大器建模

arXiv - CS - Sound Pub Date : 2024-07-15 DOI: arxiv-2407.10646

Yu-Hua Chen, Yen-Tung Yeh, Yuan-Chiao Cheng, Jui-Te Wu, Yu-Hsiang Ho, Jyh-Shing Roger Jang, Yi-Hsuan Yang

{"title":"Towards zero-shot amplifier modeling: One-to-many amplifier modeling via tone embedding control","authors":"Yu-Hua Chen, Yen-Tung Yeh, Yuan-Chiao Cheng, Jui-Te Wu, Yu-Hsiang Ho, Jyh-Shing Roger Jang, Yi-Hsuan Yang","doi":"arxiv-2407.10646","DOIUrl":"https://doi.org/arxiv-2407.10646","url":null,"abstract":"Replicating analog device circuits through neural audio effect modeling has\u0000garnered increasing interest in recent years. Existing work has predominantly\u0000focused on a one-to-one emulation strategy, modeling specific devices\u0000individually. In this paper, we tackle the less-explored scenario of\u0000one-to-many emulation, utilizing conditioning mechanisms to emulate multiple\u0000guitar amplifiers through a single neural model. For condition representation,\u0000we use contrastive learning to build a tone embedding encoder that extracts\u0000style-related features of various amplifiers, leveraging a dataset of\u0000comprehensive amplifier settings. Targeting zero-shot application scenarios, we\u0000also examine various strategies for tone embedding representation, evaluating\u0000referenced tone embedding against two retrieval-based embedding methods for\u0000amplifiers unseen in the training time. Our findings showcase the efficacy and\u0000potential of the proposed methods in achieving versatile one-to-many amplifier\u0000modeling, contributing a foundational step towards zero-shot audio modeling\u0000applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0