arXiv - CS - Sound最新文献

筛选
英文 中文
Benchmarking Sub-Genre Classification For Mainstage Dance Music 主舞台舞曲子流派分类基准测试
arXiv - CS - Sound Pub Date : 2024-09-10 DOI: arxiv-2409.06690
Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li
{"title":"Benchmarking Sub-Genre Classification For Mainstage Dance Music","authors":"Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li","doi":"arxiv-2409.06690","DOIUrl":"https://doi.org/arxiv-2409.06690","url":null,"abstract":"Music classification, with a wide range of applications, is one of the most\u0000prominent tasks in music information retrieval. To address the absence of\u0000comprehensive datasets and high-performing methods in the classification of\u0000mainstage dance music, this work introduces a novel benchmark comprising a new\u0000dataset and a baseline. Our dataset extends the number of sub-genres to cover\u0000most recent mainstage live sets by top DJs worldwide in music festivals. A\u0000continuous soft labeling approach is employed to account for tracks that span\u0000multiple sub-genres, preserving the inherent sophistication. For the baseline,\u0000we developed deep learning models that outperform current state-of-the-art\u0000multimodel language models, which struggle to identify house music sub-genres,\u0000emphasizing the need for specialized models trained on fine-grained datasets.\u0000Our benchmark is applicable to serve for application scenarios such as music\u0000recommendation, DJ set curation, and interactive multimedia, where we also\u0000provide video demos. Our code is on\u0000url{https://anonymous.4open.science/r/Mainstage-EDM-Benchmark/}.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks 利用从机器特定滤波器库中提取的频谱-时态调制表示法进行机器异常声音检测
arXiv - CS - Sound Pub Date : 2024-09-09 DOI: arxiv-2409.05319
Kai Li, Khalid Zaman, Xingfeng Li, Masato Akagi, Masashi Unoki
{"title":"Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks","authors":"Kai Li, Khalid Zaman, Xingfeng Li, Masato Akagi, Masashi Unoki","doi":"arxiv-2409.05319","DOIUrl":"https://doi.org/arxiv-2409.05319","url":null,"abstract":"Early detection of factory machinery malfunctions is crucial in industrial\u0000applications. In machine anomalous sound detection (ASD), different machines\u0000exhibit unique vibration-frequency ranges based on their physical properties.\u0000Meanwhile, the human auditory system is adept at tracking both temporal and\u0000spectral dynamics of machine sounds. Consequently, integrating the\u0000computational auditory models of the human auditory system with\u0000machine-specific properties can be an effective approach to machine ASD. We\u0000first quantified the frequency importances of four types of machines using the\u0000Fisher ratio (F-ratio). The quantified frequency importances were then used to\u0000design machine-specific non-uniform filterbanks (NUFBs), which extract the log\u0000non-uniform spectrum (LNS) feature. The designed NUFBs have a narrower\u0000bandwidth and higher filter distribution density in frequency regions with\u0000relatively high F-ratios. Finally, spectral and temporal modulation\u0000representations derived from the LNS feature were proposed. These proposed LNS\u0000feature and modulation representations are input into an autoencoder\u0000neural-network-based detector for ASD. The quantification results from the\u0000training set of the Malfunctioning Industrial Machine Investigation and\u0000Inspection dataset with a signal-to-noise (SNR) of 6 dB reveal that the\u0000distinguishing information between normal and anomalous sounds of different\u0000machines is encoded non-uniformly in the frequency domain. By highlighting\u0000these important frequency regions using NUFBs, the LNS feature can\u0000significantly enhance performance using the metric of AUC (area under the\u0000receiver operating characteristic curve) under various SNR conditions.\u0000Furthermore, modulation representations can further improve performance.\u0000Specifically, temporal modulation is effective for fans, pumps, and sliders,\u0000while spectral modulation is particularly effective for valves.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Harmonic Reasoning in Large Language Models 大型语言模型中的谐波推理
arXiv - CS - Sound Pub Date : 2024-09-09 DOI: arxiv-2409.05521
Anna Kruspe
{"title":"Harmonic Reasoning in Large Language Models","authors":"Anna Kruspe","doi":"arxiv-2409.05521","DOIUrl":"https://doi.org/arxiv-2409.05521","url":null,"abstract":"Large Language Models (LLMs) are becoming very popular and are used for many\u0000different purposes, including creative tasks in the arts. However, these models\u0000sometimes have trouble with specific reasoning tasks, especially those that\u0000involve logical thinking and counting. This paper looks at how well LLMs\u0000understand and reason when dealing with musical tasks like figuring out notes\u0000from intervals and identifying chords and scales. We tested GPT-3.5 and GPT-4o\u0000to see how they handle these tasks. Our results show that while LLMs do well\u0000with note intervals, they struggle with more complicated tasks like recognizing\u0000chords and scales. This points out clear limits in current LLM abilities and\u0000shows where we need to make them better, which could help improve how they\u0000think and work in both artistic and other complex areas. We also provide an\u0000automatically generated benchmark data set for the described tasks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation of real-time transcriptions using end-to-end ASR models 使用端到端 ASR 模型对实时转录进行评估
arXiv - CS - Sound Pub Date : 2024-09-09 DOI: arxiv-2409.05674
Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso
{"title":"Evaluation of real-time transcriptions using end-to-end ASR models","authors":"Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso","doi":"arxiv-2409.05674","DOIUrl":"https://doi.org/arxiv-2409.05674","url":null,"abstract":"Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly\u0000evolved in the last few years. Traditional architectures based on pipelines\u0000have been replaced by joint end-to-end (E2E) architectures that simplify and\u0000streamline the model training process. In addition, new AI training methods,\u0000such as weak-supervised learning have reduced the need for high-quality audio\u0000datasets for model training. However, despite all these advancements, little to\u0000no research has been done on real-time transcription. In real-time scenarios,\u0000the audio is not pre-recorded, and the input audio must be fragmented to be\u0000processed by the ASR systems. To achieve real-time requirements, these\u0000fragments must be as short as possible to reduce latency. However, audio cannot\u0000be split at any point as dividing an utterance into two separate fragments will\u0000generate an incorrect transcription. Also, shorter fragments provide less\u0000context for the ASR model. For this reason, it is necessary to design and test\u0000different splitting algorithms to optimize the quality and delay of the\u0000resulting transcription. In this paper, three audio splitting algorithms are\u0000evaluated with different ASR models to determine their impact on both the\u0000quality of the transcription and the end-to-end delay. The algorithms are\u0000fragmentation at fixed intervals, voice activity detection (VAD), and\u0000fragmentation with feedback. The results are compared to the performance of the\u0000same model, without audio fragmentation, to determine the effects of this\u0000division. The results show that VAD fragmentation provides the best quality\u0000with the highest delay, whereas fragmentation at fixed intervals provides the\u0000lowest quality and the lowest delay. The newly proposed feedback algorithm\u0000exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,\u0000to the VAD splitting.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification PDAF:用于验证说话人的语音去重注意框架
arXiv - CS - Sound Pub Date : 2024-09-09 DOI: arxiv-2409.05799
Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj
{"title":"PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification","authors":"Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj","doi":"arxiv-2409.05799","DOIUrl":"https://doi.org/arxiv-2409.05799","url":null,"abstract":"Speaker verification systems are crucial for authenticating identity through\u0000voice. Traditionally, these systems focus on comparing feature vectors,\u0000overlooking the speech's content. However, this paper challenges this by\u0000highlighting the importance of phonetic dominance, a measure of the frequency\u0000or duration of phonemes, as a crucial cue in speaker verification. A novel\u0000Phoneme Debiasing Attention Framework (PDAF) is introduced, integrating with\u0000existing attention frameworks to mitigate biases caused by phonetic dominance.\u0000PDAF adjusts the weighting for each phoneme and influences feature extraction,\u0000allowing for a more nuanced analysis of speech. This approach paves the way for\u0000more accurate and reliable identity authentication through voice. Furthermore,\u0000by employing various weighting strategies, we evaluate the influence of\u0000phonetic features on the efficacy of the speaker verification system.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Neural Networks Architectures for Spring Reverb Modelling 评估用于弹簧混响建模的神经网络架构
arXiv - CS - Sound Pub Date : 2024-09-08 DOI: arxiv-2409.04953
Francesco Papaleo, Xavier Lizarraga-Seijas, Frederic Font
{"title":"Evaluating Neural Networks Architectures for Spring Reverb Modelling","authors":"Francesco Papaleo, Xavier Lizarraga-Seijas, Frederic Font","doi":"arxiv-2409.04953","DOIUrl":"https://doi.org/arxiv-2409.04953","url":null,"abstract":"Reverberation is a key element in spatial audio perception, historically\u0000achieved with the use of analogue devices, such as plate and spring reverb, and\u0000in the last decades with digital signal processing techniques that have allowed\u0000different approaches for Virtual Analogue Modelling (VAM). The\u0000electromechanical functioning of the spring reverb makes it a nonlinear system\u0000that is difficult to fully emulate in the digital domain with white-box\u0000modelling techniques. In this study, we compare five different neural network\u0000architectures, including convolutional and recurrent models, to assess their\u0000effectiveness in replicating the characteristics of this audio effect. The\u0000evaluation is conducted on two datasets at sampling rates of 16 kHz and 48 kHz.\u0000This paper specifically focuses on neural audio architectures that offer\u0000parametric control, aiming to advance the boundaries of current black-box\u0000modelling techniques in the domain of spring reverberation.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From Computation to Consumption: Exploring the Compute-Energy Link for Training and Testing Neural Networks for SED Systems 从计算到消耗:探索用于 SED 系统的神经网络训练和测试的计算-能源联系
arXiv - CS - Sound Pub Date : 2024-09-08 DOI: arxiv-2409.05080
Constance Douwes, Romain Serizel
{"title":"From Computation to Consumption: Exploring the Compute-Energy Link for Training and Testing Neural Networks for SED Systems","authors":"Constance Douwes, Romain Serizel","doi":"arxiv-2409.05080","DOIUrl":"https://doi.org/arxiv-2409.05080","url":null,"abstract":"The massive use of machine learning models, particularly neural networks, has\u0000raised serious concerns about their environmental impact. Indeed, over the last\u0000few years we have seen an explosion in the computing costs associated with\u0000training and deploying these systems. It is, therefore, crucial to understand\u0000their energy requirements in order to better integrate them into the evaluation\u0000of models, which has so far focused mainly on performance. In this paper, we\u0000study several neural network architectures that are key components of sound\u0000event detection systems, using an audio tagging task as an example. We measure\u0000the energy consumption for training and testing small to large architectures\u0000and establish complex relationships between the energy consumption, the number\u0000of floating-point operations, the number of parameters, and the GPU/memory\u0000utilization.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering of Indonesian and Western Gamelan Orchestras through Machine Learning of Performance Parameters 通过对演奏参数的机器学习对印度尼西亚和西方加麦兰管弦乐队进行分类
arXiv - CS - Sound Pub Date : 2024-09-03 DOI: arxiv-2409.03713
Simon Linke, Gerrit Wendt, Rolf Bader
{"title":"Clustering of Indonesian and Western Gamelan Orchestras through Machine Learning of Performance Parameters","authors":"Simon Linke, Gerrit Wendt, Rolf Bader","doi":"arxiv-2409.03713","DOIUrl":"https://doi.org/arxiv-2409.03713","url":null,"abstract":"Indonesian and Western gamelan ensembles are investigated with respect to\u0000performance differences. Thereby, the often exotistic history of this music in\u0000the West might be reflected in contemporary tonal system, articulation, or\u0000large-scale form differences. Analyzing recordings of four Western and five\u0000Indonesian orchestras with respect to tonal systems and timbre features and\u0000using self-organizing Kohonen map (SOM) as a machine learning algorithm, a\u0000clear clustering between Indonesian and Western ensembles appears using certain\u0000psychoacoustic features. These point to a reduced articulation and large-scale\u0000form variability of Western ensembles compared to Indonesian ones. The SOM also\u0000clusters the ensembles with respect to their tonal systems, but no clusters\u0000between Indonesian and Western ensembles can be found in this respect.\u0000Therefore, a clear analogy between lower articulatory variability and\u0000large-scale form variation and a more exostistic, mediative and calm\u0000performance expectation and reception of gamelan in the West therefore appears.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sample-Efficient Diffusion for Text-To-Speech Synthesis 文本到语音合成的样本高效扩散
arXiv - CS - Sound Pub Date : 2024-09-01 DOI: arxiv-2409.03717
Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu
{"title":"Sample-Efficient Diffusion for Text-To-Speech Synthesis","authors":"Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu","doi":"arxiv-2409.03717","DOIUrl":"https://doi.org/arxiv-2409.03717","url":null,"abstract":"This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm\u0000for effective speech synthesis in modest data regimes through latent diffusion.\u0000It is based on a novel diffusion architecture, that we call U-Audio Transformer\u0000(U-AT), that efficiently scales to long sequences and operates in the latent\u0000space of a pre-trained audio autoencoder. Conditioned on character-aware\u0000language model representations, SESD achieves impressive results despite\u0000training on less than 1k hours of speech - far less than current\u0000state-of-the-art systems. In fact, it synthesizes more intelligible speech than\u0000the state-of-the-art auto-regressive model, VALL-E, while using less than 2%\u0000the training data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation 光学音乐识别中的知识发现:利用实例分割加强信息检索
arXiv - CS - Sound Pub Date : 2024-08-27 DOI: arxiv-2408.15002
Elona Shatri, George Fazekas
{"title":"Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation","authors":"Elona Shatri, George Fazekas","doi":"arxiv-2408.15002","DOIUrl":"https://doi.org/arxiv-2408.15002","url":null,"abstract":"Optical Music Recognition (OMR) automates the transcription of musical\u0000notation from images into machine-readable formats like MusicXML, MEI, or MIDI,\u0000significantly reducing the costs and time of manual transcription. This study\u0000explores knowledge discovery in OMR by applying instance segmentation using\u0000Mask R-CNN to enhance the detection and delineation of musical symbols in sheet\u0000music. Unlike Optical Character Recognition (OCR), OMR must handle the\u0000intricate semantics of Common Western Music Notation (CWMN), where symbol\u0000meanings depend on shape, position, and context. Our approach leverages\u0000instance segmentation to manage the density and overlap of musical symbols,\u0000facilitating more precise information retrieval from music scores. Evaluations\u0000on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with\u0000our method achieving a mean Average Precision (mAP) of up to 59.70% in dense\u0000symbol environments, achieving comparable results to object detection.\u0000Furthermore, using traditional computer vision techniques, we add a parallel\u0000step for staff detection to infer the pitch for the recognised symbols. This\u0000study emphasises the role of pixel-wise segmentation in advancing accurate\u0000music symbol recognition, contributing to knowledge discovery in OMR. Our\u0000findings indicate that instance segmentation provides more precise\u0000representations of musical symbols, particularly in densely populated scores,\u0000advancing OMR technology. We make our implementation, pre-processing scripts,\u0000trained models, and evaluation results publicly available to support further\u0000research and development.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信