arXiv - CS - Sound最新文献_第2页

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation AccentBox：实现高保真零重音生成

arXiv - CS - Sound Pub Date : 2024-09-13 DOI: arxiv-2409.09098

Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun

引用次数: 0

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation 种子音乐：高质量可控音乐生成的统一框架

arXiv - CS - Sound Pub Date : 2024-09-13 DOI: arxiv-2409.09214

Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou

{"title":"Seed-Music: A Unified Framework for High Quality and Controlled Music Generation","authors":"Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou","doi":"arxiv-2409.09214","DOIUrl":"https://doi.org/arxiv-2409.09214","url":null,"abstract":"We introduce Seed-Music, a suite of music generation systems capable of\u0000producing high-quality music with fine-grained style control. Our unified\u0000framework leverages both auto-regressive language modeling and diffusion\u0000approaches to support two key music creation workflows: textit{controlled\u0000music generation} and textit{post-production editing}. For controlled music\u0000generation, our system enables vocal music generation with performance controls\u0000from multi-modal inputs, including style descriptions, audio references,\u0000musical scores, and voice prompts. For post-production editing, it offers\u0000interactive tools for editing lyrics and vocal melodies directly in the\u0000generated audio. We encourage readers to listen to demo audio examples at\u0000https://team.doubao.com/seed-music .","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Benchmarking Sub-Genre Classification For Mainstage Dance Music 主舞台舞曲子流派分类基准测试

arXiv - CS - Sound Pub Date : 2024-09-10 DOI: arxiv-2409.06690

Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li

引用次数: 0

Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks 利用从机器特定滤波器库中提取的频谱-时态调制表示法进行机器异常声音检测

arXiv - CS - Sound Pub Date : 2024-09-09 DOI: arxiv-2409.05319

Kai Li, Khalid Zaman, Xingfeng Li, Masato Akagi, Masashi Unoki

{"title":"Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks","authors":"Kai Li, Khalid Zaman, Xingfeng Li, Masato Akagi, Masashi Unoki","doi":"arxiv-2409.05319","DOIUrl":"https://doi.org/arxiv-2409.05319","url":null,"abstract":"Early detection of factory machinery malfunctions is crucial in industrial\u0000applications. In machine anomalous sound detection (ASD), different machines\u0000exhibit unique vibration-frequency ranges based on their physical properties.\u0000Meanwhile, the human auditory system is adept at tracking both temporal and\u0000spectral dynamics of machine sounds. Consequently, integrating the\u0000computational auditory models of the human auditory system with\u0000machine-specific properties can be an effective approach to machine ASD. We\u0000first quantified the frequency importances of four types of machines using the\u0000Fisher ratio (F-ratio). The quantified frequency importances were then used to\u0000design machine-specific non-uniform filterbanks (NUFBs), which extract the log\u0000non-uniform spectrum (LNS) feature. The designed NUFBs have a narrower\u0000bandwidth and higher filter distribution density in frequency regions with\u0000relatively high F-ratios. Finally, spectral and temporal modulation\u0000representations derived from the LNS feature were proposed. These proposed LNS\u0000feature and modulation representations are input into an autoencoder\u0000neural-network-based detector for ASD. The quantification results from the\u0000training set of the Malfunctioning Industrial Machine Investigation and\u0000Inspection dataset with a signal-to-noise (SNR) of 6 dB reveal that the\u0000distinguishing information between normal and anomalous sounds of different\u0000machines is encoded non-uniformly in the frequency domain. By highlighting\u0000these important frequency regions using NUFBs, the LNS feature can\u0000significantly enhance performance using the metric of AUC (area under the\u0000receiver operating characteristic curve) under various SNR conditions.\u0000Furthermore, modulation representations can further improve performance.\u0000Specifically, temporal modulation is effective for fans, pumps, and sliders,\u0000while spectral modulation is particularly effective for valves.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Harmonic Reasoning in Large Language Models 大型语言模型中的谐波推理

arXiv - CS - Sound Pub Date : 2024-09-09 DOI: arxiv-2409.05521

Anna Kruspe

引用次数: 0

Evaluation of real-time transcriptions using end-to-end ASR models 使用端到端 ASR 模型对实时转录进行评估

arXiv - CS - Sound Pub Date : 2024-09-09 DOI: arxiv-2409.05674

Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso

{"title":"Evaluation of real-time transcriptions using end-to-end ASR models","authors":"Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso","doi":"arxiv-2409.05674","DOIUrl":"https://doi.org/arxiv-2409.05674","url":null,"abstract":"Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly\u0000evolved in the last few years. Traditional architectures based on pipelines\u0000have been replaced by joint end-to-end (E2E) architectures that simplify and\u0000streamline the model training process. In addition, new AI training methods,\u0000such as weak-supervised learning have reduced the need for high-quality audio\u0000datasets for model training. However, despite all these advancements, little to\u0000no research has been done on real-time transcription. In real-time scenarios,\u0000the audio is not pre-recorded, and the input audio must be fragmented to be\u0000processed by the ASR systems. To achieve real-time requirements, these\u0000fragments must be as short as possible to reduce latency. However, audio cannot\u0000be split at any point as dividing an utterance into two separate fragments will\u0000generate an incorrect transcription. Also, shorter fragments provide less\u0000context for the ASR model. For this reason, it is necessary to design and test\u0000different splitting algorithms to optimize the quality and delay of the\u0000resulting transcription. In this paper, three audio splitting algorithms are\u0000evaluated with different ASR models to determine their impact on both the\u0000quality of the transcription and the end-to-end delay. The algorithms are\u0000fragmentation at fixed intervals, voice activity detection (VAD), and\u0000fragmentation with feedback. The results are compared to the performance of the\u0000same model, without audio fragmentation, to determine the effects of this\u0000division. The results show that VAD fragmentation provides the best quality\u0000with the highest delay, whereas fragmentation at fixed intervals provides the\u0000lowest quality and the lowest delay. The newly proposed feedback algorithm\u0000exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,\u0000to the VAD splitting.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification PDAF：用于验证说话人的语音去重注意框架

arXiv - CS - Sound Pub Date : 2024-09-09 DOI: arxiv-2409.05799

Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj

引用次数: 0

Evaluating Neural Networks Architectures for Spring Reverb Modelling 评估用于弹簧混响建模的神经网络架构

arXiv - CS - Sound Pub Date : 2024-09-08 DOI: arxiv-2409.04953

Francesco Papaleo, Xavier Lizarraga-Seijas, Frederic Font

引用次数: 0

From Computation to Consumption: Exploring the Compute-Energy Link for Training and Testing Neural Networks for SED Systems 从计算到消耗：探索用于 SED 系统的神经网络训练和测试的计算-能源联系

arXiv - CS - Sound Pub Date : 2024-09-08 DOI: arxiv-2409.05080

Constance Douwes, Romain Serizel