Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun
{"title":"AccentBox: Towards High-Fidelity Zero-Shot Accent Generation","authors":"Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun","doi":"arxiv-2409.09098","DOIUrl":"https://doi.org/arxiv-2409.09098","url":null,"abstract":"While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high\u0000naturalness and speaker similarity, they fall short in accent fidelity and\u0000control. To address this issue, we propose zero-shot accent generation that\u0000unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel\u0000two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on\u0000Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the\u0000second stage, we condition ZS-TTS system on the pretrained speaker-agnostic\u0000accent embeddings extracted by the AID model. The proposed system achieves\u0000higher accent fidelity on inherent/cross accent generation, and enables unseen\u0000accent generation.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou
{"title":"Seed-Music: A Unified Framework for High Quality and Controlled Music Generation","authors":"Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou","doi":"arxiv-2409.09214","DOIUrl":"https://doi.org/arxiv-2409.09214","url":null,"abstract":"We introduce Seed-Music, a suite of music generation systems capable of\u0000producing high-quality music with fine-grained style control. Our unified\u0000framework leverages both auto-regressive language modeling and diffusion\u0000approaches to support two key music creation workflows: textit{controlled\u0000music generation} and textit{post-production editing}. For controlled music\u0000generation, our system enables vocal music generation with performance controls\u0000from multi-modal inputs, including style descriptions, audio references,\u0000musical scores, and voice prompts. For post-production editing, it offers\u0000interactive tools for editing lyrics and vocal melodies directly in the\u0000generated audio. We encourage readers to listen to demo audio examples at\u0000https://team.doubao.com/seed-music .","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li
{"title":"Benchmarking Sub-Genre Classification For Mainstage Dance Music","authors":"Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li","doi":"arxiv-2409.06690","DOIUrl":"https://doi.org/arxiv-2409.06690","url":null,"abstract":"Music classification, with a wide range of applications, is one of the most\u0000prominent tasks in music information retrieval. To address the absence of\u0000comprehensive datasets and high-performing methods in the classification of\u0000mainstage dance music, this work introduces a novel benchmark comprising a new\u0000dataset and a baseline. Our dataset extends the number of sub-genres to cover\u0000most recent mainstage live sets by top DJs worldwide in music festivals. A\u0000continuous soft labeling approach is employed to account for tracks that span\u0000multiple sub-genres, preserving the inherent sophistication. For the baseline,\u0000we developed deep learning models that outperform current state-of-the-art\u0000multimodel language models, which struggle to identify house music sub-genres,\u0000emphasizing the need for specialized models trained on fine-grained datasets.\u0000Our benchmark is applicable to serve for application scenarios such as music\u0000recommendation, DJ set curation, and interactive multimedia, where we also\u0000provide video demos. Our code is on\u0000url{https://anonymous.4open.science/r/Mainstage-EDM-Benchmark/}.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Li, Khalid Zaman, Xingfeng Li, Masato Akagi, Masashi Unoki
{"title":"Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks","authors":"Kai Li, Khalid Zaman, Xingfeng Li, Masato Akagi, Masashi Unoki","doi":"arxiv-2409.05319","DOIUrl":"https://doi.org/arxiv-2409.05319","url":null,"abstract":"Early detection of factory machinery malfunctions is crucial in industrial\u0000applications. In machine anomalous sound detection (ASD), different machines\u0000exhibit unique vibration-frequency ranges based on their physical properties.\u0000Meanwhile, the human auditory system is adept at tracking both temporal and\u0000spectral dynamics of machine sounds. Consequently, integrating the\u0000computational auditory models of the human auditory system with\u0000machine-specific properties can be an effective approach to machine ASD. We\u0000first quantified the frequency importances of four types of machines using the\u0000Fisher ratio (F-ratio). The quantified frequency importances were then used to\u0000design machine-specific non-uniform filterbanks (NUFBs), which extract the log\u0000non-uniform spectrum (LNS) feature. The designed NUFBs have a narrower\u0000bandwidth and higher filter distribution density in frequency regions with\u0000relatively high F-ratios. Finally, spectral and temporal modulation\u0000representations derived from the LNS feature were proposed. These proposed LNS\u0000feature and modulation representations are input into an autoencoder\u0000neural-network-based detector for ASD. The quantification results from the\u0000training set of the Malfunctioning Industrial Machine Investigation and\u0000Inspection dataset with a signal-to-noise (SNR) of 6 dB reveal that the\u0000distinguishing information between normal and anomalous sounds of different\u0000machines is encoded non-uniformly in the frequency domain. By highlighting\u0000these important frequency regions using NUFBs, the LNS feature can\u0000significantly enhance performance using the metric of AUC (area under the\u0000receiver operating characteristic curve) under various SNR conditions.\u0000Furthermore, modulation representations can further improve performance.\u0000Specifically, temporal modulation is effective for fans, pumps, and sliders,\u0000while spectral modulation is particularly effective for valves.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Harmonic Reasoning in Large Language Models","authors":"Anna Kruspe","doi":"arxiv-2409.05521","DOIUrl":"https://doi.org/arxiv-2409.05521","url":null,"abstract":"Large Language Models (LLMs) are becoming very popular and are used for many\u0000different purposes, including creative tasks in the arts. However, these models\u0000sometimes have trouble with specific reasoning tasks, especially those that\u0000involve logical thinking and counting. This paper looks at how well LLMs\u0000understand and reason when dealing with musical tasks like figuring out notes\u0000from intervals and identifying chords and scales. We tested GPT-3.5 and GPT-4o\u0000to see how they handle these tasks. Our results show that while LLMs do well\u0000with note intervals, they struggle with more complicated tasks like recognizing\u0000chords and scales. This points out clear limits in current LLM abilities and\u0000shows where we need to make them better, which could help improve how they\u0000think and work in both artistic and other complex areas. We also provide an\u0000automatically generated benchmark data set for the described tasks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso
{"title":"Evaluation of real-time transcriptions using end-to-end ASR models","authors":"Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso","doi":"arxiv-2409.05674","DOIUrl":"https://doi.org/arxiv-2409.05674","url":null,"abstract":"Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly\u0000evolved in the last few years. Traditional architectures based on pipelines\u0000have been replaced by joint end-to-end (E2E) architectures that simplify and\u0000streamline the model training process. In addition, new AI training methods,\u0000such as weak-supervised learning have reduced the need for high-quality audio\u0000datasets for model training. However, despite all these advancements, little to\u0000no research has been done on real-time transcription. In real-time scenarios,\u0000the audio is not pre-recorded, and the input audio must be fragmented to be\u0000processed by the ASR systems. To achieve real-time requirements, these\u0000fragments must be as short as possible to reduce latency. However, audio cannot\u0000be split at any point as dividing an utterance into two separate fragments will\u0000generate an incorrect transcription. Also, shorter fragments provide less\u0000context for the ASR model. For this reason, it is necessary to design and test\u0000different splitting algorithms to optimize the quality and delay of the\u0000resulting transcription. In this paper, three audio splitting algorithms are\u0000evaluated with different ASR models to determine their impact on both the\u0000quality of the transcription and the end-to-end delay. The algorithms are\u0000fragmentation at fixed intervals, voice activity detection (VAD), and\u0000fragmentation with feedback. The results are compared to the performance of the\u0000same model, without audio fragmentation, to determine the effects of this\u0000division. The results show that VAD fragmentation provides the best quality\u0000with the highest delay, whereas fragmentation at fixed intervals provides the\u0000lowest quality and the lowest delay. The newly proposed feedback algorithm\u0000exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,\u0000to the VAD splitting.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj
{"title":"PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification","authors":"Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj","doi":"arxiv-2409.05799","DOIUrl":"https://doi.org/arxiv-2409.05799","url":null,"abstract":"Speaker verification systems are crucial for authenticating identity through\u0000voice. Traditionally, these systems focus on comparing feature vectors,\u0000overlooking the speech's content. However, this paper challenges this by\u0000highlighting the importance of phonetic dominance, a measure of the frequency\u0000or duration of phonemes, as a crucial cue in speaker verification. A novel\u0000Phoneme Debiasing Attention Framework (PDAF) is introduced, integrating with\u0000existing attention frameworks to mitigate biases caused by phonetic dominance.\u0000PDAF adjusts the weighting for each phoneme and influences feature extraction,\u0000allowing for a more nuanced analysis of speech. This approach paves the way for\u0000more accurate and reliable identity authentication through voice. Furthermore,\u0000by employing various weighting strategies, we evaluate the influence of\u0000phonetic features on the efficacy of the speaker verification system.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"101 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Papaleo, Xavier Lizarraga-Seijas, Frederic Font
{"title":"Evaluating Neural Networks Architectures for Spring Reverb Modelling","authors":"Francesco Papaleo, Xavier Lizarraga-Seijas, Frederic Font","doi":"arxiv-2409.04953","DOIUrl":"https://doi.org/arxiv-2409.04953","url":null,"abstract":"Reverberation is a key element in spatial audio perception, historically\u0000achieved with the use of analogue devices, such as plate and spring reverb, and\u0000in the last decades with digital signal processing techniques that have allowed\u0000different approaches for Virtual Analogue Modelling (VAM). The\u0000electromechanical functioning of the spring reverb makes it a nonlinear system\u0000that is difficult to fully emulate in the digital domain with white-box\u0000modelling techniques. In this study, we compare five different neural network\u0000architectures, including convolutional and recurrent models, to assess their\u0000effectiveness in replicating the characteristics of this audio effect. The\u0000evaluation is conducted on two datasets at sampling rates of 16 kHz and 48 kHz.\u0000This paper specifically focuses on neural audio architectures that offer\u0000parametric control, aiming to advance the boundaries of current black-box\u0000modelling techniques in the domain of spring reverberation.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From Computation to Consumption: Exploring the Compute-Energy Link for Training and Testing Neural Networks for SED Systems","authors":"Constance Douwes, Romain Serizel","doi":"arxiv-2409.05080","DOIUrl":"https://doi.org/arxiv-2409.05080","url":null,"abstract":"The massive use of machine learning models, particularly neural networks, has\u0000raised serious concerns about their environmental impact. Indeed, over the last\u0000few years we have seen an explosion in the computing costs associated with\u0000training and deploying these systems. It is, therefore, crucial to understand\u0000their energy requirements in order to better integrate them into the evaluation\u0000of models, which has so far focused mainly on performance. In this paper, we\u0000study several neural network architectures that are key components of sound\u0000event detection systems, using an audio tagging task as an example. We measure\u0000the energy consumption for training and testing small to large architectures\u0000and establish complex relationships between the energy consumption, the number\u0000of floating-point operations, the number of parameters, and the GPU/memory\u0000utilization.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clustering of Indonesian and Western Gamelan Orchestras through Machine Learning of Performance Parameters","authors":"Simon Linke, Gerrit Wendt, Rolf Bader","doi":"arxiv-2409.03713","DOIUrl":"https://doi.org/arxiv-2409.03713","url":null,"abstract":"Indonesian and Western gamelan ensembles are investigated with respect to\u0000performance differences. Thereby, the often exotistic history of this music in\u0000the West might be reflected in contemporary tonal system, articulation, or\u0000large-scale form differences. Analyzing recordings of four Western and five\u0000Indonesian orchestras with respect to tonal systems and timbre features and\u0000using self-organizing Kohonen map (SOM) as a machine learning algorithm, a\u0000clear clustering between Indonesian and Western ensembles appears using certain\u0000psychoacoustic features. These point to a reduced articulation and large-scale\u0000form variability of Western ensembles compared to Indonesian ones. The SOM also\u0000clusters the ensembles with respect to their tonal systems, but no clusters\u0000between Indonesian and Western ensembles can be found in this respect.\u0000Therefore, a clear analogy between lower articulatory variability and\u0000large-scale form variation and a more exostistic, mediative and calm\u0000performance expectation and reception of gamelan in the West therefore appears.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}