arXiv - CS - Sound最新文献_第5页

A Benchmark for Multi-speaker Anonymization 多扬声器匿名化基准

arXiv - CS - Sound Pub Date : 2024-07-08 DOI: arxiv-2407.05608

Xiaoxiao Miao, Ruijie Tao, Chang Zeng, Xin Wang

{"title":"A Benchmark for Multi-speaker Anonymization","authors":"Xiaoxiao Miao, Ruijie Tao, Chang Zeng, Xin Wang","doi":"arxiv-2407.05608","DOIUrl":"https://doi.org/arxiv-2407.05608","url":null,"abstract":"Privacy-preserving voice protection approaches primarily suppress\u0000privacy-related information derived from paralinguistic attributes while\u0000preserving the linguistic content. Existing solutions focus on single-speaker\u0000scenarios. However, they lack practicality for real-world applications, i.e.,\u0000multi-speaker scenarios. In this paper, we present an initial attempt to\u0000provide a multi-speaker anonymization benchmark by defining the task and\u0000evaluation protocol, proposing benchmarking solutions, and discussing the\u0000privacy leakage of overlapping conversations. Specifically, ideal multi-speaker\u0000anonymization should preserve the number of speakers and the turn-taking\u0000structure of the conversation, ensuring accurate context conveyance while\u0000maintaining privacy. To achieve that, a cascaded system uses speaker\u0000diarization to aggregate the speech of each speaker and speaker anonymization\u0000to conceal speaker privacy and preserve speech content. Additionally, we\u0000propose two conversation-level speaker vector anonymization methods to improve\u0000the utility further. Both methods aim to make the original and corresponding\u0000pseudo-speaker identities of each speaker unlinkable while preserving or even\u0000improving the distinguishability among pseudo-speakers in a conversation. The\u0000first method minimizes the differential similarity across speaker pairs in the\u0000original and anonymized conversations to maintain original speaker\u0000relationships in the anonymized version. The other method minimizes the\u0000aggregated similarity across anonymized speakers to achieve better\u0000differentiation between speakers. Experiments conducted on both non-overlap\u0000simulated and real-world datasets demonstrate the effectiveness of the\u0000multi-speaker anonymization system with the proposed speaker anonymizers.\u0000Additionally, we analyzed overlapping speech regarding privacy leakage and\u0000provide potential solutions.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MERGE -- A Bimodal Dataset for Static Music Emotion Recognition MERGE -- 用于静态音乐情感识别的双模数据集

arXiv - CS - Sound Pub Date : 2024-07-08 DOI: arxiv-2407.06060

Pedro Lima Louro, Hugo Redinho, Ricardo Santos, Ricardo Malheiro, Renato Panda, Rui Pedro Paiva

引用次数: 0

Music Era Recognition Using Supervised Contrastive Learning and Artist Information 利用监督对比学习和艺术家信息识别音乐年代

arXiv - CS - Sound Pub Date : 2024-07-07 DOI: arxiv-2407.05368

Qiqi He, Xuchen Song, Weituo Hao, Ju-Chiang Wang, Wei-Tsung Lu, Wei Li

{"title":"Music Era Recognition Using Supervised Contrastive Learning and Artist Information","authors":"Qiqi He, Xuchen Song, Weituo Hao, Ju-Chiang Wang, Wei-Tsung Lu, Wei Li","doi":"arxiv-2407.05368","DOIUrl":"https://doi.org/arxiv-2407.05368","url":null,"abstract":"Does popular music from the 60s sound different than that of the 90s? Prior\u0000study has shown that there would exist some variations of patterns and\u0000regularities related to instrumentation changes and growing loudness across\u0000multi-decadal trends. This indicates that perceiving the era of a song from\u0000musical features such as audio and artist information is possible. Music era\u0000information can be an important feature for playlist generation and\u0000recommendation. However, the release year of a song can be inaccessible in many\u0000circumstances. This paper addresses a novel task of music era recognition. We\u0000formulate the task as a music classification problem and propose solutions\u0000based on supervised contrastive learning. An audio-based model is developed to\u0000predict the era from audio. For the case where the artist information is\u0000available, we extend the audio-based model to take multimodal inputs and\u0000develop a framework, called MultiModal Contrastive (MMC) learning, to enhance\u0000the training. Experimental result on Million Song Dataset demonstrates that the\u0000audio-based model achieves 54% in accuracy with a tolerance of 3-years range;\u0000incorporating the artist information with the MMC framework for training leads\u0000to 9% improvement further.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Morse Code-Enabled Speech Recognition for Individuals with Visual and Hearing Impairments 针对视力和听力障碍人士的摩斯密码语音识别技术

arXiv - CS - Sound Pub Date : 2024-07-07 DOI: arxiv-2407.14525

Ritabrata Roy Choudhury

引用次数: 0

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens CosyVoice：基于有监督语义标记的可扩展多语言零镜头文本到语音合成器

arXiv - CS - Sound Pub Date : 2024-07-07 DOI: arxiv-2407.05407

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhijie Yan

{"title":"CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens","authors":"Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhijie Yan","doi":"arxiv-2407.05407","DOIUrl":"https://doi.org/arxiv-2407.05407","url":null,"abstract":"Recent years have witnessed a trend that large language model (LLM) based\u0000text-to-speech (TTS) emerges into the mainstream due to their high naturalness\u0000and zero-shot capacity. In this paradigm, speech signals are discretized into\u0000token sequences, which are modeled by an LLM with text as prompts and\u0000reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens\u0000play a critical role in LLM-based TTS models. Current speech tokens are learned\u0000in an unsupervised manner, which lacks explicit semantic information and\u0000alignment to the text. In this paper, we propose to represent speech with\u0000supervised semantic tokens, which are derived from a multilingual speech\u0000recognition model by inserting vector quantization into the encoder. Based on\u0000the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice,\u0000which consists of an LLM for text-to-token generation and a conditional flow\u0000matching model for token-to-speech synthesis. Experimental results show that\u0000supervised semantic tokens significantly outperform existing unsupervised\u0000tokens in terms of content consistency and speaker similarity for zero-shot\u0000voice cloning. Moreover, we find that utilizing large-scale data further\u0000improves the synthesis performance, indicating the scalable capacity of\u0000CosyVoice. To the best of our knowledge, this is the first attempt to involve\u0000supervised speech tokens into TTS models.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition 增强跨语言语音情感识别的层添加策略

arXiv - CS - Sound Pub Date : 2024-07-06 DOI: arxiv-2407.04966

Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee

引用次数: 0

Few-Shot Keyword Spotting from Mixed Speech 从混合语音中发现少量关键词

arXiv - CS - Sound Pub Date : 2024-07-05 DOI: arxiv-2407.06078

Junming Yuan, Ying Shi, LanTian Li, Dong Wang, Askar Hamdulla

引用次数: 0

MUSIC-lite: Efficient MUSIC using Approximate Computing: An OFDM Radar Case Study MUSIC-lite：使用近似计算的高效 MUSIC：OFDM 雷达案例研究

arXiv - CS - Sound Pub Date : 2024-07-05 DOI: arxiv-2407.04849

Rajat Bhattacharjya, Arnab Sarkar, Biswadip Maity, Nikil Dutt

引用次数: 0

Semantic Grouping Network for Audio Source Separation 用于音源分离的语义分组网络

arXiv - CS - Sound Pub Date : 2024-07-04 DOI: arxiv-2407.03736

Shentong Mo, Yapeng Tian

{"title":"Semantic Grouping Network for Audio Source Separation","authors":"Shentong Mo, Yapeng Tian","doi":"arxiv-2407.03736","DOIUrl":"https://doi.org/arxiv-2407.03736","url":null,"abstract":"Recently, audio-visual separation approaches have taken advantage of the\u0000natural synchronization between the two modalities to boost audio source\u0000separation performance. They extracted high-level semantics from visual inputs\u0000as the guidance to help disentangle sound representation for individual\u0000sources. Can we directly learn to disentangle the individual semantics from the\u0000sound itself? The dilemma is that multiple sound sources are mixed together in\u0000the original space. To tackle the difficulty, in this paper, we present a novel\u0000Semantic Grouping Network, termed as SGN, that can directly disentangle sound\u0000representations and extract high-level semantic information for each source\u0000from input audio mixture. Specifically, SGN aggregates category-wise source\u0000features through learnable class tokens of sounds. Then, the aggregated\u0000semantic features can be used as the guidance to separate the corresponding\u0000audio sources from the mixture. We conducted extensive experiments on\u0000music-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, and\u0000VGG-Sound. The results demonstrate that our SGN significantly outperforms\u0000previous audio-only methods and audio-visual models without utilizing\u0000additional visual cues.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"2018 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141578116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark 如何在不同数据集之间推广 SER 模型？综合基准

arXiv - CS - Sound Pub Date : 2024-06-14 DOI: arxiv-2406.09933

Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed

引用次数: 0