arXiv - EE - Audio and Speech Processing最新文献_第9页

Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment 作为黄金语音生成器的零点文本到语音：系统框架及其在自动发音评估中的应用

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07151

Tien-Hong Lo, Meng-Ting Tsai, Berlin Chen

{"title":"Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment","authors":"Tien-Hong Lo, Meng-Ting Tsai, Berlin Chen","doi":"arxiv-2409.07151","DOIUrl":"https://doi.org/arxiv-2409.07151","url":null,"abstract":"Second language (L2) learners can improve their pronunciation by imitating\u0000golden speech, especially when the speech that aligns with their respective\u0000speech characteristics. This study explores the hypothesis that\u0000learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS)\u0000techniques can be harnessed as an effective metric for measuring the\u0000pronunciation proficiency of L2 learners. Building on this exploration, the\u0000contributions of this study are at least two-fold: 1) design and development of\u0000a systematic framework for assessing the ability of a synthesis model to\u0000generate golden speech, and 2) in-depth investigations of the effectiveness of\u0000using golden speech in automatic pronunciation assessment (APA). Comprehensive\u0000experiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasets\u0000suggest that our proposed modeling can yield significant performance\u0000improvements with respect to various assessment metrics in relation to some\u0000prior arts. To our knowledge, this study is the first to explore the role of\u0000golden speech in both ZS-TTS and APA, offering a promising regime for\u0000computer-assisted pronunciation training (CAPT).","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos VMAS：通过网络音乐视频中的语义对齐实现视频到音乐的生成

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07450

Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang

{"title":"VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos","authors":"Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang","doi":"arxiv-2409.07450","DOIUrl":"https://doi.org/arxiv-2409.07450","url":null,"abstract":"We present a framework for learning to generate background music from video\u0000inputs. Unlike existing works that rely on symbolic musical annotations, which\u0000are limited in quantity and diversity, our method leverages large-scale web\u0000videos accompanied by background music. This enables our model to learn to\u0000generate realistic and diverse music. To accomplish this goal, we develop a\u0000generative video-music Transformer with a novel semantic video-music alignment\u0000scheme. Our model uses a joint autoregressive and contrastive learning\u0000objective, which encourages the generation of music aligned with high-level\u0000video content. We also introduce a novel video-beat alignment scheme to match\u0000the generated music beats with the low-level motions in the video. Lastly, to\u0000capture fine-grained visual cues in a video needed for realistic background\u0000music generation, we introduce a new temporal video encoder architecture,\u0000allowing us to efficiently process videos consisting of many densely sampled\u0000frames. We train our framework on our newly curated DISCO-MV dataset,\u0000consisting of 2.2M video-music samples, which is orders of magnitude larger\u0000than any prior datasets used for video music generation. Our method outperforms\u0000existing approaches on the DISCO-MV and MusicCaps datasets according to various\u0000music generation evaluation metrics, including human evaluation. Results are\u0000available at https://genjib.github.io/project_page/VMAs/index.html","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm Muskits-ESPnet：新范式歌唱语音合成综合工具包

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07226

Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin

引用次数: 0

Enhancing CTC-Based Visual Speech Recognition 增强基于 CTC 的视觉语音识别能力

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07210

Hendrik Laux, Anke Schmeink

引用次数: 0

Rethinking Mamba in Speech Processing by Self-Supervised Models 通过自监督模型反思语音处理中的 Mamba

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07273

Xiangyu Zhang, Jianbo Ma, Mostafa Shahin, Beena Ahmed, Julien Epps

{"title":"Rethinking Mamba in Speech Processing by Self-Supervised Models","authors":"Xiangyu Zhang, Jianbo Ma, Mostafa Shahin, Beena Ahmed, Julien Epps","doi":"arxiv-2409.07273","DOIUrl":"https://doi.org/arxiv-2409.07273","url":null,"abstract":"The Mamba-based model has demonstrated outstanding performance across tasks\u0000in computer vision, natural language processing, and speech processing.\u0000However, in the realm of speech processing, the Mamba-based model's performance\u0000varies across different tasks. For instance, in tasks such as speech\u0000enhancement and spectrum reconstruction, the Mamba model performs well when\u0000used independently. However, for tasks like speech recognition, additional\u0000modules are required to surpass the performance of attention-based models. We\u0000propose the hypothesis that the Mamba-based model excels in \"reconstruction\"\u0000tasks within speech processing. However, for \"classification tasks\" such as\u0000Speech Recognition, additional modules are necessary to accomplish the\u0000\"reconstruction\" step. To validate our hypothesis, we analyze the previous\u0000Mamba-based Speech Models from an information theory perspective. Furthermore,\u0000we leveraged the properties of HuBERT in our study. We trained a Mamba-based\u0000HuBERT model, and the mutual information patterns, along with the model's\u0000performance metrics, confirmed our assumptions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Suite for Acoustic Language Model Evaluation 声学语言模型评估套件

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07437

Gallil Maimon, Amit Roth, Yossi Adi

引用次数: 0

ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages ManaTTS 波斯语：为低资源语言创建 TTS 数据集的秘诀

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07259

Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee

{"title":"ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages","authors":"Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee","doi":"arxiv-2409.07259","DOIUrl":"https://doi.org/arxiv-2409.07259","url":null,"abstract":"In this study, we introduce ManaTTS, the most extensive publicly accessible\u0000single-speaker Persian corpus, and a comprehensive framework for collecting\u0000transcribed speech datasets for the Persian language. ManaTTS, released under\u0000the open CC-0 license, comprises approximately 86 hours of audio with a\u0000sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the\u0000VirgoolInformal dataset to evaluate Persian speech recognition models used for\u0000forced alignment, extending over 5 hours of audio. The datasets are supported\u0000by a fully transparent, MIT-licensed pipeline, a testament to innovation in the\u0000field. It includes unique tools for sentence tokenization, bounded audio\u0000segmentation, and a novel forced alignment method. This alignment technique is\u0000specifically designed for low-resource languages, addressing a crucial need in\u0000the field. With this dataset, we trained a Tacotron2-based TTS model, achieving\u0000a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of\u00003.86 for the utterances generated by the same vocoder and natural spectrogram,\u0000and the MOS of 4.01 for the natural waveform, demonstrating the exceptional\u0000quality and effectiveness of the corpus.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"273 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack D-CAPTCHA++：Deepfake 验证码在可转移不可感知对抗性攻击下的复原力研究

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07390

Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac

{"title":"D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack","authors":"Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac","doi":"arxiv-2409.07390","DOIUrl":"https://doi.org/arxiv-2409.07390","url":null,"abstract":"The advancements in generative AI have enabled the improvement of audio\u0000synthesis models, including text-to-speech and voice conversion. This raises\u0000concerns about its potential misuse in social manipulation and political\u0000interference, as synthetic speech has become indistinguishable from natural\u0000human speech. Several speech-generation programs are utilized for malicious\u0000purposes, especially impersonating individuals through phone calls. Therefore,\u0000detecting fake audio is crucial to maintain social security and safeguard the\u0000integrity of information. Recent research has proposed a D-CAPTCHA system based\u0000on the challenge-response protocol to differentiate fake phone calls from real\u0000ones. In this work, we study the resilience of this system and introduce a more\u0000robust version, D-CAPTCHA++, to defend against fake calls. Specifically, we\u0000first expose the vulnerability of the D-CAPTCHA system under transferable\u0000imperceptible adversarial attack. Secondly, we mitigate such vulnerability by\u0000improving the robustness of the system by using adversarial training in\u0000D-CAPTCHA deepfake detectors and task classifiers.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction 2024 年 VoiceMOS 挑战赛：超越语音质量预测

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07001

Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao

引用次数: 0

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition 用于流式语音识别的线性时间复杂性拟合与摘要混音技术

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07165

Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya

引用次数: 0