{"title":"Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment","authors":"Tien-Hong Lo, Meng-Ting Tsai, Berlin Chen","doi":"arxiv-2409.07151","DOIUrl":"https://doi.org/arxiv-2409.07151","url":null,"abstract":"Second language (L2) learners can improve their pronunciation by imitating\u0000golden speech, especially when the speech that aligns with their respective\u0000speech characteristics. This study explores the hypothesis that\u0000learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS)\u0000techniques can be harnessed as an effective metric for measuring the\u0000pronunciation proficiency of L2 learners. Building on this exploration, the\u0000contributions of this study are at least two-fold: 1) design and development of\u0000a systematic framework for assessing the ability of a synthesis model to\u0000generate golden speech, and 2) in-depth investigations of the effectiveness of\u0000using golden speech in automatic pronunciation assessment (APA). Comprehensive\u0000experiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasets\u0000suggest that our proposed modeling can yield significant performance\u0000improvements with respect to various assessment metrics in relation to some\u0000prior arts. To our knowledge, this study is the first to explore the role of\u0000golden speech in both ZS-TTS and APA, offering a promising regime for\u0000computer-assisted pronunciation training (CAPT).","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang
{"title":"VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos","authors":"Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang","doi":"arxiv-2409.07450","DOIUrl":"https://doi.org/arxiv-2409.07450","url":null,"abstract":"We present a framework for learning to generate background music from video\u0000inputs. Unlike existing works that rely on symbolic musical annotations, which\u0000are limited in quantity and diversity, our method leverages large-scale web\u0000videos accompanied by background music. This enables our model to learn to\u0000generate realistic and diverse music. To accomplish this goal, we develop a\u0000generative video-music Transformer with a novel semantic video-music alignment\u0000scheme. Our model uses a joint autoregressive and contrastive learning\u0000objective, which encourages the generation of music aligned with high-level\u0000video content. We also introduce a novel video-beat alignment scheme to match\u0000the generated music beats with the low-level motions in the video. Lastly, to\u0000capture fine-grained visual cues in a video needed for realistic background\u0000music generation, we introduce a new temporal video encoder architecture,\u0000allowing us to efficiently process videos consisting of many densely sampled\u0000frames. We train our framework on our newly curated DISCO-MV dataset,\u0000consisting of 2.2M video-music samples, which is orders of magnitude larger\u0000than any prior datasets used for video music generation. Our method outperforms\u0000existing approaches on the DISCO-MV and MusicCaps datasets according to various\u0000music generation evaluation metrics, including human evaluation. Results are\u0000available at https://genjib.github.io/project_page/VMAs/index.html","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin
{"title":"Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm","authors":"Yuning Wu, Jiatong Shi, Yifeng Yu, Yuxun Tang, Tao Qian, Yueqian Lin, Jionghao Han, Xinyi Bai, Shinji Watanabe, Qin Jin","doi":"arxiv-2409.07226","DOIUrl":"https://doi.org/arxiv-2409.07226","url":null,"abstract":"This research presents Muskits-ESPnet, a versatile toolkit that introduces\u0000new paradigms to Singing Voice Synthesis (SVS) through the application of\u0000pretrained audio models in both continuous and discrete approaches.\u0000Specifically, we explore discrete representations derived from SSL models and\u0000audio codecs and offer significant advantages in versatility and intelligence,\u0000supporting multi-format inputs and adaptable data processing workflows for\u0000various SVS models. The toolkit features automatic music score error detection\u0000and correction, as well as a perception auto-evaluation module to imitate human\u0000subjective evaluating scores. Muskits-ESPnet is available at\u0000url{https://github.com/espnet/espnet}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing CTC-Based Visual Speech Recognition","authors":"Hendrik Laux, Anke Schmeink","doi":"arxiv-2409.07210","DOIUrl":"https://doi.org/arxiv-2409.07210","url":null,"abstract":"This paper presents LiteVSR2, an enhanced version of our previously\u0000introduced efficient approach to Visual Speech Recognition (VSR). Building upon\u0000our knowledge distillation framework from a pre-trained Automatic Speech\u0000Recognition (ASR) model, we introduce two key improvements: a stabilized video\u0000preprocessing technique and feature normalization in the distillation process.\u0000These improvements yield substantial performance gains on the LRS2 and LRS3\u0000benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model\u0000without increasing the volume of training data or computational resources\u0000utilized. Furthermore, we explore the scalability of our approach by examining\u0000performance metrics across varying model complexities and training data\u0000volumes. LiteVSR2 maintains the efficiency of its predecessor while\u0000significantly enhancing accuracy, thereby demonstrating the potential for\u0000resource-efficient advancements in VSR technology.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking Mamba in Speech Processing by Self-Supervised Models","authors":"Xiangyu Zhang, Jianbo Ma, Mostafa Shahin, Beena Ahmed, Julien Epps","doi":"arxiv-2409.07273","DOIUrl":"https://doi.org/arxiv-2409.07273","url":null,"abstract":"The Mamba-based model has demonstrated outstanding performance across tasks\u0000in computer vision, natural language processing, and speech processing.\u0000However, in the realm of speech processing, the Mamba-based model's performance\u0000varies across different tasks. For instance, in tasks such as speech\u0000enhancement and spectrum reconstruction, the Mamba model performs well when\u0000used independently. However, for tasks like speech recognition, additional\u0000modules are required to surpass the performance of attention-based models. We\u0000propose the hypothesis that the Mamba-based model excels in \"reconstruction\"\u0000tasks within speech processing. However, for \"classification tasks\" such as\u0000Speech Recognition, additional modules are necessary to accomplish the\u0000\"reconstruction\" step. To validate our hypothesis, we analyze the previous\u0000Mamba-based Speech Models from an information theory perspective. Furthermore,\u0000we leveraged the properties of HuBERT in our study. We trained a Mamba-based\u0000HuBERT model, and the mutual information patterns, along with the model's\u0000performance metrics, confirmed our assumptions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Suite for Acoustic Language Model Evaluation","authors":"Gallil Maimon, Amit Roth, Yossi Adi","doi":"arxiv-2409.07437","DOIUrl":"https://doi.org/arxiv-2409.07437","url":null,"abstract":"Speech language models have recently demonstrated great potential as\u0000universal speech processing systems. Such models have the ability to model the\u0000rich acoustic information existing in audio signals, beyond spoken content,\u0000such as emotion, background noise, etc. Despite this, evaluation benchmarks\u0000which evaluate awareness to a wide range of acoustic aspects, are lacking. To\u0000help bridge this gap, we introduce SALMon, a novel evaluation suite\u0000encompassing background noise, emotion, speaker identity and room impulse\u0000response. The proposed benchmarks both evaluate the consistency of the\u0000inspected element and how much it matches the spoken text. We follow a\u0000modelling based approach, measuring whether a model gives correct samples\u0000higher scores than incorrect ones. This approach makes the benchmark fast to\u0000compute even for large models. We evaluated several speech language models on\u0000SALMon, thus highlighting the strengths and weaknesses of each evaluated\u0000method. Code and data are publicly available at\u0000https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee
{"title":"ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages","authors":"Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee","doi":"arxiv-2409.07259","DOIUrl":"https://doi.org/arxiv-2409.07259","url":null,"abstract":"In this study, we introduce ManaTTS, the most extensive publicly accessible\u0000single-speaker Persian corpus, and a comprehensive framework for collecting\u0000transcribed speech datasets for the Persian language. ManaTTS, released under\u0000the open CC-0 license, comprises approximately 86 hours of audio with a\u0000sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the\u0000VirgoolInformal dataset to evaluate Persian speech recognition models used for\u0000forced alignment, extending over 5 hours of audio. The datasets are supported\u0000by a fully transparent, MIT-licensed pipeline, a testament to innovation in the\u0000field. It includes unique tools for sentence tokenization, bounded audio\u0000segmentation, and a novel forced alignment method. This alignment technique is\u0000specifically designed for low-resource languages, addressing a crucial need in\u0000the field. With this dataset, we trained a Tacotron2-based TTS model, achieving\u0000a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of\u00003.86 for the utterances generated by the same vocoder and natural spectrogram,\u0000and the MOS of 4.01 for the natural waveform, demonstrating the exceptional\u0000quality and effectiveness of the corpus.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack","authors":"Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac","doi":"arxiv-2409.07390","DOIUrl":"https://doi.org/arxiv-2409.07390","url":null,"abstract":"The advancements in generative AI have enabled the improvement of audio\u0000synthesis models, including text-to-speech and voice conversion. This raises\u0000concerns about its potential misuse in social manipulation and political\u0000interference, as synthetic speech has become indistinguishable from natural\u0000human speech. Several speech-generation programs are utilized for malicious\u0000purposes, especially impersonating individuals through phone calls. Therefore,\u0000detecting fake audio is crucial to maintain social security and safeguard the\u0000integrity of information. Recent research has proposed a D-CAPTCHA system based\u0000on the challenge-response protocol to differentiate fake phone calls from real\u0000ones. In this work, we study the resilience of this system and introduce a more\u0000robust version, D-CAPTCHA++, to defend against fake calls. Specifically, we\u0000first expose the vulnerability of the D-CAPTCHA system under transferable\u0000imperceptible adversarial attack. Secondly, we mitigate such vulnerability by\u0000improving the robustness of the system by using adversarial training in\u0000D-CAPTCHA deepfake detectors and task classifiers.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction","authors":"Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao","doi":"arxiv-2409.07001","DOIUrl":"https://doi.org/arxiv-2409.07001","url":null,"abstract":"We present the third edition of the VoiceMOS Challenge, a scientific\u0000initiative designed to advance research into automatic prediction of human\u0000speech ratings. There were three tracks. The first track was on predicting the\u0000quality of ``zoomed-in'' high-quality samples from speech synthesis systems.\u0000The second track was to predict ratings of samples from singing voice synthesis\u0000and voice conversion with a large variety of systems, listeners, and languages.\u0000The third track was semi-supervised quality prediction for noisy, clean, and\u0000enhanced speech, where a very small amount of labeled training data was\u0000provided. Among the eight teams from both academia and industry, we found that\u0000many were able to outperform the baseline systems. Successful techniques\u0000included retrieval-based methods and the use of non-self-supervised\u0000representations like spectrograms and pitch histograms. These results showed\u0000that the challenge has advanced the field of subjective speech rating\u0000prediction.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya
{"title":"Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition","authors":"Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Batthacharya","doi":"arxiv-2409.07165","DOIUrl":"https://doi.org/arxiv-2409.07165","url":null,"abstract":"Automatic speech recognition (ASR) with an encoder equipped with\u0000self-attention, whether streaming or non-streaming, takes quadratic time in the\u0000length of the speech utterance. This slows down training and decoding, increase\u0000their cost, and limit the deployment of the ASR in constrained devices.\u0000SummaryMixing is a promising linear-time complexity alternative to\u0000self-attention for non-streaming speech recognition that, for the first time,\u0000preserves or outperforms the accuracy of self-attention models. Unfortunately,\u0000the original definition of SummaryMixing is not suited to streaming speech\u0000recognition. Hence, this work extends SummaryMixing to a Conformer Transducer\u0000that works in both a streaming and an offline mode. It shows that this new\u0000linear-time complexity speech encoder outperforms self-attention in both\u0000scenarios while requiring less compute and memory during training and decoding.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}