Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang
{"title":"Cross-Domain Audio Deepfake Detection: Dataset and Analysis","authors":"Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang","doi":"arxiv-2404.04904","DOIUrl":"https://doi.org/arxiv-2404.04904","url":null,"abstract":"Audio deepfake detection (ADD) is essential for preventing the misuse of\u0000synthetic voices that may infringe on personal rights and privacy. Recent\u0000zero-shot text-to-speech (TTS) models pose higher risks as they can clone\u0000voices with a single utterance. However, the existing ADD datasets are\u0000outdated, leading to suboptimal generalization of detection models. In this\u0000paper, we construct a new cross-domain ADD dataset comprising over 300 hours of\u0000speech data that is generated by five advanced zero-shot TTS models. To\u0000simulate real-world scenarios, we employ diverse attack methods and audio\u0000prompts from different datasets. Experiments show that, through novel\u0000attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve\u0000equal error rates of 4.1% and 6.5% respectively. Additionally, we demonstrate\u0000our models' outstanding few-shot ADD ability by fine-tuning with just one\u0000minute of target-domain data. Nonetheless, neural codec compressors greatly\u0000affect the detection accuracy, necessitating further research.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria
{"title":"HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks","authors":"Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria","doi":"arxiv-2404.04645","DOIUrl":"https://doi.org/arxiv-2404.04645","url":null,"abstract":"Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal\u0000from the text domain to the speech domain. While developing TTS architectures\u0000that train and test on the same set of speakers has seen significant\u0000improvements, out-of-domain speaker performance still faces enormous\u0000limitations. Domain adaptation on a new set of speakers can be achieved by\u0000fine-tuning the whole model for each new domain, thus making it\u0000parameter-inefficient. This problem can be solved by Adapters that provide a\u0000parameter-efficient alternative to domain adaptation. Although famous in NLP,\u0000speech synthesis has not seen much improvement from Adapters. In this work, we\u0000present HyperTTS, which comprises a small learnable network, \"hypernetwork\",\u0000that generates parameters of the Adapter blocks, allowing us to condition\u0000Adapters on speaker representations and making them dynamic. Extensive\u0000evaluations of two domain adaptation settings demonstrate its effectiveness in\u0000achieving state-of-the-art performance in the parameter-efficient regime. We\u0000also compare different variants of HyperTTS, comparing them with baselines in\u0000different studies. Promising results on the dynamic adaptation of adapter\u0000parameters using hypernetworks open up new avenues for domain-generic\u0000multi-speaker TTS systems. The audio samples and code are available at\u0000https://github.com/declare-lab/HyperTTS.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suma K V, Deepali Koppad, Preethi Kumar, Neha A Kantikar, Surabhi Ramesh
{"title":"Multi-Task Learning for Lung sound & Lung disease classification","authors":"Suma K V, Deepali Koppad, Preethi Kumar, Neha A Kantikar, Surabhi Ramesh","doi":"arxiv-2404.03908","DOIUrl":"https://doi.org/arxiv-2404.03908","url":null,"abstract":"In recent years, advancements in deep learning techniques have considerably\u0000enhanced the efficiency and accuracy of medical diagnostics. In this work, a\u0000novel approach using multi-task learning (MTL) for the simultaneous\u0000classification of lung sounds and lung diseases is proposed. Our proposed model\u0000leverages MTL with four different deep learning models such as 2D CNN,\u0000ResNet50, MobileNet and Densenet to extract relevant features from the lung\u0000sound recordings. The ICBHI 2017 Respiratory Sound Database was employed in the\u0000current study. The MTL for MobileNet model performed better than the other\u0000models considered, with an accuracy of74% for lung sound analysis and 91% for\u0000lung diseases classification. Results of the experimentation demonstrate the\u0000efficacy of our approach in classifying both lung sounds and lung diseases\u0000concurrently. In this study,using the demographic data of the patients from the database,\u0000risk level computation for Chronic Obstructive Pulmonary Disease is also\u0000carried out. For this computation, three machine learning algorithms namely\u0000Logistic Regression, SVM and Random Forest classifierswere employed. Among\u0000these ML algorithms, the Random Forest classifier had the highest accuracy of\u000092%.This work helps in considerably reducing the physician's burden of not\u0000just diagnosing the pathology but also effectively communicating to the patient\u0000about the possible causes or outcomes.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos","authors":"Igor Cardoso, Rubens O. Moraes, Lucas N. Ferreira","doi":"arxiv-2404.04420","DOIUrl":"https://doi.org/arxiv-2404.04420","url":null,"abstract":"Neural models are one of the most popular approaches for music generation,\u0000yet there aren't standard large datasets tailored for learning music directly\u0000from game data. To address this research gap, we introduce a novel dataset\u0000named NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each\u0000paired with its original soundtrack in symbolic format (MIDI). NES-VMDB is\u0000built upon the Nintendo Entertainment System Music Database (NES-MDB),\u0000encompassing 5,278 music pieces from 397 NES games. Our approach involves\u0000collecting long-play videos for 389 games of the original dataset, slicing them\u0000into 15-second-long clips, and extracting the audio from each clip.\u0000Subsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to\u0000automatically identify the corresponding piece in the NES-MDB dataset.\u0000Additionally, we introduce a baseline method based on the Controllable Music\u0000Transformer to generate NES music conditioned on gameplay clips. We evaluated\u0000this approach with objective metrics, and the results showed that the\u0000conditional CMT improves musical structural quality when compared to its\u0000unconditional counterpart. Moreover, we used a neural classifier to predict the\u0000game genre of the generated pieces. Results showed that the CMT generator can\u0000learn correlations between gameplay videos and game genres, but further\u0000research has to be conducted to achieve human-level performance.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders","authors":"Yu Pan, Lei Ma, Jianjun Zhao","doi":"arxiv-2404.02702","DOIUrl":"https://doi.org/arxiv-2404.02702","url":null,"abstract":"Neural speech codec has recently gained widespread attention in generative\u0000speech modeling domains, like voice conversion, text-to-speech synthesis, etc.\u0000However, ensuring high-fidelity audio reconstruction of speech codecs under\u0000high compression rates remains an open and challenging issue. In this paper, we\u0000propose PromptCodec, a novel end-to-end neural speech codec model using\u0000disentangled representation learning based feature-aware prompt encoders. By\u0000incorporating additional feature representations from prompt encoders,\u0000PromptCodec can distribute the speech information requiring processing and\u0000enhance its capabilities. Moreover, a simple yet effective adaptive feature\u0000weighted fusion approach is introduced to integrate features of different\u0000encoders. Meanwhile, we propose a novel disentangled representation learning\u0000strategy based on cosine distance to optimize PromptCodec's encoders to ensure\u0000their efficiency, thereby further improving the performance of PromptCodec.\u0000Experiments on LibriTTS demonstrate that our proposed PromptCodec consistently\u0000outperforms state-of-the-art neural speech codec models under all different\u0000bitrate conditions while achieving impressive performance with low bitrates.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synthesizing Soundscapes: Leveraging Text-to-Audio Models for Environmental Sound Classification","authors":"Francesca Ronchini, Luca Comanducci, Fabio Antonacci","doi":"arxiv-2403.17864","DOIUrl":"https://doi.org/arxiv-2403.17864","url":null,"abstract":"In the past few years, text-to-audio models have emerged as a significant\u0000advancement in automatic audio generation. Although they represent impressive\u0000technological progress, the effectiveness of their use in the development of\u0000audio applications remains uncertain. This paper aims to investigate these\u0000aspects, specifically focusing on the task of classification of environmental\u0000sounds. This study analyzes the performance of two different environmental\u0000classification systems when data generated from text-to-audio models is used\u0000for training. Two cases are considered: a) when the training dataset is\u0000augmented by data coming from two different text-to-audio models; and b) when\u0000the training dataset consists solely of synthetic audio generated. In both\u0000cases, the performance of the classification task is tested on real data.\u0000Results indicate that text-to-audio models are effective for dataset\u0000augmentation, whereas the performance of the models drops when relying on only\u0000generated audio.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen
{"title":"Speaker Distance Estimation in Enclosures from Single-Channel Audio","authors":"Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen","doi":"arxiv-2403.17514","DOIUrl":"https://doi.org/arxiv-2403.17514","url":null,"abstract":"Distance estimation from audio plays a crucial role in various applications,\u0000such as acoustic scene analysis, sound source localization, and room modeling.\u0000Most studies predominantly center on employing a classification approach, where\u0000distances are discretized into distinct categories, enabling smoother model\u0000training and achieving higher accuracy but imposing restrictions on the\u0000precision of the obtained sound source position. Towards this direction, in\u0000this paper we propose a novel approach for continuous distance estimation from\u0000audio signals using a convolutional recurrent neural network with an attention\u0000module. The attention mechanism enables the model to focus on relevant temporal\u0000and spectral features, enhancing its ability to capture fine-grained\u0000distance-related information. To evaluate the effectiveness of our proposed\u0000method, we conduct extensive experiments using audio recordings in controlled\u0000environments with three levels of realism (synthetic room impulse response,\u0000measured response with convolved speech, and real recordings) on four datasets\u0000(our synthetic dataset, QMULTIMIT, VoiceHome-2, and STARSS23). Experimental\u0000results show that the model achieves an absolute error of 0.11 meters in a\u0000noiseless synthetic scenario. Moreover, the results showed an absolute error of\u0000about 1.30 meters in the hybrid scenario. The algorithm's performance in the\u0000real scenario, where unpredictable environmental factors and noise are\u0000prevalent, yields an absolute error of approximately 0.50 meters. For\u0000reproducible research purposes we make model, code, and synthetic datasets\u0000available at https://github.com/michaelneri/audio-distance-estimation.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks","authors":"Yang Ai, Zhen-Hua Ling","doi":"arxiv-2403.17378","DOIUrl":"https://doi.org/arxiv-2403.17378","url":null,"abstract":"This paper presents a novel neural speech phase prediction model which\u0000predicts wrapped phase spectra directly from amplitude spectra. The proposed\u0000model is a cascade of a residual convolutional network and a parallel\u0000estimation architecture. The parallel estimation architecture is a core module\u0000for direct wrapped phase prediction. This architecture consists of two parallel\u0000linear convolutional layers and a phase calculation formula, imitating the\u0000process of calculating the phase spectra from the real and imaginary parts of\u0000complex spectra and strictly restricting the predicted phase values to the\u0000principal value interval. To avoid the error expansion issue caused by phase\u0000wrapping, we design anti-wrapping training losses defined between the predicted\u0000wrapped phase spectra and natural ones by activating the instantaneous phase\u0000error, group delay error and instantaneous angular frequency error using an\u0000anti-wrapping function. We mathematically demonstrate that the anti-wrapping\u0000function should possess three properties, namely parity, periodicity and\u0000monotonicity. We also achieve low-latency streamable phase prediction by\u0000combining causal convolutions and knowledge distillation training strategies.\u0000For both analysis-synthesis and specific speech generation tasks, experimental\u0000results show that our proposed neural speech phase prediction model outperforms\u0000the iterative phase estimation algorithms and neural network-based phase\u0000prediction methods in terms of phase prediction precision, efficiency and\u0000robustness. Compared with HiFi-GAN-based waveform reconstruction method, our\u0000proposed model also shows outstanding efficiency advantages while ensuring the\u0000quality of synthesized speech. To the best of our knowledge, we are the first\u0000to directly predict speech phase spectra from amplitude spectra only via neural\u0000networks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep functional multiple index models with an application to SER","authors":"Matthieu Saumard, Abir El Haj, Thibault Napoleon","doi":"arxiv-2403.17562","DOIUrl":"https://doi.org/arxiv-2403.17562","url":null,"abstract":"Speech Emotion Recognition (SER) plays a crucial role in advancing\u0000human-computer interaction and speech processing capabilities. We introduce a\u0000novel deep-learning architecture designed specifically for the functional data\u0000model known as the multiple-index functional model. Our key innovation lies in\u0000integrating adaptive basis layers and an automated data transformation search\u0000within the deep learning framework. Simulations for this new model show good\u0000performances. This allows us to extract features tailored for chunk-level SER,\u0000based on Mel Frequency Cepstral Coefficients (MFCCs). We demonstrate the\u0000effectiveness of our approach on the benchmark IEMOCAP database, achieving good\u0000performance compared to existing methods.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detection of Deepfake Environmental Audio","authors":"Hafsa Ouajdi, Oussama Hadder, Modan Tailleur, Mathieu Lagrange, Laurie M. Heller","doi":"arxiv-2403.17529","DOIUrl":"https://doi.org/arxiv-2403.17529","url":null,"abstract":"With the ever-rising quality of deep generative models, it is increasingly\u0000important to be able to discern whether the audio data at hand have been\u0000recorded or synthesized. Although the detection of fake speech signals has been\u0000studied extensively, this is not the case for the detection of fake\u0000environmental audio. We propose a simple and efficient pipeline for detecting fake environmental\u0000sounds based on the CLAP audio embedding. We evaluate this detector using audio\u0000data from the 2023 DCASE challenge task on Foley sound synthesis. Our experiments show that fake sounds generated by 44 state-of-the-art\u0000synthesizers can be detected on average with 98% accuracy. We show that using\u0000an audio embedding learned on environmental audio is beneficial over a standard\u0000VGGish one as it provides a 10% increase in detection performance. Informal\u0000listening to Incorrect Negative examples demonstrates audible features of fake\u0000sounds missed by the detector such as distortion and implausible background\u0000noise.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}