Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman
{"title":"Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model","authors":"Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman","doi":"arxiv-2405.01730","DOIUrl":"https://doi.org/arxiv-2405.01730","url":null,"abstract":"Expressive voice conversion (VC) conducts speaker identity conversion for\u0000emotional speakers by jointly converting speaker identity and emotional style.\u0000Emotional style modeling for arbitrary speakers in expressive VC has not been\u0000extensively explored. Previous approaches have relied on vocoders for speech\u0000reconstruction, which makes speech quality heavily dependent on the performance\u0000of vocoders. A major challenge of expressive VC lies in emotion prosody\u0000modeling. To address these challenges, this paper proposes a fully end-to-end\u0000expressive VC framework based on a conditional denoising diffusion\u0000probabilistic model (DDPM). We utilize speech units derived from\u0000self-supervised speech models as content conditioning, along with deep features\u0000extracted from speech emotion recognition and speaker verification systems to\u0000model emotional style and speaker identity. Objective and subjective\u0000evaluations show the effectiveness of our framework. Codes and samples are\u0000publicly available.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"USAT: A Universal Speaker-Adaptive Text-to-Speech Approach","authors":"Wenbin Wang, Yang Song, Sanjay Jha","doi":"arxiv-2404.18094","DOIUrl":"https://doi.org/arxiv-2404.18094","url":null,"abstract":"Conventional text-to-speech (TTS) research has predominantly focused on\u0000enhancing the quality of synthesized speech for speakers in the training\u0000dataset. The challenge of synthesizing lifelike speech for unseen,\u0000out-of-dataset speakers, especially those with limited reference data, remains\u0000a significant and unresolved problem. While zero-shot or few-shot\u0000speaker-adaptive TTS approaches have been explored, they have many limitations.\u0000Zero-shot approaches tend to suffer from insufficient generalization\u0000performance to reproduce the voice of speakers with heavy accents. While\u0000few-shot methods can reproduce highly varying accents, they bring a significant\u0000storage burden and the risk of overfitting and catastrophic forgetting. In\u0000addition, prior approaches only provide either zero-shot or few-shot\u0000adaptation, constraining their utility across varied real-world scenarios with\u0000different demands. Besides, most current evaluations of speaker-adaptive TTS\u0000are conducted only on datasets of native speakers, inadvertently neglecting a\u0000vast portion of non-native speakers with diverse accents. Our proposed\u0000framework unifies both zero-shot and few-shot speaker adaptation strategies,\u0000which we term as \"instant\" and \"fine-grained\" adaptations based on their\u0000merits. To alleviate the insufficient generalization performance observed in\u0000zero-shot speaker adaptation, we designed two innovative discriminators and\u0000introduced a memory mechanism for the speech decoder. To prevent catastrophic\u0000forgetting and reduce storage implications for few-shot speaker adaptation, we\u0000designed two adapters and a unique adaptation procedure.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition","authors":"Houtan Ghaffari, Paul Devos","doi":"arxiv-2404.17252","DOIUrl":"https://doi.org/arxiv-2404.17252","url":null,"abstract":"Transferring the weights of a pre-trained model to assist another task has\u0000become a crucial part of modern deep learning, particularly in data-scarce\u0000scenarios. Pre-training refers to the initial step of training models outside\u0000the current task of interest, typically on another dataset. It can be done via\u0000supervised models using human-annotated datasets or self-supervised models\u0000trained on unlabeled datasets. In both cases, many pre-trained models are\u0000available to fine-tune for the task of interest. Interestingly, research has\u0000shown that pre-trained models from ImageNet can be helpful for audio tasks\u0000despite being trained on image datasets. Hence, it's unclear whether in-domain\u0000models would be advantageous compared to competent out-domain models, such as\u0000convolutional neural networks from ImageNet. Our experiments will demonstrate\u0000the usefulness of in-domain models and datasets for bird species recognition by\u0000leveraging VICReg, a recent and powerful self-supervised method.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christoph Draxler, Henk van den Heuvel, Arjan van Hessen, Pavel Ircing, Jan Lehečka
{"title":"Speech Technology Services for Oral History Research","authors":"Christoph Draxler, Henk van den Heuvel, Arjan van Hessen, Pavel Ircing, Jan Lehečka","doi":"arxiv-2405.02333","DOIUrl":"https://doi.org/arxiv-2405.02333","url":null,"abstract":"Oral history is about oral sources of witnesses and commentors on historical\u0000events. Speech technology is an important instrument to process such recordings\u0000in order to obtain transcription and further enhancements to structure the oral\u0000account In this contribution we address the transcription portal and the\u0000webservices associated with speech processing at BAS, speech solutions\u0000developed at LINDAT, how to do it yourself with Whisper, remaining challenges,\u0000and future developments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Music Style Transfer With Diffusion Model","authors":"Hong Huang, Yuyi Wang, Luyao Li, Jun Lin","doi":"arxiv-2404.14771","DOIUrl":"https://doi.org/arxiv-2404.14771","url":null,"abstract":"Previous studies on music style transfer have mainly focused on one-to-one\u0000style conversion, which is relatively limited. When considering the conversion\u0000between multiple styles, previous methods required designing multiple modes to\u0000disentangle the complex style of the music, resulting in large computational\u0000costs and slow audio generation. The existing music style transfer methods\u0000generate spectrograms with artifacts, leading to significant noise in the\u0000generated audio. To address these issues, this study proposes a music style\u0000transfer framework based on diffusion models (DM) and uses spectrogram-based\u0000methods to achieve multi-to-multi music style transfer. The GuideDiff method is\u0000used to restore spectrograms to high-fidelity audio, accelerating audio\u0000generation speed and reducing noise in the generated audio. Experimental\u0000results show that our model has good performance in multi-mode music style\u0000transfer compared to the baseline and can generate high-quality audio in\u0000real-time on consumer-grade GPUs.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140806306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vector Signal Reconstruction Sparse and Parametric Approach of direction of arrival Using Single Vector Hydrophone","authors":"Jiabin Guo","doi":"arxiv-2404.15160","DOIUrl":"https://doi.org/arxiv-2404.15160","url":null,"abstract":"This article discusses the application of single vector hydrophones in the\u0000field of underwater acoustic signal processing for Direction Of Arrival (DOA)\u0000estimation. Addressing the limitations of traditional DOA estimation methods in\u0000multi-source environments and under noise interference, this study introduces a\u0000Vector Signal Reconstruction Sparse and Parametric Approach (VSRSPA). This\u0000method involves reconstructing the signal model of a single vector hydrophone,\u0000converting its covariance matrix into a Toeplitz structure suitable for the\u0000Sparse and Parametric Approach (SPA) algorithm. The process then optimizes it\u0000using the SPA algorithm to achieve more accurate DOA estimation. Through\u0000detailed simulation analysis, this research has confirmed the performance of\u0000the proposed algorithm in single and dual-target DOA estimation scenarios,\u0000especially under various signal-to-noise ratio(SNR) conditions. The simulation\u0000results show that, compared to traditional DOA estimation methods, this\u0000algorithm has significant advantages in estimation accuracy and resolution,\u0000particularly in multi-source signals and low SNR environments. The contribution\u0000of this study lies in providing an effective new method for DOA estimation with\u0000single vector hydrophones in complex environments, introducing new research\u0000directions and solutions in the field of vector hydrophone signal processing.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DensePANet: An improved generative adversarial network for photoacoustic tomography image reconstruction from sparse data","authors":"Hesam Hakimnejad, Zohreh Azimifar, Narjes Goshtasbi","doi":"arxiv-2404.13101","DOIUrl":"https://doi.org/arxiv-2404.13101","url":null,"abstract":"Image reconstruction is an essential step of every medical imaging method,\u0000including Photoacoustic Tomography (PAT), which is a promising modality of\u0000imaging, that unites the benefits of both ultrasound and optical imaging\u0000methods. Reconstruction of PAT images using conventional methods results in\u0000rough artifacts, especially when applied directly to sparse PAT data. In recent\u0000years, generative adversarial networks (GANs) have shown a powerful performance\u0000in image generation as well as translation, rendering them a smart choice to be\u0000applied to reconstruction tasks. In this study, we proposed an end-to-end\u0000method called DensePANet to solve the problem of PAT image reconstruction from\u0000sparse data. The proposed model employs a novel modification of UNet in its\u0000generator, called FD-UNet++, which considerably improves the reconstruction\u0000performance. We evaluated the method on various in-vivo and simulated datasets.\u0000Quantitative and qualitative results show the better performance of our model\u0000over other prevalent deep learning techniques.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
{"title":"SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos","authors":"Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman","doi":"arxiv-2404.05206","DOIUrl":"https://doi.org/arxiv-2404.05206","url":null,"abstract":"We propose a novel self-supervised embedding to learn how actions sound from\u0000narrated in-the-wild egocentric videos. Whereas existing methods rely on\u0000curated data with known audio-visual correspondence, our multimodal\u0000contrastive-consensus coding (MC3) embedding reinforces the associations\u0000between audio, language, and vision when all modality pairs agree, while\u0000diminishing those associations when any one pair does not. We show our approach\u0000can successfully discover how the long tail of human actions sound from\u0000egocentric video, outperforming an array of recent multimodal embedding\u0000techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal\u0000tasks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SpeechAlign: Aligning Speech Generation to Human Preferences","authors":"Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu","doi":"arxiv-2404.05600","DOIUrl":"https://doi.org/arxiv-2404.05600","url":null,"abstract":"Speech language models have significantly advanced in generating realistic\u0000speech, with neural codec language models standing out. However, the\u0000integration of human feedback to align speech outputs to human preferences is\u0000often neglected. This paper addresses this gap by first analyzing the\u0000distribution gap in codec language models, highlighting how it leads to\u0000discrepancies between the training and inference phases, which negatively\u0000affects performance. Then we explore leveraging learning from human feedback to\u0000bridge the distribution gap. We introduce SpeechAlign, an iterative\u0000self-improvement strategy that aligns speech language models to human\u0000preferences. SpeechAlign involves constructing a preference codec dataset\u0000contrasting golden codec tokens against synthetic tokens, followed by\u0000preference optimization to improve the codec language model. This cycle of\u0000improvement is carried out iteratively to steadily convert weak models to\u0000strong ones. Through both subjective and objective evaluations, we show that\u0000SpeechAlign can bridge the distribution gap and facilitating continuous\u0000self-improvement of the speech language model. Moreover, SpeechAlign exhibits\u0000robust generalization capabilities and works for smaller models. Code and\u0000models will be available at https://github.com/0nutation/SpeechGPT.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore
{"title":"Test-Time Training for Depression Detection","authors":"Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore","doi":"arxiv-2404.05071","DOIUrl":"https://doi.org/arxiv-2404.05071","url":null,"abstract":"Previous works on depression detection use datasets collected in similar\u0000environments to train and test the models. In practice, however, the train and\u0000test distributions cannot be guaranteed to be identical. Distribution shifts\u0000can be introduced due to variations such as recording environment (e.g.,\u0000background noise) and demographics (e.g., gender, age, etc). Such\u0000distributional shifts can surprisingly lead to severe performance degradation\u0000of the depression detection models. In this paper, we analyze the application\u0000of test-time training (TTT) to improve robustness of models trained for\u0000depression detection. When compared to regular testing of the models, we find\u0000TTT can significantly improve the robustness of the model under a variety of\u0000distributional shifts introduced due to: (a) background-noise, (b) gender-bias,\u0000and (c) data collection and curation procedure (i.e., train and test samples\u0000are from separate datasets).","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}