arXiv - CS - Sound最新文献_第9页

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model 转换任何人的声音：利用条件扩散模型进行端到端表达式语音转换

arXiv - CS - Sound Pub Date : 2024-05-02 DOI: arxiv-2405.01730

Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman

引用次数: 0

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach USAT：通用演讲者自适应文本到语音方法

arXiv - CS - Sound Pub Date : 2024-04-28 DOI: arxiv-2404.18094

Wenbin Wang, Yang Song, Sanjay Jha

{"title":"USAT: A Universal Speaker-Adaptive Text-to-Speech Approach","authors":"Wenbin Wang, Yang Song, Sanjay Jha","doi":"arxiv-2404.18094","DOIUrl":"https://doi.org/arxiv-2404.18094","url":null,"abstract":"Conventional text-to-speech (TTS) research has predominantly focused on\u0000enhancing the quality of synthesized speech for speakers in the training\u0000dataset. The challenge of synthesizing lifelike speech for unseen,\u0000out-of-dataset speakers, especially those with limited reference data, remains\u0000a significant and unresolved problem. While zero-shot or few-shot\u0000speaker-adaptive TTS approaches have been explored, they have many limitations.\u0000Zero-shot approaches tend to suffer from insufficient generalization\u0000performance to reproduce the voice of speakers with heavy accents. While\u0000few-shot methods can reproduce highly varying accents, they bring a significant\u0000storage burden and the risk of overfitting and catastrophic forgetting. In\u0000addition, prior approaches only provide either zero-shot or few-shot\u0000adaptation, constraining their utility across varied real-world scenarios with\u0000different demands. Besides, most current evaluations of speaker-adaptive TTS\u0000are conducted only on datasets of native speakers, inadvertently neglecting a\u0000vast portion of non-native speakers with diverse accents. Our proposed\u0000framework unifies both zero-shot and few-shot speaker adaptation strategies,\u0000which we term as \"instant\" and \"fine-grained\" adaptations based on their\u0000merits. To alleviate the insufficient generalization performance observed in\u0000zero-shot speaker adaptation, we designed two innovative discriminators and\u0000introduced a memory mechanism for the speech decoder. To prevent catastrophic\u0000forgetting and reduce storage implications for few-shot speaker adaptation, we\u0000designed two adapters and a unique adaptation procedure.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition 比较自监督域内转移学习和监督域外转移学习在鸟类物种识别中的应用

arXiv - CS - Sound Pub Date : 2024-04-26 DOI: arxiv-2404.17252

Houtan Ghaffari, Paul Devos

引用次数: 0

Speech Technology Services for Oral History Research 为口述历史研究提供语音技术服务

arXiv - CS - Sound Pub Date : 2024-04-26 DOI: arxiv-2405.02333

Christoph Draxler, Henk van den Heuvel, Arjan van Hessen, Pavel Ircing, Jan Lehečka

引用次数: 0

Music Style Transfer With Diffusion Model 采用扩散模型的音乐风格转移

arXiv - CS - Sound Pub Date : 2024-04-23 DOI: arxiv-2404.14771

Hong Huang, Yuyi Wang, Luyao Li, Jun Lin

引用次数: 0

Vector Signal Reconstruction Sparse and Parametric Approach of direction of arrival Using Single Vector Hydrophone 使用单矢量水听器的到达方向矢量信号重构稀疏和参数方法

arXiv - CS - Sound Pub Date : 2024-04-21 DOI: arxiv-2404.15160

Jiabin Guo

{"title":"Vector Signal Reconstruction Sparse and Parametric Approach of direction of arrival Using Single Vector Hydrophone","authors":"Jiabin Guo","doi":"arxiv-2404.15160","DOIUrl":"https://doi.org/arxiv-2404.15160","url":null,"abstract":"This article discusses the application of single vector hydrophones in the\u0000field of underwater acoustic signal processing for Direction Of Arrival (DOA)\u0000estimation. Addressing the limitations of traditional DOA estimation methods in\u0000multi-source environments and under noise interference, this study introduces a\u0000Vector Signal Reconstruction Sparse and Parametric Approach (VSRSPA). This\u0000method involves reconstructing the signal model of a single vector hydrophone,\u0000converting its covariance matrix into a Toeplitz structure suitable for the\u0000Sparse and Parametric Approach (SPA) algorithm. The process then optimizes it\u0000using the SPA algorithm to achieve more accurate DOA estimation. Through\u0000detailed simulation analysis, this research has confirmed the performance of\u0000the proposed algorithm in single and dual-target DOA estimation scenarios,\u0000especially under various signal-to-noise ratio(SNR) conditions. The simulation\u0000results show that, compared to traditional DOA estimation methods, this\u0000algorithm has significant advantages in estimation accuracy and resolution,\u0000particularly in multi-source signals and low SNR environments. The contribution\u0000of this study lies in providing an effective new method for DOA estimation with\u0000single vector hydrophones in complex environments, introducing new research\u0000directions and solutions in the field of vector hydrophone signal processing.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DensePANet: An improved generative adversarial network for photoacoustic tomography image reconstruction from sparse data DensePANet：从稀疏数据重建光声断层图像的改进生成对抗网络

arXiv - CS - Sound Pub Date : 2024-04-19 DOI: arxiv-2404.13101

Hesam Hakimnejad, Zohreh Azimifar, Narjes Goshtasbi

引用次数: 0

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos 声音行动：从以自我为中心的旁白视频中学习动作的声音

arXiv - CS - Sound Pub Date : 2024-04-08 DOI: arxiv-2404.05206

Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

引用次数: 0

SpeechAlign: Aligning Speech Generation to Human Preferences SpeechAlign：使语音生成符合人类偏好

arXiv - CS - Sound Pub Date : 2024-04-08 DOI: arxiv-2404.05600

Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu

{"title":"SpeechAlign: Aligning Speech Generation to Human Preferences","authors":"Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu","doi":"arxiv-2404.05600","DOIUrl":"https://doi.org/arxiv-2404.05600","url":null,"abstract":"Speech language models have significantly advanced in generating realistic\u0000speech, with neural codec language models standing out. However, the\u0000integration of human feedback to align speech outputs to human preferences is\u0000often neglected. This paper addresses this gap by first analyzing the\u0000distribution gap in codec language models, highlighting how it leads to\u0000discrepancies between the training and inference phases, which negatively\u0000affects performance. Then we explore leveraging learning from human feedback to\u0000bridge the distribution gap. We introduce SpeechAlign, an iterative\u0000self-improvement strategy that aligns speech language models to human\u0000preferences. SpeechAlign involves constructing a preference codec dataset\u0000contrasting golden codec tokens against synthetic tokens, followed by\u0000preference optimization to improve the codec language model. This cycle of\u0000improvement is carried out iteratively to steadily convert weak models to\u0000strong ones. Through both subjective and objective evaluations, we show that\u0000SpeechAlign can bridge the distribution gap and facilitating continuous\u0000self-improvement of the speech language model. Moreover, SpeechAlign exhibits\u0000robust generalization capabilities and works for smaller models. Code and\u0000models will be available at https://github.com/0nutation/SpeechGPT.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Test-Time Training for Depression Detection 抑郁检测的测试时间训练

arXiv - CS - Sound Pub Date : 2024-04-07 DOI: arxiv-2404.05071

Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore

引用次数: 0