arXiv - CS - Sound最新文献

筛选
英文 中文
Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition 比较自监督域内转移学习和监督域外转移学习在鸟类物种识别中的应用
arXiv - CS - Sound Pub Date : 2024-04-26 DOI: arxiv-2404.17252
Houtan Ghaffari, Paul Devos
{"title":"Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition","authors":"Houtan Ghaffari, Paul Devos","doi":"arxiv-2404.17252","DOIUrl":"https://doi.org/arxiv-2404.17252","url":null,"abstract":"Transferring the weights of a pre-trained model to assist another task has\u0000become a crucial part of modern deep learning, particularly in data-scarce\u0000scenarios. Pre-training refers to the initial step of training models outside\u0000the current task of interest, typically on another dataset. It can be done via\u0000supervised models using human-annotated datasets or self-supervised models\u0000trained on unlabeled datasets. In both cases, many pre-trained models are\u0000available to fine-tune for the task of interest. Interestingly, research has\u0000shown that pre-trained models from ImageNet can be helpful for audio tasks\u0000despite being trained on image datasets. Hence, it's unclear whether in-domain\u0000models would be advantageous compared to competent out-domain models, such as\u0000convolutional neural networks from ImageNet. Our experiments will demonstrate\u0000the usefulness of in-domain models and datasets for bird species recognition by\u0000leveraging VICReg, a recent and powerful self-supervised method.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech Technology Services for Oral History Research 为口述历史研究提供语音技术服务
arXiv - CS - Sound Pub Date : 2024-04-26 DOI: arxiv-2405.02333
Christoph Draxler, Henk van den Heuvel, Arjan van Hessen, Pavel Ircing, Jan Lehečka
{"title":"Speech Technology Services for Oral History Research","authors":"Christoph Draxler, Henk van den Heuvel, Arjan van Hessen, Pavel Ircing, Jan Lehečka","doi":"arxiv-2405.02333","DOIUrl":"https://doi.org/arxiv-2405.02333","url":null,"abstract":"Oral history is about oral sources of witnesses and commentors on historical\u0000events. Speech technology is an important instrument to process such recordings\u0000in order to obtain transcription and further enhancements to structure the oral\u0000account In this contribution we address the transcription portal and the\u0000webservices associated with speech processing at BAS, speech solutions\u0000developed at LINDAT, how to do it yourself with Whisper, remaining challenges,\u0000and future developments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Music Style Transfer With Diffusion Model 采用扩散模型的音乐风格转移
arXiv - CS - Sound Pub Date : 2024-04-23 DOI: arxiv-2404.14771
Hong Huang, Yuyi Wang, Luyao Li, Jun Lin
{"title":"Music Style Transfer With Diffusion Model","authors":"Hong Huang, Yuyi Wang, Luyao Li, Jun Lin","doi":"arxiv-2404.14771","DOIUrl":"https://doi.org/arxiv-2404.14771","url":null,"abstract":"Previous studies on music style transfer have mainly focused on one-to-one\u0000style conversion, which is relatively limited. When considering the conversion\u0000between multiple styles, previous methods required designing multiple modes to\u0000disentangle the complex style of the music, resulting in large computational\u0000costs and slow audio generation. The existing music style transfer methods\u0000generate spectrograms with artifacts, leading to significant noise in the\u0000generated audio. To address these issues, this study proposes a music style\u0000transfer framework based on diffusion models (DM) and uses spectrogram-based\u0000methods to achieve multi-to-multi music style transfer. The GuideDiff method is\u0000used to restore spectrograms to high-fidelity audio, accelerating audio\u0000generation speed and reducing noise in the generated audio. Experimental\u0000results show that our model has good performance in multi-mode music style\u0000transfer compared to the baseline and can generate high-quality audio in\u0000real-time on consumer-grade GPUs.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140806306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vector Signal Reconstruction Sparse and Parametric Approach of direction of arrival Using Single Vector Hydrophone 使用单矢量水听器的到达方向矢量信号重构稀疏和参数方法
arXiv - CS - Sound Pub Date : 2024-04-21 DOI: arxiv-2404.15160
Jiabin Guo
{"title":"Vector Signal Reconstruction Sparse and Parametric Approach of direction of arrival Using Single Vector Hydrophone","authors":"Jiabin Guo","doi":"arxiv-2404.15160","DOIUrl":"https://doi.org/arxiv-2404.15160","url":null,"abstract":"This article discusses the application of single vector hydrophones in the\u0000field of underwater acoustic signal processing for Direction Of Arrival (DOA)\u0000estimation. Addressing the limitations of traditional DOA estimation methods in\u0000multi-source environments and under noise interference, this study introduces a\u0000Vector Signal Reconstruction Sparse and Parametric Approach (VSRSPA). This\u0000method involves reconstructing the signal model of a single vector hydrophone,\u0000converting its covariance matrix into a Toeplitz structure suitable for the\u0000Sparse and Parametric Approach (SPA) algorithm. The process then optimizes it\u0000using the SPA algorithm to achieve more accurate DOA estimation. Through\u0000detailed simulation analysis, this research has confirmed the performance of\u0000the proposed algorithm in single and dual-target DOA estimation scenarios,\u0000especially under various signal-to-noise ratio(SNR) conditions. The simulation\u0000results show that, compared to traditional DOA estimation methods, this\u0000algorithm has significant advantages in estimation accuracy and resolution,\u0000particularly in multi-source signals and low SNR environments. The contribution\u0000of this study lies in providing an effective new method for DOA estimation with\u0000single vector hydrophones in complex environments, introducing new research\u0000directions and solutions in the field of vector hydrophone signal processing.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DensePANet: An improved generative adversarial network for photoacoustic tomography image reconstruction from sparse data DensePANet:从稀疏数据重建光声断层图像的改进生成对抗网络
arXiv - CS - Sound Pub Date : 2024-04-19 DOI: arxiv-2404.13101
Hesam Hakimnejad, Zohreh Azimifar, Narjes Goshtasbi
{"title":"DensePANet: An improved generative adversarial network for photoacoustic tomography image reconstruction from sparse data","authors":"Hesam Hakimnejad, Zohreh Azimifar, Narjes Goshtasbi","doi":"arxiv-2404.13101","DOIUrl":"https://doi.org/arxiv-2404.13101","url":null,"abstract":"Image reconstruction is an essential step of every medical imaging method,\u0000including Photoacoustic Tomography (PAT), which is a promising modality of\u0000imaging, that unites the benefits of both ultrasound and optical imaging\u0000methods. Reconstruction of PAT images using conventional methods results in\u0000rough artifacts, especially when applied directly to sparse PAT data. In recent\u0000years, generative adversarial networks (GANs) have shown a powerful performance\u0000in image generation as well as translation, rendering them a smart choice to be\u0000applied to reconstruction tasks. In this study, we proposed an end-to-end\u0000method called DensePANet to solve the problem of PAT image reconstruction from\u0000sparse data. The proposed model employs a novel modification of UNet in its\u0000generator, called FD-UNet++, which considerably improves the reconstruction\u0000performance. We evaluated the method on various in-vivo and simulated datasets.\u0000Quantitative and qualitative results show the better performance of our model\u0000over other prevalent deep learning techniques.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos 声音行动:从以自我为中心的旁白视频中学习动作的声音
arXiv - CS - Sound Pub Date : 2024-04-08 DOI: arxiv-2404.05206
Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman
{"title":"SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos","authors":"Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman","doi":"arxiv-2404.05206","DOIUrl":"https://doi.org/arxiv-2404.05206","url":null,"abstract":"We propose a novel self-supervised embedding to learn how actions sound from\u0000narrated in-the-wild egocentric videos. Whereas existing methods rely on\u0000curated data with known audio-visual correspondence, our multimodal\u0000contrastive-consensus coding (MC3) embedding reinforces the associations\u0000between audio, language, and vision when all modality pairs agree, while\u0000diminishing those associations when any one pair does not. We show our approach\u0000can successfully discover how the long tail of human actions sound from\u0000egocentric video, outperforming an array of recent multimodal embedding\u0000techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal\u0000tasks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpeechAlign: Aligning Speech Generation to Human Preferences SpeechAlign:使语音生成符合人类偏好
arXiv - CS - Sound Pub Date : 2024-04-08 DOI: arxiv-2404.05600
Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu
{"title":"SpeechAlign: Aligning Speech Generation to Human Preferences","authors":"Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu","doi":"arxiv-2404.05600","DOIUrl":"https://doi.org/arxiv-2404.05600","url":null,"abstract":"Speech language models have significantly advanced in generating realistic\u0000speech, with neural codec language models standing out. However, the\u0000integration of human feedback to align speech outputs to human preferences is\u0000often neglected. This paper addresses this gap by first analyzing the\u0000distribution gap in codec language models, highlighting how it leads to\u0000discrepancies between the training and inference phases, which negatively\u0000affects performance. Then we explore leveraging learning from human feedback to\u0000bridge the distribution gap. We introduce SpeechAlign, an iterative\u0000self-improvement strategy that aligns speech language models to human\u0000preferences. SpeechAlign involves constructing a preference codec dataset\u0000contrasting golden codec tokens against synthetic tokens, followed by\u0000preference optimization to improve the codec language model. This cycle of\u0000improvement is carried out iteratively to steadily convert weak models to\u0000strong ones. Through both subjective and objective evaluations, we show that\u0000SpeechAlign can bridge the distribution gap and facilitating continuous\u0000self-improvement of the speech language model. Moreover, SpeechAlign exhibits\u0000robust generalization capabilities and works for smaller models. Code and\u0000models will be available at https://github.com/0nutation/SpeechGPT.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Test-Time Training for Depression Detection 抑郁检测的测试时间训练
arXiv - CS - Sound Pub Date : 2024-04-07 DOI: arxiv-2404.05071
Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore
{"title":"Test-Time Training for Depression Detection","authors":"Sri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore","doi":"arxiv-2404.05071","DOIUrl":"https://doi.org/arxiv-2404.05071","url":null,"abstract":"Previous works on depression detection use datasets collected in similar\u0000environments to train and test the models. In practice, however, the train and\u0000test distributions cannot be guaranteed to be identical. Distribution shifts\u0000can be introduced due to variations such as recording environment (e.g.,\u0000background noise) and demographics (e.g., gender, age, etc). Such\u0000distributional shifts can surprisingly lead to severe performance degradation\u0000of the depression detection models. In this paper, we analyze the application\u0000of test-time training (TTT) to improve robustness of models trained for\u0000depression detection. When compared to regular testing of the models, we find\u0000TTT can significantly improve the robustness of the model under a variety of\u0000distributional shifts introduced due to: (a) background-noise, (b) gender-bias,\u0000and (c) data collection and curation procedure (i.e., train and test samples\u0000are from separate datasets).","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Domain Audio Deepfake Detection: Dataset and Analysis 跨域音频深度伪造检测:数据集与分析
arXiv - CS - Sound Pub Date : 2024-04-07 DOI: arxiv-2404.04904
Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang
{"title":"Cross-Domain Audio Deepfake Detection: Dataset and Analysis","authors":"Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang","doi":"arxiv-2404.04904","DOIUrl":"https://doi.org/arxiv-2404.04904","url":null,"abstract":"Audio deepfake detection (ADD) is essential for preventing the misuse of\u0000synthetic voices that may infringe on personal rights and privacy. Recent\u0000zero-shot text-to-speech (TTS) models pose higher risks as they can clone\u0000voices with a single utterance. However, the existing ADD datasets are\u0000outdated, leading to suboptimal generalization of detection models. In this\u0000paper, we construct a new cross-domain ADD dataset comprising over 300 hours of\u0000speech data that is generated by five advanced zero-shot TTS models. To\u0000simulate real-world scenarios, we employ diverse attack methods and audio\u0000prompts from different datasets. Experiments show that, through novel\u0000attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve\u0000equal error rates of 4.1% and 6.5% respectively. Additionally, we demonstrate\u0000our models' outstanding few-shot ADD ability by fine-tuning with just one\u0000minute of target-domain data. Nonetheless, neural codec compressors greatly\u0000affect the detection accuracy, necessitating further research.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks HyperTTS:利用超网络实现文本到语音的参数高效适配
arXiv - CS - Sound Pub Date : 2024-04-06 DOI: arxiv-2404.04645
Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria
{"title":"HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks","authors":"Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria","doi":"arxiv-2404.04645","DOIUrl":"https://doi.org/arxiv-2404.04645","url":null,"abstract":"Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal\u0000from the text domain to the speech domain. While developing TTS architectures\u0000that train and test on the same set of speakers has seen significant\u0000improvements, out-of-domain speaker performance still faces enormous\u0000limitations. Domain adaptation on a new set of speakers can be achieved by\u0000fine-tuning the whole model for each new domain, thus making it\u0000parameter-inefficient. This problem can be solved by Adapters that provide a\u0000parameter-efficient alternative to domain adaptation. Although famous in NLP,\u0000speech synthesis has not seen much improvement from Adapters. In this work, we\u0000present HyperTTS, which comprises a small learnable network, \"hypernetwork\",\u0000that generates parameters of the Adapter blocks, allowing us to condition\u0000Adapters on speaker representations and making them dynamic. Extensive\u0000evaluations of two domain adaptation settings demonstrate its effectiveness in\u0000achieving state-of-the-art performance in the parameter-efficient regime. We\u0000also compare different variants of HyperTTS, comparing them with baselines in\u0000different studies. Promising results on the dynamic adaptation of adapter\u0000parameters using hypernetworks open up new avenues for domain-generic\u0000multi-speaker TTS systems. The audio samples and code are available at\u0000https://github.com/declare-lab/HyperTTS.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信