arXiv - CS - Sound最新文献_第8页

RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification RepAugment：用于呼吸声分类的输入诊断表征级增强技术

arXiv - CS - Sound Pub Date : 2024-05-05 DOI: arxiv-2405.02996

June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung

{"title":"RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification","authors":"June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung","doi":"arxiv-2405.02996","DOIUrl":"https://doi.org/arxiv-2405.02996","url":null,"abstract":"Recent advancements in AI have democratized its deployment as a healthcare\u0000assistant. While pretrained models from large-scale visual and audio datasets\u0000have demonstrably generalized to this task, surprisingly, no studies have\u0000explored pretrained speech models, which, as human-originated sounds,\u0000intuitively would share closer resemblance to lung sounds. This paper explores\u0000the efficacy of pretrained speech models for respiratory sound classification.\u0000We find that there is a characterization gap between speech and lung sound\u0000samples, and to bridge this gap, data augmentation is essential. However, the\u0000most widely used augmentation technique for audio and speech, SpecAugment,\u0000requires 2-dimensional spectrogram format and cannot be applied to models\u0000pretrained on speech waveforms. To address this, we propose RepAugment, an\u0000input-agnostic representation-level augmentation technique that outperforms\u0000SpecAugment, but is also suitable for respiratory sound classification with\u0000waveform pretrained models. Experimental results show that our approach\u0000outperforms the SpecAugment, demonstrating a substantial improvement in the\u0000accuracy of minority disease classes, reaching up to 7.14%.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Steered Response Power for Sound Source Localization: A Tutorial Review 声源定位的转向响应功率：教程回顾

arXiv - CS - Sound Pub Date : 2024-05-05 DOI: arxiv-2405.02991

Eric Grinstein, Elisa Tengan, Bilgesu Çakmak, Thomas Dietzen, Leonardo Nunes, Toon van Waterschoot, Mike Brookes, Patrick A. Naylor

引用次数: 0

Quranic Audio Dataset: Crowdsourced and Labeled Recitation from Non-Arabic Speakers 古兰经音频数据集：来自非阿拉伯语发言人的众包和标签化朗诵

arXiv - CS - Sound Pub Date : 2024-05-04 DOI: arxiv-2405.02675

Raghad Salameh, Mohamad Al Mdfaa, Nursultan Askarbekuly, Manuel Mazzara

引用次数: 0

Toward end-to-end interpretable convolutional neural networks for waveform signals 面向波形信号的端到端可解释卷积神经网络

arXiv - CS - Sound Pub Date : 2024-05-03 DOI: arxiv-2405.01815

Linh Vu, Thu Tran, Wern-Han Lim, Raphael Phan

引用次数: 0

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models 利用大规模预训练模型实现免训练深度伪语音识别

arXiv - CS - Sound Pub Date : 2024-05-03 DOI: arxiv-2405.02179

Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva

{"title":"Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models","authors":"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva","doi":"arxiv-2405.02179","DOIUrl":"https://doi.org/arxiv-2405.02179","url":null,"abstract":"Generalization is a main issue for current audio deepfake detectors, which\u0000struggle to provide reliable results on out-of-distribution data. Given the\u0000speed at which more and more accurate synthesis methods are developed, it is\u0000very important to design techniques that work well also on data they were not\u0000trained for. In this paper we study the potential of large-scale pre-trained\u0000models for audio deepfake detection, with special focus on generalization\u0000ability. To this end, the detection problem is reformulated in a speaker\u0000verification framework and fake audios are exposed by the mismatch between the\u0000voice sample under test and the voice of the claimed identity. With this\u0000paradigm, no fake speech sample is necessary in training, cutting off any link\u0000with the generation method at the root, and ensuring full generalization\u0000ability. Features are extracted by general-purpose large pre-trained models,\u0000with no need for training or fine-tuning on specific fake detection or speaker\u0000verification datasets. At detection time only a limited set of voice fragments\u0000of the identity under test is required. Experiments on several datasets\u0000widespread in the community show that detectors based on pre-trained models\u0000achieve excellent performance and show strong generalization ability, rivaling\u0000supervised methods on in-distribution data and largely overcoming them on\u0000out-of-distribution data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GMP-ATL: Gender-augmented Multi-scale Pseudo-label Enhanced Adaptive Transfer Learning for Speech Emotion Recognition via HuBERT GMP-ATL：通过 HuBERT 对语音情感识别进行性别增强型多尺度伪标签增强自适应迁移学习

arXiv - CS - Sound Pub Date : 2024-05-03 DOI: arxiv-2405.02151

Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao

引用次数: 0

Can We Identify Unknown Audio Recording Environments in Forensic Scenarios? 我们能否识别法证场景中的未知音频录音环境？

arXiv - CS - Sound Pub Date : 2024-05-03 DOI: arxiv-2405.02119

Denise Moussa, Germans Hirsch, Christian Riess

{"title":"Can We Identify Unknown Audio Recording Environments in Forensic Scenarios?","authors":"Denise Moussa, Germans Hirsch, Christian Riess","doi":"arxiv-2405.02119","DOIUrl":"https://doi.org/arxiv-2405.02119","url":null,"abstract":"Audio recordings may provide important evidence in criminal investigations.\u0000One such case is the forensic association of the recorded audio to the\u0000recording location. For example, a voice message may be the only investigative\u0000cue to narrow down the candidate sites for a crime. Up to now, several works\u0000provide tools for closed-set recording environment classification under\u0000relatively clean recording conditions. However, in forensic investigations, the\u0000candidate locations are case-specific. Thus, closed-set tools are not\u0000applicable without retraining on a sufficient amount of training samples for\u0000each case and respective candidate set. In addition, a forensic tool has to\u0000deal with audio material from uncontrolled sources with variable properties and\u0000quality. In this work, we therefore attempt a major step towards practical forensic\u0000application scenarios. We propose a representation learning framework called\u0000EnvId, short for environment identification. EnvId avoids case-specific\u0000retraining. Instead, it is the first tool for robust few-shot classification of\u0000unseen environment locations. We demonstrate that EnvId can handle forensically\u0000challenging material. It provides good quality predictions even under unseen\u0000signal degradations, environment characteristics or recording position\u0000mismatches. Our code and datasets will be made publicly available upon acceptance.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"247 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint sentiment analysis of lyrics and audio in music 对音乐中的歌词和音频进行联合情感分析

arXiv - CS - Sound Pub Date : 2024-05-03 DOI: arxiv-2405.01988

Lea Schaab, Anna Kruspe

引用次数: 0

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets 在中文开源数据集上揭示基于 LLM 的 ASR 的潜力

arXiv - CS - Sound Pub Date : 2024-05-03 DOI: arxiv-2405.02132

Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

{"title":"Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets","authors":"Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie","doi":"arxiv-2405.02132","DOIUrl":"https://doi.org/arxiv-2405.02132","url":null,"abstract":"Large Language Models (LLMs) have demonstrated unparalleled effectiveness in\u0000various NLP tasks, and integrating LLMs with automatic speech recognition (ASR)\u0000is becoming a mainstream paradigm. Building upon this momentum, our research\u0000delves into an in-depth examination of this paradigm on a large open-source\u0000Chinese dataset. Specifically, our research aims to evaluate the impact of\u0000various configurations of speech encoders, LLMs, and projector modules in the\u0000context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we\u0000introduce a three-stage training approach, expressly developed to enhance the\u0000model's ability to align auditory and textual information. The implementation\u0000of this approach, alongside the strategic integration of ASR components,\u0000enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and\u0000Test_Meeting test sets. Our analysis presents an empirical foundation for\u0000future research in LLM-based ASR systems and offers insights into optimizing\u0000performance using Chinese datasets. We will publicly release all scripts used\u0000for data preparation, training, inference, and scoring, as well as pre-trained\u0000models and training logs to promote reproducible research.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios 助听器的实时多通道深度语音增强：比较复杂声学场景中的单声道和双声道处理方法

arXiv - CS - Sound Pub Date : 2024-05-03 DOI: arxiv-2405.01967

Nils L. Westhausen, Hendrik Kayser, Theresa Jansen, Bernd T. Meyer

引用次数: 0