June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung
{"title":"RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification","authors":"June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung","doi":"arxiv-2405.02996","DOIUrl":"https://doi.org/arxiv-2405.02996","url":null,"abstract":"Recent advancements in AI have democratized its deployment as a healthcare\u0000assistant. While pretrained models from large-scale visual and audio datasets\u0000have demonstrably generalized to this task, surprisingly, no studies have\u0000explored pretrained speech models, which, as human-originated sounds,\u0000intuitively would share closer resemblance to lung sounds. This paper explores\u0000the efficacy of pretrained speech models for respiratory sound classification.\u0000We find that there is a characterization gap between speech and lung sound\u0000samples, and to bridge this gap, data augmentation is essential. However, the\u0000most widely used augmentation technique for audio and speech, SpecAugment,\u0000requires 2-dimensional spectrogram format and cannot be applied to models\u0000pretrained on speech waveforms. To address this, we propose RepAugment, an\u0000input-agnostic representation-level augmentation technique that outperforms\u0000SpecAugment, but is also suitable for respiratory sound classification with\u0000waveform pretrained models. Experimental results show that our approach\u0000outperforms the SpecAugment, demonstrating a substantial improvement in the\u0000accuracy of minority disease classes, reaching up to 7.14%.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric Grinstein, Elisa Tengan, Bilgesu Çakmak, Thomas Dietzen, Leonardo Nunes, Toon van Waterschoot, Mike Brookes, Patrick A. Naylor
{"title":"Steered Response Power for Sound Source Localization: A Tutorial Review","authors":"Eric Grinstein, Elisa Tengan, Bilgesu Çakmak, Thomas Dietzen, Leonardo Nunes, Toon van Waterschoot, Mike Brookes, Patrick A. Naylor","doi":"arxiv-2405.02991","DOIUrl":"https://doi.org/arxiv-2405.02991","url":null,"abstract":"In the last three decades, the Steered Response Power (SRP) method has been\u0000widely used for the task of Sound Source Localization (SSL), due to its\u0000satisfactory localization performance on moderately reverberant and noisy\u0000scenarios. Many works have analyzed and extended the original SRP method to\u0000reduce its computational cost, to allow it to locate multiple sources, or to\u0000improve its performance in adverse environments. In this work, we review over\u0000200 papers on the SRP method and its variants, with emphasis on the SRP-PHAT\u0000method. We also present eXtensible-SRP, or X-SRP, a generalized and modularized\u0000version of the SRP algorithm which allows the reviewed extensions to be\u0000implemented. We provide a Python implementation of the algorithm which includes\u0000selected extensions from the literature.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raghad Salameh, Mohamad Al Mdfaa, Nursultan Askarbekuly, Manuel Mazzara
{"title":"Quranic Audio Dataset: Crowdsourced and Labeled Recitation from Non-Arabic Speakers","authors":"Raghad Salameh, Mohamad Al Mdfaa, Nursultan Askarbekuly, Manuel Mazzara","doi":"arxiv-2405.02675","DOIUrl":"https://doi.org/arxiv-2405.02675","url":null,"abstract":"This paper addresses the challenge of learning to recite the Quran for\u0000non-Arabic speakers. We explore the possibility of crowdsourcing a carefully\u0000annotated Quranic dataset, on top of which AI models can be built to simplify\u0000the learning process. In particular, we use the volunteer-based crowdsourcing\u0000genre and implement a crowdsourcing API to gather audio assets. We integrated\u0000the API into an existing mobile application called NamazApp to collect audio\u0000recitations. We developed a crowdsourcing platform called Quran Voice for\u0000annotating the gathered audio assets. As a result, we have collected around\u00007000 Quranic recitations from a pool of 1287 participants across more than 11\u0000non-Arabic countries, and we have annotated 1166 recitations from the dataset\u0000in six categories. We have achieved a crowd accuracy of 0.77, an inter-rater\u0000agreement of 0.63 between the annotators, and 0.89 between the labels assigned\u0000by the algorithm and the expert judgments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward end-to-end interpretable convolutional neural networks for waveform signals","authors":"Linh Vu, Thu Tran, Wern-Han Lim, Raphael Phan","doi":"arxiv-2405.01815","DOIUrl":"https://doi.org/arxiv-2405.01815","url":null,"abstract":"This paper introduces a novel convolutional neural networks (CNN) framework\u0000tailored for end-to-end audio deep learning models, presenting advancements in\u0000efficiency and explainability. By benchmarking experiments on three standard\u0000speech emotion recognition datasets with five-fold cross-validation, our\u0000framework outperforms Mel spectrogram features by up to seven percent. It can\u0000potentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while\u0000remaining lightweight. Furthermore, we demonstrate the efficiency and\u0000interpretability of the front-end layer using the PhysioNet Heart Sound\u0000Database, illustrating its ability to handle and capture intricate long\u0000waveform patterns. Our contributions offer a portable solution for building\u0000efficient and interpretable models for raw waveform data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"111 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva
{"title":"Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models","authors":"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva","doi":"arxiv-2405.02179","DOIUrl":"https://doi.org/arxiv-2405.02179","url":null,"abstract":"Generalization is a main issue for current audio deepfake detectors, which\u0000struggle to provide reliable results on out-of-distribution data. Given the\u0000speed at which more and more accurate synthesis methods are developed, it is\u0000very important to design techniques that work well also on data they were not\u0000trained for. In this paper we study the potential of large-scale pre-trained\u0000models for audio deepfake detection, with special focus on generalization\u0000ability. To this end, the detection problem is reformulated in a speaker\u0000verification framework and fake audios are exposed by the mismatch between the\u0000voice sample under test and the voice of the claimed identity. With this\u0000paradigm, no fake speech sample is necessary in training, cutting off any link\u0000with the generation method at the root, and ensuring full generalization\u0000ability. Features are extracted by general-purpose large pre-trained models,\u0000with no need for training or fine-tuning on specific fake detection or speaker\u0000verification datasets. At detection time only a limited set of voice fragments\u0000of the identity under test is required. Experiments on several datasets\u0000widespread in the community show that detectors based on pre-trained models\u0000achieve excellent performance and show strong generalization ability, rivaling\u0000supervised methods on in-distribution data and largely overcoming them on\u0000out-of-distribution data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao
{"title":"GMP-ATL: Gender-augmented Multi-scale Pseudo-label Enhanced Adaptive Transfer Learning for Speech Emotion Recognition via HuBERT","authors":"Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao","doi":"arxiv-2405.02151","DOIUrl":"https://doi.org/arxiv-2405.02151","url":null,"abstract":"The continuous evolution of pre-trained speech models has greatly advanced\u0000Speech Emotion Recognition (SER). However, there is still potential for\u0000enhancement in the performance of these methods. In this paper, we present\u0000GMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning),\u0000a novel HuBERT-based adaptive transfer learning framework for SER.\u0000Specifically, GMP-ATL initially employs the pre-trained HuBERT, implementing\u0000multi-task learning and multi-scale k-means clustering to acquire frame-level\u0000gender-augmented multi-scale pseudo-labels. Then, to fully leverage both\u0000obtained frame-level and utterance-level emotion labels, we incorporate model\u0000retraining and fine-tuning methods to further optimize GMP-ATL. Experiments on\u0000IEMOCAP show that our GMP-ATL achieves superior recognition performance, with a\u0000WAR of 80.0% and a UAR of 82.0%, surpassing state-of-the-art unimodal SER\u0000methods, while also yielding comparable results with multimodal SER approaches.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Can We Identify Unknown Audio Recording Environments in Forensic Scenarios?","authors":"Denise Moussa, Germans Hirsch, Christian Riess","doi":"arxiv-2405.02119","DOIUrl":"https://doi.org/arxiv-2405.02119","url":null,"abstract":"Audio recordings may provide important evidence in criminal investigations.\u0000One such case is the forensic association of the recorded audio to the\u0000recording location. For example, a voice message may be the only investigative\u0000cue to narrow down the candidate sites for a crime. Up to now, several works\u0000provide tools for closed-set recording environment classification under\u0000relatively clean recording conditions. However, in forensic investigations, the\u0000candidate locations are case-specific. Thus, closed-set tools are not\u0000applicable without retraining on a sufficient amount of training samples for\u0000each case and respective candidate set. In addition, a forensic tool has to\u0000deal with audio material from uncontrolled sources with variable properties and\u0000quality. In this work, we therefore attempt a major step towards practical forensic\u0000application scenarios. We propose a representation learning framework called\u0000EnvId, short for environment identification. EnvId avoids case-specific\u0000retraining. Instead, it is the first tool for robust few-shot classification of\u0000unseen environment locations. We demonstrate that EnvId can handle forensically\u0000challenging material. It provides good quality predictions even under unseen\u0000signal degradations, environment characteristics or recording position\u0000mismatches. Our code and datasets will be made publicly available upon acceptance.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"247 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joint sentiment analysis of lyrics and audio in music","authors":"Lea Schaab, Anna Kruspe","doi":"arxiv-2405.01988","DOIUrl":"https://doi.org/arxiv-2405.01988","url":null,"abstract":"Sentiment or mood can express themselves on various levels in music. In\u0000automatic analysis, the actual audio data is usually analyzed, but the lyrics\u0000can also play a crucial role in the perception of moods. We first evaluate\u0000various models for sentiment analysis based on lyrics and audio separately. The\u0000corresponding approaches already show satisfactory results, but they also\u0000exhibit weaknesses, the causes of which we examine in more detail. Furthermore,\u0000different approaches to combining the audio and lyrics results are proposed and\u0000evaluated. Considering both modalities generally leads to improved performance.\u0000We investigate misclassifications and (also intentional) contradictions between\u0000audio and lyrics sentiment more closely, and identify possible causes. Finally,\u0000we address fundamental problems in this research area, such as high\u0000subjectivity, lack of data, and inconsistency in emotion taxonomies.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie
{"title":"Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets","authors":"Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie","doi":"arxiv-2405.02132","DOIUrl":"https://doi.org/arxiv-2405.02132","url":null,"abstract":"Large Language Models (LLMs) have demonstrated unparalleled effectiveness in\u0000various NLP tasks, and integrating LLMs with automatic speech recognition (ASR)\u0000is becoming a mainstream paradigm. Building upon this momentum, our research\u0000delves into an in-depth examination of this paradigm on a large open-source\u0000Chinese dataset. Specifically, our research aims to evaluate the impact of\u0000various configurations of speech encoders, LLMs, and projector modules in the\u0000context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we\u0000introduce a three-stage training approach, expressly developed to enhance the\u0000model's ability to align auditory and textual information. The implementation\u0000of this approach, alongside the strategic integration of ASR components,\u0000enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and\u0000Test_Meeting test sets. Our analysis presents an empirical foundation for\u0000future research in LLM-based ASR systems and offers insights into optimizing\u0000performance using Chinese datasets. We will publicly release all scripts used\u0000for data preparation, training, inference, and scoring, as well as pre-trained\u0000models and training logs to promote reproducible research.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nils L. Westhausen, Hendrik Kayser, Theresa Jansen, Bernd T. Meyer
{"title":"Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios","authors":"Nils L. Westhausen, Hendrik Kayser, Theresa Jansen, Bernd T. Meyer","doi":"arxiv-2405.01967","DOIUrl":"https://doi.org/arxiv-2405.01967","url":null,"abstract":"Deep learning has the potential to enhance speech signals and increase their\u0000intelligibility for users of hearing aids. Deep models suited for real-world\u0000application should feature a low computational complexity and low processing\u0000delay of only a few milliseconds. In this paper, we explore deep speech\u0000enhancement that matches these requirements and contrast monaural and binaural\u0000processing algorithms in two complex acoustic scenes. Both algorithms are\u0000evaluated with objective metrics and in experiments with hearing-impaired\u0000listeners performing a speech-in-noise test. Results are compared to two\u0000traditional enhancement strategies, i.e., adaptive differential microphone\u0000processing and binaural beamforming. While in diffuse noise, all algorithms\u0000perform similarly, the binaural deep learning approach performs best in the\u0000presence of spatial interferers. Through a post-analysis, this can be\u0000attributed to improvements at low SNRs and to precise spatial filtering.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}