{"title":"Continuous lipreading based on acoustic temporal alignments","authors":"David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos","doi":"10.1186/s13636-024-00345-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00345-7","url":null,"abstract":"Visual speech recognition (VSR) is a challenging task that has received increasing interest during the last few decades. Current state of the art employs powerful end-to-end architectures based on deep learning which depend on large amounts of data and high computational resources for their estimation. We address the task of VSR for data scarcity scenarios with limited computational resources by using traditional approaches based on hidden Markov models. We present a novel learning strategy that employs information obtained from previous acoustic temporal alignments to improve the visual system performance. Furthermore, we studied multiple visual speech representations and how image resolution or frame rate affect its performance. All these experiments were conducted on the limited data VLRF corpus, a database which offers an audio-visual support to address continuous speech recognition in Spanish. The results show that our approach significantly outperforms the best results achieved on the task to date.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"9 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140881737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mi-Go: tool which uses YouTube as data source for evaluating general-purpose speech recognition machine learning models","authors":"Tomasz Wojnar, Jarosław Hryszko, Adam Roman","doi":"10.1186/s13636-024-00343-9","DOIUrl":"https://doi.org/10.1186/s13636-024-00343-9","url":null,"abstract":"This article introduces Mi-Go, a tool aimed at evaluating the performance and adaptability of general-purpose speech recognition machine learning models across diverse real-world scenarios. The tool leverages YouTube as a rich and continuously updated data source, accounting for multiple languages, accents, dialects, speaking styles, and audio quality levels. To demonstrate the effectiveness of the tool, an experiment was conducted, by using Mi-Go to evaluate state-of-the-art automatic speech recognition machine learning models. The evaluation involved a total of 141 randomly selected YouTube videos. The results underscore the utility of YouTube as a valuable data source for evaluation of speech recognition models, ensuring their robustness, accuracy, and adaptability to diverse languages and acoustic conditions. Additionally, by contrasting the machine-generated transcriptions against human-made subtitles, the Mi-Go tool can help pinpoint potential misuse of YouTube subtitles, like search engine optimization.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"43 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140835484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chunxi Wang, Maoshen Jia, Meiran Li, Changchun Bao, Wenyu Jin
{"title":"Exploring the power of pure attention mechanisms in blind room parameter estimation","authors":"Chunxi Wang, Maoshen Jia, Meiran Li, Changchun Bao, Wenyu Jin","doi":"10.1186/s13636-024-00344-8","DOIUrl":"https://doi.org/10.1186/s13636-024-00344-8","url":null,"abstract":"Dynamic parameterization of acoustic environments has drawn widespread attention in the field of audio processing. Precise representation of local room acoustic characteristics is crucial when designing audio filters for various audio rendering applications. Key parameters in this context include reverberation time (RT $$_{60}$$ ) and geometric room volume. In recent years, neural networks have been extensively applied in the task of blind room parameter estimation. However, there remains a question of whether pure attention mechanisms can achieve superior performance in this task. To address this issue, this study employs blind room parameter estimation based on monaural noisy speech signals. Various model architectures are investigated, including a proposed attention-based model. This model is a convolution-free Audio Spectrogram Transformer, utilizing patch splitting, attention mechanisms, and cross-modality transfer learning from a pretrained Vision Transformer. Experimental results suggest that the proposed attention mechanism-based model, relying purely on attention mechanisms without using convolution, exhibits significantly improved performance across various room parameter estimation tasks, especially with the help of dedicated pretraining and data augmentation schemes. Additionally, the model demonstrates more advantageous adaptability and robustness when handling variable-length audio inputs compared to existing methods.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"10 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140805125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust acoustic reflector localization using a modified EM algorithm","authors":"Usama Saqib, Mads Græsbøll Christensen, Jesper Rindom Jensen","doi":"10.1186/s13636-024-00340-y","DOIUrl":"https://doi.org/10.1186/s13636-024-00340-y","url":null,"abstract":"In robotics, echolocation has been used to detect acoustic reflectors, e.g., walls, as it aids the robotic platform to navigate in darkness and also helps detect transparent surfaces. However, the transfer function or response of an acoustic system, e.g., loudspeakers/emitters, contributes to non-ideal behavior within the acoustic systems that can contribute to a phase lag due to propagation delay. This non-ideal response can hinder the performance of a time-of-arrival (TOA) estimator intended for acoustic reflector localization especially when the estimation of multiple reflections is required. In this paper, we, therefore, propose a robust expectation-maximization (EM) algorithm that takes into account the response of acoustic systems to enhance the TOA estimation accuracy when estimating multiple reflections when the robot is placed in a corner of a room. A non-ideal transfer function is built with two parameters, which are estimated recursively within the estimator. To test the proposed method, a hardware proof-of-concept setup was built with two different designs. The experimental results show that the proposed method could detect an acoustic reflector up to a distance of 1.6 m with $$60%$$ accuracy under the signal-to-noise ratio (SNR) of 0 dB. Compared to the state-of-the-art EM algorithm, our proposed method provides improved performance when estimating TOA by $$10%$$ under a low SNR value.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"9 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140616242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zehua Zhang, Lu Zhang, Xuyi Zhuang, Yukun Qian, Mingjiang Wang
{"title":"Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement","authors":"Zehua Zhang, Lu Zhang, Xuyi Zhuang, Yukun Qian, Mingjiang Wang","doi":"10.1186/s13636-024-00341-x","DOIUrl":"https://doi.org/10.1186/s13636-024-00341-x","url":null,"abstract":"Speech signals are often distorted by reverberation and noise, with a widely distributed signal-to-noise ratio (SNR). To address this, our study develops robust, deep neural network (DNN)-based speech enhancement methods. We reproduce several DNN-based monaural speech enhancement methods and outline a strategy for constructing datasets. This strategy, validated through experimental reproductions, has effectively enhanced the denoising efficiency and robustness of the models. Then, we propose a causal speech enhancement system named Supervised Attention Multi-Scale Temporal Convolutional Network (SA-MSTCN). SA-MSTCN extracts the complex compressed spectrum (CCS) for input encoding and employs complex ratio masking (CRM) for output decoding. The supervised attention module, a lightweight addition to SA-MSTCN, guides feature extraction. Experiment results show that the supervised attention module effectively improves noise reduction performance with a minor increase in computational cost. The multi-scale temporal convolutional network refines the perceptual field and better reconstructs the speech signal. Overall, SA-MSTCN not only achieves state-of-the-art speech quality and intelligibility compared to other methods but also maintains stable denoising performance across various environments.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"299 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rabbia Mahum, Aun Irtaza, Ali Javed, Haitham A. Mahmoud, Haseeb Hassan
{"title":"Correction: DeepDet: YAMNet with BottleNeck Attention Module (BAM) for TTS synthesis detection","authors":"Rabbia Mahum, Aun Irtaza, Ali Javed, Haitham A. Mahmoud, Haseeb Hassan","doi":"10.1186/s13636-024-00342-w","DOIUrl":"https://doi.org/10.1186/s13636-024-00342-w","url":null,"abstract":"<p><b>Correction</b><b>:</b> <b>EURASIP J. Audio Speech Music Process 2024, 18 (2024)</b></p><p><b>https://doi.org/10.1186/s13636-024-00335-9</b></p><br/><p>Following publication of the original article [1], we have been notified that:</p><p>-Equation 9 was missing from the paper, therefore all equations have been renumbered.</p><p>-The title should be modified from “DeepDet: YAMNet with BottleNeck Attention Module (BAM) TTS synthesis detection” to “DeepDet: YAMNet with BottleNeck Attention Module (BAM) for TTS synthesis detection”.</p><p>-The Acknowledgements section needs to include the following statement:</p><p>The authors extend their appreciation to King Saud University for funding this work through Researchers Supporting Project number (RSPD2024R1006), King Saud University, Riyadh, Saudi Arabia.</p><p>-The below text in the Funding section has been removed:</p><p>The authors extend their appreciation to the Deputyship for Research and Innovation, “Ministry of Education” in Saudi Arabia for funding this research (IFKSUOR3–561–2).</p><p>The original article has been corrected.</p><ol data-track-component=\"outbound reference\"><li data-counter=\"1.\"><p>Mahum et al., DeepDet: YAMNet with BottleNeck Attention Module (BAM) for TTS synthesis detection. EURASIP J. Audio Speech Music Process. <b>2024</b>, 18 (2024). https://doi.org/10.1186/s13636-024-00335-9</p><p>Article Google Scholar </p></li></ol><p>Download references<svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" role=\"img\" width=\"16\"><use xlink:href=\"#icon-eds-i-download-medium\" xmlns:xlink=\"http://www.w3.org/1999/xlink\"></use></svg></p><h3>Authors and Affiliations</h3><ol><li><p>Computer Science Department, UET Taxila, Taxila, Pakistan</p><p>Rabbia Mahum & Aun Irtaza</p></li><li><p>Software Engineering Department, UET Taxila, Taxila, Pakistan</p><p>Ali Javed</p></li><li><p>Industrial Engineering Department, College of Engineering, King Saud University, 11421, Riyadh, Saudi Arabia</p><p>Haitham A. Mahmoud</p></li><li><p>College of Big Data and Internet, Shenzhen Technology University (SZTU), Shenzhen, China</p><p>Haseeb Hassan</p></li></ol><span>Authors</span><ol><li><span>Rabbia Mahum</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Aun Irtaza</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Ali Javed</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Haitham A. Mahmoud</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Haseeb Hassan</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li></ol><h3>Corresponding author</h3><p>Correspondence to Rabbia Mahum.</p><p><b>O","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"271 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rabbia Mahum, Aun Irtaza, Ali Javed, Haitham A. Mahmoud, Haseeb Hassan
{"title":"DeepDet: YAMNet with BottleNeck Attention Module (BAM) for TTS synthesis detection","authors":"Rabbia Mahum, Aun Irtaza, Ali Javed, Haitham A. Mahmoud, Haseeb Hassan","doi":"10.1186/s13636-024-00335-9","DOIUrl":"https://doi.org/10.1186/s13636-024-00335-9","url":null,"abstract":"Spoofed speeches are becoming a big threat to society due to advancements in artificial intelligence techniques. Therefore, there must be an automated spoofing detector that can be integrated into automatic speaker verification (ASV) systems. In this study, we recommend a novel and robust model, named DeepDet, based on deep-layered architecture, to categorize speech into two classes: spoofed and bonafide. DeepDet is an improved model based on Yet Another Mobile Network (YAMNet) employing a customized MobileNet combined with a bottleneck attention module (BAM). First, we convert audio into mel-spectrograms that consist of time–frequency representations on mel-scale. Second, we trained our deep layered model using the extracted mel-spectrograms on a Logical Access (LA) set, including synthesized speeches and voice conversions of the ASVspoof-2019 dataset. In the end, we classified the audios, utilizing our trained binary classifier. More precisely, we utilized the power of layered architecture and guided attention that can discern the spoofed speech from bonafide samples. Our proposed improved model employs depth-wise linearly separate convolutions, which makes our model lighter weight than existing techniques. Furthermore, we implemented extensive experiments to assess the performance of the suggested model using the ASVspoof 2019 corpus. We attained an equal error rate (EER) of 0.042% on Logical Access (LA), whereas 0.43% on Physical Access (PA) attacks. Therefore, the performance of the proposed model is significant on the ASVspoof 2019 dataset and indicates the effectiveness of the DeepDet over existing spoofing detectors. Additionally, our proposed model is robust enough that can identify the unseen spoofed audios and classifies the several attacks accurately.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"1 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-rate modulation encoding via unsupervised learning for audio event detection","authors":"Sandeep Reddy Kothinti, Mounya Elhilali","doi":"10.1186/s13636-024-00339-5","DOIUrl":"https://doi.org/10.1186/s13636-024-00339-5","url":null,"abstract":"Technologies in healthcare, smart homes, security, ecology, and entertainment all deploy audio event detection (AED) in order to detect sound events in an audio recording. Effective AED techniques rely heavily on supervised or semi-supervised models to capture the wide range of dynamics spanned by sound events in order to achieve temporally precise boundaries and accurate event classification. These methods require extensive collections of labeled or weakly labeled in-domain data, which is costly and labor-intensive. Importantly, these approaches do not fully leverage the inherent variability and range of dynamics across sound events, aspects that can be effectively identified through unsupervised methods. The present work proposes an approach based on multi-rate autoencoders that are pretrained in an unsupervised way to leverage unlabeled audio data and ultimately learn the rich temporal dynamics inherent in natural sound events. This approach utilizes parallel autoencoders that achieve decompositions of the modulation spectrum along different bands. In addition, we introduce a rate-selective temporal contrastive loss to align the training objective with event detection metrics. Optimizing the configuration of multi-rate encoders and the temporal contrastive loss leads to notable improvements in domestic sound event detection in the context of the DCASE challenge.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"55 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synthesis of soundfields through irregular loudspeaker arrays based on convolutional neural networks","authors":"Luca Comanducci, Fabio Antonacci, Augusto Sarti","doi":"10.1186/s13636-024-00337-7","DOIUrl":"https://doi.org/10.1186/s13636-024-00337-7","url":null,"abstract":"Most soundfield synthesis approaches deal with extensive and regular loudspeaker arrays, which are often not suitable for home audio systems, due to physical space constraints. In this article, we propose a technique for soundfield synthesis through more easily deployable irregular loudspeaker arrays, i.e., where the spacing between loudspeakers is not constant, based on deep learning. The input are the driving signals obtained through a plane wave decomposition-based technique. While the considered driving signals are able to correctly reproduce the soundfield with a regular array, they show degraded performances when using irregular setups. Through a complex-valued convolutional neural network (CNN), we modify the driving signals in order to compensate the errors in the reproduction of the desired soundfield. Since no ground truth driving signals are available for the compensated ones, we train the model by calculating the loss between the desired soundfield at a number of control points and the one obtained through the driving signals estimated by the network. The proposed model must be retrained for each irregular loudspeaker array configuration. Numerical results show better reproduction accuracy with respect to the plane wave decomposition-based technique, pressure-matching approach, and linear optimizers for driving signal compensation.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"61 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140313198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An end-to-end approach for blindly rendering a virtual sound source in an audio augmented reality environment","authors":"Shivam Saini, Isaac Engel, Jürgen Peissig","doi":"10.1186/s13636-024-00338-6","DOIUrl":"https://doi.org/10.1186/s13636-024-00338-6","url":null,"abstract":"Audio augmented reality (AAR), a prominent topic in the field of audio, requires understanding the listening environment of the user for rendering an authentic virtual auditory object. Reverberation time ( $$RT_{60}$$ ) is a predominant metric for the characterization of room acoustics and numerous approaches have been proposed to estimate it blindly from a reverberant speech signal. However, a single $$RT_{60}$$ value may not be sufficient to correctly describe and render the acoustics of a room. This contribution presents a method for the estimation of multiple room acoustic parameters required to render close-to-accurate room acoustics in an unknown environment. It is shown how these parameters can be estimated blindly using an audio transformer that can be deployed on a mobile device. Furthermore, the paper also discusses the use of the estimated room acoustic parameters to find a similar room from a dataset of real BRIRs that can be further used for rendering the virtual audio source. Additionally, a novel binaural room impulse response (BRIR) augmentation technique to overcome the limitation of inadequate data is proposed. Finally, the proposed method is validated perceptually by means of a listening test.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"117 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140313191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}