{"title":"A survey of technologies for automatic Dysarthric speech recognition","authors":"Zhaopeng Qian, Kejing Xiao, Chongchong Yu","doi":"10.1186/s13636-023-00318-2","DOIUrl":"https://doi.org/10.1186/s13636-023-00318-2","url":null,"abstract":"Abstract Speakers with dysarthria often struggle to accurately pronounce words and effectively communicate with others. Automatic speech recognition (ASR) is a powerful tool for extracting the content from speakers with dysarthria. However, the narrow concept of ASR typically only covers technologies that process acoustic modality signals. In this paper, we broaden the scope of this concept that the generalized concept of ASR for dysarthric speech. Our survey discussed the systems encompassed acoustic modality processing, articulatory movements processing and audio-visual modality fusion processing in the application of recognizing dysarthric speech. Contrary to previous surveys on dysarthric speech recognition, we have conducted a systematic review of the advancements in this field. In particular, we introduced state-of-the-art technologies to supplement the survey of recent research during the era of multi-modality fusion in dysarthric speech recognition. Our survey found that audio-visual fusion technologies perform better than traditional ASR technologies in the task of dysarthric speech recognition. However, training audio-visual fusion models requires more computing resources, and the available data corpus for dysarthric speech is limited. Despite these challenges, state-of-the-art technologies show promising potential for further improving the accuracy of dysarthric speech recognition in the future.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"43 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135042425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling","authors":"Kavya Manohar, Jayan A R, Rajeev Rajan","doi":"10.1186/s13636-023-00313-7","DOIUrl":"https://doi.org/10.1186/s13636-023-00313-7","url":null,"abstract":"Abstract This article presents the research work on improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling. The speech recognition system is built using a deep neural network–hidden Markov model (DNN-HMM)-based automatic speech recognition (ASR). We propose a novel method, syllable-byte pair encoding (S-BPE), that combines linguistically informed syllable tokenization with the data-driven tokenization method of byte pair encoding (BPE). The proposed method ensures words are always segmented at valid pronunciation boundaries. On a text corpus that has been divided into tokens using the proposed method, we construct statistical n-gram language models and assess the modeling effectiveness in terms of both information-theoretic and corpus linguistic metrics. A comparative study of the proposed method with other data-driven (BPE, Morfessor, and Unigram), linguistic (Syllable), and baseline (Word) tokenization algorithms is also presented. Pronunciation lexicons of subword tokenized units are built with pronunciation described as graphemes. We develop ASR systems employing the subword tokenized language models and pronunciation lexicons. The resulting ASR models are comprehensively evaluated to answer the research questions regarding the impact of subword tokenization algorithms on language modeling complexity and on ASR performance. Our study highlights the strong performance of the hybrid S-BPE tokens, achieving a notable 10.6% word error rate (WER), which represents a substantial 16.8% improvement over the baseline word-level ASR system. The ablation study has revealed that the performance of S-BPE segmentation, which initially underperformed compared to syllable tokens with lower amounts of textual data for language modeling, exhibited steady improvement with the increase in LM training data. The extensive ablation study indicates that there is a limited advantage in raising the n-gram order of the language model beyond $$n=3$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mrow> <mml:mi>n</mml:mi> <mml:mo>=</mml:mo> <mml:mn>3</mml:mn> </mml:mrow> </mml:math> . Such an increase results in considerable model size growth without significant improvements in WER. The implementation of the algorithm and all associated experiments are available under an open license, allowing for reproduction, adaptation, and reuse.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"66 s7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135773441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios","authors":"Stijn Kindt, Jenthe Thienpondt, Luca Becker, Nilesh Madhu","doi":"10.1186/s13636-023-00310-w","DOIUrl":"https://doi.org/10.1186/s13636-023-00310-w","url":null,"abstract":"Abstract Speaker embeddings, from the ECAPA-TDNN speaker verification network, were recently introduced as features for the task of clustering microphones in ad hoc arrays. Our previous work demonstrated that, in comparison to signal-based Mod-MFCC features, using speaker embeddings yielded a more robust and logical clustering of the microphones around the sources of interest. This work aims to further establish speaker embeddings as a robust feature for ad hoc microphone clustering by addressing open and additional questions of practical interest, arising from our prior work. Specifically, whereas our initial work made use of simulated data based on shoe-box acoustics models, we now present a more thorough analysis in more realistic settings. Furthermore, we investigate additional important considerations such as the choice of the distance metric used in the fuzzy C-means clustering; the minimal time range across which data need to be aggregated to obtain robust clusters; and the performance of the features in increasingly more challenging situations, and with multiple speakers. We also contrast the results on the basis of several metrics for quantifying the quality of such ad hoc clusters. Results indicate that the speaker embeddings are robust to short inference times, and deliver logical and useful clusters, even when the sources are very close to each other.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"15 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135814221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Huang, Lin Wang, Jichen Yang, Ying Hu, Liang He
{"title":"W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision","authors":"Hao Huang, Lin Wang, Jichen Yang, Ying Hu, Liang He","doi":"10.1186/s13636-023-00312-8","DOIUrl":"https://doi.org/10.1186/s13636-023-00312-8","url":null,"abstract":"Abstract Non-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136233736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Le Ma, Xinda Wu, Ruiyuan Tang, Chongjun Zhong, Kejun Zhang
{"title":"YuYin: a multi-task learning model of multi-modal e-commerce background music recommendation","authors":"Le Ma, Xinda Wu, Ruiyuan Tang, Chongjun Zhong, Kejun Zhang","doi":"10.1186/s13636-023-00306-6","DOIUrl":"https://doi.org/10.1186/s13636-023-00306-6","url":null,"abstract":"Abstract Appropriate background music in e-commerce advertisements can help stimulate consumption and build product image. However, many factors like emotion and product category should be taken into account, which makes manually selecting music time-consuming and require professional knowledge and it becomes crucial to automatically recommend music for video. For there is no e-commerce advertisements dataset, we first establish a large-scale e-commerce advertisements dataset Commercial-98K, which covers major e-commerce categories. Then, we proposed a video-music retrieval model YuYin to learn the correlation between video and music. We introduce a weighted fusion module (WFM) to fuse emotion features and audio features from music to get a more fine-grained music representation. Considering the similarity of music in the same product category, YuYin is trained by multi-task learning to explore the correlation between video and music by cross-matching video, music, and tag as well as a category prediction task. We conduct extensive experiments to prove YuYin achieves a remarkable improvement in video-music retrieval on Commercial-98K.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135729848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transformer-based autoencoder with ID constraint for unsupervised anomalous sound detection","authors":"Jian Guan, Youde Liu, Qiuqiang Kong, Feiyang Xiao, Qiaoxi Zhu, Jiantong Tian, Wenwu Wang","doi":"10.1186/s13636-023-00308-4","DOIUrl":"https://doi.org/10.1186/s13636-023-00308-4","url":null,"abstract":"Abstract Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detecting anomalies from sound. The self-supervised methods are not always stable and perform differently, even for machines of the same type. In addition, the anomalous sound may be short-lived, making it even harder to distinguish from normal sound. This paper proposes an ID-constrained Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly score computation for unsupervised ASD. Machine ID is employed to constrain the latent space of the Transformer-based autoencoder (TransAE) by introducing a simple ID classifier to learn the difference in the distribution for the same machine type and enhance the ability of the model in distinguishing anomalous sound. Moreover, weighted anomaly score computation is introduced to highlight the anomaly scores of anomalous events that only appear for a short time. Experiments performed on DCASE 2020 Challenge Task2 development dataset demonstrate the effectiveness and superiority of our proposed method.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135853424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingtan Li, Mengkai Sun, Zhonghao Zhao, Xingcan Li, Gaigai Li, Chen Wu, Kun Qian, Bin Hu, Yoshiharu Yamamoto, Björn W. Schuller
{"title":"Battling with the low-resource condition for snore sound recognition: introducing a meta-learning strategy","authors":"Jingtan Li, Mengkai Sun, Zhonghao Zhao, Xingcan Li, Gaigai Li, Chen Wu, Kun Qian, Bin Hu, Yoshiharu Yamamoto, Björn W. Schuller","doi":"10.1186/s13636-023-00309-3","DOIUrl":"https://doi.org/10.1186/s13636-023-00309-3","url":null,"abstract":"Abstract Snoring affects 57 % of men, 40 % of women, and 27 % of children in the USA. Besides, snoring is highly correlated with obstructive sleep apnoea (OSA), which is characterised by loud and frequent snoring. OSA is also closely associated with various life-threatening diseases such as sudden cardiac arrest and is regarded as a grave medical ailment. Preliminary studies have shown that in the USA, OSA affects over 34 % of men and 14 % of women. In recent years, polysomnography has increasingly been used to diagnose OSA. However, due to its drawbacks such as being time-consuming and costly, intelligent audio analysis of snoring has emerged as an alternative method. Considering the higher demand for identifying the excitation location of snoring in clinical practice, we utilised the Munich-Passau Snore Sound Corpus (MPSSC) snoring database which classifies the snoring excitation location into four categories. Nonetheless, the problem of small samples remains in the MPSSC database due to factors such as privacy concerns and difficulties in accurate labelling. In fact, accurately labelled medical data that can be used for machine learning is often scarce, especially for rare diseases. In view of this, Model-Agnostic Meta-Learning (MAML), a small sample method based on meta-learning, is used to classify snore signals with less resources in this work. The experimental results indicate that even when using only the ESC-50 dataset (non-snoring sound signals) as the data for meta-training, we are able to achieve an unweighted average recall of 60.2 % on the test dataset after fine-tuning on just 36 instances of snoring from the development part of the MPSSC dataset. While our results only exceed the baseline by 4.4 %, they still demonstrate that even with fine-tuning on a few instances of snoring, our model can outperform the baseline. This implies that the MAML algorithm can effectively tackle the low-resource problem even with limited data resources.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135855849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments","authors":"Chunxi Wang, Maoshen Jia, Xinfeng Zhang","doi":"10.1186/s13636-023-00307-5","DOIUrl":"https://doi.org/10.1186/s13636-023-00307-5","url":null,"abstract":"Abstract In recent years, the speaker-independent, single-channel speech separation problem has made significant progress with the development of deep neural networks (DNNs). However, separating the speech of each interested speaker from an environment that includes the speech of other speakers, background noise, and room reverberation remains challenging. In order to solve this problem, a speech separation method for a noisy reverberation environment is proposed. Firstly, the time-domain end-to-end network structure of a deep encoder/decoder dual-path neural network is introduced in this paper for speech separation. Secondly, to make the model not fall into local optimum during training, a loss function stretched optimal scale-invariant signal-to-noise ratio (SOSISNR) was proposed, inspired by the scale-invariant signal-to-noise ratio (SISNR). At the same time, in order to make the training more appropriate to the human auditory system, the joint loss function is extended based on short-time objective intelligibility (STOI). Thirdly, an alignment operation is proposed to reduce the influence of time delay caused by reverberation on separation performance. Combining the above methods, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments compared to the baseline methods.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135924888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speech emotion recognition based on Graph-LSTM neural network","authors":"Yan Li, Yapeng Wang, Xu Yang, Sio-Kei Im","doi":"10.1186/s13636-023-00303-9","DOIUrl":"https://doi.org/10.1186/s13636-023-00303-9","url":null,"abstract":"Abstract Currently, Graph Neural Networks have been extended to the field of speech signal processing. It is the more compact and flexible way to represent speech sequences by graphs. However, the structures of the relationships in recent studies are tend to be relatively uncomplicated. Moreover, the graph convolution module exhibits limitations that impede its adaptability to intricate application scenarios. In this study, we establish the speech-graph using feature similarity and introduce a novel architecture for graph neural network that leverages an LSTM aggregator and weighted pooling. The unweighted accuracy of 65.39% and the weighted accuracy of 71.83% are obtained on the IEMOCAP dataset, achieving the performance comparable to or better than existing graph baselines. This method can improve the interpretability of the model to some extent, and identify speech emotion features effectively.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136209211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An acoustic echo canceller optimized for hands-free speech telecommunication in large vehicle cabins","authors":"Amin Saremi, Balaji Ramkumar, Ghazaleh Ghaffari, Zonghua Gu","doi":"10.1186/s13636-023-00305-7","DOIUrl":"https://doi.org/10.1186/s13636-023-00305-7","url":null,"abstract":"Abstract Acoustic echo cancelation (AEC) is a system identification problem that has been addressed by various techniques and most commonly by normalized least mean square (NLMS) adaptive algorithms. However, performing a successful AEC in large commercial vehicles has proved complicated due to the size and challenging variations in the acoustic characteristics of their cabins. Here, we present a wideband fully linear time domain NLMS algorithm for AEC that is enhanced by a statistical double-talk detector (DTD) and a voice activity detector (VAD). The proposed solution was tested in four main Volvo truck models, with various cabin geometries, using standard Swedish hearing-in-noise (HINT) sentences in the presence and absence of engine noise. The results show that the proposed solution achieves a high echo return loss enhancement (ERLE) of at least 25 dB with a fast convergence time, fulfilling ITU G.168 requirements. The presented solution was particularly developed to provide a practical compromise between accuracy and computational cost to allow its real-time implementation on commercial digital signal processors (DSPs). A real-time implementation of the solution was coded in C on an ARM Cortex M-7 DSP. The algorithmic latency was measured at less than 26 ms for processing each 50-ms buffer indicating the computational feasibility of the proposed solution for real-time implementation on common DSPs and embedded systems with limited computational and memory resources. MATLAB source codes and related audio files are made available online for reference and further development.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135254957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}