Computer Speech and Language最新文献

筛选
英文 中文
MECOS: A bilingual Manipuri–English spontaneous code-switching speech corpus for automatic speech recognition MECOS:用于自动语音识别的曼尼普尔语-英语双语自发代码转换语音库
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-02-20 DOI: 10.1016/j.csl.2024.101627
Naorem Karline Singh, Yambem Jina Chanu, Hoomexsun Pangsatabam
{"title":"MECOS: A bilingual Manipuri–English spontaneous code-switching speech corpus for automatic speech recognition","authors":"Naorem Karline Singh,&nbsp;Yambem Jina Chanu,&nbsp;Hoomexsun Pangsatabam","doi":"10.1016/j.csl.2024.101627","DOIUrl":"10.1016/j.csl.2024.101627","url":null,"abstract":"<div><p>In this study, we introduce a new code-switched speech database with 57h of Manipuri–English annotated spontaneous speech. Manipuri is an official language of India and is primarily spoken in the north–eastern Indian state of Manipur. Most Manipur native speakers today are bilingual and frequently use code switching in everyday discussions. By carefully assessing the amount of code-switched speech in each video, recordings from YouTube are gathered. 21,339 utterances and 291,731 instances of code switching are present in the database. Given the code-switching nature of the data, a proper annotation procedure is used, and the data are manually annotated using the Meitei Mayek unicode font and the roman alphabets for Manipuri and English, respectively. The transcription includes the information of the speakers, non-speech information, and the corresponding annotation. The aim of this research is to construct an automatic speech recognition (ASR) system as well as offer a thorough analysis and details of the speech corpus. We believe that our research is the first to use an ASR system for Manipuri–English code-switched speech. To evaluate the performance, ASR systems based on hybrid deep neural network and hidden Markov model (DNN–HMM), time delay neural network (TDNN), hybrid time delay neural network and long short-term memory (TDNN–LSTM) and three end-to-end (E2E) models i.e. hybrid connectionist temporal classification and attention model (CTC-Attention), Conformer, wav2vec XLSR are developed for Manipuri–English language. In comparison to other models, pure TDNN produces outcomes that are clearly superior.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101627"},"PeriodicalIF":4.3,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139925331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Translating scientific abstracts in the bio-medical domain with structure-aware models 翻译和预测医学领域科学摘要的文档结构
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-02-09 DOI: 10.1016/j.csl.2024.101623
Sadaf Abdul Rauf , François Yvon
{"title":"Translating scientific abstracts in the bio-medical domain with structure-aware models","authors":"Sadaf Abdul Rauf ,&nbsp;François Yvon","doi":"10.1016/j.csl.2024.101623","DOIUrl":"10.1016/j.csl.2024.101623","url":null,"abstract":"<div><p>Machine Translation (MT) technologies have improved in many ways and generate usable outputs for a growing number of domains and language pairs. Yet, most sentence based MT systems struggle with contextual dependencies, processing small chunks of texts, typically sentences, in isolation from their textual context. This is likely to cause systematic errors or inconsistencies when processing long documents. While various attempts are made to handle extended contexts in translation, the relevance of these contextual cues, especially those related to the structural organization, and the extent to which they affect translation quality remains an under explored area. In this work, we explore ways to take these structural aspects into account, by integrating document structure as an extra conditioning context. Our experiments on biomedical abstracts, which are usually structured in a rigid way, suggest that this type of structural information can be useful for MT and document structure prediction. We also present in detail the impact of structural information on MT output and assess the degree to which structural information can be learned from the data.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101623"},"PeriodicalIF":4.3,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139884004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel Chinese–Tibetan mixed-language rumor detector with multi-extractor representations 采用多提取器表征的新型汉藏混合语言谣言检测器
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-02-07 DOI: 10.1016/j.csl.2024.101625
Lisu Yu , Fei Li , Lixin Yu , Wei Li , Zhicheng Dong , Donghong Cai , Zhen Wang
{"title":"A novel Chinese–Tibetan mixed-language rumor detector with multi-extractor representations","authors":"Lisu Yu ,&nbsp;Fei Li ,&nbsp;Lixin Yu ,&nbsp;Wei Li ,&nbsp;Zhicheng Dong ,&nbsp;Donghong Cai ,&nbsp;Zhen Wang","doi":"10.1016/j.csl.2024.101625","DOIUrl":"10.1016/j.csl.2024.101625","url":null,"abstract":"<div><p>Rumors can easily propagate through social media, posing potential threats to both individual and public health. Most existing approaches focus on single-language rumor detection, which leads to unsatisfying performance when these are applied to mixed-language rumor detection. Meanwhile, the type of mixed-language (mixture of word-level or sentence-level) is a great challenge for mixed-language rumor detection. In this paper, focusing on a mixed scene of Chinese and Tibetan, the research first provides a Chinese–Tibetan mixed-language rumor detection dataset (Weibo_Ch_Ti) that comprises 1,617 non-rumor tweets and 1,456 rumor tweets in two mixed-language types. Then, the research proposes an effective model with multi-extractors, named “MER-CTRD” for short. This model mainly consists of three extractors. The Multi-task Extractor helps the model to extract feature representations of different mixed-language types adaptively. The Rich-semantic Extractor enriches the semantic features representations of Tibetan in the Chinese–Tibetan-mixed language. The Fusion-feature Extractor fuses the mean and disparity semantic features of Chinese and Tibetan to complement feature representations of the mixed language. Finally, the research conducts experiments on Weibo_Ch_Ti. The results show that the proposed model improves accuracy by about 3%–12% over the baseline models, indicating its effectiveness in the Chinese–Tibetan mixed-language rumor detection scenario.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101625"},"PeriodicalIF":4.3,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139829040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Single-channel speech enhancement using colored spectrograms 利用彩色频谱图增强单通道语音效果
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-02-07 DOI: 10.1016/j.csl.2024.101626
Sania Gul , Muhammad Salman Khan , Muhammad Fazeel
{"title":"Single-channel speech enhancement using colored spectrograms","authors":"Sania Gul ,&nbsp;Muhammad Salman Khan ,&nbsp;Muhammad Fazeel","doi":"10.1016/j.csl.2024.101626","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101626","url":null,"abstract":"<div><p>Speech enhancement concerns the processes required to remove unwanted background sounds from the target speech to improve its quality and intelligibility. In this paper, a novel approach for single-channel speech enhancement is presented using colored spectrograms. We propose the use of a deep neural network (DNN) architecture adapted from the pix2pix generative adversarial network (GAN) and train it over colored spectrograms of speech to denoise them. After denoising, the colors of spectrograms are translated to magnitudes of short-time Fourier transform (STFT) using a shallow regression neural network. These estimated STFT magnitudes are later combined with the noisy phases to obtain an enhanced speech. The results show an improvement of almost 0.84 points in the perceptual evaluation of speech quality (PESQ) and 1 % in the short-term objective intelligibility (STOI) over the unprocessed noisy data. The gain in quality and intelligibility over the unprocessed signal is almost equal to the gain achieved by the baseline methods used for comparison with the proposed model, but at a much reduced computational cost. The proposed solution offers a comparative PESQ score at almost 10 times reduced computational cost than a similar baseline model that has generated the highest PESQ score trained on grayscaled spectrograms, while it provides only a 1 % deficit in STOI at 28 times reduced computational cost when compared to another baseline system based on convolutional neural network-GAN (CNN-GAN) that produces the most intelligible speech.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101626"},"PeriodicalIF":4.3,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139749127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A method of phonemic annotation for Chinese dialects based on a deep learning model with adaptive temporal attention and a feature disentangling structure 基于具有自适应时空注意力的深度学习模型和特征分解结构的汉语方言音位标注方法
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-02-05 DOI: 10.1016/j.csl.2024.101624
Bowen Jiang , Qianhui Dong , Guojin Liu
{"title":"A method of phonemic annotation for Chinese dialects based on a deep learning model with adaptive temporal attention and a feature disentangling structure","authors":"Bowen Jiang ,&nbsp;Qianhui Dong ,&nbsp;Guojin Liu","doi":"10.1016/j.csl.2024.101624","DOIUrl":"10.1016/j.csl.2024.101624","url":null,"abstract":"<div><p>Phonemic annotation is aimed at annotating a speech fragment with phonemic symbols. As the phonetic features of a speech fragment vary greatly among different languages including their dialects, it is a significant way to describe and write down the phonetic system of a language utilizing phonemic symbols. It is meaningful to develop an automatic and effective method for this task. In this paper, we first establish a Chinese dataset where each datum consists of an original speech signal and the corresponding phonemic characters which are annotated manually. Furthermore, we propose a deep learning model to realize automatic phonemic annotation for speech fragments spoken in diverse Chinese dialects. The overall structure of the model is a many-to-many deep bi-directional gated recurrent unit (GRU) network, and an adaptive temporal attention mechanism is applied to communicate the encoder and decoder modules to prevent any loss of features adaptively. Meanwhile, a feature disentangling structure based on a generative adversarial network (GAN) is adopted to attenuate the interference towards the phonemic annotation task caused by unrelated tone features in the original speech signal and further improve the phonemic annotation performance. Extensive experimental results have verified the superiority of our model and proposed strategies over the utilized dataset.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101624"},"PeriodicalIF":4.3,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139689402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LeBenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of French speech LeBenchmark 2.0:法语语音自监督表征的标准化、可复制和增强型框架
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-02-03 DOI: 10.1016/j.csl.2024.101622
Titouan Parcollet , Ha Nguyen , Solène Evain , Marcely Zanon Boito , Adrien Pupier , Salima Mdhaffar , Hang Le , Sina Alisamir , Natalia Tomashenko , Marco Dinarelli , Shucong Zhang , Alexandre Allauzen , Maximin Coavoux , Yannick Estève , Mickael Rouvier , Jerôme Goulian , Benjamin Lecouteux , François Portet , Solange Rossato , Fabien Ringeval , Laurent Besacier
{"title":"LeBenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of French speech","authors":"Titouan Parcollet ,&nbsp;Ha Nguyen ,&nbsp;Solène Evain ,&nbsp;Marcely Zanon Boito ,&nbsp;Adrien Pupier ,&nbsp;Salima Mdhaffar ,&nbsp;Hang Le ,&nbsp;Sina Alisamir ,&nbsp;Natalia Tomashenko ,&nbsp;Marco Dinarelli ,&nbsp;Shucong Zhang ,&nbsp;Alexandre Allauzen ,&nbsp;Maximin Coavoux ,&nbsp;Yannick Estève ,&nbsp;Mickael Rouvier ,&nbsp;Jerôme Goulian ,&nbsp;Benjamin Lecouteux ,&nbsp;François Portet ,&nbsp;Solange Rossato ,&nbsp;Fabien Ringeval ,&nbsp;Laurent Besacier","doi":"10.1016/j.csl.2024.101622","DOIUrl":"10.1016/j.csl.2024.101622","url":null,"abstract":"<div><p>Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces <em>LeBenchmark 2.0</em> an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 h of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. <em>LeBenchmark 2.0</em> also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training. Overall, the newly introduced models trained on 14,000 h of French speech outperform multilingual and previous <em>LeBenchmark</em> SSL models across the benchmark but also required up to four times more energy for pre-training.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101622"},"PeriodicalIF":4.3,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139679924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spectral–temporal saliency masks and modulation tensorgrams for generalizable COVID-19 detection 用于通用 COVID-19 检测的频谱-时序突出掩码和调制张量图
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-02-01 DOI: 10.1016/j.csl.2024.101620
Yi Zhu, Tiago H. Falk
{"title":"Spectral–temporal saliency masks and modulation tensorgrams for generalizable COVID-19 detection","authors":"Yi Zhu,&nbsp;Tiago H. Falk","doi":"10.1016/j.csl.2024.101620","DOIUrl":"10.1016/j.csl.2024.101620","url":null,"abstract":"<div><p>Speech COVID-19 detection systems have gained popularity as they represent an easy-to-use and low-cost solution that is well suited for at-home long-term monitoring of patients with persistent symptoms. Recently, however, the limited generalization capability of existing deep neural network based systems to unseen datasets has been raised as a serious concern, as has their limited interpretability. In this study, we aim to develop an interpretable and generalizable speech-based COVID-19 detection system. First, we propose the use of a 3-dimensional modulation frequency tensor (called modulation tensorgram representation, MTR) as input to a convolutional recurrent neural network for COVID-19 detection. The MTR representation is known to capture long-term dynamics of speech correlated with articulation and respiration, hence being a potential candidate for characterizing COVID-19 speech. The customized network explores both the spectral and temporal pattern from MTR to learn the underlying COVID-19 speech pattern. Next, we design a spectro-temporal saliency masking to aggregate regions of the MTR related to COVID-19, thus helping further improve the generalizability and interpretability of the model. Experiments are conducted on three public datasets and results show the proposed solution consistently outperforming two benchmark systems in within-, across-, and unseen-dataset tests. The learned salient regions have been shown correlated with whispered speech and vocal hoarseness, which explains the increased generalizability. Furthermore, our model relies on a small amount of parameters, thus offering a promising solution for on-device remote monitoring of COVID-19 infection.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101620"},"PeriodicalIF":4.3,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000032/pdfft?md5=e39e0b3ee7ea45c5b9c50622ff48dbd4&pid=1-s2.0-S0885230824000032-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139664299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Effective infant cry signal analysis and reasoning using IARO based leaky Bi-LSTM model 使用基于 IARO 的泄漏 Bi-LSTM 模型有效分析和推理婴儿哭声信号
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-01-24 DOI: 10.1016/j.csl.2024.101621
B.M. Mala, Smita Sandeep Darandale
{"title":"Effective infant cry signal analysis and reasoning using IARO based leaky Bi-LSTM model","authors":"B.M. Mala,&nbsp;Smita Sandeep Darandale","doi":"10.1016/j.csl.2024.101621","DOIUrl":"10.1016/j.csl.2024.101621","url":null,"abstract":"<div><p>In the present scenario, the recognition of particular emotions or needs from an infant's cry is a difficult process in the field of pattern recognition as it does not have any verbal information. In this article, an automated model is introduced for an effective recognition of infant cries. At first, the infant cry signals are collected from the Baby Chillanto (BC) dataset and the Donate a Cry Corpus (DCC) dataset. These acquired signals are converted into feature vectors by employing nine techniques namely, Zero Crossing Rate (ZCR), acoustic features, audio features, amplitude, energy, Root Mean Square (RMS), statistical moments, autocorrelation, and Mel-Frequency Cepstral Coefficients (MFCCs). These obtained feature vectors are multi-dimensional; therefore, a Simulated Annealing Algorithm (SAA) is employed to select informative feature vectors. The selected informative feature vectors are passed to the leaky Bi-directional Long Short Term Memory (Bi-LSTM) model for classifying the types of infant cries. Specifically, in the leaky Bi-LSTM model, the conventional activation functions (Tangent (Tanh) and sigmoid) are replaced with the leaky Rectified Linear Unit (leaky ReLU) activation function. This process significantly mitigates the vanishing gradient problem and improves convergence during data training, which is vital for signal classification tasks. Furthermore, an Improved Artificial Rabbit's Optimization (IARO) algorithm is proposed to choose optimal hyper-parameters in the leaky Bi-LSTM model, where this mechanism reduces the complexity and training time of the classification model. In the IARO algorithm, selective opposition and Lévy flight strategies are integrated with the conventional ARO algorithm to enhance the dynamics and diversity of the population, along with the model's tracking efficiency. The empirical investigation denotes that the proposed IARO based leaky Bi-LSTM model achieves 99.66 % and 95.92 % of classification accuracy on the BC and DCC datasets, respectively. The proposed IARO based leaky Bi-LSTM model achieves maximum classification results when related to the conventional recognition models.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101621"},"PeriodicalIF":4.3,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances 联合扬声器编码器和神经后端模型,实现具有多个登记语料的完全端到端自动扬声器验证
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-01-18 DOI: 10.1016/j.csl.2024.101619
Chang Zeng , Xiaoxiao Miao , Xin Wang , Erica Cooper , Junichi Yamagishi
{"title":"Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances","authors":"Chang Zeng ,&nbsp;Xiaoxiao Miao ,&nbsp;Xin Wang ,&nbsp;Erica Cooper ,&nbsp;Junichi Yamagishi","doi":"10.1016/j.csl.2024.101619","DOIUrl":"10.1016/j.csl.2024.101619","url":null,"abstract":"<div><p>Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame-level and utterance-level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini-batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding-level mixup strategy for better optimization.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101619"},"PeriodicalIF":4.3,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000020/pdfft?md5=ef4d8f62c6e421e3a3accd1ee4ea9a64&pid=1-s2.0-S0885230824000020-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139507969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scale-aware dual-branch complex convolutional recurrent network for monaural speech enhancement 用于单声道语音增强的规模感知双分支复杂卷积递归网络
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-01-13 DOI: 10.1016/j.csl.2024.101618
Yihao Li , Meng Sun , Xiongwei Zhang , Hugo Van hamme
{"title":"Scale-aware dual-branch complex convolutional recurrent network for monaural speech enhancement","authors":"Yihao Li ,&nbsp;Meng Sun ,&nbsp;Xiongwei Zhang ,&nbsp;Hugo Van hamme","doi":"10.1016/j.csl.2024.101618","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101618","url":null,"abstract":"<div><p><span><span><span>A key step to single channel speech enhancement is the orthogonal separation of speech and noise. In this paper, a dual branch complex convolutional recurrent network<span> (DBCCRN) is proposed to separate the complex spectrograms of speech and noises simultaneously. To model both local and global information, we incorporate </span></span>conformer<span><span> modules into our network. The orthogonality of the outputs of the two branches can be improved by optimizing the Signal-to-Noise Ratio (SNR) related losses. However, we found the models trained by two existing versions of SI-SNRs will yield enhanced speech at a very different scale from that of its clean counterpart. SNR loss will lead to a shrink amplitude of enhanced speech as well. A solution to this problem is to simply normalize the output, but it only works for off-line processing, not for the streaming one. When streaming speech enhancement is required, the error scale will lead to the degradation of speech quality. From an analytical inspection of the weakness of the models trained by SNR and SI-SNR losses, a new loss function called scale-aware SNR (SA-SNR) is proposed to cope with the scale variations of the enhanced speech. SA-SNR improves over SI-SNR by introducing an extra </span>regularization term that encourages the model to produce signals of similar scale as the input, which has little influence on the </span></span>perceptual quality of the enhanced speech. In addition, the commonly used evaluation recipe for speech enhancement may not be sufficient to comprehensively reflect the performance of the speech enhancement methods using SI-SNR losses, where amplitude variations of input speech should be carefully considered. A new evaluation recipe called </span><em>ScaleError</em> is introduced. Experiments show that our proposed method improves over the existing baselines on the evaluation sets of the <em>voice bank corpus, DEMAND</em> and <span><em>the Interspeech 2020 Deep </em><em>Noise Suppression</em><em> Challenge</em></span>, by obtaining higher scores for PESQ, STOI, SSNR, CSIG, CBAK and COVL.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101618"},"PeriodicalIF":4.3,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139487878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信