Computer Speech and Language最新文献

筛选
英文 中文
Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances 联合扬声器编码器和神经后端模型,实现具有多个登记语料的完全端到端自动扬声器验证
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-01-18 DOI: 10.1016/j.csl.2024.101619
Chang Zeng , Xiaoxiao Miao , Xin Wang , Erica Cooper , Junichi Yamagishi
{"title":"Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances","authors":"Chang Zeng ,&nbsp;Xiaoxiao Miao ,&nbsp;Xin Wang ,&nbsp;Erica Cooper ,&nbsp;Junichi Yamagishi","doi":"10.1016/j.csl.2024.101619","DOIUrl":"10.1016/j.csl.2024.101619","url":null,"abstract":"<div><p>Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and back-end models may lead to a local minimum, which theoretically prevents the whole system from achieving the best optimization. Although some methods have been proposed for jointly optimizing the two models, such as the generalized end-to-end (GE2E) model and NPLDA E2E model, most of these methods have not fully investigated how to model the intra-relationship between multiple enrollment utterances. In this paper, we propose a new E2E joint method for speaker verification especially designed for the practical scenario of multiple enrollment utterances. To leverage the intra-relationship among multiple enrollment utterances, our model comes equipped with frame-level and utterance-level attention mechanisms. Additionally, focal loss is utilized to balance the importance of positive and negative samples within a mini-batch and focus on the difficult samples during the training process. We also utilize several data augmentation techniques, including conventional noise augmentation using MUSAN and RIRs datasets and a unique speaker embedding-level mixup strategy for better optimization.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000020/pdfft?md5=ef4d8f62c6e421e3a3accd1ee4ea9a64&pid=1-s2.0-S0885230824000020-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139507969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scale-aware dual-branch complex convolutional recurrent network for monaural speech enhancement 用于单声道语音增强的规模感知双分支复杂卷积递归网络
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-01-13 DOI: 10.1016/j.csl.2024.101618
Yihao Li , Meng Sun , Xiongwei Zhang , Hugo Van hamme
{"title":"Scale-aware dual-branch complex convolutional recurrent network for monaural speech enhancement","authors":"Yihao Li ,&nbsp;Meng Sun ,&nbsp;Xiongwei Zhang ,&nbsp;Hugo Van hamme","doi":"10.1016/j.csl.2024.101618","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101618","url":null,"abstract":"<div><p><span><span><span>A key step to single channel speech enhancement is the orthogonal separation of speech and noise. In this paper, a dual branch complex convolutional recurrent network<span> (DBCCRN) is proposed to separate the complex spectrograms of speech and noises simultaneously. To model both local and global information, we incorporate </span></span>conformer<span><span> modules into our network. The orthogonality of the outputs of the two branches can be improved by optimizing the Signal-to-Noise Ratio (SNR) related losses. However, we found the models trained by two existing versions of SI-SNRs will yield enhanced speech at a very different scale from that of its clean counterpart. SNR loss will lead to a shrink amplitude of enhanced speech as well. A solution to this problem is to simply normalize the output, but it only works for off-line processing, not for the streaming one. When streaming speech enhancement is required, the error scale will lead to the degradation of speech quality. From an analytical inspection of the weakness of the models trained by SNR and SI-SNR losses, a new loss function called scale-aware SNR (SA-SNR) is proposed to cope with the scale variations of the enhanced speech. SA-SNR improves over SI-SNR by introducing an extra </span>regularization term that encourages the model to produce signals of similar scale as the input, which has little influence on the </span></span>perceptual quality of the enhanced speech. In addition, the commonly used evaluation recipe for speech enhancement may not be sufficient to comprehensively reflect the performance of the speech enhancement methods using SI-SNR losses, where amplitude variations of input speech should be carefully considered. A new evaluation recipe called </span><em>ScaleError</em> is introduced. Experiments show that our proposed method improves over the existing baselines on the evaluation sets of the <em>voice bank corpus, DEMAND</em> and <span><em>the Interspeech 2020 Deep </em><em>Noise Suppression</em><em> Challenge</em></span>, by obtaining higher scores for PESQ, STOI, SSNR, CSIG, CBAK and COVL.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139487878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A tag-based methodology for the detection of user repair strategies in task-oriented conversational agents 基于标签的任务导向型对话代理用户修复策略检测方法
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-01-08 DOI: 10.1016/j.csl.2023.101603
Francesca Alloatti , Francesca Grasso , Roger Ferrod , Giovanni Siragusa , Luigi Di Caro , Federica Cena
{"title":"A tag-based methodology for the detection of user repair strategies in task-oriented conversational agents","authors":"Francesca Alloatti ,&nbsp;Francesca Grasso ,&nbsp;Roger Ferrod ,&nbsp;Giovanni Siragusa ,&nbsp;Luigi Di Caro ,&nbsp;Federica Cena","doi":"10.1016/j.csl.2023.101603","DOIUrl":"https://doi.org/10.1016/j.csl.2023.101603","url":null,"abstract":"<div><p><span><span>Mutual comprehension is a crucial component that makes a conversation succeed. While it can be easily reached through the cooperation of the parties in human–human dialogues, such cooperation is often lacking in human–computer interaction due to technical problems, leading to broken conversations. Our goal is to work towards an effective detection of breakdowns in a conversation between humans and Conversational Agents (CA), as well as the different repair strategies users adopt when such communication problems occur. In this work, we propose a novel tag system designed to map and classify users’ repair attempts while interacting with a CA. We subsequently present a set of </span>Machine Learning models</span><span><sup>1</sup></span> trained to automatize the detection of such repair strategies. The tags are employed in a manual annotation exercise, performed on a publicly available dataset <span><sup>2</sup></span><span> of text-based task-oriented conversations. The batch of annotated data was then used to train the neural network-based classifiers. The analysis of the annotations provides interesting insights about users’ behaviour when dealing with breakdowns in a task-oriented dialogue system<span>. The encouraging results obtained from neural models confirm the possibility of automatically recognizing occurrences of misunderstanding between users and CAs on the fly.</span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139406502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TTK: A toolkit for Tunisian linguistic analysis TTK:突尼斯语言分析工具包
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2024-01-03 DOI: 10.1016/j.csl.2023.101617
Asma Mekki, Inès Zribi, Mariem Ellouze, Lamia Hadrich Belguith
{"title":"TTK: A toolkit for Tunisian linguistic analysis","authors":"Asma Mekki,&nbsp;Inès Zribi,&nbsp;Mariem Ellouze,&nbsp;Lamia Hadrich Belguith","doi":"10.1016/j.csl.2023.101617","DOIUrl":"10.1016/j.csl.2023.101617","url":null,"abstract":"<div><p><span><span>Over the last two decades, many efforts have been made to provide resources to support the Arabic Natural Language Processing (NLP). Some of these resources target specific NLP tasks such as word tokenization, </span>parsing, or </span>sentiment analysis<span><span>, while others attempt to tackle numerous tasks at once. In this paper, we present ¡¡TTK¿¿, a toolkit for Tunisian linguistic analysis. It consists of a collection of linguistic analysis tools for orthographic normalization, sentence boundaries detection, word tokenization, morphological analysis, parsing and </span>named entity recognition. This paper focuses on the design and implementation of TTK tools.</span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced local knowledge with proximity values and syntax-clusters for aspect-level sentiment analysis 利用近似值和语法簇增强本地知识,进行方面级情感分析
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2023-12-28 DOI: 10.1016/j.csl.2023.101616
Pengfei Chen , Biqing Zeng , Yuwu Lu , Yun Xue , Fei Fan , Mayi Xu , Lingcong Feng
{"title":"Enhanced local knowledge with proximity values and syntax-clusters for aspect-level sentiment analysis","authors":"Pengfei Chen ,&nbsp;Biqing Zeng ,&nbsp;Yuwu Lu ,&nbsp;Yun Xue ,&nbsp;Fei Fan ,&nbsp;Mayi Xu ,&nbsp;Lingcong Feng","doi":"10.1016/j.csl.2023.101616","DOIUrl":"10.1016/j.csl.2023.101616","url":null,"abstract":"<div><p>Aspect-level sentiment analysis (ALSA) aims to extract the polarity of different aspect terms in a sentence. Previous works leveraging traditional dependency syntax parsing<span> trees (DSPT) to encode contextual syntactic<span> information had obtained state-of-the-art results. However, these works may not be able to learn fine-grained syntactic knowledge efficiently, which makes them difficult to take advantage of local context. Furthermore, these works failed to exploit the dependency relation from DSPT sufficiently. To solve these problems, we propose a novel method to enhance local knowledge by using extensions of Local Context Network based on Proximity Values (LCPV) and Syntax-clusters Attention (SCA), named LCSA. LCPV first gets the induced trees from pre-trained models and generates the syntactic proximity values between context word and aspect to adaptively determine the extent of local context. Our improved SCA further extracts fine-grained knowledge, which not only focuses on the essential clusters for the target aspect term but also guides the model to learn essential words inside each cluster in DSPT. Extensive experimental results on multiple benchmark datasets demonstrate that LCSA is highly robust and achieves state-of-the-art performance for ALSA.</span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139071559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing accuracy and privacy in speech-based depression detection through speaker disentanglement 通过分离说话者提高基于语音的抑郁检测的准确性和私密性
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2023-12-26 DOI: 10.1016/j.csl.2023.101605
Vijay Ravi , Jinhan Wang , Jonathan Flint , Abeer Alwan
{"title":"Enhancing accuracy and privacy in speech-based depression detection through speaker disentanglement","authors":"Vijay Ravi ,&nbsp;Jinhan Wang ,&nbsp;Jonathan Flint ,&nbsp;Abeer Alwan","doi":"10.1016/j.csl.2023.101605","DOIUrl":"10.1016/j.csl.2023.101605","url":null,"abstract":"<div><p>Speech signals are valuable biomarkers for assessing an individual’s mental health, including identifying Major Depressive Disorder (MDD) automatically. A frequently used approach in this regard is to employ features related to speaker identity, such as speaker-embeddings. However, over-reliance on speaker identity features in mental health screening systems can compromise patient privacy. Moreover, some aspects of speaker identity may not be relevant for depression detection and could serve as a bias factor that hampers system performance. To overcome these limitations, we propose disentangling speaker-identity information from depression-related information. Specifically, we present four distinct disentanglement methods to achieve this — adversarial speaker identification (SID)-loss maximization (ADV), SID-loss equalization with variance (LEV), SID-loss equalization using Cross-Entropy (LECE) and SID-loss equalization using KL divergence (LEKLD). Our experiments, which incorporated diverse input features and model architectures, have yielded improved F1 scores for MDD detection and voice-privacy attributes, as quantified by Gain in Voice Distinctiveness (<span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>V</mi><mi>D</mi></mrow></msub></math></span>) and De-Identification Scores (DeID). On the DAIC-WOZ dataset (English), LECE using ComparE16 features results in the best F1-Scores of 80% which represents the audio-only SOTA depression detection F1-Score along with a <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>V</mi><mi>D</mi></mrow></msub></math></span> of −1.1 dB and a DeID of 85%. On the EATD dataset (Mandarin), ADV using raw-audio signal achieves an F1-Score of 72.38% surpassing multi-modal SOTA along with a <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>V</mi><mi>D</mi></mrow></msub></math></span> of −0.89 dB dB and a DeID of 51.21%. By reducing the dependence on speaker-identity-related features, our method offers a promising direction for speech-based depression detection that preserves patient privacy.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230823001249/pdfft?md5=7acff7dbe3c70a9a6ae6cde978bd02e2&pid=1-s2.0-S0885230823001249-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139052205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SecNLP: An NLP classification model watermarking framework based on multi-task learning SecNLP:基于多任务学习的 NLP 分类模型水印框架
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2023-12-23 DOI: 10.1016/j.csl.2023.101606
Long Dai, Jiarong Mao, Liaoran Xu, Xuefeng Fan, Xiaoyi Zhou
{"title":"SecNLP: An NLP classification model watermarking framework based on multi-task learning","authors":"Long Dai,&nbsp;Jiarong Mao,&nbsp;Liaoran Xu,&nbsp;Xuefeng Fan,&nbsp;Xiaoyi Zhou","doi":"10.1016/j.csl.2023.101606","DOIUrl":"10.1016/j.csl.2023.101606","url":null,"abstract":"<div><p><span>The popularity of ChatGPT demonstrates the immense commercial value of natural language processing (NLP) technology. However, NLP models like ChatGPT are vulnerable to piracy and redistribution, which can harm the economic interests of model owners. Existing NLP model </span>watermarking schemes<span> struggle to balance robustness and covertness. Typically, robust watermarks require embedding more information, which compromises their covertness; conversely, covert watermarks are challenging to embed more information, which affects their robustness. This paper is proposed to use multi-task learning (MTL) to address the conflict between robustness and covertness. Specifically, a covert trigger set is established to implement remote verification of the watermark model, and a covert auxiliary network is designed to enhance the watermark model’s robustness. The proposed watermarking framework is evaluated on two benchmark datasets and three mainstream NLP models. Compared with existing schemes, the framework not only has excellent covertness and robustness but also has a lower false positive rate and can effectively resist fraudulent ownership claims by adversaries.</span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139030523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Contextual emotion detection using ensemble deep learning 利用集合深度学习进行情境情感检测
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2023-12-20 DOI: 10.1016/j.csl.2023.101604
Asalah Thiab , Luay Alawneh , Mohammad AL-Smadi
{"title":"Contextual emotion detection using ensemble deep learning","authors":"Asalah Thiab ,&nbsp;Luay Alawneh ,&nbsp;Mohammad AL-Smadi","doi":"10.1016/j.csl.2023.101604","DOIUrl":"10.1016/j.csl.2023.101604","url":null,"abstract":"<div><p><span><span>Emotion detection from online textual information is gaining more attention due to its usefulness in understanding users’ behaviors and their desires. This is driven by the large amounts of texts from different sources such as social media and shopping websites. Recent studies investigated the benefits of deep learning in the detection of emotions from textual conversations. In this paper, we study the performance of several deep learning and transformer-based models in the classification of emotions in English conversations. Further, we apply </span>ensemble learning using a majority voting technique to improve the overall classification performance. We evaluated our proposed models on the SemEval 2019 Task 3 public dataset that categorizes emotions as </span><em>Happy</em>, <em>Angry</em>, <em>Sad</em>, and <em>Others</em>. The results show that our models can successfully distinguish the three main classes of emotions and separate them from <em>Others</em> in a highly imbalanced dataset. The transformer-based models achieved a micro-averaged F1-score of up to 75.55%, whereas the RNN-based models only reached 67.03%. Further, we show that the ensemble model significantly improves the overall performance and achieves a micro-averaged F1-score of 77.07%.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139030524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving self-supervised learning model for audio spoofing detection with layer-conditioned embedding fusion 利用层条件嵌入融合改进音频欺骗检测的自监督学习模型
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2023-12-18 DOI: 10.1016/j.csl.2023.101599
Souvik Sinha, Spandan Dey, Goutam Saha
{"title":"Improving self-supervised learning model for audio spoofing detection with layer-conditioned embedding fusion","authors":"Souvik Sinha,&nbsp;Spandan Dey,&nbsp;Goutam Saha","doi":"10.1016/j.csl.2023.101599","DOIUrl":"10.1016/j.csl.2023.101599","url":null,"abstract":"<div><p>The application of voice recognition<span><span> systems has increased by a great deal with technology. This has allowed adversaries to falsely claim access to these systems by spoofing the identity of a target speaker. The existing supervised learning (SL)-based countermeasures<span> are yet to provide a complete solution against the newly evolving spoofing attacks. To tackle this problem, we explore self-supervised learning (SSL)-based frameworks. At first, we implement widely used SSL frameworks, where our target is identifying spoofed speech. We report a considerable performance improvement over the SL state-of-the-art baseline as a whole. Then, we perform an attack-wise comparative analysis between SL and SSL frameworks. While the SSL performs better in most cases, there are certain attacks where the SL outperforms it. Hence, we hypothesize that there is scope to jointly utilize information effectively included by both these models for better performance. To do that, we first perform conventional weighted score fusion between the SL and best-performing SSL models, which reduces the </span></span>EER, outperforming both the state-of-the-art SL and best-performing SSL framework. Then, we propose an embedding fusion scheme that minimizes the embedding distribution between the selected SL and SSL representations. To select the appropriate layers, we perform a comprehensive statistical analysis. The proposed fusion scheme outperforms the score fusion method and shows that the SSL performance can be improved by effectively including learned knowledge from the SL framework. The final EER achieved on the ASVspoof 2019 logical access (LA) database is 0.177%, a significant improvement over our baseline. Using the ASVspoof 2021 LA as a blind evaluation dataset, our proposed embedding fusion scheme reduces the EER to 2.666%.</span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138746029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel channel estimate for noise robust speech recognition 用于鲁棒噪声语音识别的新型信道估计
IF 4.3 3区 计算机科学
Computer Speech and Language Pub Date : 2023-12-16 DOI: 10.1016/j.csl.2023.101598
Geoffroy Vanderreydt, Kris Demuynck
{"title":"A novel channel estimate for noise robust speech recognition","authors":"Geoffroy Vanderreydt,&nbsp;Kris Demuynck","doi":"10.1016/j.csl.2023.101598","DOIUrl":"10.1016/j.csl.2023.101598","url":null,"abstract":"<div><p>We propose a novel technique to estimate the channel characteristics for robust speech recognition<span>. The method focuses on reliable time–frequency speech patches which are highly independent of the noise condition. Combined with a root-based approximation<span> of the logarithm in the MFCC computation, this reduces the variance caused by the noise on the spectral features<span>, and therefore also the constrain on the acoustic model in a multi-style training setup. We show that compared to the standard mean normalization, the proposed method estimates the channel equally well under clean conditions and better under noisy conditions. When integrated in the feature extraction pipeline, we show improvements in speech recognition accuracy on noisy speech and a status quo on clean speech. Our experiments reveal that this method helps the most for generative models that need to model the complex noise variability, and less so for discriminative models, which can learn to ignore noise instead of accurately modeling it. Our approach outperforms the state of the art on the noisy Aurora4 task.</span></span></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138745942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信