2021 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Spoofprint: A New Paradigm for Spoofing Attacks Detection 欺骗打印:欺骗攻击检测的新范式
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383572
Tianxiang Chen, E. Khoury
{"title":"Spoofprint: A New Paradigm for Spoofing Attacks Detection","authors":"Tianxiang Chen, E. Khoury","doi":"10.1109/SLT48900.2021.9383572","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383572","url":null,"abstract":"With the development of voice spoofing techniques, voice spoofing attacks have become one of the main threats to automatic speaker verification (ASV) systems. Traditionally, researchers tend to treat this problem as a binary classification task. A binary classifier is typically trained using machine learning (including deep learning) algorithms to determine whether a given audio clip is bonafide or spoofed. This approach is effective on detecting spoofing attacks that are generated by known voice spoofing techniques. However, in practical scenarios, new types of spoofing technologies are emerging rapidly. It is impossible to include all types of spoofing technologies into the training dataset, and thus it is desired that the detection system can generalize to unseen spoofing techniques. In this paper, we propose a new paradigm for spoofing attacks detection called Spoofprint. Instead of using a binary classifier to detect spoofed audio, Spoofprint uses a paradigm similar to ASV systems and involves an enrollment phase and a verification phase. We evaluate the performance on the original and noisy versions of ASVspoof 2019 logical access (LA) dataset. The results show that the proposed Spoofprint paradigm is effective on detecting unknown type of attacks and is often superior to the latest state-of-the-art.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125296752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automated Scoring of Spontaneous Speech from Young Learners of English Using Transformers 利用变形金刚对少儿英语学习者自发言语进行自动评分
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383553
Xinhao Wang, Keelan Evanini, Yao Qian, Matthew David Mulholland
{"title":"Automated Scoring of Spontaneous Speech from Young Learners of English Using Transformers","authors":"Xinhao Wang, Keelan Evanini, Yao Qian, Matthew David Mulholland","doi":"10.1109/SLT48900.2021.9383553","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383553","url":null,"abstract":"This study explores the use of Transformer-based models for the automated assessment of children’s non-native spontaneous speech. Traditional approaches for this task have relied heavily on delivery features (e.g., fluency), whereas the goal of the current study is to build automated scoring models based solely on transcriptions in order to see how well they capture additional aspects of speaking proficiency (e.g., content appropriateness, vocabulary, and grammar) despite the high word error rate (WER) of automatic speech recognition (ASR) on children’s non-native spontaneous speech. Transformer-based models are built using both manual transcriptions and ASR hypotheses, and versions of the models that incorporated the prompt text were investigated in order to more directly measure content appropriateness. Two baseline systems were used for comparison, including an attention-based Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) and a Support Vector Regressor (SVR) with manually engineered content-related features. Experimental results demonstrate the effectiveness of the Transformer-based models: the automated prompt-aware model using ASR hypotheses achieves a Pearson correlation coefficient (r) with holistic proficiency scores provided by human experts of 0.835, outperforming both the attention-based RNN-LSTM baseline (r = 0.791) and the SVR baseline (r = 0.767).","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126769012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A Conditional Cycle Emotion Gan for Cross Corpus Speech Emotion Recognition 跨语料库语音情感识别的条件循环情感Gan
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383512
Bo-Hao Su, Chi-Chun Lee
{"title":"A Conditional Cycle Emotion Gan for Cross Corpus Speech Emotion Recognition","authors":"Bo-Hao Su, Chi-Chun Lee","doi":"10.1109/SLT48900.2021.9383512","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383512","url":null,"abstract":"Speech emotion recognition (SER) is important in enabling personalized services and multimedia applications in our life. It also becomes a prevalent topic of research with its potential in creating a better user experience across many modern technologies. However, the highly contextualized scenario and expensive emotion labeling required cause a severe mismatch between already limited-in-scale speech emotional corpora; this hinders the wide adoption of SER. In this work, instead of conventionally learning a common feature space between corpora, we take a novel approach in enhancing the variability of the source (labeled) corpus that is target (unlabeled) data-aware by generating synthetic source domain data using a conditional cycle emotion generative adversarial network (CCEmoGAN). Note that no target samples with label are used during whole training process. We evaluate our framework in cross corpus emotion recognition tasks and obtain a three classes valence recognition accuracy of 47.56%, 50.11% and activation accuracy of 51.13%, 65.7% when transferring from the IEMOCAP to the CIT dataset, and the IEMOCAP to the MSP-IMPROV dataset respectively. The benefit of increasing target domain-aware variability in the source domain to improve emotion discriminability in cross corpus emotion recognition is further visualized in our augmented data space.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127951246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
RNN Based Incremental Online Spoken Language Understanding 基于RNN的增量在线口语理解
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383614
P. G. Shivakumar, Naveen Kumar, P. Georgiou, Shrikanth S. Narayanan
{"title":"RNN Based Incremental Online Spoken Language Understanding","authors":"P. G. Shivakumar, Naveen Kumar, P. Georgiou, Shrikanth S. Narayanan","doi":"10.1109/SLT48900.2021.9383614","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383614","url":null,"abstract":"Spoken Language Understanding (SLU) typically comprises of an automatic speech recognition (ASR) followed by a natural language understanding (NLU) module. The two modules process signals in a blocking sequential fashion, i.e., the NLU often has to wait for the ASR to finish processing on an utterance basis, potentially leading to high latencies that render the spoken interaction less natural. In this paper, we propose recurrent neural network (RNN) based incremental processing towards the SLU task of intent detection. The proposed methodology offers lower latencies than a typical SLU system, without any significant reduction in system accuracy. We introduce and analyze different recurrent neural network architectures for incremental and online processing of the ASR transcripts and compare it to the existing offline systems. A lexical End-of-Sentence (EOS) detector is proposed for segmenting the stream of transcript into sentences for intent classification. Intent detection experiments are conducted on benchmark ATIS, Snips and Facebook’s multilingual task oriented dialog datasets modified to emulate a continuous incremental stream of words with no utterance demarcation. We also analyze the prospects of early intent detection, before EOS, with our proposed system.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121066902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving Convolutional Recurrent Neural Networks for Speech Emotion Recognition 改进卷积递归神经网络用于语音情感识别
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383513
Patrick Meyer, Ziyi Xu, T. Fingscheidt
{"title":"Improving Convolutional Recurrent Neural Networks for Speech Emotion Recognition","authors":"Patrick Meyer, Ziyi Xu, T. Fingscheidt","doi":"10.1109/SLT48900.2021.9383513","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383513","url":null,"abstract":"Deep learning has increased the interest in speech emotion recognition (SER) and has put forth diverse structures and methods to improve performance. In recent years it has turned out that applying SER on a (log-mel) spectrogram and thus, interpreting SER as an image recognition task is a promising method. Following the trend towards using a convolutional neural network (CNN) in combination with a bidirectional long short-term memory (BLSTM) layer, and some subsequent fully connected layers, in this work, we advance the performance of this topology by several contributions: We integrate a multi-kernel width CNN, propose a BLSTM output summarization function, apply an enhanced feature representation, and introduce an effective training method. In order to foster insight into our proposed methods, we separately evaluate the impact of each modification in an ablation study. Based on our modifications, we obtain top results for this type of topology on IEMOCAP with an unweighted average recall of 64.5% on average.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129830084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Real-Time Independent Vector Analysis with a Deep-Learning-Based Source Model 基于深度学习源模型的实时独立矢量分析
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383599
Fang Kang, Feiran Yang, Jun Yang
{"title":"Real-Time Independent Vector Analysis with a Deep-Learning-Based Source Model","authors":"Fang Kang, Feiran Yang, Jun Yang","doi":"10.1109/SLT48900.2021.9383599","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383599","url":null,"abstract":"In this paper, we present a real-time blind source separation (BSS) algorithm, which unifies the independent vector analysis (IVA) as a spatial model and a deep neural network (DNN) as a source model. The auxiliary-function based IVA (Aux-IVA) is utilized to update the demixing matrix, and the required time-varying variance of the speech source is estimated by a DNN. The DNN could provide a more accurate source model, which then helps to optimize the spatial model. In addition, because the DNN is used to estimate the source variance instead of the source power spectrogram, the size of DNN can be reduced significantly. Experiment results show that the joint utilization of the model-based approach and the data-driven approach provides a more efficient solution than just alone in terms of convergence rate and source separation performance.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128762339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Developing Neural Representations for Robust Child-Adult Diarization 发展稳健儿童-成人分化的神经表征
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383488
Suchitra Krishnamachari, Manoj Kumar, So Hyun Kim, C. Lord, Shrikanth S. Narayanan
{"title":"Developing Neural Representations for Robust Child-Adult Diarization","authors":"Suchitra Krishnamachari, Manoj Kumar, So Hyun Kim, C. Lord, Shrikanth S. Narayanan","doi":"10.1109/SLT48900.2021.9383488","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383488","url":null,"abstract":"Automated processing and analysis of child speech has been long acknowledged as a harder problem compared to understanding speech by adults. Specifically, conversations between a child and adult involve spontaneous speech which often compounds idiosyncrasies associated with child speech. In this work, we improve upon the task of speaker diarization (determining who spoke when) from audio of child-adult conversations in naturalistic settings. We select conversations from the autism diagnosis and intervention domains, wherein speaker diarization forms an important step towards computational behavioral analysis in support of clinical research and decision making. We train deep speaker embeddings using publicly available child speech and adult speech corpora, unlike predominant state-of-art models which typically utilize only adult speech for speaker embedding training. We demonstrate significant reductions in relative diarization error rate (DER) on DIHARD II (dev) sessions containing child speech (22.88%) and two internal corpora representing interactions involving children with Autism: excerpts from ADOS Mod3 sessions (33.7%) and combination of full-length ADOS and BOSCC sessions (44.99%). Further, we validate our improvements in identifying the child speaker (typically with short speaking time) using the recall measure. Finally, we analyze the effect of fundamental frequency augmentation and the effect of child age, gender on speaker diarization performance.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122195395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Optimized Prediction of Fluency of L2 English Based on Interpretable Network Using Quantity of Phonation and Quality of Pronunciation 基于语音量和语音质量的可解释网络对二语英语流利度的优化预测
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383458
Yan Shen, A. Yasukagawa, D. Saito, N. Minematsu, Kazuya Saito
{"title":"Optimized Prediction of Fluency of L2 English Based on Interpretable Network Using Quantity of Phonation and Quality of Pronunciation","authors":"Yan Shen, A. Yasukagawa, D. Saito, N. Minematsu, Kazuya Saito","doi":"10.1109/SLT48900.2021.9383458","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383458","url":null,"abstract":"This paper presents results of a joint project between an engineering team of a university and an educational team of another to develop an online fluency assessment system for Japanese learners of English. A picture description corpus of English spoken by 90 learners and 10 native speakers was used, where fluency was rated by other 10 native raters for each speaker manually. The assessment system was built to predict the averaged manual scores. For system development, a special focus was put on two separate purposes. The assessment system was trained in such an analytical way that teachers can know and discuss which speech features contribute more to fluency prediction, and in such a technical way that teachers' knowledge can be involved for training the system, which can be further optimized using an interpretable network. Experiments showed that quality-of-pronunciation features are much more helpful than quantity-of-phonation features, and the optimized system reached an extremely high correlation of 0.956 with the averaged manual scores, which is higher than the maximum of inter-rater correlations (0.910).","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132296862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Improving Speaker Recognition with Quality Indicators 利用质量指标改进说话人识别
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383627
H. Rao, Kedar Phatak, E. Khoury
{"title":"Improving Speaker Recognition with Quality Indicators","authors":"H. Rao, Kedar Phatak, E. Khoury","doi":"10.1109/SLT48900.2021.9383627","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383627","url":null,"abstract":"Nuisance factors such as short duration, noise and transmission conditions still pose accuracy challenges to state-of-the-art automatic speaker verification (ASV) systems. To address this problem, we propose a no reference system that consumes quality indicators encapsulating information about duration of speech, acoustic events and codec artifacts. These quality indicators are used as estimates to measure how close a given speech utterance would be to a high-quality speech segment uttered by the same speaker. The proposed measures when fused with a baseline ASV system are found to improve the performance of speaker recognition. The experimental study carried on a modified version of the NIST SRE 2019 dataset shows a relative decrease of 9.6% in equal error rate (EER) compared to the baseline.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129241542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Investigation of Node Pruning Criteria for Neural Networks Model Compression with Non-Linear Function and Non-Uniform Network Topology 非线性函数和非均匀网络拓扑下神经网络模型压缩的节点修剪准则研究
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383593
K. Nakadai, Yosuke Fukumoto, Ryu Takeda
{"title":"Investigation of Node Pruning Criteria for Neural Networks Model Compression with Non-Linear Function and Non-Uniform Network Topology","authors":"K. Nakadai, Yosuke Fukumoto, Ryu Takeda","doi":"10.1109/SLT48900.2021.9383593","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383593","url":null,"abstract":"This paper investigates node-pruning-based compression for non-uniform deep learning models such as acoustic models in automatic speech recognition (ASR). Node pruning for small footprint ASR has been well studied, but most studies assumed a sigmoid as an activation function and uniform or simple fully-connected neural networks without bypass connections. We propose a node pruning method that can be applied to non-sigmoid functions such as ReLU and that can deal with network topology related issues such as bypass connections. To deal with non-sigmoid functions, we extend a node entropy technique to estimate node activities. To cope with non-uniform network topology, we propose three criteria; inter-layer pairing, no bypass connection pruning, and layer-based pruning rate configuration. The proposed method as a combination of these four techniques and criteria was applied to compress a Kaldi's acoustic model with ReLU as a non-linear function, time delay neural networks (TDNN) and bypass connections inspired by residual networks. Experimental results showed that the proposed method achieved a 31% speed increase while maintaining the ASR accuracy to be comparable by taking network topology into consideration.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115727068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信