2021 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Detecting Expressions with Multimodal Transformers 用多模态变压器检测表达式
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-30 DOI: 10.1109/SLT48900.2021.9383573
Srinivas Parthasarathy, Shiva Sundaram
{"title":"Detecting Expressions with Multimodal Transformers","authors":"Srinivas Parthasarathy, Shiva Sundaram","doi":"10.1109/SLT48900.2021.9383573","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383573","url":null,"abstract":"Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person’s audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user’s expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130925345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Look Who’s Not Talking 看看谁没说话
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-30 DOI: 10.1109/SLT48900.2021.9383502
Youngki Kwon, Hee-Soo Heo, Jaesung Huh, Bong-Jin Lee, Joon Son Chung
{"title":"Look Who’s Not Talking","authors":"Youngki Kwon, Hee-Soo Heo, Jaesung Huh, Bong-Jin Lee, Joon Son Chung","doi":"10.1109/SLT48900.2021.9383502","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383502","url":null,"abstract":"The objective of this work is speaker diarisation of speech recordings ‘in the wild’. The ability to determine speech segments is a crucial part of diarisation systems, accounting for a large proportion of errors. In this paper, we present a simple but effective solution for speech activity detection based on the speaker embeddings. In particular, we discover that the norm of the speaker embedding is an extremely effective indicator of speech activity. The method does not require an independent model for speech activity detection, therefore allows speaker diarisation to be performed using a unified representation for both speaker modelling and speech activity detection. We perform a number of experiments on in-house and public datasets, in which our method outperforms popular baselines.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125979622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Transformer-Based Online Speech Recognition with Decoder-end Adaptive Computation Steps 基于变压器的在线语音识别解码器端自适应计算步骤
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-27 DOI: 10.1109/SLT48900.2021.9383613
Mohan Li, Catalin Zorila, R. Doddipatla
{"title":"Transformer-Based Online Speech Recognition with Decoder-end Adaptive Computation Steps","authors":"Mohan Li, Catalin Zorila, R. Doddipatla","doi":"10.1109/SLT48900.2021.9383613","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383613","url":null,"abstract":"Transformer-based end-to-end (E2E) automatic speech recognition (ASR) systems have recently gained wide popularity, and are shown to outperform E2E models based on recurrent structures on a number of ASR tasks. However, like other E2E models, Transformer ASR also requires the full input sequence for calculating the attentions on both encoder and decoder, leading to increased latency and posing a challenge for online ASR. The paper proposes Decoder-end Adaptive Computation Steps (DACS) algorithm to address the issue of latency and facilitate online ASR. The proposed algorithm streams the decoding of Transformer ASR by triggering an output after the confidence acquired from the encoder states reaches a certain threshold. Unlike other monotonic attention mechanisms that risk visiting the entire encoder states for each output step, the paper introduces a maximum look-ahead step into the DACS algorithm to prevent from reaching the end of speech too fast. A Chunkwise en-coder is adopted in our system to handle real-time speech inputs. The proposed online Transformer ASR system has been evaluated on Wall Street Journal (WSJ) and AIShell-1 datasets, yielding 5.5% word error rate (WER) and 7.1% character error rate (CER) respectively, with only a minor decay in performance when compared to the offline systems.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124490317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion 多石英网:多分辨率卷积语音识别与多层特征融合
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-26 DOI: 10.1109/SLT48900.2021.9383532
Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao
{"title":"Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion","authors":"Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao","doi":"10.1109/SLT48900.2021.9383532","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383532","url":null,"abstract":"In this paper, we propose an end-to-end speech recognition network based on Nvidia’s previous QuartzNet [1] model. We try to promote the model performance, and design three components: (1) Multi-Resolution Convolution Module, re-places the original 1D time-channel separable convolution with multi-stream convolutions. Each stream has a unique dilated stride on convolutional operations. (2) Channel-Wise Attention Module, calculates the attention weight of each convolutional stream by spatial channel-wise pooling. (3) Multi-Layer Feature Fusion Module, reweights each convolutional block by global multi-layer feature maps. Our experiments demonstrate that Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130174079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
VOXLINGUA107: A Dataset for Spoken Language Recognition VOXLINGUA107:口语识别数据集
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-25 DOI: 10.1109/SLT48900.2021.9383459
Jörgen Valk, Tanel Alumäe
{"title":"VOXLINGUA107: A Dataset for Spoken Language Recognition","authors":"Jörgen Valk, Tanel Alumäe","doi":"10.1109/SLT48900.2021.9383459","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383459","url":null,"abstract":"This paper investigates the use of automatically collected web audio data for the task of spoken language recognition. We generate semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. Post-filtering is used to remove segments from the database that are likely not in the given language, increasing the proportion of correctly labeled segments to 98%, based on crowd-sourced verification. The size of the resulting training set (VoxLingua107) is 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances. We use the data to build language recognition models for several spoken language identification tasks. Experiments show that using the automatically retrieved training data gives competitive results to using hand-labeled proprietary datasets. The dataset is publicly available1.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114763083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 94
Synth2Aug: Cross-Domain Speaker Recognition with TTS Synthesized Speech 基于TTS合成语音的跨域说话人识别
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI: 10.1109/SLT48900.2021.9383525
Yiling Huang, Yutian Chen, Jason W. Pelecanos, Quan Wang
{"title":"Synth2Aug: Cross-Domain Speaker Recognition with TTS Synthesized Speech","authors":"Yiling Huang, Yutian Chen, Jason W. Pelecanos, Quan Wang","doi":"10.1109/SLT48900.2021.9383525","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383525","url":null,"abstract":"In recent years, Text-To-Speech (TTS) has been used as a data augmentation technique for speech recognition to help complement inadequacies in the training data. Correspondingly, we investigate the use of a multi-speaker TTS system to synthesize speech in support of speaker recognition. In this study we focus the analysis on tasks where a relatively small number of speakers is available for training. We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance and can be combined effectively with multi-style training. Additionally, we explore the effectiveness of different types of text transcripts used for TTS synthesis. Results suggest that matching the textual content of the target domain is a good practice, and if that is not feasible, a transcript with a sufficiently large vocabulary is recommended.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126455566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Tight Integrated End-to-End Training for Cascaded Speech Translation 紧密集成的端到端级联语音翻译训练
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI: 10.1109/SLT48900.2021.9383462
Parnia Bahar, Tobias Bieschke, R. Schlüter, H. Ney
{"title":"Tight Integrated End-to-End Training for Cascaded Speech Translation","authors":"Parnia Bahar, Tobias Bieschke, R. Schlüter, H. Ney","doi":"10.1109/SLT48900.2021.9383462","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383462","url":null,"abstract":"A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is a tightly integrated method that passes renormalized source word posterior distributions as a soft decision instead of one-hot vectors and enables backpropagation. Therefore, it provides both transcriptions and translations and achieves strong consistency between them. Our experiments on four tasks with different data scenarios show that the model outperforms cascade models up to 1.8% in BLEU and 2.0% in TER and is superior compared to direct models.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"9 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123669920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Acoustic Span Embeddings for Multilingual Query-by-Example Search 多语言按例查询搜索的声跨嵌入
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI: 10.1109/SLT48900.2021.9383545
Yushi Hu, Shane Settle, Karen Livescu
{"title":"Acoustic Span Embeddings for Multilingual Query-by-Example Search","authors":"Yushi Hu, Shane Settle, Karen Livescu","doi":"10.1109/SLT48900.2021.9383545","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383545","url":null,"abstract":"Query-by-example (QbE) speech search is the task of matching spoken queries to utterances within a search collection. In low-or zero-resource settings, QbE search is often addressed with approaches based on dynamic time warping (DTW). Recent work has found that methods based on acoustic word embeddings (AWEs) can improve both performance and search speed. However, prior work on AWE-based QbE has primarily focused on English data and with single-word queries. In this work, we generalize AWE training to spans of words, producing acoustic span embeddings (ASE), and explore the application of ASE to QbE with arbitrary-length queries in multiple unseen languages. We consider the commonly used setting where we have access to labeled data in other languages (in our case, several low-resource languages) distinct from the unseen test languages. We evaluate our approach on the QUESST 2015 QbE tasks, finding that multilingual ASE-based search is much faster than DTW-based search and outperforms the best previously published results on this task.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126971836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Light Transformer For Speech-To-Intent Applications 用于语音到意图应用的光变压器
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI: 10.1109/SLT48900.2021.9383559
Pu Wang, H. V. hamme
{"title":"A Light Transformer For Speech-To-Intent Applications","authors":"Pu Wang, H. V. hamme","doi":"10.1109/SLT48900.2021.9383559","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383559","url":null,"abstract":"Spoken language understanding (SLU) systems can make life more agreeable, safer (e.g. in a car) or can increase the independence of physically challenged users. However, due to the many sources of variation in speech, a well-trained system is hard to transfer to other conditions like a different language or to speech impaired users. A remedy is to design a user-taught SLU system that can learn fully from scratch from users’ demonstrations, which in turn requires that the system’s model quickly converges after only a few training samples. In this paper, we propose a light transformer structure by using a simplified relative position encoding with the goal to reduce the model size and improve efficiency. The light transformer works as an alternative speech encoder for an existing user-taught multitask SLU system. Experimental results on three datasets with challenging speech conditions prove our approach outperforms the existed system and other state-of-art models with half of the original model size and training time.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132676078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Integration of Variational Autoencoder and Spatial Clustering for Adaptive Multi-Channel Neural Speech Separation 基于变分自编码器和空间聚类的自适应多通道神经语音分离
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI: 10.1109/SLT48900.2021.9383612
Kateřina Žmolíková, Marc Delcroix, L. Burget, T. Nakatani, J. Černocký
{"title":"Integration of Variational Autoencoder and Spatial Clustering for Adaptive Multi-Channel Neural Speech Separation","authors":"Kateřina Žmolíková, Marc Delcroix, L. Burget, T. Nakatani, J. Černocký","doi":"10.1109/SLT48900.2021.9383612","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383612","url":null,"abstract":"In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation. The advantage of integrating spatial clustering with a spectral model was shown in several works. As the spectral model, previous works used either factorial generative models of the mixed speech or discriminative neural networks. In our work, we combine the strengths of both approaches, by building a factorial model based on a generative neural network, a variational autoencoder. By doing so, we can exploit the modeling power of neural networks, but at the same time, keep a structured model. Such a model can be advantageous when adapting to new noise conditions as only the noise part of the model needs to be modified. We show experimentally, that our model significantly outperforms previous factorial model based on Gaussian mixture model (DOLPHIN), performs comparably to integration of permutation invariant training with spatial clustering, and enables us to easily adapt to new noise conditions.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130673461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信