2021 IEEE Spoken Language Technology Workshop (SLT)最新文献_第9页

Detecting Expressions with Multimodal Transformers 用多模态变压器检测表达式

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-30 DOI: 10.1109/SLT48900.2021.9383573

Srinivas Parthasarathy, Shiva Sundaram

{"title":"Detecting Expressions with Multimodal Transformers","authors":"Srinivas Parthasarathy, Shiva Sundaram","doi":"10.1109/SLT48900.2021.9383573","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383573","url":null,"abstract":"Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person’s audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user’s expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130925345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Look Who’s Not Talking 看看谁没说话

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-30 DOI: 10.1109/SLT48900.2021.9383502

Youngki Kwon, Hee-Soo Heo, Jaesung Huh, Bong-Jin Lee, Joon Son Chung

引用次数: 26

Transformer-Based Online Speech Recognition with Decoder-end Adaptive Computation Steps 基于变压器的在线语音识别解码器端自适应计算步骤

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-27 DOI: 10.1109/SLT48900.2021.9383613

Mohan Li, Catalin Zorila, R. Doddipatla

{"title":"Transformer-Based Online Speech Recognition with Decoder-end Adaptive Computation Steps","authors":"Mohan Li, Catalin Zorila, R. Doddipatla","doi":"10.1109/SLT48900.2021.9383613","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383613","url":null,"abstract":"Transformer-based end-to-end (E2E) automatic speech recognition (ASR) systems have recently gained wide popularity, and are shown to outperform E2E models based on recurrent structures on a number of ASR tasks. However, like other E2E models, Transformer ASR also requires the full input sequence for calculating the attentions on both encoder and decoder, leading to increased latency and posing a challenge for online ASR. The paper proposes Decoder-end Adaptive Computation Steps (DACS) algorithm to address the issue of latency and facilitate online ASR. The proposed algorithm streams the decoding of Transformer ASR by triggering an output after the confidence acquired from the encoder states reaches a certain threshold. Unlike other monotonic attention mechanisms that risk visiting the entire encoder states for each output step, the paper introduces a maximum look-ahead step into the DACS algorithm to prevent from reaching the end of speech too fast. A Chunkwise en-coder is adopted in our system to handle real-time speech inputs. The proposed online Transformer ASR system has been evaluated on Wall Street Journal (WSJ) and AIShell-1 datasets, yielding 5.5% word error rate (WER) and 7.1% character error rate (CER) respectively, with only a minor decay in performance when compared to the offline systems.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124490317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion 多石英网:多分辨率卷积语音识别与多层特征融合

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-26 DOI: 10.1109/SLT48900.2021.9383532

Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao

引用次数: 8

VOXLINGUA107: A Dataset for Spoken Language Recognition VOXLINGUA107:口语识别数据集

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-25 DOI: 10.1109/SLT48900.2021.9383459

Jörgen Valk, Tanel Alumäe

引用次数: 94

Synth2Aug: Cross-Domain Speaker Recognition with TTS Synthesized Speech 基于TTS合成语音的跨域说话人识别

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI: 10.1109/SLT48900.2021.9383525

Yiling Huang, Yutian Chen, Jason W. Pelecanos, Quan Wang

引用次数: 9

Tight Integrated End-to-End Training for Cascaded Speech Translation 紧密集成的端到端级联语音翻译训练

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI: 10.1109/SLT48900.2021.9383462

Parnia Bahar, Tobias Bieschke, R. Schlüter, H. Ney

{"title":"Tight Integrated End-to-End Training for Cascaded Speech Translation","authors":"Parnia Bahar, Tobias Bieschke, R. Schlüter, H. Ney","doi":"10.1109/SLT48900.2021.9383462","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383462","url":null,"abstract":"A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is a tightly integrated method that passes renormalized source word posterior distributions as a soft decision instead of one-hot vectors and enables backpropagation. Therefore, it provides both transcriptions and translations and achieves strong consistency between them. Our experiments on four tasks with different data scenarios show that the model outperforms cascade models up to 1.8% in BLEU and 2.0% in TER and is superior compared to direct models.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"9 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123669920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Acoustic Span Embeddings for Multilingual Query-by-Example Search 多语言按例查询搜索的声跨嵌入

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI: 10.1109/SLT48900.2021.9383545

Yushi Hu, Shane Settle, Karen Livescu

引用次数: 5

A Light Transformer For Speech-To-Intent Applications 用于语音到意图应用的光变压器

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI: 10.1109/SLT48900.2021.9383559

Pu Wang, H. V. hamme

引用次数: 4

Integration of Variational Autoencoder and Spatial Clustering for Adaptive Multi-Channel Neural Speech Separation 基于变分自编码器和空间聚类的自适应多通道神经语音分离

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI: 10.1109/SLT48900.2021.9383612

Kateřina Žmolíková, Marc Delcroix, L. Burget, T. Nakatani, J. Černocký

{"title":"Integration of Variational Autoencoder and Spatial Clustering for Adaptive Multi-Channel Neural Speech Separation","authors":"Kateřina Žmolíková, Marc Delcroix, L. Burget, T. Nakatani, J. Černocký","doi":"10.1109/SLT48900.2021.9383612","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383612","url":null,"abstract":"In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation. The advantage of integrating spatial clustering with a spectral model was shown in several works. As the spectral model, previous works used either factorial generative models of the mixed speech or discriminative neural networks. In our work, we combine the strengths of both approaches, by building a factorial model based on a generative neural network, a variational autoencoder. By doing so, we can exploit the modeling power of neural networks, but at the same time, keep a structured model. Such a model can be advantageous when adapting to new noise conditions as only the noise part of the model needs to be modified. We show experimentally, that our model significantly outperforms previous factorial model based on Gaussian mixture model (DOLPHIN), performs comparably to integration of permutation invariant training with spatial clustering, and enables us to easily adapt to new noise conditions.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130673461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3