Attention-Based Speech Recognition Using Gaze Information

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI:10.1109/ASRU46091.2019.9004030

Osamu Segawa, Tomoki Hayashi, K. Takeda

引用次数: 0

Abstract

We assume that there is a correlation between an utterance and a corresponding gaze object, and propose a new paradigm of multi-modal end-to-end speech recognition using multimodal information, namely, utterances and corresponding gaze points. In our method, the system extracts acoustic features and corresponding images around gaze points, and inputs the information into the proposed attention-based multiple encoder-decoder networks. This makes it possible to integrate the two different modalities, and the performance of speech recognition is improved. To evaluate the proposed method, we prepared a simulation task of power-line control operations, and built a corpus that contains utterances and corresponding gaze points in the operations. We conducted an experimental evaluation using this corpus, and the results showed the reduction in the CER, suggesting the effectiveness of the proposed method in which acoustic features and gaze information are integrated.

查看原文本刊更多论文

基于注视信息的语音识别

我们假设话语和相应的凝视对象之间存在相关性，并提出了一种基于多模态信息的端到端多模态语音识别新范式，即话语和相应的凝视点。在我们的方法中，系统提取凝视点周围的声学特征和相应的图像，并将信息输入到所提出的基于注意力的多个编码器-解码器网络中。这使得两种不同模式的融合成为可能，从而提高了语音识别的性能。为了评估所提出的方法，我们准备了一个电力线控制操作的仿真任务，并建立了一个包含操作中话语和相应注视点的语料库。我们使用该语料库进行了实验评估，结果表明，将声学特征和凝视信息相结合的方法降低了CER，表明了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量