Deep End-to-End Representation Learning for Food Type Recognition from Speech

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI:10.1145/3242969.3243683

Benjamin Sertolli, N. Cummins, A. Şengür, Björn Schuller

引用次数: 2

Abstract

The use of Convolutional Neural Networks (CNN) pre-trained for a particular task, as a feature extractor for an alternate task, is a standard practice in many image classification paradigms. However, to date there have been comparatively few works exploring this technique for speech classification tasks. Herein, we utilise a pre-trained end-to-end Automatic Speech Recognition CNN as a feature extractor for the task of food-type recognition from speech. Furthermore, we also explore the benefits of Compact Bilinear Pooling for combining multiple feature representations extracted from the CNN. Key results presented indicate the suitability of this approach. When combined with a Recurrent Neural Network classifier, our strongest system achieves, for a seven-class food-type classification task an unweighted average recall of 73.3% on the test set of the iHEARu-EAT database.

查看原文本刊更多论文

基于语音的食物类型识别的深度端到端表示学习

使用卷积神经网络(CNN)对特定任务进行预训练，作为替代任务的特征提取器，是许多图像分类范式的标准实践。然而，迄今为止，在语音分类任务中探索这种技术的工作相对较少。在这里，我们利用预训练的端到端自动语音识别CNN作为语音食物类型识别任务的特征提取器。此外，我们还探讨了紧凑双线性池结合从CNN提取的多个特征表示的好处。给出的关键结果表明了该方法的适用性。当与递归神经网络分类器相结合时，我们最强的系统在iHEARu-EAT数据库的测试集上实现了7类食物类型分类任务的未加权平均召回率为73.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 20th ACM International Conference on Multimodal Interaction

自引率

0.00%

发文量