Multi-modal action segmentation in the kitchen with a feature fusion approach

International Conference on Quality Control by Artificial Vision Pub Date : 2021-07-16 DOI:10.1117/12.2591752

Shunsuke Kogure, Y. Aoki

引用次数: 0

Abstract

In this paper, we propose a “Multi-modal Action Segmentation approach” that uses three modalities: (i) video, (ii) audio, (iii) thermal to classify cooking behavior in the kitchen. These 3 modalities are assumed to be features related to cooking. However, there is no public dataset containing these three modalities. Therefore, we built the original dataset and frame-level annotation. We then examined the usefulness of Action Segmentation using multi-modal features. We analyzed the effects of each modality using three evaluation metrics. As a result, the accuracy, edit distance, and F1 value were improved by up to about 1%, 2%, and 8%, respectively, compared to the case when only images were used.

查看原文本刊更多论文

基于特征融合方法的厨房多模态动作分割

在本文中，我们提出了一种“多模态动作分割方法”，该方法使用三种模式:(i)视频，(ii)音频，(iii)热对厨房中的烹饪行为进行分类。这三种形态被认为是与烹饪有关的特征。然而，没有包含这三种模式的公共数据集。因此，我们构建了原始数据集和框架级标注。然后，我们检查了使用多模态特征的动作分割的有用性。我们使用三个评价指标分析了每种模式的效果。因此，与仅使用图像的情况相比，精度、编辑距离和F1值分别提高了约1%、2%和8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Quality Control by Artificial Vision

自引率

0.00%

发文量