用于细粒度烹饪活动识别的食材序列变换信息学习

Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications Pub Date : 2022-10-10 DOI:10.1145/3552485.3554940

Atsushi Okamoto, Katsufumi Inoue, M. Yoshioka

{"title":"用于细粒度烹饪活动识别的食材序列变换信息学习","authors":"Atsushi Okamoto, Katsufumi Inoue, M. Yoshioka","doi":"10.1145/3552485.3554940","DOIUrl":null,"url":null,"abstract":"The goal of our research is to recognize the fine-grained cooking activities (e.g., dicing or mincing in cutting) in the egocentric videos from the sequential transformation of ingredients that are processed by the camera-wearer; these types of activities are classified according to the state of ingredients after processing, and we often utilize the same cooking utensils and similar motions in such activities. Due to the above conditions, the recognition of such activities is a challenging task in computer vision and multimedia analysis. To tackle this problem, we need to perceive the sequential state transformation of ingredients precisely. In this research, to realize this, we propose a new GAN-based network whose characteristic points are 1) we crop images around the ingredient as a preprocessing to remove the environmental information, 2) we generate intermediate images from the past and future images to obtain the sequential information in the generator network, 3) the adversarial network is employed as a discriminator to classify whether the input image is generated one or not, and 4) we employ the temporally coherent network to check the temporal smoothness of input images and to predict cooking activities by comparing the original sequential images and the generated ones. To investigate the effectiveness of our proposed method, for the first step, we especially focus on \"\\textitcutting activities \". From the experimental results with our originally prepared dataset, in this paper, we report the effectiveness of our proposed method.","PeriodicalId":338126,"journal":{"name":"Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications","volume":"272 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning Sequential Transformation Information of Ingredients for Fine-Grained Cooking Activity Recognition\",\"authors\":\"Atsushi Okamoto, Katsufumi Inoue, M. Yoshioka\",\"doi\":\"10.1145/3552485.3554940\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The goal of our research is to recognize the fine-grained cooking activities (e.g., dicing or mincing in cutting) in the egocentric videos from the sequential transformation of ingredients that are processed by the camera-wearer; these types of activities are classified according to the state of ingredients after processing, and we often utilize the same cooking utensils and similar motions in such activities. Due to the above conditions, the recognition of such activities is a challenging task in computer vision and multimedia analysis. To tackle this problem, we need to perceive the sequential state transformation of ingredients precisely. In this research, to realize this, we propose a new GAN-based network whose characteristic points are 1) we crop images around the ingredient as a preprocessing to remove the environmental information, 2) we generate intermediate images from the past and future images to obtain the sequential information in the generator network, 3) the adversarial network is employed as a discriminator to classify whether the input image is generated one or not, and 4) we employ the temporally coherent network to check the temporal smoothness of input images and to predict cooking activities by comparing the original sequential images and the generated ones. To investigate the effectiveness of our proposed method, for the first step, we especially focus on \\\"\\\\textitcutting activities \\\". From the experimental results with our originally prepared dataset, in this paper, we report the effectiveness of our proposed method.\",\"PeriodicalId\":338126,\"journal\":{\"name\":\"Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications\",\"volume\":\"272 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3552485.3554940\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3552485.3554940","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们的研究目标是在以自我为中心的视频中识别精细的烹饪活动(例如，切丁或切碎)，这些视频来自于相机佩戴者处理的食材的顺序转换;这些类型的活动是根据原料加工后的状态来分类的，我们经常在这些活动中使用相同的烹饪器具和类似的动作。由于上述条件，这些活动的识别在计算机视觉和多媒体分析中是一项具有挑战性的任务。为了解决这个问题，我们需要精确地感知成分的顺序状态转换。在本研究中，为了实现这一点，我们提出了一种新的基于gan的网络，其特征点是:1)我们裁剪成分周围的图像作为预处理以去除环境信息;2)我们从过去和未来的图像中生成中间图像以获得生成器网络中的顺序信息;3)使用对抗网络作为判别器来分类输入图像是否为生成图像。4)使用时间相干网络检查输入图像的时间平滑性，并通过对比原始序列图像和生成的序列图像来预测烹饪活动。为了研究我们提出的方法的有效性，第一步，我们特别关注“文本切割活动”。根据我们最初准备的数据集的实验结果，在本文中，我们报告了我们提出的方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning Sequential Transformation Information of Ingredients for Fine-Grained Cooking Activity Recognition

The goal of our research is to recognize the fine-grained cooking activities (e.g., dicing or mincing in cutting) in the egocentric videos from the sequential transformation of ingredients that are processed by the camera-wearer; these types of activities are classified according to the state of ingredients after processing, and we often utilize the same cooking utensils and similar motions in such activities. Due to the above conditions, the recognition of such activities is a challenging task in computer vision and multimedia analysis. To tackle this problem, we need to perceive the sequential state transformation of ingredients precisely. In this research, to realize this, we propose a new GAN-based network whose characteristic points are 1) we crop images around the ingredient as a preprocessing to remove the environmental information, 2) we generate intermediate images from the past and future images to obtain the sequential information in the generator network, 3) the adversarial network is employed as a discriminator to classify whether the input image is generated one or not, and 4) we employ the temporally coherent network to check the temporal smoothness of input images and to predict cooking activities by comparing the original sequential images and the generated ones. To investigate the effectiveness of our proposed method, for the first step, we especially focus on "\textitcutting activities ". From the experimental results with our originally prepared dataset, in this paper, we report the effectiveness of our proposed method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications

自引率

0.00%

发文量