Frame Selection for Producing Recipe with Pictures from an Execution Video of a Recipe

Taichi Nishimura, Atsushi Hashimoto, Yoko Yamakata, Shinsuke Mori
{"title":"Frame Selection for Producing Recipe with Pictures from an Execution Video of a Recipe","authors":"Taichi Nishimura, Atsushi Hashimoto, Yoko Yamakata, Shinsuke Mori","doi":"10.1145/3326458.3326928","DOIUrl":null,"url":null,"abstract":"In cooking procedure instruction, text format plays an important role in conveying quantitative information accurately, such as time and quantity. On the other hand, image format can smoothly convey qualitative information (e.g., the target food state of a procedure) at a glance. Our goal is to produce multimedia recipes, which have texts and corresponding pictures, for chefs to better understand the procedures. The system takes a procedural text and its unedited execution video as the input and outputs selected frames for instructions in the text. We assume that a frame suits to an instruction when they share key objects. Under this assumption, we extract the information of key objects using named entity recognizer from the text and object detection from the frame, and we convert them into feature vectors and calculate their cosine similarity. To enhance the measurement, we also calculate the scene importance based on the latest changes in object appearance, and aggregate it to the cosine similarity. Finally we align the instruction sequence and the frame sequence using the Viterbi algorithm referring to this suitability and get the frame selection for each instruction. We implemented our method and tested it on a dataset consisting of text recipes and their execution videos. In the experiments we compared the automatic alignment results with those by human annotators. The precision, recall, and F-measure showed that the proposed approach made a steady improvement in this challenging problem of selecting pictures from an unedited video.","PeriodicalId":184771,"journal":{"name":"Proceedings of the 11th Workshop on Multimedia for Cooking and Eating Activities","volume":"109 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th Workshop on Multimedia for Cooking and Eating Activities","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3326458.3326928","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

In cooking procedure instruction, text format plays an important role in conveying quantitative information accurately, such as time and quantity. On the other hand, image format can smoothly convey qualitative information (e.g., the target food state of a procedure) at a glance. Our goal is to produce multimedia recipes, which have texts and corresponding pictures, for chefs to better understand the procedures. The system takes a procedural text and its unedited execution video as the input and outputs selected frames for instructions in the text. We assume that a frame suits to an instruction when they share key objects. Under this assumption, we extract the information of key objects using named entity recognizer from the text and object detection from the frame, and we convert them into feature vectors and calculate their cosine similarity. To enhance the measurement, we also calculate the scene importance based on the latest changes in object appearance, and aggregate it to the cosine similarity. Finally we align the instruction sequence and the frame sequence using the Viterbi algorithm referring to this suitability and get the frame selection for each instruction. We implemented our method and tested it on a dataset consisting of text recipes and their execution videos. In the experiments we compared the automatic alignment results with those by human annotators. The precision, recall, and F-measure showed that the proposed approach made a steady improvement in this challenging problem of selecting pictures from an unedited video.
框架选择与图片从一个食谱的执行视频制作配方
在烹饪过程教学中,文本格式在准确传达时间、数量等定量信息方面起着重要作用。另一方面,图像格式可以流畅地传达定性信息(例如,一个程序的目标食物状态)。我们的目标是制作多媒体食谱,其中有文字和相应的图片,让厨师更好地理解过程。系统将过程文本及其未编辑的执行视频作为文本中指令的输入和输出选定帧。我们假设一个框架适合于一条指令,当它们共享关键对象时。在此假设下,我们使用命名实体识别器从文本中提取关键目标信息,从帧中进行目标检测,并将其转换为特征向量并计算其余弦相似度。为了增强测量,我们还根据物体外观的最新变化计算场景重要性,并将其聚合为余弦相似度。最后利用Viterbi算法对指令序列和帧序列进行对齐,得到每条指令的帧选择。我们实现了我们的方法,并在包含文本食谱及其执行视频的数据集上进行了测试。在实验中,我们将自动比对结果与人工标注的比对结果进行了比较。精确度、召回率和f值表明,所提出的方法在从未编辑的视频中选择图片这一具有挑战性的问题上取得了稳步的进步。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信