Taichi Nishimura, Atsushi Hashimoto, Yoko Yamakata, Shinsuke Mori
{"title":"Frame Selection for Producing Recipe with Pictures from an Execution Video of a Recipe","authors":"Taichi Nishimura, Atsushi Hashimoto, Yoko Yamakata, Shinsuke Mori","doi":"10.1145/3326458.3326928","DOIUrl":null,"url":null,"abstract":"In cooking procedure instruction, text format plays an important role in conveying quantitative information accurately, such as time and quantity. On the other hand, image format can smoothly convey qualitative information (e.g., the target food state of a procedure) at a glance. Our goal is to produce multimedia recipes, which have texts and corresponding pictures, for chefs to better understand the procedures. The system takes a procedural text and its unedited execution video as the input and outputs selected frames for instructions in the text. We assume that a frame suits to an instruction when they share key objects. Under this assumption, we extract the information of key objects using named entity recognizer from the text and object detection from the frame, and we convert them into feature vectors and calculate their cosine similarity. To enhance the measurement, we also calculate the scene importance based on the latest changes in object appearance, and aggregate it to the cosine similarity. Finally we align the instruction sequence and the frame sequence using the Viterbi algorithm referring to this suitability and get the frame selection for each instruction. We implemented our method and tested it on a dataset consisting of text recipes and their execution videos. In the experiments we compared the automatic alignment results with those by human annotators. The precision, recall, and F-measure showed that the proposed approach made a steady improvement in this challenging problem of selecting pictures from an unedited video.","PeriodicalId":184771,"journal":{"name":"Proceedings of the 11th Workshop on Multimedia for Cooking and Eating Activities","volume":"109 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th Workshop on Multimedia for Cooking and Eating Activities","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3326458.3326928","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
In cooking procedure instruction, text format plays an important role in conveying quantitative information accurately, such as time and quantity. On the other hand, image format can smoothly convey qualitative information (e.g., the target food state of a procedure) at a glance. Our goal is to produce multimedia recipes, which have texts and corresponding pictures, for chefs to better understand the procedures. The system takes a procedural text and its unedited execution video as the input and outputs selected frames for instructions in the text. We assume that a frame suits to an instruction when they share key objects. Under this assumption, we extract the information of key objects using named entity recognizer from the text and object detection from the frame, and we convert them into feature vectors and calculate their cosine similarity. To enhance the measurement, we also calculate the scene importance based on the latest changes in object appearance, and aggregate it to the cosine similarity. Finally we align the instruction sequence and the frame sequence using the Viterbi algorithm referring to this suitability and get the frame selection for each instruction. We implemented our method and tested it on a dataset consisting of text recipes and their execution videos. In the experiments we compared the automatic alignment results with those by human annotators. The precision, recall, and F-measure showed that the proposed approach made a steady improvement in this challenging problem of selecting pictures from an unedited video.