基于语义感知和上下文丰富的视频描述方法的机器学习管道

Proceedings of the 2022 International Conference on Pattern Recognition and Intelligent Systems Pub Date : 2022-07-29 DOI:10.1145/3549179.3549182

Yichiet Aun, Y. Khaw, Ming-Lee Gan, Ley-Ter Tin

{"title":"基于语义感知和上下文丰富的视频描述方法的机器学习管道","authors":"Yichiet Aun, Y. Khaw, Ming-Lee Gan, Ley-Ter Tin","doi":"10.1145/3549179.3549182","DOIUrl":null,"url":null,"abstract":"Video description (VD) methods use machine learning to automatically generate sentences to describe video contents. Global-description based VD (gVD) methods generates global description to provide the big picture of video scenes but they lack finer grain entities information. Meanwhile, modern entity-based VD (eVD) use deep learning to train ML models like object model (YOLOv3), human activity model (CNN), location tracking (DeepSORT) to resolve individual entity that made up the complete sentences. However, existing eVD are limited in the types of supported entities; thus, resulting in eVD generating sentences that contexts-deprived and incomplete to clearly describe video scenes. In addition, the entities resolved by eVD are isolated since they are inferred from different ML models; resulting in sentences that are not semantically cohesive; contextually and grammatically. In this paper, a two-stages ML pipeline (teVD) is proposed for a holistic and semantic-aware VD sentence generation. Firstly, a ML pipeline is designed to aggregate several high performing ML models for resolving fine grain entities to improve the accuracy of resolved entities. Second, the components in the entities set are ‘stitched’ together using an entity trimming method to (1) remove shadow entities and (2) to re-arrange entities based on linguistic rules to generate video descriptions that are context-aware and less ambiguous. The experimental results showed that teVD successfully improved the quality of generated sentences in short videos; achieving BLEU score of 48.01 and METEOR score of 32.80 on MSVD dataset.","PeriodicalId":105724,"journal":{"name":"Proceedings of the 2022 International Conference on Pattern Recognition and Intelligent Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Machine-Learning Pipeline for Semantic-Aware and Contexts-Rich Video Description Method\",\"authors\":\"Yichiet Aun, Y. Khaw, Ming-Lee Gan, Ley-Ter Tin\",\"doi\":\"10.1145/3549179.3549182\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video description (VD) methods use machine learning to automatically generate sentences to describe video contents. Global-description based VD (gVD) methods generates global description to provide the big picture of video scenes but they lack finer grain entities information. Meanwhile, modern entity-based VD (eVD) use deep learning to train ML models like object model (YOLOv3), human activity model (CNN), location tracking (DeepSORT) to resolve individual entity that made up the complete sentences. However, existing eVD are limited in the types of supported entities; thus, resulting in eVD generating sentences that contexts-deprived and incomplete to clearly describe video scenes. In addition, the entities resolved by eVD are isolated since they are inferred from different ML models; resulting in sentences that are not semantically cohesive; contextually and grammatically. In this paper, a two-stages ML pipeline (teVD) is proposed for a holistic and semantic-aware VD sentence generation. Firstly, a ML pipeline is designed to aggregate several high performing ML models for resolving fine grain entities to improve the accuracy of resolved entities. Second, the components in the entities set are ‘stitched’ together using an entity trimming method to (1) remove shadow entities and (2) to re-arrange entities based on linguistic rules to generate video descriptions that are context-aware and less ambiguous. The experimental results showed that teVD successfully improved the quality of generated sentences in short videos; achieving BLEU score of 48.01 and METEOR score of 32.80 on MSVD dataset.\",\"PeriodicalId\":105724,\"journal\":{\"name\":\"Proceedings of the 2022 International Conference on Pattern Recognition and Intelligent Systems\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 International Conference on Pattern Recognition and Intelligent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3549179.3549182\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Pattern Recognition and Intelligent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3549179.3549182","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

视频描述(VD)方法利用机器学习自动生成描述视频内容的句子。基于全局描述的VD (gVD)方法生成全局描述以提供视频场景的全貌，但缺乏更细粒度的实体信息。与此同时，现代基于实体的VD (eVD)使用深度学习来训练ML模型，如对象模型(YOLOv3)、人类活动模型(CNN)、位置跟踪(DeepSORT)，以解析组成完整句子的单个实体。然而，现有的埃博拉病毒病在支持实体的类型上是有限的;从而导致eVD生成的句子缺乏语境，无法清晰地描述视频场景。此外，eVD解析的实体是隔离的，因为它们是从不同的ML模型中推断出来的;导致句子在语义上不连贯;上下文和语法。本文提出了一种两阶段的机器学习流水线(teVD)，用于整体的、语义感知的机器学习句子生成。首先，设计ML流水线，聚合多个高性能ML模型用于解析细粒度实体，以提高解析实体的准确性;其次，使用实体修剪方法将实体集中的组件“拼接”在一起，以(1)去除阴影实体，(2)根据语言规则重新排列实体，以生成上下文感知且不那么模糊的视频描述。实验结果表明，teVD成功地提高了短视频生成句子的质量;在MSVD数据集上BLEU得分为48.01,METEOR得分为32.80。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Machine-Learning Pipeline for Semantic-Aware and Contexts-Rich Video Description Method

Video description (VD) methods use machine learning to automatically generate sentences to describe video contents. Global-description based VD (gVD) methods generates global description to provide the big picture of video scenes but they lack finer grain entities information. Meanwhile, modern entity-based VD (eVD) use deep learning to train ML models like object model (YOLOv3), human activity model (CNN), location tracking (DeepSORT) to resolve individual entity that made up the complete sentences. However, existing eVD are limited in the types of supported entities; thus, resulting in eVD generating sentences that contexts-deprived and incomplete to clearly describe video scenes. In addition, the entities resolved by eVD are isolated since they are inferred from different ML models; resulting in sentences that are not semantically cohesive; contextually and grammatically. In this paper, a two-stages ML pipeline (teVD) is proposed for a holistic and semantic-aware VD sentence generation. Firstly, a ML pipeline is designed to aggregate several high performing ML models for resolving fine grain entities to improve the accuracy of resolved entities. Second, the components in the entities set are ‘stitched’ together using an entity trimming method to (1) remove shadow entities and (2) to re-arrange entities based on linguistic rules to generate video descriptions that are context-aware and less ambiguous. The experimental results showed that teVD successfully improved the quality of generated sentences in short videos; achieving BLEU score of 48.01 and METEOR score of 32.80 on MSVD dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 International Conference on Pattern Recognition and Intelligent Systems

自引率

0.00%

发文量