{"title":"基于语义感知和上下文丰富的视频描述方法的机器学习管道","authors":"Yichiet Aun, Y. Khaw, Ming-Lee Gan, Ley-Ter Tin","doi":"10.1145/3549179.3549182","DOIUrl":null,"url":null,"abstract":"Video description (VD) methods use machine learning to automatically generate sentences to describe video contents. Global-description based VD (gVD) methods generates global description to provide the big picture of video scenes but they lack finer grain entities information. Meanwhile, modern entity-based VD (eVD) use deep learning to train ML models like object model (YOLOv3), human activity model (CNN), location tracking (DeepSORT) to resolve individual entity that made up the complete sentences. However, existing eVD are limited in the types of supported entities; thus, resulting in eVD generating sentences that contexts-deprived and incomplete to clearly describe video scenes. In addition, the entities resolved by eVD are isolated since they are inferred from different ML models; resulting in sentences that are not semantically cohesive; contextually and grammatically. In this paper, a two-stages ML pipeline (teVD) is proposed for a holistic and semantic-aware VD sentence generation. Firstly, a ML pipeline is designed to aggregate several high performing ML models for resolving fine grain entities to improve the accuracy of resolved entities. Second, the components in the entities set are ‘stitched’ together using an entity trimming method to (1) remove shadow entities and (2) to re-arrange entities based on linguistic rules to generate video descriptions that are context-aware and less ambiguous. The experimental results showed that teVD successfully improved the quality of generated sentences in short videos; achieving BLEU score of 48.01 and METEOR score of 32.80 on MSVD dataset.","PeriodicalId":105724,"journal":{"name":"Proceedings of the 2022 International Conference on Pattern Recognition and Intelligent Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Machine-Learning Pipeline for Semantic-Aware and Contexts-Rich Video Description Method\",\"authors\":\"Yichiet Aun, Y. Khaw, Ming-Lee Gan, Ley-Ter Tin\",\"doi\":\"10.1145/3549179.3549182\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video description (VD) methods use machine learning to automatically generate sentences to describe video contents. Global-description based VD (gVD) methods generates global description to provide the big picture of video scenes but they lack finer grain entities information. Meanwhile, modern entity-based VD (eVD) use deep learning to train ML models like object model (YOLOv3), human activity model (CNN), location tracking (DeepSORT) to resolve individual entity that made up the complete sentences. However, existing eVD are limited in the types of supported entities; thus, resulting in eVD generating sentences that contexts-deprived and incomplete to clearly describe video scenes. In addition, the entities resolved by eVD are isolated since they are inferred from different ML models; resulting in sentences that are not semantically cohesive; contextually and grammatically. In this paper, a two-stages ML pipeline (teVD) is proposed for a holistic and semantic-aware VD sentence generation. Firstly, a ML pipeline is designed to aggregate several high performing ML models for resolving fine grain entities to improve the accuracy of resolved entities. Second, the components in the entities set are ‘stitched’ together using an entity trimming method to (1) remove shadow entities and (2) to re-arrange entities based on linguistic rules to generate video descriptions that are context-aware and less ambiguous. The experimental results showed that teVD successfully improved the quality of generated sentences in short videos; achieving BLEU score of 48.01 and METEOR score of 32.80 on MSVD dataset.\",\"PeriodicalId\":105724,\"journal\":{\"name\":\"Proceedings of the 2022 International Conference on Pattern Recognition and Intelligent Systems\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 International Conference on Pattern Recognition and Intelligent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3549179.3549182\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Pattern Recognition and Intelligent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3549179.3549182","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Machine-Learning Pipeline for Semantic-Aware and Contexts-Rich Video Description Method
Video description (VD) methods use machine learning to automatically generate sentences to describe video contents. Global-description based VD (gVD) methods generates global description to provide the big picture of video scenes but they lack finer grain entities information. Meanwhile, modern entity-based VD (eVD) use deep learning to train ML models like object model (YOLOv3), human activity model (CNN), location tracking (DeepSORT) to resolve individual entity that made up the complete sentences. However, existing eVD are limited in the types of supported entities; thus, resulting in eVD generating sentences that contexts-deprived and incomplete to clearly describe video scenes. In addition, the entities resolved by eVD are isolated since they are inferred from different ML models; resulting in sentences that are not semantically cohesive; contextually and grammatically. In this paper, a two-stages ML pipeline (teVD) is proposed for a holistic and semantic-aware VD sentence generation. Firstly, a ML pipeline is designed to aggregate several high performing ML models for resolving fine grain entities to improve the accuracy of resolved entities. Second, the components in the entities set are ‘stitched’ together using an entity trimming method to (1) remove shadow entities and (2) to re-arrange entities based on linguistic rules to generate video descriptions that are context-aware and less ambiguous. The experimental results showed that teVD successfully improved the quality of generated sentences in short videos; achieving BLEU score of 48.01 and METEOR score of 32.80 on MSVD dataset.