{"title":"通过精确的关键点定位的细粒度视频字幕","authors":"Yunjie Zhang, Tiangyang Xu, Xiaoning Song, Zhenghua Feng, Xiaojun Wu","doi":"10.1145/3552455.3555817","DOIUrl":null,"url":null,"abstract":"In recent years, a variety of excellent dense video caption models have emerged. However, most of these models focus on global features and salient events in the video. For the makeup data set used in this competition, the video content is very similar with only slight variations. Because the model lacks the ability to focus on fine-grained features, it does not generate captions very well. Based on this, this paper proposes a key point detection algorithm for the human face and human hand to synchronize and coordinate the detection of video frame extraction, and encapsulate the detected auxiliary features into the existing features, so that the existing video subtitle system can focus on fine-grained features. In order to improve the effect of generating subtitles, we further use the TSP model to extract more efficient video features. Our model has better performance than the baseline.","PeriodicalId":309164,"journal":{"name":"Proceedings of the 4th on Person in Context Workshop","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fine-grained Video Captioning via Precise Key Point Positioning\",\"authors\":\"Yunjie Zhang, Tiangyang Xu, Xiaoning Song, Zhenghua Feng, Xiaojun Wu\",\"doi\":\"10.1145/3552455.3555817\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, a variety of excellent dense video caption models have emerged. However, most of these models focus on global features and salient events in the video. For the makeup data set used in this competition, the video content is very similar with only slight variations. Because the model lacks the ability to focus on fine-grained features, it does not generate captions very well. Based on this, this paper proposes a key point detection algorithm for the human face and human hand to synchronize and coordinate the detection of video frame extraction, and encapsulate the detected auxiliary features into the existing features, so that the existing video subtitle system can focus on fine-grained features. In order to improve the effect of generating subtitles, we further use the TSP model to extract more efficient video features. Our model has better performance than the baseline.\",\"PeriodicalId\":309164,\"journal\":{\"name\":\"Proceedings of the 4th on Person in Context Workshop\",\"volume\":\"53 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th on Person in Context Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3552455.3555817\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th on Person in Context Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3552455.3555817","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fine-grained Video Captioning via Precise Key Point Positioning
In recent years, a variety of excellent dense video caption models have emerged. However, most of these models focus on global features and salient events in the video. For the makeup data set used in this competition, the video content is very similar with only slight variations. Because the model lacks the ability to focus on fine-grained features, it does not generate captions very well. Based on this, this paper proposes a key point detection algorithm for the human face and human hand to synchronize and coordinate the detection of video frame extraction, and encapsulate the detected auxiliary features into the existing features, so that the existing video subtitle system can focus on fine-grained features. In order to improve the effect of generating subtitles, we further use the TSP model to extract more efficient video features. Our model has better performance than the baseline.