Surbhi Madan, Shreya Ghosh, Lownish Rai Sookha, M. A. Ganaie, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon
{"title":"MIP-GAF:最重要人物定位和群体上下文理解的 MLLM 注释基准","authors":"Surbhi Madan, Shreya Ghosh, Lownish Rai Sookha, M. A. Ganaie, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon","doi":"arxiv-2409.06224","DOIUrl":null,"url":null,"abstract":"Estimating the Most Important Person (MIP) in any social event setup is a\nchallenging problem mainly due to contextual complexity and scarcity of labeled\ndata. Moreover, the causality aspects of MIP estimation are quite subjective\nand diverse. To this end, we aim to address the problem by annotating a\nlarge-scale `in-the-wild' dataset for identifying human perceptions about the\n`Most Important Person (MIP)' in an image. The paper provides a thorough\ndescription of our proposed Multimodal Large Language Model (MLLM) based data\nannotation strategy, and a thorough data quality analysis. Further, we perform\na comprehensive benchmarking of the proposed dataset utilizing state-of-the-art\nMIP localization methods, indicating a significant drop in performance compared\nto existing datasets. The performance drop shows that the existing MIP\nlocalization algorithms must be more robust with respect to `in-the-wild'\nsituations. We believe the proposed dataset will play a vital role in building\nthe next-generation social situation understanding methods. The code and data\nis available at https://github.com/surbhimadan92/MIP-GAF.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding\",\"authors\":\"Surbhi Madan, Shreya Ghosh, Lownish Rai Sookha, M. A. Ganaie, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon\",\"doi\":\"arxiv-2409.06224\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Estimating the Most Important Person (MIP) in any social event setup is a\\nchallenging problem mainly due to contextual complexity and scarcity of labeled\\ndata. Moreover, the causality aspects of MIP estimation are quite subjective\\nand diverse. To this end, we aim to address the problem by annotating a\\nlarge-scale `in-the-wild' dataset for identifying human perceptions about the\\n`Most Important Person (MIP)' in an image. The paper provides a thorough\\ndescription of our proposed Multimodal Large Language Model (MLLM) based data\\nannotation strategy, and a thorough data quality analysis. Further, we perform\\na comprehensive benchmarking of the proposed dataset utilizing state-of-the-art\\nMIP localization methods, indicating a significant drop in performance compared\\nto existing datasets. The performance drop shows that the existing MIP\\nlocalization algorithms must be more robust with respect to `in-the-wild'\\nsituations. We believe the proposed dataset will play a vital role in building\\nthe next-generation social situation understanding methods. The code and data\\nis available at https://github.com/surbhimadan92/MIP-GAF.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"26 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06224\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06224","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding
Estimating the Most Important Person (MIP) in any social event setup is a
challenging problem mainly due to contextual complexity and scarcity of labeled
data. Moreover, the causality aspects of MIP estimation are quite subjective
and diverse. To this end, we aim to address the problem by annotating a
large-scale `in-the-wild' dataset for identifying human perceptions about the
`Most Important Person (MIP)' in an image. The paper provides a thorough
description of our proposed Multimodal Large Language Model (MLLM) based data
annotation strategy, and a thorough data quality analysis. Further, we perform
a comprehensive benchmarking of the proposed dataset utilizing state-of-the-art
MIP localization methods, indicating a significant drop in performance compared
to existing datasets. The performance drop shows that the existing MIP
localization algorithms must be more robust with respect to `in-the-wild'
situations. We believe the proposed dataset will play a vital role in building
the next-generation social situation understanding methods. The code and data
is available at https://github.com/surbhimadan92/MIP-GAF.