Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu
{"title":"What Makes a Maze Look Like a Maze?","authors":"Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu","doi":"arxiv-2409.08202","DOIUrl":null,"url":null,"abstract":"A unique aspect of human visual understanding is the ability to flexibly\ninterpret abstract concepts: acquiring lifted rules explaining what they\nsymbolize, grounding them across familiar and unfamiliar contexts, and making\npredictions or reasoning about them. While off-the-shelf vision-language models\nexcel at making literal interpretations of images (e.g., recognizing object\ncategories such as tree branches), they still struggle to make sense of such\nvisual abstractions (e.g., how an arrangement of tree branches may form the\nwalls of a maze). To address this challenge, we introduce Deep Schema Grounding\n(DSG), a framework that leverages explicit structured representations of visual\nabstractions for grounding and reasoning. At the core of DSG are\nschemas--dependency graph descriptions of abstract concepts that decompose them\ninto more primitive-level symbols. DSG uses large language models to extract\nschemas, then hierarchically grounds concrete to abstract components of the\nschema onto images with vision-language models. The grounded schema is used to\naugment visual abstraction understanding. We systematically evaluate DSG and\ndifferent methods in reasoning on our new Visual Abstractions Dataset, which\nconsists of diverse, real-world images of abstract concepts and corresponding\nquestion-answer pairs labeled by humans. We show that DSG significantly\nimproves the abstract visual reasoning performance of vision-language models,\nand is a step toward human-aligned understanding of visual abstractions.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08202","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A unique aspect of human visual understanding is the ability to flexibly
interpret abstract concepts: acquiring lifted rules explaining what they
symbolize, grounding them across familiar and unfamiliar contexts, and making
predictions or reasoning about them. While off-the-shelf vision-language models
excel at making literal interpretations of images (e.g., recognizing object
categories such as tree branches), they still struggle to make sense of such
visual abstractions (e.g., how an arrangement of tree branches may form the
walls of a maze). To address this challenge, we introduce Deep Schema Grounding
(DSG), a framework that leverages explicit structured representations of visual
abstractions for grounding and reasoning. At the core of DSG are
schemas--dependency graph descriptions of abstract concepts that decompose them
into more primitive-level symbols. DSG uses large language models to extract
schemas, then hierarchically grounds concrete to abstract components of the
schema onto images with vision-language models. The grounded schema is used to
augment visual abstraction understanding. We systematically evaluate DSG and
different methods in reasoning on our new Visual Abstractions Dataset, which
consists of diverse, real-world images of abstract concepts and corresponding
question-answer pairs labeled by humans. We show that DSG significantly
improves the abstract visual reasoning performance of vision-language models,
and is a step toward human-aligned understanding of visual abstractions.