{"title":"Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance","authors":"Jaehoon Joo, Taejin Jeong, Seongjae Hwang","doi":"arxiv-2409.12099","DOIUrl":null,"url":null,"abstract":"Understanding how humans process visual information is one of the crucial\nsteps for unraveling the underlying mechanism of brain activity. Recently, this\ncuriosity has motivated the fMRI-to-image reconstruction task; given the fMRI\ndata from visual stimuli, it aims to reconstruct the corresponding visual\nstimuli. Surprisingly, leveraging powerful generative models such as the Latent\nDiffusion Model (LDM) has shown promising results in reconstructing complex\nvisual stimuli such as high-resolution natural images from vision datasets.\nDespite the impressive structural fidelity of these reconstructions, they often\nlack details of small objects, ambiguous shapes, and semantic nuances.\nConsequently, the incorporation of additional semantic knowledge, beyond mere\nvisuals, becomes imperative. In light of this, we exploit how modern LDMs\neffectively incorporate multi-modal guidance (text guidance, visual guidance,\nand image layout) for structurally and semantically plausible image\ngenerations. Specifically, inspired by the two-streams hypothesis suggesting\nthat perceptual and semantic information are processed in different brain\nregions, our framework, Brain-Streams, maps fMRI signals from these brain\nregions to appropriate embeddings. That is, by extracting textual guidance from\nsemantic information regions and visual guidance from perceptual information\nregions, Brain-Streams provides accurate multi-modal guidance to LDMs. We\nvalidate the reconstruction ability of Brain-Streams both quantitatively and\nqualitatively on a real fMRI dataset comprising natural image stimuli and fMRI\ndata.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12099","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Understanding how humans process visual information is one of the crucial
steps for unraveling the underlying mechanism of brain activity. Recently, this
curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI
data from visual stimuli, it aims to reconstruct the corresponding visual
stimuli. Surprisingly, leveraging powerful generative models such as the Latent
Diffusion Model (LDM) has shown promising results in reconstructing complex
visual stimuli such as high-resolution natural images from vision datasets.
Despite the impressive structural fidelity of these reconstructions, they often
lack details of small objects, ambiguous shapes, and semantic nuances.
Consequently, the incorporation of additional semantic knowledge, beyond mere
visuals, becomes imperative. In light of this, we exploit how modern LDMs
effectively incorporate multi-modal guidance (text guidance, visual guidance,
and image layout) for structurally and semantically plausible image
generations. Specifically, inspired by the two-streams hypothesis suggesting
that perceptual and semantic information are processed in different brain
regions, our framework, Brain-Streams, maps fMRI signals from these brain
regions to appropriate embeddings. That is, by extracting textual guidance from
semantic information regions and visual guidance from perceptual information
regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We
validate the reconstruction ability of Brain-Streams both quantitatively and
qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI
data.