{"title":"Spurious reconstruction from brain activity","authors":"Ken Shirakawa , Yoshihiro Nagano , Misato Tanaka , Shuntaro C. Aoki , Yusuke Muraki , Kei Majima , Yukiyasu Kamitani","doi":"10.1016/j.neunet.2025.107515","DOIUrl":null,"url":null,"abstract":"<div><div>Advances in brain decoding, particularly in visual image reconstruction, have sparked discussions about the societal implications and ethical considerations of neurotechnology. As reconstruction methods aim to recover visual experiences from brain activity and achieve prediction beyond training samples (zero-shot prediction), it is crucial to assess their capabilities and limitations to inform public expectations and regulations. Our case study of recent text-guided reconstruction methods, which leverage a large-scale dataset (Natural Scenes Dataset, NSD) and text-to-image diffusion models, reveals critical limitations in their generalizability, demonstrated by poor reconstructions on a different dataset. UMAP visualization of the text features from NSD images shows limited diversity with overlapping semantic and visual clusters between training and test sets. We identify that clustered training samples can lead to “output dimension collapse,” restricting predictable output feature dimensions. While diverse training data improves generalization over the entire feature space without requiring exponential scaling, text features alone prove insufficient for mapping to the visual space. Our findings suggest that the apparent realism in current text-guided reconstructions stems from a combination of classification into trained categories and inauthentic image generation (hallucination) through diffusion models, rather than genuine visual reconstruction. We argue that careful selection of datasets and target features, coupled with rigorous evaluation methods, is essential for achieving authentic visual image reconstruction. These insights underscore the importance of grounding interdisciplinary discussions in a thorough understanding of the technology’s current capabilities and limitations to ensure responsible development.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107515"},"PeriodicalIF":6.0000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025003946","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Advances in brain decoding, particularly in visual image reconstruction, have sparked discussions about the societal implications and ethical considerations of neurotechnology. As reconstruction methods aim to recover visual experiences from brain activity and achieve prediction beyond training samples (zero-shot prediction), it is crucial to assess their capabilities and limitations to inform public expectations and regulations. Our case study of recent text-guided reconstruction methods, which leverage a large-scale dataset (Natural Scenes Dataset, NSD) and text-to-image diffusion models, reveals critical limitations in their generalizability, demonstrated by poor reconstructions on a different dataset. UMAP visualization of the text features from NSD images shows limited diversity with overlapping semantic and visual clusters between training and test sets. We identify that clustered training samples can lead to “output dimension collapse,” restricting predictable output feature dimensions. While diverse training data improves generalization over the entire feature space without requiring exponential scaling, text features alone prove insufficient for mapping to the visual space. Our findings suggest that the apparent realism in current text-guided reconstructions stems from a combination of classification into trained categories and inauthentic image generation (hallucination) through diffusion models, rather than genuine visual reconstruction. We argue that careful selection of datasets and target features, coupled with rigorous evaluation methods, is essential for achieving authentic visual image reconstruction. These insights underscore the importance of grounding interdisciplinary discussions in a thorough understanding of the technology’s current capabilities and limitations to ensure responsible development.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.