Promoting the Shift From Pixel-Level Correlations to Object Semantics Learning by Rethinking Computer Vision Benchmark Data Sets

IF 2.1 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Computation Pub Date : 2024-07-19 DOI:10.1162/neco_a_01677

Maria Osório;Andreas Wichert

{"title":"Promoting the Shift From Pixel-Level Correlations to Object Semantics Learning by Rethinking Computer Vision Benchmark Data Sets","authors":"Maria Osório;Andreas Wichert","doi":"10.1162/neco_a_01677","DOIUrl":null,"url":null,"abstract":"In computer vision research, convolutional neural networks (CNNs) have demonstrated remarkable capabilities at extracting patterns from raw pixel data, achieving state-of-the-art recognition accuracy. However, they significantly differ from human visual perception, prioritizing pixel-level correlations and statistical patterns, often overlooking object semantics. To explore this difference, we propose an approach that isolates core visual features crucial for human perception and object recognition: color, texture, and shape. In experiments on three benchmarks—Fruits 360, CIFAR-10, and Fashion MNIST—each visual feature is individually input into a neural network. Results reveal data set–dependent variations in classification accuracy, highlighting that deep learning models tend to learn pixel-level correlations instead of fundamental visual features. To validate this observation, we used various combinations of concatenated visual features as input for a neural network on the CIFAR-10 data set. CNNs excel at learning statistical patterns in images, achieving exceptional performance when training and test data share similar distributions. To substantiate this point, we trained a CNN on CIFAR-10 data set and evaluated its performance on the “dog” class from CIFAR-10 and on an equivalent number of examples from the Stanford Dogs data set. The CNN poor performance on Stanford Dogs images underlines the disparity between deep learning and human visual perception, highlighting the need for models that learn object semantics. Specialized benchmark data sets with controlled variations hold promise for aligning learned representations with human cognition in computer vision research.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 8","pages":"1626-1642"},"PeriodicalIF":2.1000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computation","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10661262/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In computer vision research, convolutional neural networks (CNNs) have demonstrated remarkable capabilities at extracting patterns from raw pixel data, achieving state-of-the-art recognition accuracy. However, they significantly differ from human visual perception, prioritizing pixel-level correlations and statistical patterns, often overlooking object semantics. To explore this difference, we propose an approach that isolates core visual features crucial for human perception and object recognition: color, texture, and shape. In experiments on three benchmarks—Fruits 360, CIFAR-10, and Fashion MNIST—each visual feature is individually input into a neural network. Results reveal data set–dependent variations in classification accuracy, highlighting that deep learning models tend to learn pixel-level correlations instead of fundamental visual features. To validate this observation, we used various combinations of concatenated visual features as input for a neural network on the CIFAR-10 data set. CNNs excel at learning statistical patterns in images, achieving exceptional performance when training and test data share similar distributions. To substantiate this point, we trained a CNN on CIFAR-10 data set and evaluated its performance on the “dog” class from CIFAR-10 and on an equivalent number of examples from the Stanford Dogs data set. The CNN poor performance on Stanford Dogs images underlines the disparity between deep learning and human visual perception, highlighting the need for models that learn object semantics. Specialized benchmark data sets with controlled variations hold promise for aligning learned representations with human cognition in computer vision research.

查看原文本刊更多论文

通过重新思考计算机视觉基准数据集，促进从像素级相关性学习到物体语义学习的转变

在计算机视觉研究中，卷积神经网络（CNN）在从原始像素数据中提取模式方面表现出了非凡的能力，达到了最先进的识别精度。然而，它们与人类的视觉感知有很大不同，它们优先考虑像素级的相关性和统计模式，往往忽略了物体的语义。为了探索这种差异，我们提出了一种方法，它能分离出对人类感知和物体识别至关重要的核心视觉特征：颜色、纹理和形状。在三个基准--水果 360、CIFAR-10 和时尚 MNIST--的实验中，每个视觉特征都被单独输入到神经网络中。实验结果表明，分类准确率随数据集而变化，这突出表明深度学习模型倾向于学习像素级相关性，而不是基本视觉特征。为了验证这一观点，我们在 CIFAR-10 数据集上使用了不同的视觉特征串联组合作为神经网络的输入。CNN 擅长学习图像中的统计模式，在训练数据和测试数据具有相似分布的情况下，CNN 可实现卓越的性能。为了证明这一点，我们在 CIFAR-10 数据集上训练了一个 CNN，并评估了它在 CIFAR-10 的 "狗 "类和斯坦福狗数据集的同等数量示例上的表现。CNN 在 "斯坦福狗 "图像上的表现不佳，凸显了深度学习与人类视觉感知之间的差距，强调了学习对象语义的模型的必要性。具有可控变化的专用基准数据集有望使计算机视觉研究中的学习表征与人类认知相一致。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Computation 工程技术-计算机：人工智能

CiteScore

6.30

自引率

3.40%

发文量

审稿时长

3.0 months

期刊介绍： Neural Computation is uniquely positioned at the crossroads between neuroscience and TMCS and welcomes the submission of original papers from all areas of TMCS, including: Advanced experimental design; Analysis of chemical sensor data; Connectomic reconstructions; Analysis of multielectrode and optical recordings; Genetic data for cell identity; Analysis of behavioral data; Multiscale models; Analysis of molecular mechanisms; Neuroinformatics; Analysis of brain imaging data; Neuromorphic engineering; Principles of neural coding, computation, circuit dynamics, and plasticity; Theories of brain function.