Pixel shuffling is all you need: spatially aware convmixer for dense prediction tasks

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2024-10-09 DOI:10.1016/j.patcog.2024.111068

Hatem Ibrahem, Ahmed Salem, Hyun-Soo Kang

{"title":"Pixel shuffling is all you need: spatially aware convmixer for dense prediction tasks","authors":"Hatem Ibrahem, Ahmed Salem, Hyun-Soo Kang","doi":"10.1016/j.patcog.2024.111068","DOIUrl":null,"url":null,"abstract":"<div><div>ConvMixer is an extremely simple model that could perform better than the state-of-the-art convolutional-based and vision transformer-based methods thanks to mixing the input image patches using a standard convolution. The global mixing process of the patches is only valid for the classification tasks, but it cannot be used for dense prediction tasks as the spatial information of the image is lost in the mixing process. We propose a more efficient technique for image patching, known as pixel shuffling, as it can preserve spatial information. We downsample the input image using the pixel shuffle downsampling in the same form of image patches so that the ConvMixer can be extended for the dense prediction tasks. This paper proves that pixel shuffle downsampling is more efficient than the standard image patching as it outperforms the original ConvMixer architecture in the CIFAR10 and ImageNet-1k classification tasks. We also suggest spatially-aware ConvMixer architectures based on efficient pixel shuffle downsampling and upsampling operations for semantic segmentation and monocular depth estimation. We performed extensive experiments to test the proposed architectures on several datasets; Pascal VOC2012, Cityscapes, and ADE20k for semantic segmentation, NYU-depthV2, and Cityscapes for depth estimation. We show that SA-ConvMixer is efficient enough to get relatively high accuracy at many tasks in a few training epochs (150<span><math><mo>∼</mo></math></span>400). The proposed SA-ConvMixer could achieve an ImageNet-1K Top-1 classification accuracy of 87.02%, mean intersection over union (mIOU) of 87.1% in the PASCAL VOC2012 semantic segmentation task, and absolute relative error of 0.096 in the NYU depthv2 depth estimation task. The implementation code of the proposed method is available at: <span><span>https://github.com/HatemHosam/SA-ConvMixer/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111068"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008197","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

ConvMixer is an extremely simple model that could perform better than the state-of-the-art convolutional-based and vision transformer-based methods thanks to mixing the input image patches using a standard convolution. The global mixing process of the patches is only valid for the classification tasks, but it cannot be used for dense prediction tasks as the spatial information of the image is lost in the mixing process. We propose a more efficient technique for image patching, known as pixel shuffling, as it can preserve spatial information. We downsample the input image using the pixel shuffle downsampling in the same form of image patches so that the ConvMixer can be extended for the dense prediction tasks. This paper proves that pixel shuffle downsampling is more efficient than the standard image patching as it outperforms the original ConvMixer architecture in the CIFAR10 and ImageNet-1k classification tasks. We also suggest spatially-aware ConvMixer architectures based on efficient pixel shuffle downsampling and upsampling operations for semantic segmentation and monocular depth estimation. We performed extensive experiments to test the proposed architectures on several datasets; Pascal VOC2012, Cityscapes, and ADE20k for semantic segmentation, NYU-depthV2, and Cityscapes for depth estimation. We show that SA-ConvMixer is efficient enough to get relatively high accuracy at many tasks in a few training epochs (150

\sim

400). The proposed SA-ConvMixer could achieve an ImageNet-1K Top-1 classification accuracy of 87.02%, mean intersection over union (mIOU) of 87.1% in the PASCAL VOC2012 semantic segmentation task, and absolute relative error of 0.096 in the NYU depthv2 depth estimation task. The implementation code of the proposed method is available at: https://github.com/HatemHosam/SA-ConvMixer/.

查看原文本刊更多论文

只需像素洗牌：用于密集预测任务的空间感知卷积混频器

ConvMixer 是一个非常简单的模型，通过使用标准卷积混合输入图像补丁，其性能优于最先进的基于卷积的方法和基于视觉变换器的方法。全局混合图像片段的方法只适用于分类任务，但不能用于密集预测任务，因为在混合过程中会丢失图像的空间信息。我们提出了一种更有效的图像修补技术，即像素洗牌，因为它可以保留空间信息。我们使用像素洗牌降采样技术对输入图像进行降采样，使其成为相同形式的图像补丁，从而使 ConvMixer 可扩展用于高密度预测任务。本文证明了像素洗牌下采样比标准图像修补更有效，因为它在 CIFAR10 和 ImageNet-1k 分类任务中的表现优于原始 ConvMixer 架构。我们还提出了基于高效像素洗牌下采样和上采样操作的空间感知 ConvMixer 架构，用于语义分割和单目深度估计。我们进行了大量实验，在多个数据集上测试了所提出的架构：Pascal VOC2012、Cityscapes 和 ADE20k 用于语义分割，NYU-depthV2 和 Cityscapes 用于深度估计。我们的研究表明，SA-ConvMixer 足够高效，只需几个训练历元（150∼400）就能在许多任务中获得相对较高的准确率。所提出的 SA-ConvMixer 在 ImageNet-1K Top-1 分类准确率为 87.02%，在 PASCAL VOC2012 语义分割任务中的平均交集大于联合率（mIOU）为 87.1%，在纽约大学 depthv2 深度估计任务中的绝对相对误差为 0.096。该方法的实现代码可在以下网址获取：https://github.com/HatemHosam/SA-ConvMixer/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.