{"title":"Pixel shuffling is all you need: spatially aware convmixer for dense prediction tasks","authors":"Hatem Ibrahem, Ahmed Salem, Hyun-Soo Kang","doi":"10.1016/j.patcog.2024.111068","DOIUrl":null,"url":null,"abstract":"<div><div>ConvMixer is an extremely simple model that could perform better than the state-of-the-art convolutional-based and vision transformer-based methods thanks to mixing the input image patches using a standard convolution. The global mixing process of the patches is only valid for the classification tasks, but it cannot be used for dense prediction tasks as the spatial information of the image is lost in the mixing process. We propose a more efficient technique for image patching, known as pixel shuffling, as it can preserve spatial information. We downsample the input image using the pixel shuffle downsampling in the same form of image patches so that the ConvMixer can be extended for the dense prediction tasks. This paper proves that pixel shuffle downsampling is more efficient than the standard image patching as it outperforms the original ConvMixer architecture in the CIFAR10 and ImageNet-1k classification tasks. We also suggest spatially-aware ConvMixer architectures based on efficient pixel shuffle downsampling and upsampling operations for semantic segmentation and monocular depth estimation. We performed extensive experiments to test the proposed architectures on several datasets; Pascal VOC2012, Cityscapes, and ADE20k for semantic segmentation, NYU-depthV2, and Cityscapes for depth estimation. We show that SA-ConvMixer is efficient enough to get relatively high accuracy at many tasks in a few training epochs (150<span><math><mo>∼</mo></math></span>400). The proposed SA-ConvMixer could achieve an ImageNet-1K Top-1 classification accuracy of 87.02%, mean intersection over union (mIOU) of 87.1% in the PASCAL VOC2012 semantic segmentation task, and absolute relative error of 0.096 in the NYU depthv2 depth estimation task. The implementation code of the proposed method is available at: <span><span>https://github.com/HatemHosam/SA-ConvMixer/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111068"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008197","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
ConvMixer is an extremely simple model that could perform better than the state-of-the-art convolutional-based and vision transformer-based methods thanks to mixing the input image patches using a standard convolution. The global mixing process of the patches is only valid for the classification tasks, but it cannot be used for dense prediction tasks as the spatial information of the image is lost in the mixing process. We propose a more efficient technique for image patching, known as pixel shuffling, as it can preserve spatial information. We downsample the input image using the pixel shuffle downsampling in the same form of image patches so that the ConvMixer can be extended for the dense prediction tasks. This paper proves that pixel shuffle downsampling is more efficient than the standard image patching as it outperforms the original ConvMixer architecture in the CIFAR10 and ImageNet-1k classification tasks. We also suggest spatially-aware ConvMixer architectures based on efficient pixel shuffle downsampling and upsampling operations for semantic segmentation and monocular depth estimation. We performed extensive experiments to test the proposed architectures on several datasets; Pascal VOC2012, Cityscapes, and ADE20k for semantic segmentation, NYU-depthV2, and Cityscapes for depth estimation. We show that SA-ConvMixer is efficient enough to get relatively high accuracy at many tasks in a few training epochs (150400). The proposed SA-ConvMixer could achieve an ImageNet-1K Top-1 classification accuracy of 87.02%, mean intersection over union (mIOU) of 87.1% in the PASCAL VOC2012 semantic segmentation task, and absolute relative error of 0.096 in the NYU depthv2 depth estimation task. The implementation code of the proposed method is available at: https://github.com/HatemHosam/SA-ConvMixer/.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.