Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang
{"title":"采用跨模块表征约束的轻量级 CNN-ViT,用于快递包裹检测","authors":"Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang","doi":"10.1007/s00371-024-03602-0","DOIUrl":null,"url":null,"abstract":"<p>The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lightweight CNN-ViT with cross-module representational constraint for express parcel detection\",\"authors\":\"Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang\",\"doi\":\"10.1007/s00371-024-03602-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.</p>\",\"PeriodicalId\":501186,\"journal\":{\"name\":\"The Visual Computer\",\"volume\":\"26 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Visual Computer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00371-024-03602-0\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03602-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Lightweight CNN-ViT with cross-module representational constraint for express parcel detection
The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.