采用跨模块表征约束的轻量级 CNN-ViT,用于快递包裹检测

Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang
{"title":"采用跨模块表征约束的轻量级 CNN-ViT,用于快递包裹检测","authors":"Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang","doi":"10.1007/s00371-024-03602-0","DOIUrl":null,"url":null,"abstract":"<p>The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lightweight CNN-ViT with cross-module representational constraint for express parcel detection\",\"authors\":\"Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang\",\"doi\":\"10.1007/s00371-024-03602-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.</p>\",\"PeriodicalId\":501186,\"journal\":{\"name\":\"The Visual Computer\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Visual Computer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00371-024-03602-0\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03602-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

快递包裹(EP)检测模型需要部署在计算能力有限的边缘设备上,因此一个轻量级、高效的物体检测模型至关重要。在这项工作中,我们介绍了一种新型的轻量级 CNN-ViT,它具有跨模块表示约束,专为 EP 检测而设计--CMViT。在 CMViT 中,我们借鉴了多模态模型中交叉注意力的概念,并提出了一种新的跨模块注意力(CMA)编码器。本地特征由提出的轻量级洗牌块(LSBlock)提供,而 CMA 编码器通过自我注意灵活地连接了 CNN-ViT 混合模型的本地特征和全局特征,在本地特征和全局特征之间构建了稳健的依赖关系,从而有效地增强了模型的感受野。此外,LSBlock 还为 CMA 编码器提供了有效的指导和约束,避免了对冗余信息的不必要关注,降低了计算成本。在 EP 检测中,与 YOLOv8s 相比,CMViT 以 25% 的输入分辨率、54.5% 的参数和 14.7% 的 FLOPs 实现了 99% 的平均准确率,表现出卓越的性能和广阔的应用前景。在更具挑战性的物体检测任务中,CMViT 表现出卓越的性能,在 COCO 数据集上实现了 28.8 mAP 和 2.2G MAdds,从而在准确度上比 MobileViT 高出 4%,同时消耗更少的计算能力。代码见:https://github.com/Acc2386/CMViT。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Lightweight CNN-ViT with cross-module representational constraint for express parcel detection

Lightweight CNN-ViT with cross-module representational constraint for express parcel detection

The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信