End-to-end multitasking network for smart container product positioning and segmentation

IF 1 4区计算机科学 Q4 ENGINEERING, ELECTRICAL & ELECTRONIC

Journal of Electronic Imaging Pub Date : 2024-09-01 DOI:10.1117/1.jei.33.5.053009

Wenzhong Shen, Xuejian Cai

{"title":"End-to-end multitasking network for smart container product positioning and segmentation","authors":"Wenzhong Shen, Xuejian Cai","doi":"10.1117/1.jei.33.5.053009","DOIUrl":null,"url":null,"abstract":"The current smart cooler’s commodity identification system first locates the item being purchased, followed by feature extraction and matching. However, this method often suffers from inaccuracies due to the presence of background in the detection frame, leading to missed detections and misidentifications. To address these issues, we propose an end-to-end You Only Look Once (YOLO) for detection and segmentation algorithm. In the backbone network, we combine deformable convolution with a channel-to-pixel (C2f) module to enhance the model feature extraction capability. In the neck network, we introduce an optimized feature fusion structure, which is based on the weighted bi-directional feature pyramid. To further enhance the model’s understanding of both global and local context, a triple feature encoding module is employed, seamlessly fusing multi-scale features for improved performance. The convolutional block attention module is connected to the improved C2f module to enhance the network’s attention to the commodity image channel and spatial information. A supplementary segmentation branch is incorporated into the head of the network, allowing it to not only detect targets within the image but also generate precise segmentation masks for each detected object, thereby enhancing its multi-task capabilities. Compared with YOLOv8, for box and mask, the precision increases by 3% and 4.7%, recall increases by 2.8% and 4.7%, and mean average precision (mAP) increases by 4.9% and 14%. The frames per second is 119, which meets the demand for real-time detection. The results of comparative and ablation studies confirm the high accuracy and performance of the proposed algorithm, solidifying its foundation for fine-grained commodity identification.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"27 1","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Electronic Imaging","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1117/1.jei.33.5.053009","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The current smart cooler’s commodity identification system first locates the item being purchased, followed by feature extraction and matching. However, this method often suffers from inaccuracies due to the presence of background in the detection frame, leading to missed detections and misidentifications. To address these issues, we propose an end-to-end You Only Look Once (YOLO) for detection and segmentation algorithm. In the backbone network, we combine deformable convolution with a channel-to-pixel (C2f) module to enhance the model feature extraction capability. In the neck network, we introduce an optimized feature fusion structure, which is based on the weighted bi-directional feature pyramid. To further enhance the model’s understanding of both global and local context, a triple feature encoding module is employed, seamlessly fusing multi-scale features for improved performance. The convolutional block attention module is connected to the improved C2f module to enhance the network’s attention to the commodity image channel and spatial information. A supplementary segmentation branch is incorporated into the head of the network, allowing it to not only detect targets within the image but also generate precise segmentation masks for each detected object, thereby enhancing its multi-task capabilities. Compared with YOLOv8, for box and mask, the precision increases by 3% and 4.7%, recall increases by 2.8% and 4.7%, and mean average precision (mAP) increases by 4.9% and 14%. The frames per second is 119, which meets the demand for real-time detection. The results of comparative and ablation studies confirm the high accuracy and performance of the proposed algorithm, solidifying its foundation for fine-grained commodity identification.

查看原文本刊更多论文

用于智能集装箱产品定位和细分的端到端多任务网络

目前智能冷柜的商品识别系统首先定位购买的商品，然后进行特征提取和匹配。然而，由于检测帧中存在背景，这种方法往往存在误差，从而导致漏检和错误识别。为了解决这些问题，我们提出了一种端到端的 "只看一遍"（YOLO）检测和分割算法。在骨干网络中，我们将可变形卷积与通道到像素（C2f）模块相结合，以增强模型特征提取能力。在颈部网络中，我们引入了基于加权双向特征金字塔的优化特征融合结构。为了进一步增强模型对全局和局部背景的理解，我们采用了三重特征编码模块，无缝融合多尺度特征以提高性能。卷积块关注模块与改进的 C2f 模块相连，以增强网络对商品图像通道和空间信息的关注。在网络的头部加入了一个辅助分割分支，使其不仅能检测图像中的目标，还能为每个检测到的物体生成精确的分割掩码，从而增强了其多任务处理能力。与 YOLOv8 相比，方框和掩码的精度分别提高了 3% 和 4.7%，召回率分别提高了 2.8% 和 4.7%，平均精度 (mAP) 分别提高了 4.9% 和 14%。每秒帧数为 119，满足了实时检测的要求。对比研究和消融研究的结果证实了所提算法的高精确度和高性能，为细粒度商品识别奠定了坚实的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Electronic Imaging 工程技术-成像科学与照相技术

CiteScore

1.70

自引率

27.30%

发文量

341

审稿时长

4.0 months

期刊介绍： The Journal of Electronic Imaging publishes peer-reviewed papers in all technology areas that make up the field of electronic imaging and are normally considered in the design, engineering, and applications of electronic imaging systems.