VMamba-Crowd：从Visual Mamba连接多尺度功能，用于弱监督的人群计数

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters Pub Date : 2025-08-29 DOI:10.1016/j.patrec.2025.08.005

Zhanqiang Huo , Chunxin Yuan , Kunwei Zhang , Yingxu Qiao , Fen Luo

{"title":"VMamba-Crowd：从Visual Mamba连接多尺度功能，用于弱监督的人群计数","authors":"Zhanqiang Huo , Chunxin Yuan , Kunwei Zhang , Yingxu Qiao , Fen Luo","doi":"10.1016/j.patrec.2025.08.005","DOIUrl":null,"url":null,"abstract":"<div><div>Weakly-supervised crowd counting requires only count-level annotations instead of location-level annotations that makes it become a new research hotspot in the field of crowd counting. Currently, most weakly-supervised crowd counting networks based on deep learning utilize CNNs and/or Transformers to extract features and build global contexts, overlooking multi-scale feature fusion and therefore leading to suboptimal feature representation and utilization. The more advanced Mamba model, leveraging its selective state space mechanism, excels in feature extraction for image processing tasks particularly in capturing multi-scale features without relying on self-attention. In this paper, we introduce the carefully selected multi-scale features extracted from Visual Mamba into weakly-supervised crowd counting task for the first time and propose the VMamba-Crowd model. Specifically, the Adjacent-scale Progressive Bridging Module (APBM) progressively facilitates the interactions between adjacent high-level semantic and low-level detailed information across both channel and spatial dimensions. The Mixed Regression Bridging Module (MRBM) performs secondary mixed regression to bridge multi-scale global feature information. Extensive experiments demonstrate that our VMamba-Crowd surpasses the performance of most existing weakly-supervised crowd counting networks and achieves competitive performance compared to fully-supervised ones. In particular, cross-dataset experiments confirm that our weakly-supervised method has a remarkable generalization ability.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"197 ","pages":"Pages 297-303"},"PeriodicalIF":3.3000,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VMamba-Crowd: Bridging multi-scale features from Visual Mamba for weakly-supervised crowd counting\",\"authors\":\"Zhanqiang Huo , Chunxin Yuan , Kunwei Zhang , Yingxu Qiao , Fen Luo\",\"doi\":\"10.1016/j.patrec.2025.08.005\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Weakly-supervised crowd counting requires only count-level annotations instead of location-level annotations that makes it become a new research hotspot in the field of crowd counting. Currently, most weakly-supervised crowd counting networks based on deep learning utilize CNNs and/or Transformers to extract features and build global contexts, overlooking multi-scale feature fusion and therefore leading to suboptimal feature representation and utilization. The more advanced Mamba model, leveraging its selective state space mechanism, excels in feature extraction for image processing tasks particularly in capturing multi-scale features without relying on self-attention. In this paper, we introduce the carefully selected multi-scale features extracted from Visual Mamba into weakly-supervised crowd counting task for the first time and propose the VMamba-Crowd model. Specifically, the Adjacent-scale Progressive Bridging Module (APBM) progressively facilitates the interactions between adjacent high-level semantic and low-level detailed information across both channel and spatial dimensions. The Mixed Regression Bridging Module (MRBM) performs secondary mixed regression to bridge multi-scale global feature information. Extensive experiments demonstrate that our VMamba-Crowd surpasses the performance of most existing weakly-supervised crowd counting networks and achieves competitive performance compared to fully-supervised ones. In particular, cross-dataset experiments confirm that our weakly-supervised method has a remarkable generalization ability.</div></div>\",\"PeriodicalId\":54638,\"journal\":{\"name\":\"Pattern Recognition Letters\",\"volume\":\"197 \",\"pages\":\"Pages 297-303\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167865525002843\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002843","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

弱监督人群计数只需要计数级标注，不需要位置级标注，这使得弱监督人群计数成为人群计数领域新的研究热点。目前，大多数基于深度学习的弱监督人群计数网络利用cnn和/或transformer来提取特征和构建全局上下文，忽略了多尺度特征融合，从而导致特征表示和利用的次优。更先进的Mamba模型利用其选择性状态空间机制，在图像处理任务的特征提取方面表现出色，特别是在捕获多尺度特征而不依赖于自关注方面。本文首次将从Visual Mamba中提取的精心挑选的多尺度特征引入弱监督人群计数任务中，提出了VMamba-Crowd模型。具体而言，邻接尺度渐进桥接模块（APBM）在通道和空间维度上逐步促进相邻的高级语义和低级详细信息之间的交互。混合回归桥接模块（MRBM）通过二次混合回归来桥接多尺度的全局特征信息。大量的实验表明，我们的vammba - crowd超越了大多数现有的弱监督人群计数网络的性能，并且与完全监督的人群计数网络相比具有竞争力。特别是，跨数据集实验证实了我们的弱监督方法具有显著的泛化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

VMamba-Crowd: Bridging multi-scale features from Visual Mamba for weakly-supervised crowd counting

Weakly-supervised crowd counting requires only count-level annotations instead of location-level annotations that makes it become a new research hotspot in the field of crowd counting. Currently, most weakly-supervised crowd counting networks based on deep learning utilize CNNs and/or Transformers to extract features and build global contexts, overlooking multi-scale feature fusion and therefore leading to suboptimal feature representation and utilization. The more advanced Mamba model, leveraging its selective state space mechanism, excels in feature extraction for image processing tasks particularly in capturing multi-scale features without relying on self-attention. In this paper, we introduce the carefully selected multi-scale features extracted from Visual Mamba into weakly-supervised crowd counting task for the first time and propose the VMamba-Crowd model. Specifically, the Adjacent-scale Progressive Bridging Module (APBM) progressively facilitates the interactions between adjacent high-level semantic and low-level detailed information across both channel and spatial dimensions. The Mixed Regression Bridging Module (MRBM) performs secondary mixed regression to bridge multi-scale global feature information. Extensive experiments demonstrate that our VMamba-Crowd surpasses the performance of most existing weakly-supervised crowd counting networks and achieves competitive performance compared to fully-supervised ones. In particular, cross-dataset experiments confirm that our weakly-supervised method has a remarkable generalization ability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition Letters 工程技术-计算机：人工智能

CiteScore

12.40

自引率

5.90%

发文量

287

审稿时长

9.1 months

期刊介绍： Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.