Zhanqiang Huo , Chunxin Yuan , Kunwei Zhang , Yingxu Qiao , Fen Luo
{"title":"VMamba-Crowd:从Visual Mamba连接多尺度功能,用于弱监督的人群计数","authors":"Zhanqiang Huo , Chunxin Yuan , Kunwei Zhang , Yingxu Qiao , Fen Luo","doi":"10.1016/j.patrec.2025.08.005","DOIUrl":null,"url":null,"abstract":"<div><div>Weakly-supervised crowd counting requires only count-level annotations instead of location-level annotations that makes it become a new research hotspot in the field of crowd counting. Currently, most weakly-supervised crowd counting networks based on deep learning utilize CNNs and/or Transformers to extract features and build global contexts, overlooking multi-scale feature fusion and therefore leading to suboptimal feature representation and utilization. The more advanced Mamba model, leveraging its selective state space mechanism, excels in feature extraction for image processing tasks particularly in capturing multi-scale features without relying on self-attention. In this paper, we introduce the carefully selected multi-scale features extracted from Visual Mamba into weakly-supervised crowd counting task for the first time and propose the VMamba-Crowd model. Specifically, the Adjacent-scale Progressive Bridging Module (APBM) progressively facilitates the interactions between adjacent high-level semantic and low-level detailed information across both channel and spatial dimensions. The Mixed Regression Bridging Module (MRBM) performs secondary mixed regression to bridge multi-scale global feature information. Extensive experiments demonstrate that our VMamba-Crowd surpasses the performance of most existing weakly-supervised crowd counting networks and achieves competitive performance compared to fully-supervised ones. In particular, cross-dataset experiments confirm that our weakly-supervised method has a remarkable generalization ability.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"197 ","pages":"Pages 297-303"},"PeriodicalIF":3.3000,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VMamba-Crowd: Bridging multi-scale features from Visual Mamba for weakly-supervised crowd counting\",\"authors\":\"Zhanqiang Huo , Chunxin Yuan , Kunwei Zhang , Yingxu Qiao , Fen Luo\",\"doi\":\"10.1016/j.patrec.2025.08.005\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Weakly-supervised crowd counting requires only count-level annotations instead of location-level annotations that makes it become a new research hotspot in the field of crowd counting. Currently, most weakly-supervised crowd counting networks based on deep learning utilize CNNs and/or Transformers to extract features and build global contexts, overlooking multi-scale feature fusion and therefore leading to suboptimal feature representation and utilization. The more advanced Mamba model, leveraging its selective state space mechanism, excels in feature extraction for image processing tasks particularly in capturing multi-scale features without relying on self-attention. In this paper, we introduce the carefully selected multi-scale features extracted from Visual Mamba into weakly-supervised crowd counting task for the first time and propose the VMamba-Crowd model. Specifically, the Adjacent-scale Progressive Bridging Module (APBM) progressively facilitates the interactions between adjacent high-level semantic and low-level detailed information across both channel and spatial dimensions. The Mixed Regression Bridging Module (MRBM) performs secondary mixed regression to bridge multi-scale global feature information. Extensive experiments demonstrate that our VMamba-Crowd surpasses the performance of most existing weakly-supervised crowd counting networks and achieves competitive performance compared to fully-supervised ones. In particular, cross-dataset experiments confirm that our weakly-supervised method has a remarkable generalization ability.</div></div>\",\"PeriodicalId\":54638,\"journal\":{\"name\":\"Pattern Recognition Letters\",\"volume\":\"197 \",\"pages\":\"Pages 297-303\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167865525002843\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002843","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
VMamba-Crowd: Bridging multi-scale features from Visual Mamba for weakly-supervised crowd counting
Weakly-supervised crowd counting requires only count-level annotations instead of location-level annotations that makes it become a new research hotspot in the field of crowd counting. Currently, most weakly-supervised crowd counting networks based on deep learning utilize CNNs and/or Transformers to extract features and build global contexts, overlooking multi-scale feature fusion and therefore leading to suboptimal feature representation and utilization. The more advanced Mamba model, leveraging its selective state space mechanism, excels in feature extraction for image processing tasks particularly in capturing multi-scale features without relying on self-attention. In this paper, we introduce the carefully selected multi-scale features extracted from Visual Mamba into weakly-supervised crowd counting task for the first time and propose the VMamba-Crowd model. Specifically, the Adjacent-scale Progressive Bridging Module (APBM) progressively facilitates the interactions between adjacent high-level semantic and low-level detailed information across both channel and spatial dimensions. The Mixed Regression Bridging Module (MRBM) performs secondary mixed regression to bridge multi-scale global feature information. Extensive experiments demonstrate that our VMamba-Crowd surpasses the performance of most existing weakly-supervised crowd counting networks and achieves competitive performance compared to fully-supervised ones. In particular, cross-dataset experiments confirm that our weakly-supervised method has a remarkable generalization ability.
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.