{"title":"用于 WiFi 和视频融合多模式人群计数的异构双注意网络","authors":"Lifei Hao;Baoqi Huang;Bing Jia;Guoqiang Mao","doi":"10.1109/TMC.2024.3444469","DOIUrl":null,"url":null,"abstract":"Crowd counting aims to estimate the number of individuals in targeted areas. However, mainstream vision-based methods suffer from limited coverage and difficulty in multi-camera collaboration, which limits their scalability, whereas emerging WiFi-based methods can only obtain coarse results due to signal randomness. To overcome the inherent limitations of unimodal approaches and effectively exploit the advantage of multi-modal approaches, this paper presents an innovative WiFi and video-fused multi-modal paradigm by leveraging a heterogeneous dual-attentional network, which jointly models the intra- and inter-modality relationships of global WiFi measurements and local videos to achieve accurate and stable large-scale crowd counting. First, a flexible hybrid sensing network is constructed to capture synchronized multi-modal measurements characterizing the same crowd at different scales and perspectives; second, differential preprocessing, heterogeneous feature extractors, and self-attention mechanisms are sequentially utilized to extract and optimize modality-independent and crowd-related features; third, the cross-attention mechanism is employed to deeply fuse and generalize the matching relationships of two modalities. Extensive real-world experiments demonstrate that our method can significantly reduce the error by 26.2%, improve the stability by 48.43%, and achieve the accuracy of about 88% in large-scale crowd counting when including the videos from two cameras, compared to the best WiFi unimodal baseline.","PeriodicalId":50389,"journal":{"name":"IEEE Transactions on Mobile Computing","volume":"23 12","pages":"14233-14247"},"PeriodicalIF":7.7000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Heterogeneous Dual-Attentional Network for WiFi and Video-Fused Multi-Modal Crowd Counting\",\"authors\":\"Lifei Hao;Baoqi Huang;Bing Jia;Guoqiang Mao\",\"doi\":\"10.1109/TMC.2024.3444469\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Crowd counting aims to estimate the number of individuals in targeted areas. However, mainstream vision-based methods suffer from limited coverage and difficulty in multi-camera collaboration, which limits their scalability, whereas emerging WiFi-based methods can only obtain coarse results due to signal randomness. To overcome the inherent limitations of unimodal approaches and effectively exploit the advantage of multi-modal approaches, this paper presents an innovative WiFi and video-fused multi-modal paradigm by leveraging a heterogeneous dual-attentional network, which jointly models the intra- and inter-modality relationships of global WiFi measurements and local videos to achieve accurate and stable large-scale crowd counting. First, a flexible hybrid sensing network is constructed to capture synchronized multi-modal measurements characterizing the same crowd at different scales and perspectives; second, differential preprocessing, heterogeneous feature extractors, and self-attention mechanisms are sequentially utilized to extract and optimize modality-independent and crowd-related features; third, the cross-attention mechanism is employed to deeply fuse and generalize the matching relationships of two modalities. Extensive real-world experiments demonstrate that our method can significantly reduce the error by 26.2%, improve the stability by 48.43%, and achieve the accuracy of about 88% in large-scale crowd counting when including the videos from two cameras, compared to the best WiFi unimodal baseline.\",\"PeriodicalId\":50389,\"journal\":{\"name\":\"IEEE Transactions on Mobile Computing\",\"volume\":\"23 12\",\"pages\":\"14233-14247\"},\"PeriodicalIF\":7.7000,\"publicationDate\":\"2024-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Mobile Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10637758/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Mobile Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10637758/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Heterogeneous Dual-Attentional Network for WiFi and Video-Fused Multi-Modal Crowd Counting
Crowd counting aims to estimate the number of individuals in targeted areas. However, mainstream vision-based methods suffer from limited coverage and difficulty in multi-camera collaboration, which limits their scalability, whereas emerging WiFi-based methods can only obtain coarse results due to signal randomness. To overcome the inherent limitations of unimodal approaches and effectively exploit the advantage of multi-modal approaches, this paper presents an innovative WiFi and video-fused multi-modal paradigm by leveraging a heterogeneous dual-attentional network, which jointly models the intra- and inter-modality relationships of global WiFi measurements and local videos to achieve accurate and stable large-scale crowd counting. First, a flexible hybrid sensing network is constructed to capture synchronized multi-modal measurements characterizing the same crowd at different scales and perspectives; second, differential preprocessing, heterogeneous feature extractors, and self-attention mechanisms are sequentially utilized to extract and optimize modality-independent and crowd-related features; third, the cross-attention mechanism is employed to deeply fuse and generalize the matching relationships of two modalities. Extensive real-world experiments demonstrate that our method can significantly reduce the error by 26.2%, improve the stability by 48.43%, and achieve the accuracy of about 88% in large-scale crowd counting when including the videos from two cameras, compared to the best WiFi unimodal baseline.
期刊介绍:
IEEE Transactions on Mobile Computing addresses key technical issues related to various aspects of mobile computing. This includes (a) architectures, (b) support services, (c) algorithm/protocol design and analysis, (d) mobile environments, (e) mobile communication systems, (f) applications, and (g) emerging technologies. Topics of interest span a wide range, covering aspects like mobile networks and hosts, mobility management, multimedia, operating system support, power management, online and mobile environments, security, scalability, reliability, and emerging technologies such as wearable computers, body area networks, and wireless sensor networks. The journal serves as a comprehensive platform for advancements in mobile computing research.