Why Existing Multimodal Crowd Counting Datasets Can Lead to Unfulfilled Expectations in Real-World Applications

Computer Science Research Notes Pub Date : 2023-07-01 DOI:10.24132/csrn.3301.5

Martin Thißen, Elke Hergenröther

{"title":"Why Existing Multimodal Crowd Counting Datasets Can Lead to Unfulfilled Expectations in Real-World Applications","authors":"Martin Thißen, Elke Hergenröther","doi":"10.24132/csrn.3301.5","DOIUrl":null,"url":null,"abstract":"More information leads to better decisions and predictions, right? Confirming this hypothesis, several studies concluded that the simultaneous use of optical and thermal images leads to better predictions in crowd counting. However, the way multimodal models extract enriched features from both modalities is not yet fully understood. Since the use of multimodal data usually increases the complexity, inference time, and memory requirements of the models, it is relevant to examine the differences and advantages of multimodal compared to monomodal models. In this work, all available multimodal datasets for crowd counting are used to investigate the differences between monomodal and multimodal models. To do so, we designed a monomodal architecture that considers the current state of research on monomodal crowd counting. In addition, several multimodal architectures have been developed using different multimodal learning strategies. The key components of the monomodal architecture are also used in the multimodal architectures to be able to answer whether multimodal models perform better in crowd counting in general. Surprisingly, no general answer to this question can be derived from the existing datasets. We found that the existing datasets hold a bias toward thermal images. This was determined by analyzing the relationship between the brightness of optical images and crowd count as well as examining the annotations made for each dataset. Since answering this question is important for future real-world applications of crowd counting, this paper establishes criteria for a potential dataset suitable for answering whether multimodal models perform better in crowd counting in general.","PeriodicalId":487307,"journal":{"name":"Computer Science Research Notes","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science Research Notes","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24132/csrn.3301.5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

More information leads to better decisions and predictions, right? Confirming this hypothesis, several studies concluded that the simultaneous use of optical and thermal images leads to better predictions in crowd counting. However, the way multimodal models extract enriched features from both modalities is not yet fully understood. Since the use of multimodal data usually increases the complexity, inference time, and memory requirements of the models, it is relevant to examine the differences and advantages of multimodal compared to monomodal models. In this work, all available multimodal datasets for crowd counting are used to investigate the differences between monomodal and multimodal models. To do so, we designed a monomodal architecture that considers the current state of research on monomodal crowd counting. In addition, several multimodal architectures have been developed using different multimodal learning strategies. The key components of the monomodal architecture are also used in the multimodal architectures to be able to answer whether multimodal models perform better in crowd counting in general. Surprisingly, no general answer to this question can be derived from the existing datasets. We found that the existing datasets hold a bias toward thermal images. This was determined by analyzing the relationship between the brightness of optical images and crowd count as well as examining the annotations made for each dataset. Since answering this question is important for future real-world applications of crowd counting, this paper establishes criteria for a potential dataset suitable for answering whether multimodal models perform better in crowd counting in general.

查看原文本刊更多论文

为什么现有的多模态人群计数数据集在实际应用中可能导致无法实现的期望

更多的信息会带来更好的决策和预测，对吗?为了证实这一假设，一些研究得出结论，同时使用光学和热图像可以更好地预测人群计数。然而，多模态模型从两种模态中提取丰富特征的方式尚未完全理解。由于使用多模态数据通常会增加模型的复杂性、推理时间和内存需求，因此有必要研究多模态与单模态模型的差异和优势。在这项工作中，所有可用的多模态数据集用于人群计数来研究单模态和多模态模型之间的差异。为此，我们设计了一个考虑到当前单模人群计数研究现状的单模体系结构。此外，使用不同的多模态学习策略开发了几种多模态架构。在多模态体系结构中也使用了单模态体系结构的关键组件，以便能够回答多模态模型是否在人群计数中表现更好。令人惊讶的是，从现有的数据集中无法得出这个问题的一般答案。我们发现现有的数据集对热图像有偏见。这是通过分析光学图像的亮度和人群数量之间的关系以及检查每个数据集的注释来确定的。由于回答这个问题对于未来人群计数的实际应用非常重要，因此本文建立了一个潜在数据集的标准，该数据集适用于回答多模态模型是否在人群计数中表现更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Science Research Notes

CiteScore

0.30

自引率

0.00%

发文量