Computer Vision and Image Understanding最新文献

筛选
英文 中文
M3A: A multimodal misinformation dataset for media authenticity analysis M3A:用于媒体真实性分析的多模态错误信息数据集
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-15 DOI: 10.1016/j.cviu.2024.104205
{"title":"M3A: A multimodal misinformation dataset for media authenticity analysis","authors":"","doi":"10.1016/j.cviu.2024.104205","DOIUrl":"10.1016/j.cviu.2024.104205","url":null,"abstract":"<div><div>With the development of various generative models, misinformation in news media becomes more deceptive and easier to create, posing a significant problem. However, existing datasets for misinformation study often have limited modalities, constrained sources, and a narrow range of topics. These limitations make it difficult to train models that can effectively combat real-world misinformation. To address this, we propose a comprehensive, large-scale Multimodal Misinformation dataset for Media Authenticity Analysis (<span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span>), featuring broad sources and fine-grained annotations for topics and sentiments. To curate <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span>, we collect genuine news content from 60 renowned news outlets worldwide and generate fake samples using multiple techniques. These include altering named entities in texts, swapping modalities between samples, creating new modalities, and misrepresenting movie content as news. <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span> contains 708K genuine news samples and over 6M fake news samples, spanning text, images, audio, and video. <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span> provides detailed multi-class labels, crucial for various misinformation detection tasks, including out-of-context detection and deepfake detection. For each task, we offer extensive benchmarks using state-of-the-art models, aiming to enhance the development of robust misinformation detection systems.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Region-aware image-based human action retrieval with transformers 利用变换器进行基于区域感知图像的人体动作检索
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-14 DOI: 10.1016/j.cviu.2024.104202
{"title":"Region-aware image-based human action retrieval with transformers","authors":"","doi":"10.1016/j.cviu.2024.104202","DOIUrl":"10.1016/j.cviu.2024.104202","url":null,"abstract":"<div><div>Human action understanding is a fundamental and challenging task in computer vision. Although there exists tremendous research on this area, most works focus on action recognition, while action retrieval has received less attention. In this paper, we focus on the neglected but important task of image-based action retrieval which aims to find images that depict the same action as a query image. We establish benchmarks for this task and set up important baseline methods for fair comparison. We present a Transformer-based model that learns rich action representations from three aspects: the anchored person, contextual regions, and the global image. A fusion transformer is designed to model the relationships among different features and effectively fuse them into an action representation. Experiments on both the Stanford-40 and PASCAL VOC 2012 Action datasets show that the proposed method significantly outperforms previous approaches for image-based action retrieval.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A simple but effective vision transformer framework for visible–infrared person re-identification 用于可见光-红外线人员再识别的简单而有效的视觉转换器框架
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-11 DOI: 10.1016/j.cviu.2024.104192
{"title":"A simple but effective vision transformer framework for visible–infrared person re-identification","authors":"","doi":"10.1016/j.cviu.2024.104192","DOIUrl":"10.1016/j.cviu.2024.104192","url":null,"abstract":"<div><div>In the context of visible–infrared person re-identification (VI-ReID), the acquisition of a robust visual representation is paramount. Existing approaches predominantly rely on convolutional neural networks (CNNs), which are guided by intricately designed loss functions to extract features. In contrast, the vision transformer (ViT), a potent visual backbone, has often yielded subpar results in VI-ReID. We contend that the prevailing training methodologies and insights derived from CNNs do not seamlessly apply to ViT, leading to the underutilization of its potential in VI-ReID. One notable limitation is ViT’s appetite for extensive data, exemplified by the JFT-300M dataset, to surpass CNNs. Consequently, ViT struggles to transfer its knowledge from visible to infrared images due to inadequate training data. Even the largest available dataset, SYSU-MM01, proves insufficient for ViT to glean a robust representation of infrared images. This predicament is exacerbated when ViT is trained on the smaller RegDB dataset, where slight data flow modifications drastically affect performance—a stark contrast to CNN behavior. These observations lead us to conjecture that the CNN-inspired paradigm impedes ViT’s progress in VI-ReID. In light of these challenges, we undertake comprehensive ablation studies to shed new light on ViT’s applicability in VI-ReID. We propose a straightforward yet effective framework, named “Idformer”, to train a high-performing ViT for VI-ReID. Idformer serves as a robust baseline that can be further enhanced with carefully designed techniques akin to those used for CNNs. Remarkably, our method attains competitive results even in the absence of auxiliary information, achieving 78.58%/76.99% Rank-1/mAP on the SYSU-MM01 dataset, as well as 96.82%/91.83% Rank-1/mAP on the RegDB dataset. The code will be made publicly accessible.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An end-to-end tracking framework via multi-view and temporal feature aggregation 通过多视角和时间特征聚合实现端到端跟踪框架
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-10 DOI: 10.1016/j.cviu.2024.104203
{"title":"An end-to-end tracking framework via multi-view and temporal feature aggregation","authors":"","doi":"10.1016/j.cviu.2024.104203","DOIUrl":"10.1016/j.cviu.2024.104203","url":null,"abstract":"<div><div>Multi-view pedestrian tracking has frequently been used to cope with the challenges of occlusion and limited fields-of-view in single-view tracking. However, there are few end-to-end methods in this field. Many existing algorithms detect pedestrians in individual views, cluster projected detections in a top view and then track them. The others track pedestrians in individual views and then associate the projected tracklets in a top view. In this paper, an end-to-end framework is proposed for multi-view tracking, in which both multi-view and temporal aggregations of feature maps are applied. The multi-view aggregation projects the per-view feature maps to a top view, uses a transformer encoder to output encoded feature maps and then uses a CNN to calculate a pedestrian occupancy map. The temporal aggregation uses another CNN to estimate position offsets from the encoded feature maps in consecutive frames. Our experiments have demonstrated that this end-to-end framework outperforms the state-of-the-art online algorithms for multi-view pedestrian tracking.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A fast differential network with adaptive reference sample for gaze estimation 用于凝视估计的带有自适应参考样本的快速差分网络
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-09 DOI: 10.1016/j.cviu.2024.104156
{"title":"A fast differential network with adaptive reference sample for gaze estimation","authors":"","doi":"10.1016/j.cviu.2024.104156","DOIUrl":"10.1016/j.cviu.2024.104156","url":null,"abstract":"<div><div>Most non-invasive gaze estimation methods do not consider the inter-individual differences in anatomical structure, but directly regress the gaze direction from the appearance image information, which limits the accuracy of individual-independent gaze estimation networks. In addition, existing gaze estimation methods tend to consider only how to improve the model’s generalization performance, ignoring the crucial issue of efficiency, which leads to bulky models that are difficult to deploy and have questionable cost-effectiveness in practical use. This paper makes the following contributions: (1) A differential network for gaze estimation using adaptive reference samples is proposed, which can adaptively select reference samples based on scene and individual characteristics. (2) The knowledge distillation is used to transfer the knowledge structure of robust teacher networks into lightweight networks so that our networks can execute quickly and at low computational cost, dramatically increasing the prospect and value of applying gaze estimation. (3) Integrating the above innovations, a novel fast differential neural network (Diff-Net) named FDAR-Net is constructed and achieved excellent results on MPIIGaze, UTMultiview and EyeDiap.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A semantic segmentation method integrated convolutional nonlinear spiking neural model with Transformer 将卷积非线性尖峰神经模型与变压器整合在一起的语义分割方法
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-09 DOI: 10.1016/j.cviu.2024.104196
{"title":"A semantic segmentation method integrated convolutional nonlinear spiking neural model with Transformer","authors":"","doi":"10.1016/j.cviu.2024.104196","DOIUrl":"10.1016/j.cviu.2024.104196","url":null,"abstract":"<div><div>Semantic segmentation is a critical task in computer vision, with significant applications in areas like autonomous driving and medical imaging. Transformer-based methods have gained considerable attention recently because of their strength in capturing global information. However, these methods often sacrifice detailed information due to the lack of mechanisms for local interactions. Similarly, convolutional neural network (CNN) methods struggle to capture global context due to the inherent limitations of convolutional kernels. To overcome these challenges, this paper introduces a novel Transformer-based semantic segmentation method called NSNPFormer, which leverages the nonlinear spiking neural P (NSNP) system—a computational model inspired by the spiking mechanisms of biological neurons. The NSNPFormer employs an encoding–decoding structure with two convolutional NSNP components and a residual connection channel. The convolutional NSNP components facilitate nonlinear local feature extraction and block-level feature fusion. Meanwhile, the residual connection channel helps prevent the loss of feature information during the decoding process. Evaluations on the ADE20K and Pascal Context datasets show that NSNPFormer achieves mIoU scores of 53.7 and 58.06, respectively, highlighting its effectiveness in semantic segmentation tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142433245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MT-DSNet: Mix-mask teacher–student strategies and dual dynamic selection plug-in module for fine-grained image recognition MT-DSNet:用于细粒度图像识别的师生混合掩码策略和双动态选择插件模块
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-08 DOI: 10.1016/j.cviu.2024.104201
{"title":"MT-DSNet: Mix-mask teacher–student strategies and dual dynamic selection plug-in module for fine-grained image recognition","authors":"","doi":"10.1016/j.cviu.2024.104201","DOIUrl":"10.1016/j.cviu.2024.104201","url":null,"abstract":"<div><div>The fine-grained image recognition (FGIR) task aims to classify and distinguish subtle differences between subcategories with visually similar appearances, such as bird species and the makes or models of vehicles. However, subtle interclass differences and significant intraclass variances lead to poor model recognition performance. To address these challenges, we developed a mixed-mask teacher–student cooperative training strategy. A mixed masked image is generated and embedded into a knowledge distillation network by replacing one image’s visible marker with another’s masked marker. Collaborative reinforcement between teachers and students is used to improve the recognition performance of the network. We chose the classic transformer architecture as a baseline to better explore the contextual relationships between features. Additionally, we suggest a dual dynamic selection plug-in for choosing features with discriminative capabilities in the spatial and channel dimensions and filter out irrelevant interference information to efficiently handle background and noise features in fine-grained images. The proposed feature suppression module is used to enhance the differences between different features, thereby motivating the network to mine more discriminative features. We validated our method using two datasets: CUB-200-2011 and Stanford Cars. The experimental results show that the proposed MT-DSNet can significantly improve the feature representation for FGIR tasks. Moreover, by applying it to different fine-grained networks, the FGIR accuracy can be improved without changing the original network structure. We hope that this work provides a promising approach for improving the feature representation of networks in the future.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyperspectral image classification with token fusion on GPU 利用 GPU 进行标记融合的高光谱图像分类
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-05 DOI: 10.1016/j.cviu.2024.104198
{"title":"Hyperspectral image classification with token fusion on GPU","authors":"","doi":"10.1016/j.cviu.2024.104198","DOIUrl":"10.1016/j.cviu.2024.104198","url":null,"abstract":"<div><div>Hyperspectral images capture material nuances with spectral data, vital for remote sensing. Transformer has become a mainstream approach for tackling the challenges posed by high-dimensional hyperspectral data with complex structures. However, a major challenge they face when processing hyperspectral images is the presence of a large number of redundant tokens, which leads to a significant increase in computational load, adding to the model’s computational burden and affecting inference speed. Therefore, we propose a token fusion algorithm tailored to the operational characteristics of the hyperspectral image and pure transformer network, aimed at enhancing the final accuracy and throughput of the model. The token fusion algorithm introduces a token merging step between the attention mechanism and the multi-layer perceptron module in each Transformer layer. Experiments on four hyperspectral image datasets demonstrate that our token fusion algorithm can significantly improve inference speed without any training, while only causing a slight decrease in the pure transformer network’s classification accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring event-based human pose estimation with 3D event representations 利用三维事件表征探索基于事件的人体姿态估计
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-05 DOI: 10.1016/j.cviu.2024.104189
{"title":"Exploring event-based human pose estimation with 3D event representations","authors":"","doi":"10.1016/j.cviu.2024.104189","DOIUrl":"10.1016/j.cviu.2024.104189","url":null,"abstract":"<div><div>Human pose estimation is a fundamental and appealing task in computer vision. Although traditional cameras are commonly applied, their reliability decreases in scenarios under high dynamic range or heavy motion blur, where event cameras offer a robust solution. Predominant event-based methods accumulate events into frames, ignoring the asynchronous and high temporal resolution that is crucial for distinguishing distinct actions. To address this issue and to unlock the 3D potential of event information, we introduce two 3D event representations: the Rasterized Event Point Cloud (RasEPC) and the Decoupled Event Voxel (DEV). The RasEPC aggregates events within concise temporal slices at identical positions, preserving their 3D attributes along with statistical information, thereby significantly reducing memory and computational demands. Meanwhile, the DEV representation discretizes events into voxels and projects them across three orthogonal planes, utilizing decoupled event attention to retrieve 3D cues from the 2D planes. Furthermore, we develop and release EV-3DPW, a synthetic event-based dataset crafted to facilitate training and quantitative analysis in outdoor scenes. Our methods are tested on the DHP19 public dataset, MMHPSD dataset, and our EV-3DPW dataset, with further qualitative validation via a derived driving scene dataset EV-JAAD and an outdoor collection vehicle. Our code and dataset have been made publicly available at <span><span>https://github.com/MasterHow/EventPointPose</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AWADA: Foreground-focused adversarial learning for cross-domain object detection AWADA:用于跨域物体检测的前景对抗学习
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-05 DOI: 10.1016/j.cviu.2024.104153
{"title":"AWADA: Foreground-focused adversarial learning for cross-domain object detection","authors":"","doi":"10.1016/j.cviu.2024.104153","DOIUrl":"10.1016/j.cviu.2024.104153","url":null,"abstract":"<div><div>Object detection networks have achieved impressive results, but it can be challenging to replicate this success in practical applications due to a lack of relevant data specific to the task. Typically, additional data sources are used to support the training process. However, the domain gaps between these data sources present a challenge. Adversarial image-to-image style transfer is often used to bridge this gap, but it is not directly connected to the object detection task and can be unstable. We propose AWADA, a framework that combines attention-weighted adversarial domain adaptation connecting style transfer and object detection. By using object detector proposals to create attention maps for foreground objects, we focus the style transfer on these regions and stabilize the training process. Our results demonstrate that AWADA can reach state-of-the-art unsupervised domain adaptation performance in three commonly used benchmarks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信