Image and Vision Computing最新文献

筛选
英文 中文
EVA-02: A visual representation for neon genesis EVA-02:霓虹灯起源的视觉呈现
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-07-17 DOI: 10.1016/j.imavis.2024.105171
{"title":"EVA-02: A visual representation for neon genesis","authors":"","doi":"10.1016/j.imavis.2024.105171","DOIUrl":"10.1016/j.imavis.2024.105171","url":null,"abstract":"<div><p>We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open &amp; accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only <strong>304</strong> <strong>M</strong> parameters achieves a phenomenal <strong>90.0</strong> fine-tuning top-1 accuracy on ImageNet-1 K val set. Additionally, our EVA-02-CLIP can reach up to <strong>80.4</strong> zero-shot top-1 on ImageNet-1 K, outperforming the previous largest &amp; best open-sourced CLIP with only ∼<!--> <!-->1/6 parameters and ∼ 1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6 M to 304 M parameters, all with impressive performance. To facilitate open access and open research, we release the complete suite of EVA-02 to the community at <span><span>https://github.com/baaivision/EVA/tree/master/EVA-02</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring global context and position-aware representation for group activity recognition 探索群体活动识别的全局上下文和位置感知表示法
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-07-15 DOI: 10.1016/j.imavis.2024.105181
{"title":"Exploring global context and position-aware representation for group activity recognition","authors":"","doi":"10.1016/j.imavis.2024.105181","DOIUrl":"10.1016/j.imavis.2024.105181","url":null,"abstract":"<div><p>This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141638115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-task disagreement-reducing multimodal sentiment fusion network 多任务减少分歧多模态情感融合网络
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-07-14 DOI: 10.1016/j.imavis.2024.105158
{"title":"Multi-task disagreement-reducing multimodal sentiment fusion network","authors":"","doi":"10.1016/j.imavis.2024.105158","DOIUrl":"10.1016/j.imavis.2024.105158","url":null,"abstract":"<div><p>Existing multimodal sentiment analysis models can effectively capture sentimental commonalities between different modalities and possess high sentimental acquisition capability. However, there are still shortcomings in the model's analysis and recognition abilities when dealing with samples that exhibit sentimental polarity disagreement between different modalities. Additionally, the dominance of the text modality in multimodal models, particularly those pre-trained with BERT, can hinder the learning of other modalities due to its richer semantic information. This issue becomes particularly pronounced in cases where there is a conflict between multimodal and textual sentimental polarities, often leading to suboptimal analytical results. Besides, the classification ability of each modality is also suppressed by single-task learning. In this paper, We propose a Multi-Task disagreement-Reducing Multimodal Sentiment Fusion Network (MtDr-MSF), designed to enhance the semantic information of non-text modalities and reduce the dominant impact of the textual modality on the model, and to improve the learning capabilities of unimodal networks. We conducted experiments on multimodal sentiment analysis datasets, CMU-MOSI, CMU-MOSEI, and CH-SIMS. The results show that our method outperforms the current SOTA method.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141692024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Event-driven weakly supervised video anomaly detection 事件驱动的弱监督视频异常检测
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-07-14 DOI: 10.1016/j.imavis.2024.105169
{"title":"Event-driven weakly supervised video anomaly detection","authors":"","doi":"10.1016/j.imavis.2024.105169","DOIUrl":"10.1016/j.imavis.2024.105169","url":null,"abstract":"<div><p>Inspired by the observations of human working manners, this work proposes an event-driven method for weakly supervised video anomaly detection. Complementary to the conventional snippet-level anomaly detection, this work designs an event analysis module to predict the event-level anomaly scores as well. It first generates event proposals simply <em>via</em> a temporal sliding window and then constructs a cascaded causal transformer to capture temporal dependencies for potential events of varying durations. Moreover, a dual-memory augmented self-attention scheme is also designed to capture global semantic dependencies for event feature enhancement. The network is learned with a standard multiple instance learning (MIL) loss, together with normal-abnormal contrastive learning losses. During inference, the snippet- and event-level anomaly scores are fused for anomaly detection. Experiments show that the event-level analysis helps to detect anomalous events more continuously and precisely. The performance of the proposed method on three public datasets demonstrates that the proposed approach is competitive with state-of-the-art methods.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141708029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint training strategy of unimodal and multimodal for multimodal sentiment analysis 用于多模态情感分析的单模态和多模态联合训练策略
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-07-14 DOI: 10.1016/j.imavis.2024.105172
{"title":"Joint training strategy of unimodal and multimodal for multimodal sentiment analysis","authors":"","doi":"10.1016/j.imavis.2024.105172","DOIUrl":"10.1016/j.imavis.2024.105172","url":null,"abstract":"<div><p>With the explosive growth of social media video content, research on multimodal sentiment analysis (MSA) has attracted considerable attention recently. Despite significant progress in MSA, there remains challenges: current research mostly focuses on learning either unimodal features or aspects of multimodal interactions, neglecting the importance of simultaneously considering both unimodal features and intermodal interactions. To address the aforementioned challenges, this paper proposes a fusion strategy called Joint Training of Unimodal and Multimodal (JTUM). Specifically, this strategy combines unimodal label generation module with cross-modal transformer. The unimodal label generation module aims to generate more distinctive labels for each unimodal input, facilitating more effective learning of unimodal representations. Meanwhile, cross-modal transformer is designed to treat each modality as a target modality and optimize it using other modalities as source modalities, thereby learning the interactions between each pair of modalities. By jointly training unimodal and multimodal tasks, our model can focus on individual modality features while learning the interactions between modalities. Finally, to better capture temporal information and make predictions, we also added self-attention transformer as sequence models. Experimental results on the CMU-MOSI and CMU-MOSEI datasets demonstrate that JTUM outperforms current main methods.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141638154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel facial expression recognition model based on harnessing complementary features in multi-scale network with attention fusion 基于注意力融合的多尺度网络互补特征的新型面部表情识别模型
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-07-14 DOI: 10.1016/j.imavis.2024.105183
{"title":"A novel facial expression recognition model based on harnessing complementary features in multi-scale network with attention fusion","authors":"","doi":"10.1016/j.imavis.2024.105183","DOIUrl":"10.1016/j.imavis.2024.105183","url":null,"abstract":"<div><p>This paper presents a novel method for facial expression recognition using the proposed feature complementation and multi-scale attention model with attention fusion (FCMSA-AF). The proposed model consists of four main components: the shallow feature extractor module, parallel structured two-branch multi-scale attention module (MSA), feature complementing module (FCM), and attention fusion and classification module. The MSA module contains multi-scale attention modules in a cascaded fashion in two paths to learn diverse features. The upper and lower paths use left and right multi-scale blocks to extract and aggregate the features at different receptive fields. The attention networks in MSA focus on salient local regions to extract features at granular levels. The FCM uses the correlation between the feature maps in two paths to make the multi-scale attention features complementary to each other. Finally, the complementary features are fused through an attention network to form an informative holistic feature which includes subtle, visually varying regions in similar classes. Hence, complementary and informative features are used in classification to minimize information loss and capture the discriminating finer aspects of facial expression recognition. Experimental evaluation of the proposed model carried out on AffectNet and CK<!--> <!-->+ datasets achieve accuracies of 64.59% and 98.98%, respectively, outperforming some of the state-of-the-art methods.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141716836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relational-branchformer: Novel framework for audio-visual speech recognition 关系分支形成器:视听语音识别的新框架
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-07-11 DOI: 10.1016/j.imavis.2024.105182
Yewei Xiao , Xuanming Liu , Aosu Zhu , Jian Huang
{"title":"Relational-branchformer: Novel framework for audio-visual speech recognition","authors":"Yewei Xiao ,&nbsp;Xuanming Liu ,&nbsp;Aosu Zhu ,&nbsp;Jian Huang","doi":"10.1016/j.imavis.2024.105182","DOIUrl":"https://doi.org/10.1016/j.imavis.2024.105182","url":null,"abstract":"<div><p>This study embraced the state-of-the-art Branchformer series architecture within the realm of automatic speech recognition, supplanting the widely utilized Conformer architecture. This substitution offers an innovative remedy tailored to audio-visual speech recognition tasks. Building upon the Branchformer architecture, enhancements were made, culminating in the proposal of the Relational-Branchformer (R-Branchformer). The convolutional attention relation module was innovatively incorporated to augment the connectivity between the local and global branches by meticulously considering their interrelations and interplays. Consequently, this module facilitates the mutual embedding of local and global contextual information, ultimately leading to a substantial enhancement in model performance. Our model was grounded in the utilization of the connectionist temporal classification (CTC) loss, wherein intermediate CTC losses were incorporated between blocks. Moreover, through the reference and enhancement of the gated interlayer collaboration module, which superseded the inter CTC module, the conditional independence assumption intrinsic to the CTC model was effectively relaxed. As a consequence, this augmentation markedly bolstered the overall performance of our model. Furthermore, the audio-visual output enhancement module was proposed, which adeptly assimilates information from both audio and visual modalities to enrich the representation of audio-visual information. Consequently, the R-Branchformer model achieved remarkable word error rates of 1.7% and 1.5% on the LRS2 and LRS3 test sets, respectively, exemplifying its state-of-the-art performance in audio-visual speech recognition tasks.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141605897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transferable dual multi-granularity semantic excavating for partially relevant video retrieval 用于部分相关视频检索的可转移双多粒度语义挖掘技术
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-07-11 DOI: 10.1016/j.imavis.2024.105168
{"title":"Transferable dual multi-granularity semantic excavating for partially relevant video retrieval","authors":"","doi":"10.1016/j.imavis.2024.105168","DOIUrl":"10.1016/j.imavis.2024.105168","url":null,"abstract":"<div><p>Partially Relevant Video Retrieval (PRVR) aims to retrieve partially relevant videos from many unlabeled and untrimmed videos according to the query, which is defined as the multiple instance learning problem. The challenge of PRVR is that it utilizes untrimmed videos, which are much closer to reality. The existing methods excavate video-text semantic consistency information insufficiently and lack the capacity to highlight the semantics of key representations. To tackle these issues, we propose a transferable dual multi-granularity semantic excavating network, called T-D3N, to focus on enhancing the learning of dual-modal representations. Specifically, we first introduce a novel transferable textual semantic learning strategy by designing Adaptive Multi-scale Semantic Mining (AMSM) component to excavate significant textual semantic from multiple perspectives. Second, T-D3N distinguishes the feature differences from the frame-wise perspective to better perform contrastive learning between positive and negative samples in the video feature domain, which can further distance the positive and negative samples and improve the probability of positive samples being retrieved by query. Finally, our model constructs multi-grained video temporal dependencies and conducts cross-grained core feature perception, which enables more sufficient multimodal interactions. Extensive experiments are performed on three benchmarks, i.e., ActivityNet Captions, Charades-STA, and TVR, our T-D3N achieves state-of-the-art results. Furthermore, we also confirm that our model is transferable on a broad range of multimodal tasks such as T2VR, VMR, and MMSum.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141704158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical disentangled representation for image denoising and beyond 用于图像去噪及其他方面的分层分解表示法
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-07-10 DOI: 10.1016/j.imavis.2024.105165
{"title":"Hierarchical disentangled representation for image denoising and beyond","authors":"","doi":"10.1016/j.imavis.2024.105165","DOIUrl":"10.1016/j.imavis.2024.105165","url":null,"abstract":"<div><p>Image denoising is a typical ill-posed problem due to complex degradation. Leading methods based on normalizing flows have tried to solve this problem with an invertible transformation instead of a deterministic mapping. However, it is difficult to construct feasible bijective mapping to remove spatial-variant noise while recovering fine texture and structure details due to latent ambiguity in inverse problems. Inspired by a common observation that noise tends to appear in the high-frequency part of the image, we propose a fully invertible denoising method that injects the idea of disentangled learning into a general invertible architecture to split noise from the high-frequency part. More specifically, we decompose the noisy image into clean low-frequency and hybrid high-frequency parts with an invertible transformation and then disentangle case-specific noise and high-frequency components in the latent space. In this way, denoising is made tractable by inversely merging noiseless low and high-frequency parts. Furthermore, we construct a flexible hierarchical disentangling framework, which aims to decompose most of the low-frequency image information while disentangling noise from the high-frequency part in a coarse-to-fine manner. Extensive experiments on real image denoising, JPEG compressed artifact removal, and medical low-dose CT image restoration have demonstrated that the proposed method achieves competitive performance on both quantitative metrics and visual quality, with significantly less computational cost.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141623269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CoNPL: Consistency training framework with noise-aware pseudo labeling for dense pose estimation CoNPL:利用噪声感知伪标签进行一致性训练的密集姿态估计框架
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-07-10 DOI: 10.1016/j.imavis.2024.105170
{"title":"CoNPL: Consistency training framework with noise-aware pseudo labeling for dense pose estimation","authors":"","doi":"10.1016/j.imavis.2024.105170","DOIUrl":"10.1016/j.imavis.2024.105170","url":null,"abstract":"<div><p>Dense pose estimation faces hurdles due to the lack of costly precise pixel-level IUV labels. Existing methods aim to overcome it by regularizing model outputs or interpolating pseudo labels. However, conventional geometric transformations often fall short, and pseudo labels may introduce unwanted noise, leading to continued challenges in rectifying inaccurate estimations. We introduced a novel Consistency training framework with Noise-aware Pseudo Labeling (CoNPL) to tackle problems in learning from unlabeled pixels. CoNPL employs both weak and strong augmentations in a shared model to enhance robustness against aggressive transformations. To address noisy pseudo labels, CoNPL integrates a Noise-aware Pseudo Labeling (NPL) module, which consists of a Noise-Aware Module (NAM), and Noise-Resistant Learning (NRL) modules. NAM identifies misclassifications and incorrect UV coordinates using binary classification and regression, while NRL dynamically adjusts loss weights to filter out uncertain samples, thereby stabilizing learning from pseudo labels. Our method demonstrates a + 2.0% improvement in AP on the DensePose-COCO benchmark across different networks, achieving state-of-the-art performance. On the Ultrapose and DensePose-Chimps benchmark, our method also demonstrates a + 2.7% and + 3.0% improvement in AP.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141698756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信