Image and Vision Computing最新文献

筛选
英文 中文
A dictionary learning based unsupervised neural network for single image compressed sensing 基于字典学习的单图像压缩传感无监督神经网络
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-09-23 DOI: 10.1016/j.imavis.2024.105281
{"title":"A dictionary learning based unsupervised neural network for single image compressed sensing","authors":"","doi":"10.1016/j.imavis.2024.105281","DOIUrl":"10.1016/j.imavis.2024.105281","url":null,"abstract":"<div><div>In the field of Compressed Sensing (CS), the sparse representation of signals and the advancement of reconstruction algorithms are two critical challenges. However, conventional CS algorithms often fail to sufficiently exploit the structured sparsity present in images and suffer from poor reconstruction quality. Most deep learning-based CS methods are typically trained on large-scale datasets. Obtaining a sufficient number of training sets is challenging in many practical applications and there may be no training sets available at all in some cases. In this paper, a novel deep Dictionary Learning (DL) based unsupervised neural network for single image CS (dubbed DL-CSNet) is proposed. It is an effective trainless neural network that consists of three components and their corresponding loss functions: 1) a DL layer that consists of multi-layer perceptron (MLP) and convolution neural networks (CNN) for latent sparse features extraction with the L1-norm sparsity loss function; 2) an image smoothing layer with the Total Variation (TV) like image smoothing loss function; and 3) a CS acquisition layer for image compression, with the Mean Square Error (MSE) loss function between the original image compression and the reconstructed image compression. In particular, the proposed DL-CSNet is a lightweight and fast model that does not require datasets for training and exhibits a fast convergence speed, making it suitable for deployment in resource-constrained environments. Experiments have demonstrated that the proposed DL-CSNet achieves superior performance compared to traditional CS methods and other unsupervised state-of-the-art deep learning-based CS methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unbiased scene graph generation via head-tail cooperative network with self-supervised learning 通过具有自我监督学习功能的头尾协同网络生成无偏差场景图
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-09-22 DOI: 10.1016/j.imavis.2024.105283
{"title":"Unbiased scene graph generation via head-tail cooperative network with self-supervised learning","authors":"","doi":"10.1016/j.imavis.2024.105283","DOIUrl":"10.1016/j.imavis.2024.105283","url":null,"abstract":"<div><div>Scene Graph Generation (SGG) as a critical task in image understanding, facing the challenge of head-biased prediction caused by the long-tail distribution of predicates. However, current debiased SGG methods can easily prioritize improving the prediction of tail predicates while ignoring the substantial sacrifice of head predicates, leading to a shift from head bias to tail bias. To address this issue, we propose a Head-Tail Cooperative network with self-supervised Learning (HTCL), which achieves unbiased SGG by cooperating head-prefer and tail-prefer predictions through learnable weight parameters. HTCL employs a tail-prefer feature encoder to re-represent predicate features by injecting self-supervised learning, which focuses on the intrinsic structure of features, into the supervised learning of SGG, constraining the representation of predicate features to enhance the distinguishability of tail samples. We demonstrate the effectiveness of our HTCL by applying it to VG150, Open Images V6 and GQA200 datasets. The results show that HTCL achieves higher mean Recall with a minimal sacrifice in Recall and achieves a new state-of-the-art overall performance. Our code is available at <span><span>https://github.com/wanglei0618/HTCL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142324050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UIR-ES: An unsupervised underwater image restoration framework with equivariance and stein unbiased risk estimator UIR-ES:使用等差数列和斯坦因无偏风险估计器的无监督水下图像修复框架
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-09-22 DOI: 10.1016/j.imavis.2024.105285
{"title":"UIR-ES: An unsupervised underwater image restoration framework with equivariance and stein unbiased risk estimator","authors":"","doi":"10.1016/j.imavis.2024.105285","DOIUrl":"10.1016/j.imavis.2024.105285","url":null,"abstract":"<div><div>Underwater imaging faces challenges for enhancing object visibility and restoring true colors due to the absorptive and scattering characteristics of water. Underwater image restoration (UIR) seeks solutions to restore clean images from degraded ones, providing significant utility in downstream tasks. Recently, data-driven UIR has garnered much attention due to the potent expressive capabilities of deep neural networks (DNNs). These DNNs are supervised, relying on a large amount of labeled training samples. However, acquiring such data is expensive or even impossible in real-world underwater scenarios. While recent researches suggest that unsupervised learning is effective in UIR, none of these frameworks consider signal physical priors. In this work, we present a novel physics-inspired unsupervised UIR framework empowered by equivariance and unbiased estimation techniques. Specifically, equivariance stems from the invariance, inherent in natural signals to enhance data-efficient learning. Given that degraded images invariably contain noise, we propose a noise-tolerant loss for unsupervised UIR based on the Stein unbiased risk estimator to achieve an accurate estimation of the data consistency. Extensive experiments on the benchmark UIR datasets, including the UIEB and RUIE datasets, validate the superiority of the proposed method in terms of quantitative scores, visual outcomes, and generalization ability, compared to state-of-the-art counterparts. Moreover, our method demonstrates even comparable performance with the supervised model.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new deepfake detection model for responding to perception attacks in embodied artificial intelligence 在具身人工智能中应对感知攻击的新型深度伪造检测模型
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-09-19 DOI: 10.1016/j.imavis.2024.105279
{"title":"A new deepfake detection model for responding to perception attacks in embodied artificial intelligence","authors":"","doi":"10.1016/j.imavis.2024.105279","DOIUrl":"10.1016/j.imavis.2024.105279","url":null,"abstract":"<div><div>Embodied artificial intelligence (AI) represents a new generation of robotics technology combined with artificial intelligence, and it is at the forefront of current research. To reduce the impact of deepfake technology on embodied perception and enhance the security and reliability of embodied AI, this paper proposes a novel deepfake detection model with a new Balanced Contrastive Learning strategy, named BCL. By integrating unsupervised contrastive learning and supervised contrastive learning with deepfake detection, the model effectively extracts the underlying features of fake images from both individual level and category level, thereby leading to more discriminative features. In addition, a Multi-scale Attention Interaction module (MAI) is proposed to enrich the representative ability of features. By cross-fusing the semantic features of different receptive fields of the encoder, more effective deepfake traces can be mined. Finally, extensive experiments demonstrate that our method has good performance and generalization capabilities across intra-dataset, cross-dataset and cross-manipulation scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142357042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ground4Act: Leveraging visual-language model for collaborative pushing and grasping in clutter Ground4Act:利用视觉语言模型,在杂乱无章的环境中协同推动和抓取
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-09-18 DOI: 10.1016/j.imavis.2024.105280
{"title":"Ground4Act: Leveraging visual-language model for collaborative pushing and grasping in clutter","authors":"","doi":"10.1016/j.imavis.2024.105280","DOIUrl":"10.1016/j.imavis.2024.105280","url":null,"abstract":"<div><div>The challenge in robotics is to enable robots to transition from visual perception and language understanding to performing tasks such as grasp and assembling objects, bridging the gap between “seeing” and “hearing” to “doing”. In this work, we propose Ground4Act, a two-stage approach for collaborative pushing and grasping in clutter using a visual-language model. In the grounding stage, Ground4Act extracts target features from multi-modal data via visual grounding. In the action stage, it embeds a collaborative pushing and grasping framework to generate the action's position and direction. Specifically, we propose a DQN-based reinforcement learning pushing policy that uses RGBD images as the state space to determine the push action's pixel-level coordinates and direction. Additionally, a least squares-based linear fitting grasping policy takes the target mask from the grounding stage as input to achieve efficient grasp. Simulations and real-world experiments demonstrate Ground4Act's superior performance. The simulation suite, source code, and trained models will be made publicly available.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0262885624003858/pdfft?md5=14502c48d797b01e4d54229138caf4f2&pid=1-s2.0-S0262885624003858-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142314366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A learnable motion preserving pooling for action recognition 用于动作识别的可学习运动保护池
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-09-17 DOI: 10.1016/j.imavis.2024.105278
{"title":"A learnable motion preserving pooling for action recognition","authors":"","doi":"10.1016/j.imavis.2024.105278","DOIUrl":"10.1016/j.imavis.2024.105278","url":null,"abstract":"<div><div>Using deep neural networks (DNN) for video understanding tasks is expensive in terms of computation cost. Pooling layers in DNN which are widely used in most vision tasks to resize the spatial dimensions play crucial roles in reducing the computation and memory cost. In video-related tasks, pooling layers are also applied, mostly in the spatial dimension only as the standard average pooling in the temporal domain can significantly reduce its performance. This is because conventional temporal pooling degrades the underlying important motion features in consecutive frames. Such a phenomenon is rarely investigated and most state-of-art methods simply do not adopt temporal pooling, leading to enormous computation costs. In this work, we propose a learnable motion-preserving pooling (MPPool) layer that is able to preserve the general motion progression after the pooling. This pooling layer first locates the frames with the strongest motion features and then keeps these crucial features during pooling. Our experiments demonstrate that MPPool not only reduces the computation cost for video data modeling, but also increases the final prediction accuracy on various motion-centric and appearance-centric datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142310476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LVD-YOLO: An efficient lightweight vehicle detection model for intelligent transportation systems LVD-YOLO:用于智能交通系统的高效轻型车辆检测模型
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-09-16 DOI: 10.1016/j.imavis.2024.105276
{"title":"LVD-YOLO: An efficient lightweight vehicle detection model for intelligent transportation systems","authors":"","doi":"10.1016/j.imavis.2024.105276","DOIUrl":"10.1016/j.imavis.2024.105276","url":null,"abstract":"<div><p>Vehicle detection is a fundamental component of intelligent transportation systems. However, current algorithms often encounter issues such as high computational complexity, long execution times, and significant resource demands, making them unsuitable for resource-limited environments. To overcome these challenges, we propose LVD-YOLO, a Lightweight Vehicle Detection Model based on YOLO. This model incorporates the EfficientNetv2 network structure as its backbone, which reduces parameters and enhances feature extraction capabilities. By utilizing a bidirectional feature pyramid structure along with a dual attention mechanism, we enable efficient information exchange across feature layers, thereby improving multiscale feature fusion. Additionally, we refine the model's loss function with SIoU loss to boost regression and prediction performance. Experimental results on the PASCAL VOC and MS COCO datasets show that LVD-YOLO outperforms YOLOv5s, achieving a 0.5% increase in accuracy while reducing FLOPs by 64.6% and parameters by 48.6%. These improvements highlight its effectiveness for use in resource-constrained environments.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142241601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CRFormer: A cross-region transformer for shadow removal CRFormer:用于去除阴影的跨区域变换器
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-09-16 DOI: 10.1016/j.imavis.2024.105273
{"title":"CRFormer: A cross-region transformer for shadow removal","authors":"","doi":"10.1016/j.imavis.2024.105273","DOIUrl":"10.1016/j.imavis.2024.105273","url":null,"abstract":"<div><div>Image shadow removal is a fundamental task in computer vision, which aims to restore damaged signals caused by shadows, thereby improving image quality and scene understanding. Recently, transformers have demonstrated strong capabilities in various applications by capturing global pixel interactions, a capability highly desirable for shadow removal. However, applying transformers to promote shadow removal is non-trivial for the following two reasons: 1) The patchify operation is not suitable for shadow removal due to irregular shadow shapes; 2) Shadow removal only requires one-way interaction from the non-shadow region to the shadow region instead of the common two-way interactions among all pixels in the image. In this paper, we propose a novel <strong>C</strong>ross-<strong>R</strong>egion trans<strong>F</strong>ormer (CRFormer) for shadow removal which differs from existing transformers by only considering the pixel interactions from the non-shadow region to the shadow region without splitting images into patches. This is achieved by a carefully designed region-aware cross-attention mechanism that aggregates the recovered shadow region features, conditioned on the non-shadow region features. Extensive experiments on the ISTD, AISTD, SRD, and Video Shadow Removal datasets demonstrate the superiority of our method compared to other state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142310477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DFEDC: Dual fusion with enhanced deformable convolution for medical image segmentation DFEDC:利用增强型可变形卷积进行医学图像分割的双重融合
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-09-13 DOI: 10.1016/j.imavis.2024.105277
{"title":"DFEDC: Dual fusion with enhanced deformable convolution for medical image segmentation","authors":"","doi":"10.1016/j.imavis.2024.105277","DOIUrl":"10.1016/j.imavis.2024.105277","url":null,"abstract":"<div><p>Considering the complexity of lesion regions in medical images, current researches relying on CNNs typically employ large-kernel convolutions to expand the receptive field and enhance segmentation quality. However, these convolution methods are hindered by substantial computational requirements and limited capacity to extract contextual and multi-scale information, making it challenging to efficiently segment complex regions. To address this issue, we propose a dual fusion with enhanced deformable convolution network, namely DFEDC, which dynamically adjusts the receptive field and simultaneously integrates multi-scale feature information to effectively segment complex lesion areas and process boundaries. Firstly, we combine global channel and spatial fusion in a serial way, which integrates and reuses global channel attention and fully connected layers to achieve lightweight extraction of channel and spatial information. Additionally, we design a structured deformable convolution (SDC) that structures deformable convolution with inceptions and large kernel attention, and enhances the learning of offsets through parallel fusion to efficiently extract multi-scale feature information. To compensate for the loss of spatial information of SDC, we introduce a hybrid 2D and 3D feature extraction module to transform feature extraction from a single dimension to a fusion of 2D and 3D. Extensive experimental results on the Synapse, ACDC, and ISIC-2018 datasets demonstrate that our proposed DFEDC achieves superior results.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142242400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VLAI: Exploration and Exploitation based on Visual-Language Aligned Information for Robotic Object Goal Navigation VLAI:基于视觉语言对齐信息的探索和利用,用于机器人目标导航
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2024-09-12 DOI: 10.1016/j.imavis.2024.105259
{"title":"VLAI: Exploration and Exploitation based on Visual-Language Aligned Information for Robotic Object Goal Navigation","authors":"","doi":"10.1016/j.imavis.2024.105259","DOIUrl":"10.1016/j.imavis.2024.105259","url":null,"abstract":"<div><p>Object Goal Navigation(ObjectNav) is the task that an agent need navigate to an instance of a specific category in an unseen environment through visual observations within limited time steps. This work plays a significant role in enhancing the efficiency of locating specific items in indoor spaces and assisting individuals in completing various tasks, as well as providing support for people with disabilities. To achieve efficient ObjectNav in unfamiliar environments, global perception capabilities, understanding the regularities of space and semantics in the environment layout are significant. In this work, we propose an explicit-prediction method called VLAI that utilizes visual-language alignment information to guide the agent's exploration, unlike previous navigation methods based on frontier potential prediction or egocentric map completion, which only leverage visual observations to construct semantic maps, thus failing to help the agent develop a better global perception. Specifically, when predicting long-term goals, we retrieve previously saved visual observations to obtain visual information around the frontiers based on their position on the incrementally built incomplete semantic map. Then, we apply our designed Chat Describer to this visual information to obtain detailed frontier object descriptions. The Chat Describer, a novel automatic-questioning approach deployed in Visual-to-Language, is composed of Large Language Model(LLM) and the visual-to-language model(VLM), which has visual question-answering functionality. In addition, we also obtain the semantic similarity of target object and frontier object categories. Ultimately, by combining the semantic similarity and the boundary descriptions, the agent can predict the long-term goals more accurately. Our experiments on the Gibson and HM3D datasets reveal that our VLAI approach yields significantly better results compared to earlier methods. The code is released at</p><p><span><span><span>https://github.com/31539lab/VLAI</span></span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142242398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信