{"title":"Semantic-preserved point-based human avatar","authors":"Lixiang Lin, Jianke Zhu","doi":"10.1016/j.cviu.2025.104307","DOIUrl":"10.1016/j.cviu.2025.104307","url":null,"abstract":"<div><div>To enable realistic experience in AR/VR and digital entertainment, we present the first point-based human avatar model that embodies the entirety expressive range of digital humans. Specifically, we employ two MLPs to model pose-dependent deformation and linear skinning (LBS) weights. The representation of appearance relies on a decoder and the features attached to each point. In contrast to alternative implicit approaches, the oriented points representation not only provides a more intuitive way to model human avatar animation but also significantly reduces the computational time on both training and inference. Moreover, we propose a novel method to transfer semantic information from the SMPL-X model to the points, which enables to better understand human body movements. By leveraging the semantic information of points, we can facilitate virtual try-on and human avatar composition through exchanging the points of same category across different subjects. Experimental results demonstrate the efficacy of our presented method. Our implementation is publicly available at <span><span>https://github.com/l1346792580123/spa</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104307"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph-based Dense Event Grounding with relative positional encoding","authors":"Jianxiang Dong, Zhaozheng Yin","doi":"10.1016/j.cviu.2024.104257","DOIUrl":"10.1016/j.cviu.2024.104257","url":null,"abstract":"<div><div>Temporal Sentence Grounding (TSG) in videos aims to localize a temporal moment from an untrimmed video that is relevant to a given query sentence. Most existing methods focus on addressing the problem of single sentence grounding. Recently, researchers proposed a new Dense Event Grounding (DEG) problem by extending the single event localization to a multi-event localization, where the temporal moments of multiple events described by multiple sentences are retrieved. In this paper, we introduce an effective proposal-based approach to solve the DEG problem. A Relative Sentence Interaction (RSI) module using graph neural network is proposed to model the event relationship by introducing a temporal relative positional encoding to learn the relative temporal order information between sentences in a dense multi-sentence query. In addition, we design an Event-contextualized Cross-modal Interaction (ECI) module to tackle the lack of global information from other related events when fusing visual and sentence features. Finally, we construct an Event Graph (EG) with intra-event edges and inter-event edges to model the relationship between proposals in the same event and proposals in different events to further refine their representations for final localizations. Extensive experiments on ActivityNet-Captions and TACoS datasets show the effectiveness of our solution.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104257"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Donghyeon Lee , Eunho Lee , Jaehyuk Kang, Youngbae Hwang
{"title":"Pruning networks at once via nuclear norm-based regularization and bi-level optimization","authors":"Donghyeon Lee , Eunho Lee , Jaehyuk Kang, Youngbae Hwang","doi":"10.1016/j.cviu.2024.104247","DOIUrl":"10.1016/j.cviu.2024.104247","url":null,"abstract":"<div><div>Most network pruning methods focus on identifying redundant channels from pre-trained models, which is inefficient due to its three-step process: pre-training, pruning and fine-tuning, and reconfiguration. In this paper, we propose a pruning-from-scratch framework that unifies these processes into a single approach. We introduce nuclear norm-based regularization to maintain the representational capacity of large networks during pruning. Combining this with MACs-based regularization enhances the performance of the pruned network at the target compression rate. Our bi-level optimization approach simultaneously improves pruning efficiency and representation capacity. Experimental results show that our method achieves 75.4% accuracy on ImageNet without a pre-trained network, using only 41% of the original model’s computational cost. It also attains 0.5% higher performance in compressing the SSD network for object detection. Furthermore, we analyze the effects of nuclear norm-based regularization.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104247"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jikang Cheng, Baojin Huang, Yan Fang, Zhen Han, Zhongyuan Wang
{"title":"Adversarial intensity awareness for robust object detection","authors":"Jikang Cheng, Baojin Huang, Yan Fang, Zhen Han, Zhongyuan Wang","doi":"10.1016/j.cviu.2024.104252","DOIUrl":"10.1016/j.cviu.2024.104252","url":null,"abstract":"<div><div>Like other computer vision models, object detectors are vulnerable to adversarial examples (AEs) containing imperceptible perturbations. These AEs can be generated with multiple intensities and then used to attack object detectors in real-world scenarios. One of the most effective ways to improve the robustness of object detectors is adversarial training (AT), which incorporates AEs into the training process. However, while previous AT-based models have shown certain robustness against adversarial attacks of a pre-specific intensity, they still struggle to maintain robustness when defending against adversarial attacks with multiple intensities. To address this issue, we propose a novel robust object detection method based on adversarial intensity awareness. We first explore potential schema to define the relationship between the neglected intensity information and actual evaluation metrics in AT. Then, we propose the sequential intensity loss (SI Loss) to represent and leverage the neglected intensity information in the AEs. Specifically, SI Loss deploys a sequential adaptive strategy to transform intensity into concrete learnable metrics in a discrete and cumulative manner. Additionally, a boundary smoothing algorithm is introduced to mitigate the influence of some particular AEs that challenging to be divided into a certain intensity level. Extensive experiments on PASCAL VOC and MS-COCO datasets substantially demonstrate the superior performance of our method over other defense methods against multi-intensity adversarial attacks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104252"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joint Generating Terminal Correction Imaging method for modular LED integral imaging systems","authors":"Tianshu Li, Shigang Wang","doi":"10.1016/j.cviu.2025.104279","DOIUrl":"10.1016/j.cviu.2025.104279","url":null,"abstract":"<div><div>Integral imaging has garnered significant attention in 3D display technology due to its potential for high-quality visualization. However, elemental images in integral imaging systems usually suffer from misalignment due to the mechanical or human-induced assembly within the lens arrays, leading to undesirable display quality. This paper introduces a novel Joint-Generating Terminal Correction Imaging (JGTCI) approach tailored for large-scale, modular LED integral imaging systems to address the misalignment between the optical centers of physical lens arrays and the camera in generated elemental image arrays. Specifically, we propose: (1) a high-sensitivity calibration marker to enhance alignment precision by accurately matching lens centers to the central points of elemental images; (2) a partitioned calibration strategy that supports independent calibration of display sections, enabling seamless system expansion without recalibrating previously adjusted regions; and (3) a calibration setup where markers are strategically placed near the lens focal length, ensuring optimal pixel coverage in the camera frame for improved accuracy. Extensive experimental results demonstrate that our JGTCI approach significantly enhances 3D display accuracy, extends the viewing angle, and improves the scalability and practicality of modular integral imaging systems, outperforming recent state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104279"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis Hernando Ríos González , Sebastián López Flórez , Alfonso González-Briones , Fernando de la Prieta
{"title":"Semantic scene understanding through advanced object context analysis in image","authors":"Luis Hernando Ríos González , Sebastián López Flórez , Alfonso González-Briones , Fernando de la Prieta","doi":"10.1016/j.cviu.2025.104299","DOIUrl":"10.1016/j.cviu.2025.104299","url":null,"abstract":"<div><div>Advancements in computer vision have primarily concentrated on interpreting visual data, often overlooking the significance of contextual differences across various regions within images. In contrast, our research introduces a model for indoor scene recognition that pivots towards the ‘attention’ paradigm. This model views attention as a response to the stimulus image properties, suggesting that focus is ‘pulled’ towards the most visually salient zones within an image, as represented in a saliency map. Attention is directed towards these zones based on uninterpreted semantic features of the image, such as luminance contrast, color, shape, and edge orientation. This neurobiologically plausible and computationally tractable approach offers a more nuanced understanding of scenes by prioritizing zones solely based on their image properties. The proposed model enhances scene understanding through an in-depth analysis of the object context in images. Scene recognition is achieved by extracting features from selected regions of interest within individual image frames using patch-based object detection techniques, thus generating distinctive feature descriptors for the identified objects of interest. The resulting feature descriptors are then subjected to semantic embedding, which uses distributed representations to transform the sparse feature vectors into dense semantic vectors within a learned latent space. This enables subsequent classification tasks by machine learning models trained on embedded semantic representations. This model was evaluated on three image datasets: UIUC Sports-8, PASCAL VOC - Visual Object Classes, and a proprietary image set created by the authors. Compared to state-of-the-art methods, this paper presents a more robust approach to the abstraction and generalization of interior scenes. This approach has demonstrated superior accuracy with our novel model over existing models. Consequently, this has led to an improvement in the classification of scenes in the selected indoor environments. Our code is published here: <span><span>https://github.com/sebastianlop8/Semantic-Scene-Object-Context-Analysis.git</span><svg><path></path></svg></span></div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104299"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-supervised vision transformers for semantic segmentation","authors":"Xianfan Gu , Yingdong Hu , Chuan Wen , Yang Gao","doi":"10.1016/j.cviu.2024.104272","DOIUrl":"10.1016/j.cviu.2024.104272","url":null,"abstract":"<div><div>Semantic segmentation is a fundamental task in computer vision and it is a building block of many other vision applications. Nevertheless, semantic segmentation annotations are extremely expensive to collect, so using pre-training to alleviate the need for a large number of labeled samples is appealing. Recently, self-supervised learning (SSL) has shown effectiveness in extracting strong representations and has been widely applied to a variety of downstream tasks. However, most works perform sub-optimally in semantic segmentation because they ignore the specific properties of segmentation: (i) the need of pixel level fine-grained understanding; (ii) with the assistance of global context understanding; (iii) both of the above achieve with the dense self-supervisory signal. Based on these key factors, we introduce a systematic self-supervised pre-training framework for semantic segmentation, which consists of a hierarchical encoder–decoder architecture MEVT for generating high-resolution features with global contextual information propagation and a self-supervised training strategy for learning fine-grained semantic features. In our study, our framework shows competitive performance compared with other main self-supervised pre-training methods for semantic segmentation on COCO-Stuff, ADE20K, PASCAL VOC, and Cityscapes datasets. e.g., MEVT achieves the advantage in linear probing by +1.3 mIoU on PASCAL VOC.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104272"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RS3Lip: Consistency for remote sensing image classification on part embeddings using self-supervised learning and CLIP","authors":"Ankit Jha , Mainak Singha , Avigyan Bhattacharya , Biplab Banerjee","doi":"10.1016/j.cviu.2024.104254","DOIUrl":"10.1016/j.cviu.2024.104254","url":null,"abstract":"<div><div>Tackling domain and class generalization challenges remains a significant hurdle in the realm of remote sensing (RS). Recently, large-scale pre-trained vision-language models (VLMs), exemplified by CLIP, have showcased impressive zero-shot and few-shot generalization capabilities through extensive contrastive training. Existing literature emphasizes prompt learning as a means of enriching prompts with both domain and content information, particularly through smaller learnable projectors, thereby addressing multi-domain data challenges perceptibly. Along with this, it is observed that CLIP’s vision encoder fails to generalize well on the puzzled or corrupted RS images. In response, we propose a novel solution utilizing self-supervised learning (SSL) to ensure consistency for puzzled RS images in domain generalization (DG). This approach strengthens visual features, facilitating the generation of domain-invariant prompts. Our proposed RS<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>Lip, trained with small projectors featuring few layers, complements the pre-trained CLIP. It incorporates SSL and inpainting losses for visual features, along with a consistency loss between the features of SSL tasks and textual features. Empirical findings demonstrate that RS<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>Lip consistently outperforms state-of-the-art prompt learning methods across five benchmark optical remote sensing datasets, achieving improvements of at least by 1.3% in domain and class generalization tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104254"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanyuan Liu , Hong Zhu , Zhong Wu , Sen Du , Shuning Wu , Jing Shi
{"title":"Adaptive semantic guidance network for video captioning","authors":"Yuanyuan Liu , Hong Zhu , Zhong Wu , Sen Du , Shuning Wu , Jing Shi","doi":"10.1016/j.cviu.2024.104255","DOIUrl":"10.1016/j.cviu.2024.104255","url":null,"abstract":"<div><div>Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high-quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high-quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104255"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}