{"title":"Generative adversarial network for semi-supervised image captioning","authors":"","doi":"10.1016/j.cviu.2024.104199","DOIUrl":"10.1016/j.cviu.2024.104199","url":null,"abstract":"<div><div>Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve this problem. The proposed method uses a generative adversarial network to generate images that match captions, and uses these generated images and captions as new training data. This avoids the error accumulation problem when generating pseudo captions with autoregressive method and the network can directly perform backpropagation. At the same time, in order to ensure the correlation between the generated images and captions, we introduced the CLIP model for constraints. The CLIP model has been pre-trained on a large amount of image–text data, so it shows excellent performance in semantic alignment of images and text. To verify the effectiveness of our method, we validate on MSCOCO offline “Karpathy” test split. Experiment results show that our method can significantly improve the performance of the model when using 1% paired data, with the CIDEr score increasing from 69.5% to 77.7%. This shows that our method can effectively utilize unlabeled data for image caption tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BundleMoCap++: Efficient, robust and smooth motion capture from sparse multiview videos","authors":"","doi":"10.1016/j.cviu.2024.104190","DOIUrl":"10.1016/j.cviu.2024.104190","url":null,"abstract":"<div><div>Producing smooth and accurate motions from sparse videos without requiring specialized equipment and markers is a long-standing problem in the research community. Most approaches typically involve complex processes such as temporal constraints, multiple stages combining data-driven regression and optimization techniques, and bundle solving over temporal windows. These increase the computational burden and introduce the challenge of hyperparameter tuning for the different objective terms. In contrast, BundleMoCap++ offers a simple yet effective approach to this problem. It solves the motion in a single stage, eliminating the need for temporal smoothness objectives while still delivering smooth motions without compromising accuracy. BundleMoCap++ outperforms the state-of-the-art without increasing complexity. Our approach is based on manifold interpolation between latent keyframes. By relying on a local manifold smoothness assumption and appropriate interpolation schemes, we efficiently solve a bundle of frames using two or more latent codes. Additionally, the method is implemented as a sliding window optimization and requires only the first frame to be properly initialized, reducing the overall computational burden. BundleMoCap++’s strength lies in achieving high-quality motion capture results with fewer computational resources. To do this efficiently, we propose a novel human pose prior that focuses on the geometric aspect of the latent space, modeling it as a hypersphere, allowing for the introduction of sophisticated interpolation techniques. We also propose an algorithm for optimizing the latent variables directly on the learned manifold, improving convergence and performance. Finally, we introduce high-order interpolation techniques adapted for the hypersphere, allowing us to increase the solving temporal window, enhancing performance and efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142442023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel image inpainting method based on a modified Lengyel–Epstein model","authors":"","doi":"10.1016/j.cviu.2024.104195","DOIUrl":"10.1016/j.cviu.2024.104195","url":null,"abstract":"<div><div>With the increasing popularity of digital images, developing advanced algorithms that can accurately reconstruct damaged images while maintaining high visual quality is crucial. Traditional image restoration algorithms often struggle with complex structures and details, while recent deep learning methods, though effective, face significant challenges related to high data dependency and computational costs. To resolve these challenges, we propose a novel image inpainting model, which is based on a modified Lengyel–Epstein (LE) model. We discretize the modified LE model by using an explicit Euler algorithm. A series of restoration experiments are conducted on various image types, including binary images, grayscale images, index images, and color images. The experimental results demonstrate the effectiveness and robustness of the method, and even under complex conditions of noise interference and local damage, the proposed method can exhibit excellent repair performance. To quantify the fidelity of these restored images, we use the peak signal-to-noise ratio (PSNR), a widely accepted metric in image processing. The calculation results further demonstrate the applicability of our model across different image types. Moreover, by evaluating CPU time, our method can achieve ideal repair results within a remarkably brief duration. The proposed method validates significant potential for real-world applications in diverse domains of image restoration and enhancement.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"WGS-YOLO: A real-time object detector based on YOLO framework for autonomous driving","authors":"","doi":"10.1016/j.cviu.2024.104200","DOIUrl":"10.1016/j.cviu.2024.104200","url":null,"abstract":"<div><div>The safety and reliability of autonomous driving depends on the precision and efficiency of object detection systems. In this paper, a refined adaptation of the YOLO architecture (WGS-YOLO) is developed to improve the detection of pedestrians and vehicles. Specifically, its information fusion is enhanced by incorporating the Weighted Efficient Layer Aggregation Network (W-ELAN) module, an innovative dynamic weighted feature fusion module using channel shuffling. Meanwhile, the computational demands and parameters of the proposed WGS-YOLO are significantly reduced by employing the Space-to-Depth Convolution (SPD-Conv) and the Grouped Spatial Pyramid Pooling (GSPP) modules that have been strategically designed. The performance of our model is evaluated with the BDD100k and DAIR-V2X-V datasets. In terms of mean Average Precision (<span><math><msub><mrow><mtext>mAP</mtext></mrow><mrow><mn>0</mn><mo>.</mo><mn>5</mn></mrow></msub></math></span>), the proposed model outperforms the baseline Yolov7 by 12%. Furthermore, extensive experiments are conducted to verify our analysis and the model’s robustness across diverse scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MultiSubjects: A multi-subject video dataset for single-person basketball action recognition from basketball gym","authors":"","doi":"10.1016/j.cviu.2024.104193","DOIUrl":"10.1016/j.cviu.2024.104193","url":null,"abstract":"<div><div>Computer vision technology is becoming a research focus in the field of basketball. Despite the abundance of datasets centered on basketball games, there remains a significant gap in the availability of a large-scale, multi-subject, and fine-grained dataset for the recognition of basketball actions in real-world sports scenarios, particularly for amateur players. Such datasets are crucial for advancing the application of computer vision tasks in the real world. To address this gap, we deployed multi-view cameras in a civilian basketball gym, constructed a real basketball data acquisition platform, and acquired a challenging multi-subject video dataset, named MultiSubjects. The MultiSubjects v1.0 dataset features a variety of ages, body types, attire, genders, and basketball actions, providing researchers with a high-quality and diverse resource of basketball action data. We collected a total of 1,000 distinct subjects from video data between September and December 2023, classified and labeled three basic basketball actions, and assigned a unique identity ID to each subject, provided a total of 6,144 video clips, 436,460 frames, and labeled 6,144 instances of actions with clear temporal boundaries using 436,460 human body bounding boxes. Additionally, complete frame-wise skeleton keypoint coordinates for the entire action are provided. We used some representative video action recognition algorithms as well as skeleton-based action recognition algorithms on the MultiSubjects v1.0 dataset and analyzed the results. The results confirm that the quality of our dataset surpasses that of popular video action recognition datasets, it also presents that skeleton-based action recognition remains a challenging task. The link to our dataset is: <span><span>https://huggingface.co/datasets/Henu-Software/Henu-MultiSubjects</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Found missing semantics: Supplemental prototype network for few-shot semantic segmentation","authors":"","doi":"10.1016/j.cviu.2024.104191","DOIUrl":"10.1016/j.cviu.2024.104191","url":null,"abstract":"<div><div>Few-shot semantic segmentation alleviates the problem of massive data requirements and high costs in semantic segmentation tasks. By learning from support set, few-shot semantic segmentation can segment new classes. However, existing few-shot semantic segmentation methods suffer from information loss during the process of mask average pooling. To address this problem, we propose a supplemental prototype network (SPNet). The SPNet aggregates the lost information from global prototypes to create a supplemental prototype, which enhances the segmentation performance for the current class. In addition, we utilize mutual attention to enhance the similarity between the support and the query feature maps, allowing the model to better identify the target to be segmented. Finally, we introduce a Self-correcting auxiliary, which utilizes the data more effectively to improve segmentation accuracy. We conducted extensive experiments on PASCAL-5i and COCO-20i, which demonstrated the effectiveness of SPNet. And our method achieved state-of-the-art results in the 1-shot and 5-shot semantic segmentation settings.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient degradation representation learning network for remote sensing image super-resolution","authors":"","doi":"10.1016/j.cviu.2024.104182","DOIUrl":"10.1016/j.cviu.2024.104182","url":null,"abstract":"<div><div>The advancements in convolutional neural networks have led to significant progress in image super-resolution (SR) techniques. Nevertheless, it is crucial to acknowledge that current SR methods operate under the assumption of bicubic downsampling as a degradation factor in low-resolution (LR) images and train models accordingly. However, this approach does not account for the unknown degradation patterns present in real-world scenes. To address this problem, we propose an efficient degradation representation learning network (EDRLN). Specifically, we adopt a contrast learning approach, which enables the model to distinguish and learn various degradation representations in realistic images to obtain critical degradation information. We also introduce streamlined and efficient pixel attention to strengthen the feature extraction capability of the model. In addition, we optimize our model with mutual affine convolution layers instead of ordinary convolution layers to make it more lightweight while minimizing performance loss. Experimental results on remote sensing and benchmark datasets show that our proposed EDRLN exhibits good performance for different degradation scenarios, while the lightweight version minimizes the performance loss as much as possible. The Code will be available at: <span><span>https://github.com/Leilei11111/EDRLN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distance-based loss function for deep feature space learning of convolutional neural networks","authors":"","doi":"10.1016/j.cviu.2024.104184","DOIUrl":"10.1016/j.cviu.2024.104184","url":null,"abstract":"<div><div>Convolutional Neural Networks (CNNs) have been on the forefront of neural network research in recent years. Their breakthrough performance in fields such as image classification has gathered efforts in the development of new CNN-based architectures, but recently more attention has been directed to the study of new loss functions. Softmax loss remains the most popular loss function due mainly to its efficiency in class separation, but the function is unsatisfactory in terms of intra-class compactness. While some studies have addressed this problem, most solutions attempt to refine softmax loss or combine it with other approaches. We present a novel loss function based on distance matrices (LDMAT), softmax independent, that maximizes interclass distance and minimizes intraclass distance. The loss function operates directly on deep features, allowing their use on arbitrary classifiers. LDMAT minimizes the distance between two distance matrices, one constructed with the model’s deep features and the other calculated from the labels. The use of a distance matrix in the loss function allows a two-dimensional representation of features and imposes a fixed distance between classes, while improving intra-class compactness. A regularization method applied to the distance matrix of labels is also presented, that allows a degree of relaxation of the solution and leads to a better spreading of features in the separation space. Efficient feature extraction was observed on datasets such as MNIST, CIFAR10 and CIFAR100.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient feature reuse distillation network for lightweight image super-resolution","authors":"","doi":"10.1016/j.cviu.2024.104178","DOIUrl":"10.1016/j.cviu.2024.104178","url":null,"abstract":"<div><div>In recent research, single-image super-resolution (SISR) using deep Convolutional Neural Networks (CNN) has seen significant advancements. While previous methods excelled at learning complex mappings between low-resolution (LR) and high-resolution (HR) images, they often required substantial computational and memory resources. We propose the Efficient Feature Reuse Distillation Network (EFRDN) to alleviate these challenges. EFRDN primarily comprises Asymmetric Convolutional Distillation Modules (ACDM), incorporating the Multiple Self-Calibrating Convolution (MSCC) units for spatial and channel feature extraction. It includes an Asymmetric Convolution Residual Block (ACRB) to enhance the skeleton information of the square convolution kernel and a Feature Fusion Lattice Block (FFLB) to convert low-order input signals into higher-order representations. Introducing a Transformer module for global features, we enhance feature reuse and gradient flow, improving model performance and efficiency. Extensive experimental results demonstrate that EFRDN outperforms existing methods in performance while conserving computing and memory resources.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uncertainty guided test-time training for face forgery detection","authors":"","doi":"10.1016/j.cviu.2024.104185","DOIUrl":"10.1016/j.cviu.2024.104185","url":null,"abstract":"<div><div>The rapid development of generative image modeling poses security risks of spreading unreal visual information, even though those techniques make a lot of applications possible in positive aspects. To provide alerts and maintain a secure social environment, forgery detection has been an urgent and crucial solution to deal with this situation and try to avoid any negative effects, especially for human faces, owing to potential severe results when malicious creators spread disinformation widely. In spite of the success of recent works w.r.t. model design and feature engineering, detecting face forgery from novel image creation methods or data distributions remains unresolved, because well-trained models are typically not robust to the distribution shift during test-time. In this work, we aim to alleviate the sensitivity of an existing face forgery detector to new domains, and then boost real-world detection under unknown test situations. In specific, we leverage test examples, selected by uncertainty values, to fine-tune the model before making a final prediction. Therefore, it leads to a test-time training based approach for face forgery detection, that our framework incorporates an uncertainty-driven test sample selection with self-training to adapt a classifier onto target domains. To demonstrate the effectiveness of our framework and compare with previous methods, we conduct extensive experiments on public datasets, including FaceForensics++, Celeb-DF-v2, ForgeryNet and DFDC. Our results clearly show that the proposed framework successfully improves many state-of-the-art methods in terms of better overall performance as well as stronger robustness to novel data distributions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}