Hongbo Bi , Yuyu Tong , Pan Zhang , Jiayuan Zhang , Cong Zhang
{"title":"Dual cross-enhancement network for highly accurate dichotomous image segmentation","authors":"Hongbo Bi , Yuyu Tong , Pan Zhang , Jiayuan Zhang , Cong Zhang","doi":"10.1016/j.cviu.2024.104122","DOIUrl":"10.1016/j.cviu.2024.104122","url":null,"abstract":"<div><p>The existing image segmentation tasks mainly focus on segmenting objects with specific characteristics, such as salient, camouflaged, and meticulous objects, etc. However, the research of highly accurate Dichotomous Image Segmentation (DIS) combining these tasks has just started and still faces problems such as insufficient information interaction between layers and incomplete integration of high-level semantic information and low-level detailed features. In this paper, a new dual cross-enhancement network (DCENet) for highly accurate DIS is proposed, which mainly consists of two new modules: a cross-scaling guidance (CSG) module and a semantic cross-transplantation (SCT) module. Specifically, the CSG module adopts the adjacent-layer cross-scaling guidance method, which can efficiently interact with the multi-scale features of the adjacent layers extracted; the SCT module uses dual-branch features to complement each other. Moreover, in the way of transplantation, the high-level semantic information of the low-resolution branch is used to guide the low-level detail features of the high-resolution branch, and the features of different resolution branches are effectively fused. Finally, experimental results on the challenging DIS5K benchmark dataset show that the proposed network outperforms the 9 state-of-the-art (SOTA) networks in 5 widely used evaluation metrics. In addition, the ablation experiments also demonstrate the effectiveness of the cross-scaling guidance module and the semantic cross-transplantation module.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104122"},"PeriodicalIF":4.3,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142136398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boyuan Zhang , Jiaxu Li , Yucheng Shi , Yahong Han , Qinghua Hu
{"title":"VADS: Visuo-Adaptive DualStrike attack on visual question answer","authors":"Boyuan Zhang , Jiaxu Li , Yucheng Shi , Yahong Han , Qinghua Hu","doi":"10.1016/j.cviu.2024.104137","DOIUrl":"10.1016/j.cviu.2024.104137","url":null,"abstract":"<div><p>Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. The adversarial vulnerability of VQA models is crucial for their reliability in real-world applications. However, current VQA attacks are mainly focused on the white-box and transfer-based settings, which require the attacker to have full or partial prior knowledge of victim VQA models. Besides that, query-based VQA attacks require a massive amount of query times, which the victim model may detect. In this paper, we propose the Visuo-Adaptive DualStrike (VADS) attack, a novel adversarial attack method combining transfer-based and query-based strategies to exploit vulnerabilities in VQA systems. Unlike current VQA attacks focusing on either approach, VADS leverages a momentum-like ensemble method to search potential attack targets and compress the perturbation. After that, our method employs a query-based strategy to dynamically adjust the weight of perturbation per surrogate model. We evaluate the effectiveness of VADS across 8 VQA models and two datasets. The results demonstrate that VADS outperforms existing adversarial techniques in both efficiency and success rate. Our code is available at: <span><span>https://github.com/stevenzhang9577/VADS</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104137"},"PeriodicalIF":4.3,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quanwei Yang , Lingyun Yu , Fengyuan Liu , Yun Song , Meng Shao , Guoqing Jin , Hongtao Xie
{"title":"Symmetrical Siamese Network for pose-guided person synthesis","authors":"Quanwei Yang , Lingyun Yu , Fengyuan Liu , Yun Song , Meng Shao , Guoqing Jin , Hongtao Xie","doi":"10.1016/j.cviu.2024.104134","DOIUrl":"10.1016/j.cviu.2024.104134","url":null,"abstract":"<div><p>Pose-Guided Person Image Synthesis (PGPIS) aims to generate a realistic person image that preserves the appearance of the source person while adopting the target pose. Various appearances and drastic pose changes make this task highly challenging. Due to the insufficient utilization of paired data, existing models face difficulties in accurately preserving the source appearance details and high-frequency textures in the generated images. Meanwhile, although current popular AdaIN-based methods are advantageous in handling drastic pose changes, they struggle to capture diverse clothing shapes imposed by the limitation of global feature statistics. To address these issues, we propose a novel Symmetrical Siamese Network (SSNet) for PGPIS, which consists of two synergistic symmetrical generative branches that leverage prior knowledge of paired data to comprehensively exploit appearance details. For feature integration, we propose a Style Matching Module (SMM) to transfer multi-level region appearance styles and gradient information to the desired pose for enriching the high-frequency textures. Furthermore, to overcome the limitation of global feature statistics, a Spatial Attention Module (SAM) is introduced to complement the SMM for capturing clothing shapes. Extensive experiments show the effectiveness of our SSNet, achieving state-of-the-art results on public datasets. Moreover, our SSNet can also edit the source appearance attributes, making it versatile in wider application scenarios.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104134"},"PeriodicalIF":4.3,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142148048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinan Wang, Sansitha Panchadsaram, Rezvan Sherkati, James J. Clark
{"title":"An egocentric video and eye-tracking dataset for visual search in convenience stores","authors":"Yinan Wang, Sansitha Panchadsaram, Rezvan Sherkati, James J. Clark","doi":"10.1016/j.cviu.2024.104129","DOIUrl":"10.1016/j.cviu.2024.104129","url":null,"abstract":"<div><p>We introduce an egocentric video and eye-tracking dataset, comprised of 108 first-person videos of 36 shoppers searching for three different products (orange juice, KitKat chocolate bars, and canned tuna) in a convenience store, along with the frame-centered eye fixation locations for each video frame. The dataset also includes demographic information about each participant in the form of an 11-question survey. The paper describes two applications using the dataset — an analysis of eye fixations during search in the store, and a training of a clustered saliency model for predicting saliency of viewers engaged in product search in the store. The fixation analysis shows that fixation duration statistics are very similar to those found in image and video viewing, suggesting that similar visual processing is employed during search in 3D environments and during viewing of imagery on computer screens. A clustering technique was applied to the questionnaire data, which resulted in two clusters being detected. Based on these clusters, personalized saliency prediction models were trained on the store fixation data, which provided improved performance in prediction saliency on the store video data compared to state-of-the art universal saliency prediction methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104129"},"PeriodicalIF":4.3,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002108/pdfft?md5=dfee816a569ed0f626ef6b190cabb0bc&pid=1-s2.0-S1077314224002108-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"URINet: Unsupervised point cloud rotation invariant representation learning via semantic and structural reasoning","authors":"Qiuxia Wu, Kunming Su","doi":"10.1016/j.cviu.2024.104136","DOIUrl":"10.1016/j.cviu.2024.104136","url":null,"abstract":"<div><p>In recent years, many rotation-invariant networks have been proposed to alleviate the interference caused by point cloud arbitrary rotations. These networks have demonstrated powerful representation learning capabilities. However, most of those methods rely on costly manually annotated supervision for model training. Moreover, they fail to reason the structural relations and lose global information. To address these issues, we present an unsupervised method for achieving comprehensive rotation invariant representations without human annotation. Specifically, we propose a novel encoder–decoder architecture named URINet, which learns a point cloud representation by combining local semantic and global structural information, and then reconstructs the input without rotation perturbation. In detail, the encoder is a two-branch network where the graph convolution based structural branch models the relationships among local regions to learn global structural knowledge and the semantic branch learns rotation invariant local semantic features. The two branches derive complementary information and explore the point clouds comprehensively. Furthermore, to avoid the self-reconstruction ambiguity brought by uncertain poses, a bidirectional alignment is proposed to measure the quality of reconstruction results without orientation knowledge. Extensive experiments on downstream tasks show that the proposed method significantly surpasses existing state-of-the-art methods on both synthetic and real-world datasets.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104136"},"PeriodicalIF":4.3,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142129055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deyu Lin , Huanxin Wang , Xin Lei , Weidong Min , Chenguang Yao , Yuan Zhong , Yong Liang Guan
{"title":"DSU-GAN: A robust frontal face recognition approach based on generative adversarial network","authors":"Deyu Lin , Huanxin Wang , Xin Lei , Weidong Min , Chenguang Yao , Yuan Zhong , Yong Liang Guan","doi":"10.1016/j.cviu.2024.104128","DOIUrl":"10.1016/j.cviu.2024.104128","url":null,"abstract":"<div><div>Face recognition technology is widely used in different areas, such as entrance guard, payment <em>etc</em>. However, little attention has been given to non-positive faces recognition, especially model training and the quality of the generated images. To this end, a novel robust frontal face recognition approach based on generative adversarial network (DSU-GAN) is proposed in this paper. A mechanism of consistency loss is presented in deformable convolution proposed in the generator-encoder to avoid additional computational overhead and the problem of overfitting. In addition, a self-attention mechanism is presented in generator–encoder to avoid information overloading and construct the long-term dependencies at the pixel level. To balance the capability between the generator and discriminator, a novelf discriminator architecture based U-Net is proposed. Finally, the single-way discriminator is improved through a new up-sampling module. Experiment results demonstrate that our proposal achieves an average Rank-1 recognition rate of 95.14% on the Multi-PIE face dataset in dealing with the multi-pose. In addition, it is proven that our proposal has achieved outstanding performance in recent benchmarks conducted on both IJB-A and IJB-C.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104128"},"PeriodicalIF":4.3,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved high dynamic range imaging using multi-scale feature flows balanced between task-orientedness and accuracy","authors":"Qian Ye , Masanori Suganuma , Takayuki Okatani","doi":"10.1016/j.cviu.2024.104126","DOIUrl":"10.1016/j.cviu.2024.104126","url":null,"abstract":"<div><p>Deep learning has made it possible to accurately generate high dynamic range (HDR) images from multiple images taken at different exposure settings, largely owing to advancements in neural network design. However, generating images without artifacts remains difficult, especially in scenes with moving objects. In such cases, issues like color distortion, geometric misalignment, or ghosting can appear. Current state-of-the-art network designs address this by estimating the optical flow between input images to align them better. The parameters for the flow estimation are learned through the primary goal, producing high-quality HDR images. However, we find that this ”task-oriented flow” approach has its drawbacks, especially in minimizing artifacts. To address this, we introduce a new network design and training method that improve the accuracy of flow estimation. This aims to strike a balance between task-oriented flow and accurate flow. Additionally, the network utilizes multi-scale features extracted from the input images for both flow estimation and HDR image reconstruction. Our experiments demonstrate that these two innovations result in HDR images with fewer artifacts and enhanced quality.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104126"},"PeriodicalIF":4.3,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002078/pdfft?md5=35e8f40c73b01f0b9afae0db47e39486&pid=1-s2.0-S1077314224002078-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142148049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The shading isophotes: Model and methods for Lambertian planes and a point light","authors":"Damien Mariyanayagam, Adrien Bartoli","doi":"10.1016/j.cviu.2024.104135","DOIUrl":"10.1016/j.cviu.2024.104135","url":null,"abstract":"<div><p>Structure-from-Motion (SfM) and Shape-from-Shading (SfS) are complementary classical approaches to 3D vision. Broadly speaking, SfM exploits geometric primitives from textured surfaces and SfS exploits pixel intensity from the shading image. We propose an approach that exploits virtual geometric primitives extracted from the shading image, namely the level-sets, which we name shading isophotes. Our approach thus combines the strength of geometric reasoning with the rich shading information. We focus on the case of untextured Lambertian planes of unknown albedo lit by an unknown Point Light Source (PLS) of unknown intensity. We derive a comprehensive geometric model showing that the unknown scene parameters are in general all recoverable from a single image of at least two planes. We propose computational methods to detect the isophotes, to reconstruct the scene parameters in closed-form and to refine the results densely using pixel intensity. Our methods thus estimate light source, plane pose and camera pose parameters for untextured planes, which cannot be achieved by the existing approaches. We evaluate our model and methods on synthetic and real images.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104135"},"PeriodicalIF":4.3,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhichao Cui , Zeqi Chen , Chi Zhang , Gaofeng Meng , Yuehu Liu , Xiangmo Zhao
{"title":"DDGPnP: Differential degree graph based PnP solution to handle outliers","authors":"Zhichao Cui , Zeqi Chen , Chi Zhang , Gaofeng Meng , Yuehu Liu , Xiangmo Zhao","doi":"10.1016/j.cviu.2024.104130","DOIUrl":"10.1016/j.cviu.2024.104130","url":null,"abstract":"<div><p>Existing external relationships for outlier removal in the perspective-n-point problem are generally spatial coherence among the neighbor correspondences. In the situation of high noise or spatially incoherent distributions, pose estimation is relatively inaccurate due to a small number of detected inliers. To address these problems, this paper explores the globally coherent external relationships for outlier removal and pose estimation. To this end, the differential degree graph (DDG) is proposed to employ the intersection angles between rays of correspondences to handle outliers. Firstly, a pair of two degree graphs are constructed to establish the external relationships between 3D-2D correspondences in the world and camera coordinates. Secondly, the DDG is estimated through subtracting the two degree graphs and operating binary operation with a degree threshold. Besides, this paper mathematically proves that the maximum clique of the DDG represents the inliers. Thirdly, a novel vertice degree based method is put forward to extract the maximum clique from DDG for outlier removal. Besides, this paper proposes a novel pipeline of DDG based PnP solution, i.e. DDGPnP, to achieve accurate pose estimation. Experiments demonstrate the superiority and effectiveness of the proposed method in the aspects of outlier removal and pose estimation by comparison with the state of the arts. Especially for the high noise situation, the DDGPnP method can achieve not only accurate pose but also a large number of correct correspondences.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104130"},"PeriodicalIF":4.3,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142136397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaming Wang , Jiatong Chen , Xian Fang , Mingfeng Jiang , Jianhua Ma
{"title":"Dual cross perception network with texture and boundary guidance for camouflaged object detection","authors":"Yaming Wang , Jiatong Chen , Xian Fang , Mingfeng Jiang , Jianhua Ma","doi":"10.1016/j.cviu.2024.104131","DOIUrl":"10.1016/j.cviu.2024.104131","url":null,"abstract":"<div><p>Camouflaged object detection (COD) is a task needs to segment objects that subtly blend into their surroundings effectively. Edge and texture information of the objects can be utilized to reveal the edges of camouflaged objects and detect texture differences between camouflaged objects and the surrounding environment. However, existing methods often fail to fully exploit the advantages of these two types of information. Considering this, our paper proposes an innovative Dual Cross Perception Network (DCPNet) with texture and boundary guidance for camouflaged object detection. DCPNet consists of two essential modules, namely Dual Cross Fusion Module (DCFM) and the Subgroup Aggregation Module (SAM). DCFM utilizes attention techniques to emphasize the information that exists in edges and textures by cross-fusing features of the edge, texture, and basic RGB image, which strengthens the ability to capture edge information and texture details in image analysis. SAM gives varied weights to low-level and high-level features in order to enhance the comprehension of objects and scenes of various sizes. Several experiments have demonstrated that DCPNet outperforms 13 state-of-the-art methods on four widely used assessment metrics.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104131"},"PeriodicalIF":4.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142048064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}