{"title":"DSU-GAN: A robust frontal face recognition approach based on generative adversarial network","authors":"","doi":"10.1016/j.cviu.2024.104128","DOIUrl":"10.1016/j.cviu.2024.104128","url":null,"abstract":"<div><div>Face recognition technology is widely used in different areas, such as entrance guard, payment <em>etc</em>. However, little attention has been given to non-positive faces recognition, especially model training and the quality of the generated images. To this end, a novel robust frontal face recognition approach based on generative adversarial network (DSU-GAN) is proposed in this paper. A mechanism of consistency loss is presented in deformable convolution proposed in the generator-encoder to avoid additional computational overhead and the problem of overfitting. In addition, a self-attention mechanism is presented in generator–encoder to avoid information overloading and construct the long-term dependencies at the pixel level. To balance the capability between the generator and discriminator, a novelf discriminator architecture based U-Net is proposed. Finally, the single-way discriminator is improved through a new up-sampling module. Experiment results demonstrate that our proposal achieves an average Rank-1 recognition rate of 95.14% on the Multi-PIE face dataset in dealing with the multi-pose. In addition, it is proven that our proposal has achieved outstanding performance in recent benchmarks conducted on both IJB-A and IJB-C.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved high dynamic range imaging using multi-scale feature flows balanced between task-orientedness and accuracy","authors":"","doi":"10.1016/j.cviu.2024.104126","DOIUrl":"10.1016/j.cviu.2024.104126","url":null,"abstract":"<div><p>Deep learning has made it possible to accurately generate high dynamic range (HDR) images from multiple images taken at different exposure settings, largely owing to advancements in neural network design. However, generating images without artifacts remains difficult, especially in scenes with moving objects. In such cases, issues like color distortion, geometric misalignment, or ghosting can appear. Current state-of-the-art network designs address this by estimating the optical flow between input images to align them better. The parameters for the flow estimation are learned through the primary goal, producing high-quality HDR images. However, we find that this ”task-oriented flow” approach has its drawbacks, especially in minimizing artifacts. To address this, we introduce a new network design and training method that improve the accuracy of flow estimation. This aims to strike a balance between task-oriented flow and accurate flow. Additionally, the network utilizes multi-scale features extracted from the input images for both flow estimation and HDR image reconstruction. Our experiments demonstrate that these two innovations result in HDR images with fewer artifacts and enhanced quality.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002078/pdfft?md5=35e8f40c73b01f0b9afae0db47e39486&pid=1-s2.0-S1077314224002078-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142148049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The shading isophotes: Model and methods for Lambertian planes and a point light","authors":"","doi":"10.1016/j.cviu.2024.104135","DOIUrl":"10.1016/j.cviu.2024.104135","url":null,"abstract":"<div><p>Structure-from-Motion (SfM) and Shape-from-Shading (SfS) are complementary classical approaches to 3D vision. Broadly speaking, SfM exploits geometric primitives from textured surfaces and SfS exploits pixel intensity from the shading image. We propose an approach that exploits virtual geometric primitives extracted from the shading image, namely the level-sets, which we name shading isophotes. Our approach thus combines the strength of geometric reasoning with the rich shading information. We focus on the case of untextured Lambertian planes of unknown albedo lit by an unknown Point Light Source (PLS) of unknown intensity. We derive a comprehensive geometric model showing that the unknown scene parameters are in general all recoverable from a single image of at least two planes. We propose computational methods to detect the isophotes, to reconstruct the scene parameters in closed-form and to refine the results densely using pixel intensity. Our methods thus estimate light source, plane pose and camera pose parameters for untextured planes, which cannot be achieved by the existing approaches. We evaluate our model and methods on synthetic and real images.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DDGPnP: Differential degree graph based PnP solution to handle outliers","authors":"","doi":"10.1016/j.cviu.2024.104130","DOIUrl":"10.1016/j.cviu.2024.104130","url":null,"abstract":"<div><p>Existing external relationships for outlier removal in the perspective-n-point problem are generally spatial coherence among the neighbor correspondences. In the situation of high noise or spatially incoherent distributions, pose estimation is relatively inaccurate due to a small number of detected inliers. To address these problems, this paper explores the globally coherent external relationships for outlier removal and pose estimation. To this end, the differential degree graph (DDG) is proposed to employ the intersection angles between rays of correspondences to handle outliers. Firstly, a pair of two degree graphs are constructed to establish the external relationships between 3D-2D correspondences in the world and camera coordinates. Secondly, the DDG is estimated through subtracting the two degree graphs and operating binary operation with a degree threshold. Besides, this paper mathematically proves that the maximum clique of the DDG represents the inliers. Thirdly, a novel vertice degree based method is put forward to extract the maximum clique from DDG for outlier removal. Besides, this paper proposes a novel pipeline of DDG based PnP solution, i.e. DDGPnP, to achieve accurate pose estimation. Experiments demonstrate the superiority and effectiveness of the proposed method in the aspects of outlier removal and pose estimation by comparison with the state of the arts. Especially for the high noise situation, the DDGPnP method can achieve not only accurate pose but also a large number of correct correspondences.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142136397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual cross perception network with texture and boundary guidance for camouflaged object detection","authors":"","doi":"10.1016/j.cviu.2024.104131","DOIUrl":"10.1016/j.cviu.2024.104131","url":null,"abstract":"<div><p>Camouflaged object detection (COD) is a task needs to segment objects that subtly blend into their surroundings effectively. Edge and texture information of the objects can be utilized to reveal the edges of camouflaged objects and detect texture differences between camouflaged objects and the surrounding environment. However, existing methods often fail to fully exploit the advantages of these two types of information. Considering this, our paper proposes an innovative Dual Cross Perception Network (DCPNet) with texture and boundary guidance for camouflaged object detection. DCPNet consists of two essential modules, namely Dual Cross Fusion Module (DCFM) and the Subgroup Aggregation Module (SAM). DCFM utilizes attention techniques to emphasize the information that exists in edges and textures by cross-fusing features of the edge, texture, and basic RGB image, which strengthens the ability to capture edge information and texture details in image analysis. SAM gives varied weights to low-level and high-level features in order to enhance the comprehension of objects and scenes of various sizes. Several experiments have demonstrated that DCPNet outperforms 13 state-of-the-art methods on four widely used assessment metrics.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142048064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep learning model for simultaneous recognition of quantitative and qualitative emotion using visual and bio-sensing data","authors":"","doi":"10.1016/j.cviu.2024.104121","DOIUrl":"10.1016/j.cviu.2024.104121","url":null,"abstract":"<div><p>The recognition of emotions heavily relies on important factors such as human facial expressions and physiological signals, including electroencephalogram and electrocardiogram. In literature, emotion recognition is investigated quantitatively (while estimating valance, arousal, and dominance) and qualitatively (while predicting discrete emotions like happiness, sadness, anger, surprise, and so on). Current methods utilize a combination of visual data and bio-sensing information to create recognition systems that incorporate multiple modes (quantitative/qualitative). Nevertheless, these methods necessitate extensive expertise in specific domains and intricate preprocessing procedures, and consequently, they are unable to fully leverage the inherent advantages of end-to-end deep learning techniques. Moreover, methods usually aim to recognize either qualitative or quantitative emotions. Although both kinds of emotions are significantly co-related, previous methods do not simultaneously recognize qualitative and quantitative emotions. In this paper, a novel deep end-to-end framework named DeepVADNet is introduced, specifically designed for the purpose of multi-modal emotion recognition. The proposed framework leverages deep learning techniques to effectively extract crucial face appearance features as well as bio-sensing features, predicting both qualitative and quantitative emotions in a single forward pass. In this study, we employ the CRNN architecture to extract face appearance features, while the ConvLSTM model is utilized to extract spatio-temporal information from visual data (videos). Additionally, we utilize the Conv1D model for processing physiological signals (EEG, EOG, ECG, and GSR) as this approach deviates from conventional manual techniques that involve traditional manual methods for extracting features based on time and frequency domains. After enhancing the feature quality by fusing both modalities, we use a novel method employing quantitative emotion to predict qualitative emotions accurately. We perform extensive experiments on the DEAP and MAHNOB-HCI datasets, achieving state-of-the-art quantitative emotion recognition results of 98.93%/6e-4 and 89.08%/0.97 (mean classification accuracy/MSE) in both datasets, respectively. Also, for the qualitative emotion recognition task, we achieve 82.71% mean classification accuracy on the MAHNOB-HCI dataset. The code and evaluation can be accessed at: <span><span>https://github.com/I-Man-H/DeepVADNet.git</span><svg><path></path></svg></span></p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio–visual deepfake detection using articulatory representation learning","authors":"","doi":"10.1016/j.cviu.2024.104133","DOIUrl":"10.1016/j.cviu.2024.104133","url":null,"abstract":"<div><p>Advancements in generative artificial intelligence have made it easier to manipulate auditory and visual elements, highlighting the critical need for robust audio–visual deepfake detection methods. In this paper, we propose an articulatory representation-based audio–visual deepfake detection approach, <em>ART-AVDF</em>. First, we devise an audio encoder to extract articulatory features that capture the physical significance of articulation movement, integrating with a lip encoder to explore audio–visual articulatory correspondences in a self-supervised learning manner. Then, we design a multimodal joint fusion module to further explore inherent audio–visual consistency using the articulatory embeddings. Extensive experiments on the DFDC, FakeAVCeleb, and DefakeAVMiT datasets demonstrate that <em>ART-AVDF</em> obtains a significant performance improvement compared to many deepfake detection models.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RSTC: Residual Swin Transformer Cascade to approximate Taylor expansion for image denoising","authors":"","doi":"10.1016/j.cviu.2024.104132","DOIUrl":"10.1016/j.cviu.2024.104132","url":null,"abstract":"<div><p>Traditional denoising methods establish mathematical models by employing different priors, which can achieve preferable results but they are usually time-consuming and their outputs are not adaptive on regularization parameters. While the success of end-to-end deep learning denoising strategies depends on a large amount of data and lacks a theoretical interpretability. In order to address the above problems, this paper proposes a novel image denoising method, namely Residual Swin Transformer Cascade (RSTC), based on Taylor expansion. The key procedures of our RSTC are specified as follows: Firstly, we discuss the relationship between image denoising model and Taylor expansion, as well as its adjacent derivative parts. Secondly, we use a lightweight deformable convolutional neural network to estimate the basic layer of Taylor expansion and a residual network where swin transformer block is selected as a backbone for pursuing the solution of the derivative layer. Finally, the results of the two networks contribute to the approximation solution of Taylor expansion. In the experiments, we firstly test and discuss the selection of network parameters to verify its effectiveness. Then, we compare it with existing advanced methods in terms of visualization and quantification, and the results show that our method has a powerful generalization ability and performs better than state-of-the-art denoising methods on performance improvement and structure preservation.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142048067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep video compression based on Long-range Temporal Context Learning","authors":"","doi":"10.1016/j.cviu.2024.104127","DOIUrl":"10.1016/j.cviu.2024.104127","url":null,"abstract":"<div><p>Video compression allows for efficient storage and transmission of data, benefiting imaging and vision applications, e.g. computational imaging, photography, and displays by delivering high-quality videos. To exploit more informative contexts of video, we propose DVCL, a novel <strong>D</strong>eep <strong>V</strong>ideo <strong>C</strong>ompression based on <strong>L</strong>ong-range Temporal Context Learning. Aiming at high coding performance, this new compression paradigm makes full use of long-range temporal correlations derived from multiple reference frames to learn richer contexts. Motion vectors (MVs) are estimated to represent the motion relations of videos. By employing MVs, a long-range temporal context learning (LTCL) module is presented to extract context information from multiple reference frames, such that a more accurate and informative temporal contexts can be learned and constructed. The long-range temporal contexts serve as conditions and generate the predicted frames by contextual encoder and decoder. To address the challenge of imbalanced training, we develop a multi-stage training strategy to ensure the whole DVCL framework is trained progressively and stably. Extensive experiments demonstrate the proposed DVCL achieves the highest objective and subjective quality, while maintaining relatively low complexity. Specifically, 25.30% and 45.75% bitrate savings on average can be obtained than x265 codec at the same PSNR and MS-SSIM, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142129056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep unsupervised shadow detection with curriculum learning and self-training","authors":"","doi":"10.1016/j.cviu.2024.104124","DOIUrl":"10.1016/j.cviu.2024.104124","url":null,"abstract":"<div><p>Shadow detection is undergoing a rapid and remarkable development along with the wide use of deep neural networks. Benefiting from a large number of training images annotated with strong pixel-level ground-truth masks, current deep shadow detectors have achieved state-of-the-art performance. However, it is expensive and time-consuming to provide the pixel-level ground-truth mask for each training image. Considering that, this paper proposes the first unsupervised deep shadow detection framework, which consists of an initial pseudo label generation (IPG) module, a curriculum learning (CL) module and a self-training (ST) module. The supervision signals used in our learning framework are generated from several existing traditional unsupervised shadow detectors, which usually contain a lot of noisy information. Therefore, each module in our unsupervised framework is dedicated to reduce the adverse influence of noisy information on model training. Specifically, the IPG module combines different traditional unsupervised shadow maps to obtain their complementary shadow information. After obtaining the initial pseudo labels, the CL module and the ST module will be used in conjunction to gradually learn new shadow patterns and update the qualities of pseudo labels simultaneously. Extensive experimental results on various benchmark datasets demonstrate that our deep shadow detector not only outperforms the traditional unsupervised shadow detection methods by a large margin but also achieves comparable results with some recent state-of-the-art fully-supervised deep shadow detection methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}