ACM Transactions on Multimedia Computing Communications and Applications最新文献_第7页

Facial soft-biometrics obfuscation through adversarial attacks 通过对抗性攻击混淆面部软生物识别技术

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-06 DOI: 10.1145/3656474

Vincenzo Carletti, Pasquale Foggia, Antonio Greco, Alessia Saggese, Mario Vento

{"title":"Facial soft-biometrics obfuscation through adversarial attacks","authors":"Vincenzo Carletti, Pasquale Foggia, Antonio Greco, Alessia Saggese, Mario Vento","doi":"10.1145/3656474","DOIUrl":"https://doi.org/10.1145/3656474","url":null,"abstract":"Sharing facial pictures through online services, especially on social networks, has become a common habit for thousands of users. This practice hides a possible threat to privacy: the owners of such services, as well as malicious users, could automatically extract information from faces using modern and effective neural networks. In this paper, we propose the harmless use of adversarial attacks, i.e. variations of images that are almost imperceptible to the human eye and that are typically generated with the malicious purpose to mislead Convolutional Neural Networks (CNNs). Such attacks have been instead adopted to (i) obfuscate soft biometrics (gender, age, ethnicity) but (ii) without degrading the quality of the face images posted online. We achieve the above mentioned two conflicting goals by modifying the implementations of four of the most popular adversarial attacks, namely FGSM, PGD, DeepFool and C&W, in order to constrain the average amount of noise they generate on the image and the maximum perturbation they add on the single pixel. We demonstrate, in an experimental framework including three popular CNNs, namely VGG16, SENet and MobileNetV3, that the considered obfuscation method, which requires at most four seconds for each image, is effective not only when we have a complete knowledge of the neural network that extracts the soft biometrics (white box attacks), but also when the adversarial attacks are generated in a more realistic black box scenario. Finally, we prove that an opponent can implement defense techniques to partially reduce the effect of the obfuscation, but substantially paying in terms of accuracy over clean images; this result, confirmed by the experiments carried out with three popular defense methods, namely adversarial training, denoising autoencoder and Kullback-Leibler autoencoder, shows that it is not convenient for the opponent to defend himself and that the proposed approach is robust to defenses.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"16 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MEDUSA: A Dynamic Codec Switching Approach in HTTP Adaptive Streaming MEDUSA：HTTP 自适应流媒体中的动态编解码器切换方法

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-05 DOI: 10.1145/3656175

Daniele Lorenzi, Farzad Tashtarian, Hermann Hellwagner, Christian Timmerer

{"title":"MEDUSA: A Dynamic Codec Switching Approach in HTTP Adaptive Streaming","authors":"Daniele Lorenzi, Farzad Tashtarian, Hermann Hellwagner, Christian Timmerer","doi":"10.1145/3656175","DOIUrl":"https://doi.org/10.1145/3656175","url":null,"abstract":"HTTP Adaptive Streaming (HAS) solutions utilize various Adaptive BitRate (ABR) algorithms to dynamically select appropriate video representations, aiming to adapt to fluctuations in network bandwidth. However, current ABR implementations have a limitation in that they are designed to function with one set of video representations, i.e., the bitrate ladder, which differ in bitrate and resolution, but are encoded with the same video codec. When multiple codecs are available, current ABR algorithms select one of them prior to the streaming session and stick to it throughout the entire streaming session. Although newer codecs are generally preferred over older ones, their compression efficiencies differ depending on the content’s complexity, which varies over time. Therefore, it is necessary to select the appropriate codec for each video segment to reduce the requested data while delivering the highest possible quality. In this paper, we first provide a practical example where we compare compression efficiencies of different codecs on a set of video sequences. Based on this analysis, we formulate the optimization problem of selecting the appropriate codec for each user and video segment (on a per-segment basis in the outmost case), refining the selection of the ABR algorithms by exploiting key metrics, such as the perceived segment quality and size. Subsequently, to address the scalability issues of this centralized model, we introduce a novel distributed plug-in ABR algorithm for Video on Demand (VoD) applications called MEDUSA to be deployed on top of existing ABR algorithms. MEDUSA enhances the user’s Quality of Experience (QoE) by utilizing a multi-objective function that considers the quality and size of video segments when selecting the next representation. Using quality information and segment size from the modified Media Presentation Description (MPD), MEDUSA utilizes buffer occupancy to prioritize quality or size by assigning specific weights in the objective function. To show the impact of MEDUSA, we compare the proposed plug-in approach on top of state-of-the-art techniques with their original implementations and analyze the results for different network traces, video content, and buffer capacities. According to the experimental findings, MEDUSA shows the ability to improve QoE for various test videos and scenarios. The results reveal an impressive improvement in the QoE score of up to 42% according to the ITU-T P.1203 model (mode 0). Additionally, MEDUSA can reduce the transmitted data volume by up to more than 40% achieving a QoE similar to the techniques compared, reducing the burden on streaming service providers for delivery costs.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"10 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection 用于 RGB-D 突出物体检测的异构融合与完整性学习网络

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-05 DOI: 10.1145/3656476

Haoran Gao, Yiming Su, Fasheng Wang, Haojie Li

{"title":"Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection","authors":"Haoran Gao, Yiming Su, Fasheng Wang, Haojie Li","doi":"10.1145/3656476","DOIUrl":"https://doi.org/10.1145/3656476","url":null,"abstract":"While significant progress has been made in recent years in the field of salient object detection (SOD), there are still limitations in heterogeneous modality fusion and salient feature integrity learning. The former is primarily attributed to a paucity of attention from researchers to the fusion of cross-scale information between different modalities during processing multi-modal heterogeneous data, coupled with an absence of methods for adaptive control of their respective contributions. The latter constraint stems from the shortcomings in existing approaches concerning the prediction of salient region’s integrity. To address these problems, we propose a Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection, denoted as HFIL-Net. In response to the first challenge, we design an Advanced Semantic Guidance Aggregation (ASGA) module, which utilizes three fusion blocks to achieve the aggregation of three types of information: within-scale cross-modal, within-modal cross-scale, and cross-modal cross-scale. In addition, we embed the local fusion factor matrices in the ASGA module and utilize the global fusion factor matrices in the Multi-modal Information Adaptive Fusion (MIAF) module to control the contributions adaptively from different perspectives during the fusion process. For the second issue, we introduce the Feature Integrity Learning and Refinement (FILR) Module. It leverages the idea of ”part-whole” relationships from capsule networks to learn feature integrity and further refine the learned features through attention mechanisms. Extensive experimental results demonstrate that our proposed HFIL-Net outperforms over 17 state-of-the-art (SOTA) detection methods in testing across seven challenging standard datasets. Codes and results are available on https://github.com/BojueGao/HFIL-Net.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"2015 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Domain Image-to-Image Translation with Cross-Granularity Contrastive Learning 利用跨粒度对比学习进行多域图像到图像翻译

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-04 DOI: 10.1145/3656048

Huiyuan Fu, Jin Liu, Ting Yu, Xin Wang, Huadong Ma

{"title":"Multi-Domain Image-to-Image Translation with Cross-Granularity Contrastive Learning","authors":"Huiyuan Fu, Jin Liu, Ting Yu, Xin Wang, Huadong Ma","doi":"10.1145/3656048","DOIUrl":"https://doi.org/10.1145/3656048","url":null,"abstract":"The objective of multi-domain image-to-image translation is to learn the mapping from a source domain to a target domain in multiple image domains while preserving the content representation of the source domain. Despite the importance and recent efforts, most previous studies disregard the large style discrepancy between images and instances in various domains, or fail to capture instance details and boundaries properly, resulting in poor translation results for rich scenes. To address these problems, we present an effective architecture for multi-domain image-to-image translation that only requires one generator. Specifically, we provide detailed procedures for capturing the features of instances throughout the learning process, as well as learning the relationship between the style of the global image and that of a local instance in the image by enforcing the cross-granularity consistency. In order to capture local details within the content space, we employ a dual contrastive learning strategy that operates at both the instance and patch levels. Extensive studies on different multi-domain image-to-image translation datasets reveal that our proposed method outperforms state-of-the-art approaches.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"1 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Universal Relocalizer for Weakly Supervised Referring Expression Grounding 用于弱监督引用表达接地的通用重定位器

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-04 DOI: 10.1145/3656045

Panpan Zhang, Meng Liu, Xuemeng Song, Da Cao, Zan Gao, Liqiang Nie

{"title":"Universal Relocalizer for Weakly Supervised Referring Expression Grounding","authors":"Panpan Zhang, Meng Liu, Xuemeng Song, Da Cao, Zan Gao, Liqiang Nie","doi":"10.1145/3656045","DOIUrl":"https://doi.org/10.1145/3656045","url":null,"abstract":"This paper introduces the Universal Relocalizer, a novel approach designed for weakly supervised referring expression grounding. Our method strives to pinpoint a target proposal that corresponds to a specific query, eliminating the need for region-level annotations during training. To bolster the localization precision and enrich the semantic understanding of the target proposal, we devise three key modules: the category module, the color module, and the spatial relationship module. The category and color modules assign respective category and color labels to region proposals, enabling the computation of category and color scores. Simultaneously, the spatial relationship module integrates spatial cues, yielding a spatial score for each proposal to enhance localization accuracy further. By adeptly amalgamating the category, color, and spatial scores, we derive a refined grounding score for every proposal. Comprehensive evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets manifest the prowess of the Universal Relocalizer, showcasing its formidable performance across the board.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"17 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dual Dynamic Threshold Adjustment Strategy 双重动态阈值调整策略

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-03 DOI: 10.1145/3656047

Xiruo Jiang, Yazhou Yao, Sheng Liu, Fumin Shen, Liqiang Nie, Xian-Sheng Hua

{"title":"Dual Dynamic Threshold Adjustment Strategy","authors":"Xiruo Jiang, Yazhou Yao, Sheng Liu, Fumin Shen, Liqiang Nie, Xian-Sheng Hua","doi":"10.1145/3656047","DOIUrl":"https://doi.org/10.1145/3656047","url":null,"abstract":"Loss functions and sample mining strategies are essential components in deep metric learning algorithms. However, the existing loss function or mining strategy often necessitate the incorporation of additional hyperparameters, notably the threshold, which defines whether the sample pair is informative. The threshold provides a stable numerical standard for determining whether to retain the pairs. It is a vital parameter to reduce the redundant sample pairs participating in training. Nonetheless, finding the optimal threshold can be a time-consuming endeavor, often requiring extensive grid searches. Because the threshold cannot be dynamically adjusted in the training stage, we should conduct plenty of repeated experiments to determine the threshold. Therefore, we introduce a novel approach for adjusting the thresholds associated with both the loss function and the sample mining strategy. We design a static Asymmetric Sample Mining Strategy (ASMS) and its dynamic version Adaptive Tolerance ASMS (AT-ASMS), tailored for sample mining methods. ASMS utilizes differentiated thresholds to address the problems (too few positive pairs and too many redundant negative pairs) caused by only applying a single threshold to filter samples. AT-ASMS can adaptively regulate the ratio of positive and negative pairs during training according to the ratio of the currently mined positive and negative pairs. This meta-learning-based threshold generation algorithm utilizes a single-step gradient descent to obtain new thresholds. We combine these two threshold adjustment algorithms to form the Dual Dynamic Threshold Adjustment Strategy (DDTAS). Experimental results show that our algorithm achieves competitive performance on CUB200, Cars196, and SOP datasets. Our codes are available at https://github.com/NUST-Machine-Intelligence-Laboratory/DDTAS.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"40 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Inter-Camera Identity Discrimination for Unsupervised Person Re-Identification 用于无监督人员再识别的摄像头间身份识别技术

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-03 DOI: 10.1145/3652858

Mingfu Xiong, Kaikang Hu, Zhihan Lv, Fei Fang, Zhongyuan Wang, Ruimin Hu, Khan Muhammad

{"title":"Inter-Camera Identity Discrimination for Unsupervised Person Re-Identification","authors":"Mingfu Xiong, Kaikang Hu, Zhihan Lv, Fei Fang, Zhongyuan Wang, Ruimin Hu, Khan Muhammad","doi":"10.1145/3652858","DOIUrl":"https://doi.org/10.1145/3652858","url":null,"abstract":"Unsupervised person re-identification (Re-ID) has garnered significant attention because of its data-friendly nature, as it does not require labeled data. Existing approaches primarily address this challenge by employing feature-clustering techniques to generate pseudo-labels. In addition, camera-proxy-based methods have emerged because of their impressive ability to cluster sample identities. However, these methods often blur the distinctions between individuals within inter-camera views, which is crucial for effective person re-ID. To address this issue, this study introduces an inter-camera-identity-difference-based contrastive learning framework for unsupervised person Re-ID. The proposed framework comprises two key components: (1) a different sample cross-view close-range penalty module and (2) the same sample cross-view long-range constraint module. The former aims to penalize excessive similarity among different subjects across inter-camera views, whereas the latter mitigates the challenge of excessive dissimilarity among the same subject across camera views. To validate the performance of our method, we conducted extensive experiments on three existing person Re-ID datasets (Market-1501, MSMT17, and PersonX). The results demonstrate the effectiveness of the proposed method, which shows a promising performance. The code is available at https://github.com/hooldylan/IIDCL.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"52 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition StepNet：用于孤立手语识别的时空部分感知网络

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-03 DOI: 10.1145/3656046

Xiaolong Shen, Zhedong Zheng, Yi Yang

{"title":"StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition","authors":"Xiaolong Shen, Zhedong Zheng, Yi Yang","doi":"10.1145/3656046","DOIUrl":"https://doi.org/10.1145/3656046","url":null,"abstract":"The goal of sign language recognition (SLR) is to help those who are hard of hearing or deaf overcome the communication barrier. Most existing approaches can be typically divided into two lines, i.e., Skeleton-based and RGB-based methods, but both the two lines of methods have their limitations. Skeleton-based methods do not consider facial expressions, while RGB-based approaches usually ignore the fine-grained hand structure. To overcome both limitations, we propose a new framework called Spatial-temporal Part-aware network (StepNet), based on RGB parts. As its name suggests, it is made up of two modules: Part-level Spatial Modeling and Part-level Temporal Modeling. Part-level Spatial Modeling, in particular, automatically captures the appearance-based properties, such as hands and faces, in the feature space without the use of any keypoint-level annotations. On the other hand, Part-level Temporal Modeling implicitly mines the long-short term context to capture the relevant attributes over time. Extensive experiments demonstrate that our StepNet, thanks to spatial-temporal modules, achieves competitive Top-1 Per-instance accuracy on three commonly-used SLR benchmarks, i.e., 56.89% on WLASL, 77.2% on NMFs-CSL, and 77.1% on BOBSL. Additionally, the proposed method is compatible with the optical flow input and can produce superior performance if fused. For those who are hard of hearing, we hope that our work can act as a preliminary step.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"31 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition 利用稀疏低秩双线性集合进行多模态评分融合，实现以自我为中心的手部动作识别

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-02 DOI: 10.1145/3656044

Kankana Roy

{"title":"Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition","authors":"Kankana Roy","doi":"10.1145/3656044","DOIUrl":"https://doi.org/10.1145/3656044","url":null,"abstract":"With the advent of egocentric cameras, there are new challenges where traditional computer vision are not sufficient to handle this kind of videos. Moreover, egocentric cameras often offer multiple modalities which need to be modeled jointly to exploit complimentary information. In this paper, we proposed a sparse low-rank bilinear score pooling approach for egocentric hand action recognition from RGB-D videos. It consists of five blocks: a baseline CNN to encode RGB and depth information for producing classification probabilities; a novel bilinear score pooling block to generate a score matrix; a sparse low rank matrix recovery block to reduce redundant features, which is common in bilinear pooling; a one layer CNN for frame-level classification; and an RNN for video level classification. We proposed to fuse classification probabilities instead of traditional CNN features from RGB and depth modality, involving an effective yet simple sparse low rank bilinear score pooling to produce a fused RGB-D score matrix. To demonstrate the efficacy of our method, we perform extensive experiments over two large-scale hand action datasets, namely, THU-READ and FPHA, and two smaller datasets, GUN-71 and HAD. We observe that the proposed method outperforms state-of-the-art methods and achieves accuracies of 78.55% and 96.87% over the THU-READ dataset in cross-subject and cross-group settings, respectively. Further, we achieved accuracies of 91.59% and 43.87% over the FPHA and Gun-71 datasets, respectively.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"52 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Double Reference Guided Interactive 2D and 3D Caricature Generation 双重参照指导下的交互式二维和三维漫画生成

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-01 DOI: 10.1145/3655624

Xin Huang, Dong Liang, Hongrui Cai, Yunfeng Bai, Juyong Zhang, Feng Tian, Jinyuan Jia

{"title":"Double Reference Guided Interactive 2D and 3D Caricature Generation","authors":"Xin Huang, Dong Liang, Hongrui Cai, Yunfeng Bai, Juyong Zhang, Feng Tian, Jinyuan Jia","doi":"10.1145/3655624","DOIUrl":"https://doi.org/10.1145/3655624","url":null,"abstract":"In this paper, we propose the first geometry and texture (double) referenced interactive 2D and 3D caricature generating and editing method. The main challenge of caricature generation lies in the fact that it not only exaggerates the facial geometry but also refreshes the facial texture. We address this challenge by utilizing the semantic segmentation maps as an intermediary domain, removing the influence of photo texture while preserving the person-specific geometry features. Specifically, our proposed method consists of two main components: 3D-CariNet and CariMaskGAN. 3D-CariNet uses sketches or caricatures to exaggerate the input photo into several types of 3D caricatures. To generate a CariMask, we geometrically exaggerate the photos using the projection of exaggerated 3D landmarks, after which CariMask is converted into a caricature by CariMaskGAN. In this step, users can edit and adjust the geometry of caricatures freely. Moreover, we propose a semantic detail preprocessing approach that considerably increases the details of generated caricatures and allows modification of hair strands, wrinkles, and beards. By rendering high-quality 2D caricatures as textures, we produce 3D caricatures with a variety of texture styles. Extensive experimental results have demonstrated that our method can produce higher-quality caricatures as well as support interactive modification with ease.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"79 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0