{"title":"MagicStyle: Portrait Stylization Based on Reference Image","authors":"Zhaoli Deng, Kaibin Zhou, Fanyi Wang, Zhenpeng Mi","doi":"arxiv-2409.08156","DOIUrl":"https://doi.org/arxiv-2409.08156","url":null,"abstract":"The development of diffusion models has significantly advanced the research\u0000on image stylization, particularly in the area of stylizing a content image\u0000based on a given style image, which has attracted many scholars. The main\u0000challenge in this reference image stylization task lies in how to maintain the\u0000details of the content image while incorporating the color and texture features\u0000of the style image. This challenge becomes even more pronounced when the\u0000content image is a portrait which has complex textural details. To address this\u0000challenge, we propose a diffusion model-based reference image stylization\u0000method specifically for portraits, called MagicStyle. MagicStyle consists of\u0000two phases: Content and Style DDIM Inversion (CSDI) and Feature Fusion Forward\u0000(FFF). The CSDI phase involves a reverse denoising process, where DDIM\u0000Inversion is performed separately on the content image and the style image,\u0000storing the self-attention query, key and value features of both images during\u0000the inversion process. The FFF phase executes forward denoising, harmoniously\u0000integrating the texture and color information from the pre-stored feature\u0000queries, keys and values into the diffusion generation process based on our\u0000Well-designed Feature Fusion Attention (FFA). We conducted comprehensive\u0000comparative and ablation experiments to validate the effectiveness of our\u0000proposed MagicStyle and FFA.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Frequency Anti-DreamBooth: Robust Defense Against Image Synthesis","authors":"Takuto Onikubo, Yusuke Matsui","doi":"arxiv-2409.08167","DOIUrl":"https://doi.org/arxiv-2409.08167","url":null,"abstract":"Recently, text-to-image generative models have been misused to create\u0000unauthorized malicious images of individuals, posing a growing social problem.\u0000Previous solutions, such as Anti-DreamBooth, add adversarial noise to images to\u0000protect them from being used as training data for malicious generation.\u0000However, we found that the adversarial noise can be removed by adversarial\u0000purification methods such as DiffPure. Therefore, we propose a new adversarial\u0000attack method that adds strong perturbation on the high-frequency areas of\u0000images to make it more robust to adversarial purification. Our experiment\u0000showed that the adversarial images retained noise even after adversarial\u0000purification, hindering malicious image generation.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang
{"title":"IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation","authors":"Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang","doi":"arxiv-2409.08240","DOIUrl":"https://doi.org/arxiv-2409.08240","url":null,"abstract":"While Text-to-Image (T2I) diffusion models excel at generating visually\u0000appealing images of individual instances, they struggle to accurately position\u0000and control the features generation of multiple instances. The Layout-to-Image\u0000(L2I) task was introduced to address the positioning challenges by\u0000incorporating bounding boxes as spatial control signals, but it still falls\u0000short in generating precise instance features. In response, we propose the\u0000Instance Feature Generation (IFG) task, which aims to ensure both positional\u0000accuracy and feature fidelity in generated instances. To address the IFG task,\u0000we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances\u0000feature depiction by incorporating additional appearance tokens and utilizing\u0000an Instance Semantic Map to align instance-level features with spatial\u0000locations. The IFAdapter guides the diffusion process as a plug-and-play\u0000module, making it adaptable to various community models. For evaluation, we\u0000contribute an IFG benchmark and develop a verification pipeline to objectively\u0000compare models' abilities to generate instances with accurate positioning and\u0000features. Experimental results demonstrate that IFAdapter outperforms other\u0000models in both quantitative and qualitative evaluations.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inzamamul Alam, Muhammad Shahid Muneer, Simon S. Woo
{"title":"UGAD: Universal Generative AI Detector utilizing Frequency Fingerprints","authors":"Inzamamul Alam, Muhammad Shahid Muneer, Simon S. Woo","doi":"arxiv-2409.07913","DOIUrl":"https://doi.org/arxiv-2409.07913","url":null,"abstract":"In the wake of a fabricated explosion image at the Pentagon, an ability to\u0000discern real images from fake counterparts has never been more critical. Our\u0000study introduces a novel multi-modal approach to detect AI-generated images\u0000amidst the proliferation of new generation methods such as Diffusion models.\u0000Our method, UGAD, encompasses three key detection steps: First, we transform\u0000the RGB images into YCbCr channels and apply an Integral Radial Operation to\u0000emphasize salient radial features. Secondly, the Spatial Fourier Extraction\u0000operation is used for a spatial shift, utilizing a pre-trained deep learning\u0000network for optimal feature extraction. Finally, the deep neural network\u0000classification stage processes the data through dense layers using softmax for\u0000classification. Our approach significantly enhances the accuracy of\u0000differentiating between real and AI-generated images, as evidenced by a 12.64%\u0000increase in accuracy and 28.43% increase in AUC compared to existing\u0000state-of-the-art methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-Attention Based Influence Model for Manual and Nonmanual Sign Language Analysis","authors":"Lipisha Chaudhary, Fei Xu, Ifeoma Nwogu","doi":"arxiv-2409.08162","DOIUrl":"https://doi.org/arxiv-2409.08162","url":null,"abstract":"Both manual (relating to the use of hands) and non-manual markers (NMM), such\u0000as facial expressions or mouthing cues, are important for providing the\u0000complete meaning of phrases in American Sign Language (ASL). Efforts have been\u0000made in advancing sign language to spoken/written language understanding, but\u0000most of these have primarily focused on manual features only. In this work,\u0000using advanced neural machine translation methods, we examine and report on the\u0000extent to which facial expressions contribute to understanding sign language\u0000phrases. We present a sign language translation architecture consisting of\u0000two-stream encoders, with one encoder handling the face and the other handling\u0000the upper body (with hands). We propose a new parallel cross-attention decoding\u0000mechanism that is useful for quantifying the influence of each input modality\u0000on the output. The two streams from the encoder are directed simultaneously to\u0000different attention stacks in the decoder. Examining the properties of the\u0000parallel cross-attention weights allows us to analyze the importance of facial\u0000markers compared to body and hand features during a translating task.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Style Based Clustering of Visual Artworks","authors":"Abhishek Dangeti, Pavan Gajula, Vivek Srivastava, Vikram Jamwal","doi":"arxiv-2409.08245","DOIUrl":"https://doi.org/arxiv-2409.08245","url":null,"abstract":"Clustering artworks based on style has many potential real-world applications\u0000like art recommendations, style-based search and retrieval, and the study of\u0000artistic style evolution in an artwork corpus. However, clustering artworks\u0000based on style is largely an unaddressed problem. A few present methods for\u0000clustering artworks principally rely on generic image feature representations\u0000derived from deep neural networks and do not specifically deal with the\u0000artistic style. In this paper, we introduce and deliberate over the notion of\u0000style-based clustering of visual artworks. Our main objective is to explore\u0000neural feature representations and architectures that can be used for\u0000style-based clustering and observe their impact and effectiveness. We develop\u0000different methods and assess their relative efficacy for style-based clustering\u0000through qualitative and quantitative analysis by applying them to four artwork\u0000corpora and four curated synthetically styled datasets. Our analysis provides\u0000some key novel insights on architectures, feature representations, and\u0000evaluation methods suitable for style-based clustering.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Control+Shift: Generating Controllable Distribution Shifts","authors":"Roy Friedman, Rhea Chowers","doi":"arxiv-2409.07940","DOIUrl":"https://doi.org/arxiv-2409.07940","url":null,"abstract":"We propose a new method for generating realistic datasets with distribution\u0000shifts using any decoder-based generative model. Our approach systematically\u0000creates datasets with varying intensities of distribution shifts, facilitating\u0000a comprehensive analysis of model performance degradation. We then use these\u0000generated datasets to evaluate the performance of various commonly used\u0000networks and observe a consistent decline in performance with increasing shift\u0000intensity, even when the effect is almost perceptually unnoticeable to the\u0000human eye. We see this degradation even when using data augmentations. We also\u0000find that enlarging the training dataset beyond a certain point has no effect\u0000on the robustness and that stronger inductive biases increase robustness.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seonho Lee, Jiho Choi, Seohyun Lim, Jiwook Kim, Hyunjung Shim
{"title":"Scribble-Guided Diffusion for Training-free Text-to-Image Generation","authors":"Seonho Lee, Jiho Choi, Seohyun Lim, Jiwook Kim, Hyunjung Shim","doi":"arxiv-2409.08026","DOIUrl":"https://doi.org/arxiv-2409.08026","url":null,"abstract":"Recent advancements in text-to-image diffusion models have demonstrated\u0000remarkable success, yet they often struggle to fully capture the user's intent.\u0000Existing approaches using textual inputs combined with bounding boxes or region\u0000masks fall short in providing precise spatial guidance, often leading to\u0000misaligned or unintended object orientation. To address these limitations, we\u0000propose Scribble-Guided Diffusion (ScribbleDiff), a training-free approach that\u0000utilizes simple user-provided scribbles as visual prompts to guide image\u0000generation. However, incorporating scribbles into diffusion models presents\u0000challenges due to their sparse and thin nature, making it difficult to ensure\u0000accurate orientation alignment. To overcome these challenges, we introduce\u0000moment alignment and scribble propagation, which allow for more effective and\u0000flexible alignment between generated images and scribble inputs. Experimental\u0000results on the PASCAL-Scribble dataset demonstrate significant improvements in\u0000spatial control and consistency, showcasing the effectiveness of scribble-based\u0000guidance in diffusion models. Our code is available at\u0000https://github.com/kaist-cvml-lab/scribble-diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Chen, Jiafu Wu, Ying Jin, Jinlong Peng, Xiaofeng Mao, Mingmin Chi, Mufeng Yao, Bo Peng, Jian Li, Yun Cao
{"title":"VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis","authors":"Hao Chen, Jiafu Wu, Ying Jin, Jinlong Peng, Xiaofeng Mao, Mingmin Chi, Mufeng Yao, Bo Peng, Jian Li, Yun Cao","doi":"arxiv-2409.08207","DOIUrl":"https://doi.org/arxiv-2409.08207","url":null,"abstract":"Recently, methods like Zero-1-2-3 have focused on single-view based 3D\u0000reconstruction and have achieved remarkable success. However, their predictions\u0000for unseen areas heavily rely on the inductive bias of large-scale pretrained\u0000diffusion models. Although subsequent work, such as DreamComposer, attempts to\u0000make predictions more controllable by incorporating additional views, the\u0000results remain unrealistic due to feature entanglement in the vanilla latent\u0000space, including factors such as lighting, material, and structure. To address\u0000these issues, we introduce the Visual Isotropy 3D Reconstruction Model\u0000(VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates\u0000within an ID consistent and perspective-disentangled 3D latent space. By\u0000facilitating the disentanglement of semantic information, color, material\u0000properties and lighting, VI3DRM is capable of generating highly realistic\u0000images that are indistinguishable from real photographs. By leveraging both\u0000real and synthesized images, our approach enables the accurate construction of\u0000pointmaps, ultimately producing finely textured meshes or point clouds. On the\u0000NVS task, tested on the GSO dataset, VI3DRM significantly outperforms\u0000state-of-the-art method DreamComposer, achieving a PSNR of 38.61, an SSIM of\u00000.929, and an LPIPS of 0.027. Code will be made available upon publication.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters","authors":"Shun Zou, Zhuo Zhang, Yi Zou, Guangwei Gao","doi":"arxiv-2409.07896","DOIUrl":"https://doi.org/arxiv-2409.07896","url":null,"abstract":"In the field of medical microscopic image classification (MIC), CNN-based and\u0000Transformer-based models have been extensively studied. However, CNNs struggle\u0000with modeling long-range dependencies, limiting their ability to fully utilize\u0000semantic information in images. Conversely, Transformers are hampered by the\u0000complexity of quadratic computations. To address these challenges, we propose a\u0000model based on the Mamba architecture: Microscopic-Mamba. Specifically, we\u0000designed the Partially Selected Feed-Forward Network (PSFFN) to replace the\u0000last linear layer of the Visual State Space Module (VSSM), enhancing Mamba's\u0000local feature extraction capabilities. Additionally, we introduced the\u0000Modulation Interaction Feature Aggregation (MIFA) module to effectively\u0000modulate and dynamically aggregate global and local features. We also\u0000incorporated a parallel VSSM mechanism to improve inter-channel information\u0000interaction while reducing the number of parameters. Extensive experiments have\u0000demonstrated that our method achieves state-of-the-art performance on five\u0000public datasets. Code is available at\u0000https://github.com/zs1314/Microscopic-Mamba","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}