arXiv - CS - Computer Vision and Pattern Recognition最新文献_第7页

MagicStyle: Portrait Stylization Based on Reference Image MagicStyle：基于参考图像的肖像风格化

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08156

Zhaoli Deng, Kaibin Zhou, Fanyi Wang, Zhenpeng Mi

{"title":"MagicStyle: Portrait Stylization Based on Reference Image","authors":"Zhaoli Deng, Kaibin Zhou, Fanyi Wang, Zhenpeng Mi","doi":"arxiv-2409.08156","DOIUrl":"https://doi.org/arxiv-2409.08156","url":null,"abstract":"The development of diffusion models has significantly advanced the research\u0000on image stylization, particularly in the area of stylizing a content image\u0000based on a given style image, which has attracted many scholars. The main\u0000challenge in this reference image stylization task lies in how to maintain the\u0000details of the content image while incorporating the color and texture features\u0000of the style image. This challenge becomes even more pronounced when the\u0000content image is a portrait which has complex textural details. To address this\u0000challenge, we propose a diffusion model-based reference image stylization\u0000method specifically for portraits, called MagicStyle. MagicStyle consists of\u0000two phases: Content and Style DDIM Inversion (CSDI) and Feature Fusion Forward\u0000(FFF). The CSDI phase involves a reverse denoising process, where DDIM\u0000Inversion is performed separately on the content image and the style image,\u0000storing the self-attention query, key and value features of both images during\u0000the inversion process. The FFF phase executes forward denoising, harmoniously\u0000integrating the texture and color information from the pre-stored feature\u0000queries, keys and values into the diffusion generation process based on our\u0000Well-designed Feature Fusion Attention (FFA). We conducted comprehensive\u0000comparative and ablation experiments to validate the effectiveness of our\u0000proposed MagicStyle and FFA.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Frequency Anti-DreamBooth: Robust Defense Against Image Synthesis 高频反梦境ooth：稳健防御图像合成

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08167

Takuto Onikubo, Yusuke Matsui

引用次数: 0

IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation IFAdapter：基于文本到图像生成的实例特征控制

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08240

Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang

{"title":"IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation","authors":"Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang","doi":"arxiv-2409.08240","DOIUrl":"https://doi.org/arxiv-2409.08240","url":null,"abstract":"While Text-to-Image (T2I) diffusion models excel at generating visually\u0000appealing images of individual instances, they struggle to accurately position\u0000and control the features generation of multiple instances. The Layout-to-Image\u0000(L2I) task was introduced to address the positioning challenges by\u0000incorporating bounding boxes as spatial control signals, but it still falls\u0000short in generating precise instance features. In response, we propose the\u0000Instance Feature Generation (IFG) task, which aims to ensure both positional\u0000accuracy and feature fidelity in generated instances. To address the IFG task,\u0000we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances\u0000feature depiction by incorporating additional appearance tokens and utilizing\u0000an Instance Semantic Map to align instance-level features with spatial\u0000locations. The IFAdapter guides the diffusion process as a plug-and-play\u0000module, making it adaptable to various community models. For evaluation, we\u0000contribute an IFG benchmark and develop a verification pipeline to objectively\u0000compare models' abilities to generate instances with accurate positioning and\u0000features. Experimental results demonstrate that IFAdapter outperforms other\u0000models in both quantitative and qualitative evaluations.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"61 13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UGAD: Universal Generative AI Detector utilizing Frequency Fingerprints UGAD：利用频率指纹的通用生成式人工智能探测器

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07913

Inzamamul Alam, Muhammad Shahid Muneer, Simon S. Woo

引用次数: 0

Cross-Attention Based Influence Model for Manual and Nonmanual Sign Language Analysis 手动和非手动手语分析中基于交叉注意力的影响模型

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08162

Lipisha Chaudhary, Fei Xu, Ifeoma Nwogu

{"title":"Cross-Attention Based Influence Model for Manual and Nonmanual Sign Language Analysis","authors":"Lipisha Chaudhary, Fei Xu, Ifeoma Nwogu","doi":"arxiv-2409.08162","DOIUrl":"https://doi.org/arxiv-2409.08162","url":null,"abstract":"Both manual (relating to the use of hands) and non-manual markers (NMM), such\u0000as facial expressions or mouthing cues, are important for providing the\u0000complete meaning of phrases in American Sign Language (ASL). Efforts have been\u0000made in advancing sign language to spoken/written language understanding, but\u0000most of these have primarily focused on manual features only. In this work,\u0000using advanced neural machine translation methods, we examine and report on the\u0000extent to which facial expressions contribute to understanding sign language\u0000phrases. We present a sign language translation architecture consisting of\u0000two-stream encoders, with one encoder handling the face and the other handling\u0000the upper body (with hands). We propose a new parallel cross-attention decoding\u0000mechanism that is useful for quantifying the influence of each input modality\u0000on the output. The two streams from the encoder are directed simultaneously to\u0000different attention stacks in the decoder. Examining the properties of the\u0000parallel cross-attention weights allows us to analyze the importance of facial\u0000markers compared to body and hand features during a translating task.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Style Based Clustering of Visual Artworks 基于风格的视觉艺术作品聚类

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08245

Abhishek Dangeti, Pavan Gajula, Vivek Srivastava, Vikram Jamwal

{"title":"Style Based Clustering of Visual Artworks","authors":"Abhishek Dangeti, Pavan Gajula, Vivek Srivastava, Vikram Jamwal","doi":"arxiv-2409.08245","DOIUrl":"https://doi.org/arxiv-2409.08245","url":null,"abstract":"Clustering artworks based on style has many potential real-world applications\u0000like art recommendations, style-based search and retrieval, and the study of\u0000artistic style evolution in an artwork corpus. However, clustering artworks\u0000based on style is largely an unaddressed problem. A few present methods for\u0000clustering artworks principally rely on generic image feature representations\u0000derived from deep neural networks and do not specifically deal with the\u0000artistic style. In this paper, we introduce and deliberate over the notion of\u0000style-based clustering of visual artworks. Our main objective is to explore\u0000neural feature representations and architectures that can be used for\u0000style-based clustering and observe their impact and effectiveness. We develop\u0000different methods and assess their relative efficacy for style-based clustering\u0000through qualitative and quantitative analysis by applying them to four artwork\u0000corpora and four curated synthetically styled datasets. Our analysis provides\u0000some key novel insights on architectures, feature representations, and\u0000evaluation methods suitable for style-based clustering.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Control+Shift: Generating Controllable Distribution Shifts 控制+转变：产生可控的分配转变

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07940

Roy Friedman, Rhea Chowers

引用次数: 0

Scribble-Guided Diffusion for Training-free Text-to-Image Generation 用于免训练文本到图像生成的涂鸦引导扩散技术

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08026

Seonho Lee, Jiho Choi, Seohyun Lim, Jiwook Kim, Hyunjung Shim

{"title":"Scribble-Guided Diffusion for Training-free Text-to-Image Generation","authors":"Seonho Lee, Jiho Choi, Seohyun Lim, Jiwook Kim, Hyunjung Shim","doi":"arxiv-2409.08026","DOIUrl":"https://doi.org/arxiv-2409.08026","url":null,"abstract":"Recent advancements in text-to-image diffusion models have demonstrated\u0000remarkable success, yet they often struggle to fully capture the user's intent.\u0000Existing approaches using textual inputs combined with bounding boxes or region\u0000masks fall short in providing precise spatial guidance, often leading to\u0000misaligned or unintended object orientation. To address these limitations, we\u0000propose Scribble-Guided Diffusion (ScribbleDiff), a training-free approach that\u0000utilizes simple user-provided scribbles as visual prompts to guide image\u0000generation. However, incorporating scribbles into diffusion models presents\u0000challenges due to their sparse and thin nature, making it difficult to ensure\u0000accurate orientation alignment. To overcome these challenges, we introduce\u0000moment alignment and scribble propagation, which allow for more effective and\u0000flexible alignment between generated images and scribble inputs. Experimental\u0000results on the PASCAL-Scribble dataset demonstrate significant improvements in\u0000spatial control and consistency, showcasing the effectiveness of scribble-based\u0000guidance in diffusion models. Our code is available at\u0000https://github.com/kaist-cvml-lab/scribble-diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis VI3DRM：通过逼真的新颖视图合成从稀疏视图实现细致的三维重建

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08207

Hao Chen, Jiafu Wu, Ying Jin, Jinlong Peng, Xiaofeng Mao, Mingmin Chi, Mufeng Yao, Bo Peng, Jian Li, Yun Cao

{"title":"VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis","authors":"Hao Chen, Jiafu Wu, Ying Jin, Jinlong Peng, Xiaofeng Mao, Mingmin Chi, Mufeng Yao, Bo Peng, Jian Li, Yun Cao","doi":"arxiv-2409.08207","DOIUrl":"https://doi.org/arxiv-2409.08207","url":null,"abstract":"Recently, methods like Zero-1-2-3 have focused on single-view based 3D\u0000reconstruction and have achieved remarkable success. However, their predictions\u0000for unseen areas heavily rely on the inductive bias of large-scale pretrained\u0000diffusion models. Although subsequent work, such as DreamComposer, attempts to\u0000make predictions more controllable by incorporating additional views, the\u0000results remain unrealistic due to feature entanglement in the vanilla latent\u0000space, including factors such as lighting, material, and structure. To address\u0000these issues, we introduce the Visual Isotropy 3D Reconstruction Model\u0000(VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates\u0000within an ID consistent and perspective-disentangled 3D latent space. By\u0000facilitating the disentanglement of semantic information, color, material\u0000properties and lighting, VI3DRM is capable of generating highly realistic\u0000images that are indistinguishable from real photographs. By leveraging both\u0000real and synthesized images, our approach enables the accurate construction of\u0000pointmaps, ultimately producing finely textured meshes or point clouds. On the\u0000NVS task, tested on the GSO dataset, VI3DRM significantly outperforms\u0000state-of-the-art method DreamComposer, achieving a PSNR of 38.61, an SSIM of\u00000.929, and an LPIPS of 0.027. Code will be made available upon publication.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters 显微曼巴仅用 4M 参数揭示显微图像的秘密

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07896

Shun Zou, Zhuo Zhang, Yi Zou, Guangwei Gao

{"title":"Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters","authors":"Shun Zou, Zhuo Zhang, Yi Zou, Guangwei Gao","doi":"arxiv-2409.07896","DOIUrl":"https://doi.org/arxiv-2409.07896","url":null,"abstract":"In the field of medical microscopic image classification (MIC), CNN-based and\u0000Transformer-based models have been extensively studied. However, CNNs struggle\u0000with modeling long-range dependencies, limiting their ability to fully utilize\u0000semantic information in images. Conversely, Transformers are hampered by the\u0000complexity of quadratic computations. To address these challenges, we propose a\u0000model based on the Mamba architecture: Microscopic-Mamba. Specifically, we\u0000designed the Partially Selected Feed-Forward Network (PSFFN) to replace the\u0000last linear layer of the Visual State Space Module (VSSM), enhancing Mamba's\u0000local feature extraction capabilities. Additionally, we introduced the\u0000Modulation Interaction Feature Aggregation (MIFA) module to effectively\u0000modulate and dynamically aggregate global and local features. We also\u0000incorporated a parallel VSSM mechanism to improve inter-channel information\u0000interaction while reducing the number of parameters. Extensive experiments have\u0000demonstrated that our method achieves state-of-the-art performance on five\u0000public datasets. Code is available at\u0000https://github.com/zs1314/Microscopic-Mamba","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0