arXiv - CS - Multimedia最新文献_第2页

Turbo your multi-modal classification with contrastive learning 通过对比学习提升多模态分类能力

arXiv - CS - Multimedia Pub Date : 2024-09-14 DOI: arxiv-2409.09282

Zhiyu Zhang, Da Liu, Shengqiang Liu, Anna Wang, Jie Gao, Yali Li

引用次数: 0

MHAD: Multimodal Home Activity Dataset with Multi-Angle Videos and Synchronized Physiological Signals MHAD：包含多角度视频和同步生理信号的多模态家庭活动数据集

arXiv - CS - Multimedia Pub Date : 2024-09-14 DOI: arxiv-2409.09366

Lei Yu, Jintao Fei, Xinyi Liu, Yang Yao, Jun Zhao, Guoxin Wang, Xin Li

{"title":"MHAD: Multimodal Home Activity Dataset with Multi-Angle Videos and Synchronized Physiological Signals","authors":"Lei Yu, Jintao Fei, Xinyi Liu, Yang Yao, Jun Zhao, Guoxin Wang, Xin Li","doi":"arxiv-2409.09366","DOIUrl":"https://doi.org/arxiv-2409.09366","url":null,"abstract":"Video-based physiology, exemplified by remote photoplethysmography (rPPG),\u0000extracts physiological signals such as pulse and respiration by analyzing\u0000subtle changes in video recordings. This non-contact, real-time monitoring\u0000method holds great potential for home settings. Despite the valuable\u0000contributions of public benchmark datasets to this technology, there is\u0000currently no dataset specifically designed for passive home monitoring.\u0000Existing datasets are often limited to close-up, static, frontal recordings and\u0000typically include only 1-2 physiological signals. To advance video-based\u0000physiology in real home settings, we introduce the MHAD dataset. It comprises\u00001,440 videos from 40 subjects, capturing 6 typical activities from 3 angles in\u0000a real home environment. Additionally, 5 physiological signals were recorded,\u0000making it a comprehensive video-based physiology dataset. MHAD is compatible\u0000with the rPPG-toolbox and has been validated using several unsupervised and\u0000supervised methods. Our dataset is publicly available at\u0000https://github.com/jdh-algo/MHAD-Dataset.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"100 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prototypical Prompting for Text-to-image Person Re-identification 文本到图像的人员再识别原型提示

arXiv - CS - Multimedia Pub Date : 2024-09-14 DOI: arxiv-2409.09427

Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang

{"title":"Prototypical Prompting for Text-to-image Person Re-identification","authors":"Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang","doi":"arxiv-2409.09427","DOIUrl":"https://doi.org/arxiv-2409.09427","url":null,"abstract":"In this paper, we study the problem of Text-to-Image Person Re-identification\u0000(TIReID), which aims to find images of the same identity described by a text\u0000sentence from a pool of candidate images. Benefiting from Vision-Language\u0000Pre-training, such as CLIP (Contrastive Language-Image Pretraining), the TIReID\u0000techniques have achieved remarkable progress recently. However, most existing\u0000methods only focus on instance-level matching and ignore identity-level\u0000matching, which involves associating multiple images and texts belonging to the\u0000same person. In this paper, we propose a novel prototypical prompting framework\u0000(Propot) designed to simultaneously model instance-level and identity-level\u0000matching for TIReID. Our Propot transforms the identity-level matching problem\u0000into a prototype learning problem, aiming to learn identity-enriched\u0000prototypes. Specifically, Propot works by 'initialize, adapt, enrich, then\u0000aggregate'. We first use CLIP to generate high-quality initial prototypes.\u0000Then, we propose a domain-conditional prototypical prompting (DPP) module to\u0000adapt the prototypes to the TIReID task using task-related information.\u0000Further, we propose an instance-conditional prototypical prompting (IPP) module\u0000to update prototypes conditioned on intra-modal and inter-modal instances to\u0000ensure prototype diversity. Finally, we design an adaptive prototype\u0000aggregation module to aggregate these prototypes, generating final\u0000identity-enriched prototypes. With identity-enriched prototypes, we diffuse its\u0000rich identity information to instances through prototype-to-instance\u0000contrastive loss to facilitate identity-level matching. Extensive experiments\u0000conducted on three benchmarks demonstrate the superiority of Propot compared to\u0000existing TIReID methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AI-Driven Virtual Teacher for Enhanced Educational Efficiency: Leveraging Large Pretrain Models for Autonomous Error Analysis and Correction 人工智能驱动的虚拟教师，提高教育效率：利用大型预训练模型进行自主错误分析和纠正

arXiv - CS - Multimedia Pub Date : 2024-09-14 DOI: arxiv-2409.09403

Tianlong Xu, Yi-Fan Zhang, Zhendong Chu, Shen Wang, Qingsong Wen

{"title":"AI-Driven Virtual Teacher for Enhanced Educational Efficiency: Leveraging Large Pretrain Models for Autonomous Error Analysis and Correction","authors":"Tianlong Xu, Yi-Fan Zhang, Zhendong Chu, Shen Wang, Qingsong Wen","doi":"arxiv-2409.09403","DOIUrl":"https://doi.org/arxiv-2409.09403","url":null,"abstract":"Students frequently make mistakes while solving mathematical problems, and\u0000traditional error correction methods are both time-consuming and\u0000labor-intensive. This paper introduces an innovative textbf{V}irtual\u0000textbf{A}I textbf{T}eacher system designed to autonomously analyze and\u0000correct student textbf{E}rrors (VATE). Leveraging advanced large language\u0000models (LLMs), the system uses student drafts as a primary source for error\u0000analysis, which enhances understanding of the student's learning process. It\u0000incorporates sophisticated prompt engineering and maintains an error pool to\u0000reduce computational overhead. The AI-driven system also features a real-time\u0000dialogue component for efficient student interaction. Our approach demonstrates\u0000significant advantages over traditional and machine learning-based error\u0000correction methods, including reduced educational costs, high scalability, and\u0000superior generalizability. The system has been deployed on the Squirrel AI\u0000learning platform for elementary mathematics education, where it achieves\u000078.3% accuracy in error analysis and shows a marked improvement in student\u0000learning efficiency. Satisfaction surveys indicate a strong positive reception,\u0000highlighting the system's potential to transform educational practices.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy? 多模态语音变换解码器：多种模式何时能提高准确性？

arXiv - CS - Multimedia Pub Date : 2024-09-13 DOI: arxiv-2409.09221

Yiwen Guan, Viet Anh Trinh, Vivek Voleti, Jacob Whitehill

引用次数: 0

Improving Virtual Try-On with Garment-focused Diffusion Models 利用以服装为重点的扩散模型改进虚拟试穿

arXiv - CS - Multimedia Pub Date : 2024-09-12 DOI: arxiv-2409.08258

Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei

{"title":"Improving Virtual Try-On with Garment-focused Diffusion Models","authors":"Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei","doi":"arxiv-2409.08258","DOIUrl":"https://doi.org/arxiv-2409.08258","url":null,"abstract":"Diffusion models have led to the revolutionizing of generative modeling in\u0000numerous image synthesis tasks. Nevertheless, it is not trivial to directly\u0000apply diffusion models for synthesizing an image of a target person wearing a\u0000given in-shop garment, i.e., image-based virtual try-on (VTON) task. The\u0000difficulty originates from the aspect that the diffusion process should not\u0000only produce holistically high-fidelity photorealistic image of the target\u0000person, but also locally preserve every appearance and texture detail of the\u0000given garment. To address this, we shape a new Diffusion model, namely GarDiff,\u0000which triggers the garment-focused diffusion process with amplified guidance of\u0000both basic visual appearance and detailed textures (i.e., high-frequency\u0000details) derived from the given garment. GarDiff first remoulds a pre-trained\u0000latent diffusion model with additional appearance priors derived from the CLIP\u0000and VAE encodings of the reference garment. Meanwhile, a novel garment-focused\u0000adapter is integrated into the UNet of diffusion model, pursuing local\u0000fine-grained alignment with the visual appearance of reference garment and\u0000human pose. We specifically design an appearance loss over the synthesized\u0000garment to enhance the crucial, high-frequency details. Extensive experiments\u0000on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff\u0000when compared to state-of-the-art VTON approaches. Code is publicly available\u0000at:\u0000href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations 反思带有部分注释的多标签识别的提示策略

arXiv - CS - Multimedia Pub Date : 2024-09-12 DOI: arxiv-2409.08381

Samyak Rawlekar, Shubhang Bhatnagar, Narendra Ahuja

{"title":"Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations","authors":"Samyak Rawlekar, Shubhang Bhatnagar, Narendra Ahuja","doi":"arxiv-2409.08381","DOIUrl":"https://doi.org/arxiv-2409.08381","url":null,"abstract":"Vision-language models (VLMs) like CLIP have been adapted for Multi-Label\u0000Recognition (MLR) with partial annotations by leveraging prompt-learning, where\u0000positive and negative prompts are learned for each class to associate their\u0000embeddings with class presence or absence in the shared vision-text feature\u0000space. While this approach improves MLR performance by relying on VLM priors,\u0000we hypothesize that learning negative prompts may be suboptimal, as the\u0000datasets used to train VLMs lack image-caption pairs explicitly focusing on\u0000class absence. To analyze the impact of positive and negative prompt learning\u0000on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is\u0000learned with VLM guidance while the other is replaced by an embedding vector\u0000learned directly in the shared feature space without relying on the text\u0000encoder. Through empirical analysis, we observe that negative prompts degrade\u0000MLR performance, and learning only positive prompts, combined with learned\u0000negative embeddings (PositiveCoOp), outperforms dual prompt learning\u0000approaches. Moreover, we quantify the performance benefits that prompt-learning\u0000offers over a simple vision-features-only baseline, observing that the baseline\u0000displays strong performance comparable to dual prompt learning approach\u0000(DualCoOp), when the proportion of missing labels is low, while requiring half\u0000the training compute and 16 times fewer parameters","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"201 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142254053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ComAlign: Compositional Alignment in Vision-Language Models ComAlign：视觉语言模型中的构图对齐

arXiv - CS - Multimedia Pub Date : 2024-09-12 DOI: arxiv-2409.08206

Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah

{"title":"ComAlign: Compositional Alignment in Vision-Language Models","authors":"Ali Abdollah, Amirmohammad Izadi, Armin Saghafian, Reza Vahidimajd, Mohammad Mozafari, Amirreza Mirzaei, Mohammadmahdi Samiei, Mahdieh Soleymani Baghshah","doi":"arxiv-2409.08206","DOIUrl":"https://doi.org/arxiv-2409.08206","url":null,"abstract":"Vision-language models (VLMs) like CLIP have showcased a remarkable ability\u0000to extract transferable features for downstream tasks. Nonetheless, the\u0000training process of these models is usually based on a coarse-grained\u0000contrastive loss between the global embedding of images and texts which may\u0000lose the compositional structure of these modalities. Many recent studies have\u0000shown VLMs lack compositional understandings like attribute binding and\u0000identifying object relationships. Although some recent methods have tried to\u0000achieve finer-level alignments, they either are not based on extracting\u0000meaningful components of proper granularity or don't properly utilize the\u0000modalities' correspondence (especially in image-text pairs with more\u0000ingredients). Addressing these limitations, we introduce Compositional\u0000Alignment (ComAlign), a fine-grained approach to discover more exact\u0000correspondence of text and image components using only the weak supervision in\u0000the form of image-text pairs. Our methodology emphasizes that the compositional\u0000structure (including entities and relations) extracted from the text modality\u0000must also be retained in the image modality. To enforce correspondence of\u0000fine-grained concepts in image and text modalities, we train a lightweight\u0000network lying on top of existing visual and language encoders using a small\u0000dataset. The network is trained to align nodes and edges of the structure\u0000across the modalities. Experimental results on various VLMs and datasets\u0000demonstrate significant improvements in retrieval and compositional benchmarks,\u0000affirming the effectiveness of our plugin model.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MSMF: Multi-Scale Multi-Modal Fusion for Enhanced Stock Market Prediction MSMF：多尺度多模态融合增强股市预测功能

arXiv - CS - Multimedia Pub Date : 2024-09-12 DOI: arxiv-2409.07855

Jiahao Qin

引用次数: 0

Improving Text-guided Object Inpainting with Semantic Pre-inpainting 利用语义预绘制改进文本引导的对象绘制

arXiv - CS - Multimedia Pub Date : 2024-09-12 DOI: arxiv-2409.08260

Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei

{"title":"Improving Text-guided Object Inpainting with Semantic Pre-inpainting","authors":"Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei","doi":"arxiv-2409.08260","DOIUrl":"https://doi.org/arxiv-2409.08260","url":null,"abstract":"Recent years have witnessed the success of large text-to-image diffusion\u0000models and their remarkable potential to generate high-quality images. The\u0000further pursuit of enhancing the editability of images has sparked significant\u0000interest in the downstream task of inpainting a novel object described by a\u0000text prompt within a designated region in the image. Nevertheless, the problem\u0000is not trivial from two aspects: 1) Solely relying on one single U-Net to align\u0000text prompt and visual object across all the denoising timesteps is\u0000insufficient to generate desired objects; 2) The controllability of object\u0000generation is not guaranteed in the intricate sampling space of diffusion\u0000model. In this paper, we propose to decompose the typical single-stage object\u0000inpainting into two cascaded processes: 1) semantic pre-inpainting that infers\u0000the semantic features of desired objects in a multi-modal feature space; 2)\u0000high-fieldity object generation in diffusion latent space that pivots on such\u0000inpainted semantic features. To achieve this, we cascade a Transformer-based\u0000semantic inpainter and an object inpainting diffusion model, leading to a novel\u0000CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object\u0000inpainting. Technically, the semantic inpainter is trained to predict the\u0000semantic features of the target object conditioning on unmasked context and\u0000text prompt. The outputs of the semantic inpainter then act as the informative\u0000visual prompts to guide high-fieldity object generation through a reference\u0000adapter layer, leading to controllable object inpainting. Extensive evaluations\u0000on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against\u0000the state-of-the-art methods. Code is available at\u0000url{https://github.com/Nnn-s/CATdiffusion}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0