{"title":"Exploring Rich Subjective Quality Information for Image Quality Assessment in the Wild","authors":"Xiongkuo Min, Yixuan Gao, Yuqin Cao, Guangtao Zhai, Wenjun Zhang, Huifang Sun, Chang Wen Chen","doi":"arxiv-2409.05540","DOIUrl":"https://doi.org/arxiv-2409.05540","url":null,"abstract":"Traditional in the wild image quality assessment (IQA) models are generally\u0000trained with the quality labels of mean opinion score (MOS), while missing the\u0000rich subjective quality information contained in the quality ratings, for\u0000example, the standard deviation of opinion scores (SOS) or even distribution of\u0000opinion scores (DOS). In this paper, we propose a novel IQA method named\u0000RichIQA to explore the rich subjective rating information beyond MOS to predict\u0000image quality in the wild. RichIQA is characterized by two key novel designs:\u0000(1) a three-stage image quality prediction network which exploits the powerful\u0000feature representation capability of the Convolutional vision Transformer (CvT)\u0000and mimics the short-term and long-term memory mechanisms of human brain; (2) a\u0000multi-label training strategy in which rich subjective quality information like\u0000MOS, SOS and DOS are concurrently used to train the quality prediction network.\u0000Powered by these two novel designs, RichIQA is able to predict the image\u0000quality in terms of a distribution, from which the mean image quality can be\u0000subsequently obtained. Extensive experimental results verify that the\u0000three-stage network is tailored to predict rich quality information, while the\u0000multi-label training strategy can fully exploit the potentials within\u0000subjective quality rating and enhance the prediction performance and\u0000generalizability of the network. RichIQA outperforms state-of-the-art\u0000competitors on multiple large-scale in the wild IQA databases with rich\u0000subjective rating labels. The code of RichIQA will be made publicly available\u0000on GitHub.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Look One and More: Distilling Hybrid Order Relational Knowledge for Cross-Resolution Image Recognition","authors":"Shiming Ge, Kangkai Zhang, Haolin Liu, Yingying Hua, Shengwei Zhao, Xin Jin, Hao Wen","doi":"arxiv-2409.05384","DOIUrl":"https://doi.org/arxiv-2409.05384","url":null,"abstract":"In spite of great success in many image recognition tasks achieved by recent\u0000deep models, directly applying them to recognize low-resolution images may\u0000suffer from low accuracy due to the missing of informative details during\u0000resolution degradation. However, these images are still recognizable for\u0000subjects who are familiar with the corresponding high-resolution ones. Inspired\u0000by that, we propose a teacher-student learning approach to facilitate\u0000low-resolution image recognition via hybrid order relational knowledge\u0000distillation. The approach refers to three streams: the teacher stream is\u0000pretrained to recognize high-resolution images in high accuracy, the student\u0000stream is learned to identify low-resolution images by mimicking the teacher's\u0000behaviors, and the extra assistant stream is introduced as bridge to help\u0000knowledge transfer across the teacher to the student. To extract sufficient\u0000knowledge for reducing the loss in accuracy, the learning of student is\u0000supervised with multiple losses, which preserves the similarities in various\u0000order relational structures. In this way, the capability of recovering missing\u0000details of familiar low-resolution images can be effectively enhanced, leading\u0000to a better knowledge transfer. Extensive experiments on metric learning,\u0000low-resolution image classification and low-resolution face recognition tasks\u0000show the effectiveness of our approach, while taking reduced models.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"REVISION: A Roadmap on Adaptive Video Streaming Optimization","authors":"Farzad Tashtarian, Christian Timmerer","doi":"arxiv-2409.06051","DOIUrl":"https://doi.org/arxiv-2409.06051","url":null,"abstract":"Due to the soaring popularity of video applications and the consequent rise\u0000in video traffic on the Internet, technologies like HTTP Adaptive Streaming\u0000(HAS) are crucial for delivering high Quality of Experience (QoE) to consumers.\u0000HAS technology enables video players on consumer devices to enhance viewer\u0000engagement by dynamically adapting video content quality based on network\u0000conditions. This is especially relevant for consumer electronics as it ensures\u0000an optimized viewing experience across a variety of devices, from smartphones\u0000to smart TVs. This paper introduces REVISION, an efficient roadmap designed to\u0000enhance adaptive video streaming, a core feature of modern consumer\u0000electronics. The REVISION optimization triangle highlights three essential\u0000aspects for improving streaming: Objective, Input Space, and Action Domain.\u0000Additionally, REVISION proposes a novel layer-based architecture tailored to\u0000refine video streaming systems, comprising Application, Control and Management,\u0000and Resource layers. Each layer is designed to optimize different components of\u0000the streaming process, which is directly linked to the performance and\u0000efficiency of consumer devices. By adopting the principles of the REVISION,\u0000manufacturers and developers can significantly improve the streaming\u0000capabilities of consumer electronics, thereby enriching the consumer's\u0000multimedia experience and accommodating the increasing demand for high-quality,\u0000real-time video content. This approach addresses the complexities of today's\u0000diverse video streaming ecosystem and paves the way for future advancements in\u0000consumer technology.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey of Multimodal Composite Editing and Retrieval","authors":"Suyan Li, Fuxiang Huang, Lei Zhang","doi":"arxiv-2409.05405","DOIUrl":"https://doi.org/arxiv-2409.05405","url":null,"abstract":"In the real world, where information is abundant and diverse across different\u0000modalities, understanding and utilizing various data types to improve retrieval\u0000systems is a key focus of research. Multimodal composite retrieval integrates\u0000diverse modalities such as text, image and audio, etc. to provide more\u0000accurate, personalized, and contextually relevant results. To facilitate a\u0000deeper understanding of this promising direction, this survey explores\u0000multimodal composite editing and retrieval in depth, covering image-text\u0000composite editing, image-text composite retrieval, and other multimodal\u0000composite retrieval. In this survey, we systematically organize the application\u0000scenarios, methods, benchmarks, experiments, and future directions. Multimodal\u0000learning is a hot topic in large model era, and have also witnessed some\u0000surveys in multimodal learning and vision-language models with transformers\u0000published in the PAMI journal. To the best of our knowledge, this survey is the\u0000first comprehensive review of the literature on multimodal composite retrieval,\u0000which is a timely complement of multimodal fusion to existing reviews. To help\u0000readers' quickly track this field, we build the project page for this survey,\u0000which can be found at\u0000https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao
{"title":"CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization","authors":"Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao","doi":"arxiv-2409.05606","DOIUrl":"https://doi.org/arxiv-2409.05606","url":null,"abstract":"Subject-driven text-to-image (T2I) customization has drawn significant\u0000interest in academia and industry. This task enables pre-trained models to\u0000generate novel images based on unique subjects. Existing studies adopt a\u0000self-reconstructive perspective, focusing on capturing all details of a single\u0000image, which will misconstrue the specific image's irrelevant attributes (e.g.,\u0000view, pose, and background) as the subject intrinsic attributes. This\u0000misconstruction leads to both overfitting or underfitting of irrelevant and\u0000intrinsic attributes of the subject, i.e., these attributes are\u0000over-represented or under-represented simultaneously, causing a trade-off\u0000between similarity and controllability. In this study, we argue an ideal\u0000subject representation can be achieved by a cross-differential perspective,\u0000i.e., decoupling subject intrinsic attributes from irrelevant attributes via\u0000contrastive learning, which allows the model to focus more on intrinsic\u0000attributes through intra-consistency (features of the same subject are\u0000spatially closer) and inter-distinctiveness (features of different subjects\u0000have distinguished differences). Specifically, we propose CustomContrast, a\u0000novel framework, which includes a Multilevel Contrastive Learning (MCL)\u0000paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is\u0000used to extract intrinsic features of subjects from high-level semantics to\u0000low-level appearance through crossmodal semantic contrastive learning and\u0000multiscale appearance contrastive learning. To facilitate contrastive learning,\u0000we introduce the MFI encoder to capture cross-modal representations. Extensive\u0000experiments show the effectiveness of CustomContrast in subject similarity and\u0000text controllability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hoang-Son Vo-Thanh, Quang-Vinh Nguyen, Soo-Hyung Kim
{"title":"KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation","authors":"Hoang-Son Vo-Thanh, Quang-Vinh Nguyen, Soo-Hyung Kim","doi":"arxiv-2409.05330","DOIUrl":"https://doi.org/arxiv-2409.05330","url":null,"abstract":"Audio-driven talking face generation is a widely researched topic due to its\u0000high applicability. Reconstructing a talking face using audio significantly\u0000contributes to fields such as education, healthcare, online conversations,\u0000virtual assistants, and virtual reality. Early studies often focused solely on\u0000changing the mouth movements, which resulted in outcomes with limited practical\u0000applications. Recently, researchers have proposed a new approach of\u0000constructing the entire face, including face pose, neck, and shoulders. To\u0000achieve this, they need to generate through landmarks. However, creating stable\u0000landmarks that align well with the audio is a challenge. In this paper, we\u0000propose the KFusion of Dual-Domain model, a robust model that generates\u0000landmarks from audio. We separate the audio into two distinct domains to learn\u0000emotional information and facial context, then use a fusion mechanism based on\u0000the KAN model. Our model demonstrates high efficiency compared to recent\u0000models. This will lay the groundwork for the development of the audio-driven\u0000talking face generation problem in the future.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Javier Huertas-Tato, Christos Koutlis, Symeon Papadopoulos, David Camacho, Ioannis Kompatsiaris
{"title":"A CLIP-based siamese approach for meme classification","authors":"Javier Huertas-Tato, Christos Koutlis, Symeon Papadopoulos, David Camacho, Ioannis Kompatsiaris","doi":"arxiv-2409.05772","DOIUrl":"https://doi.org/arxiv-2409.05772","url":null,"abstract":"Memes are an increasingly prevalent element of online discourse in social\u0000networks, especially among young audiences. They carry ideas and messages that\u0000range from humorous to hateful, and are widely consumed. Their potentially high\u0000impact requires adequate means of control to moderate their use in large scale.\u0000In this work, we propose SimCLIP a deep learning-based architecture for\u0000cross-modal understanding of memes, leveraging a pre-trained CLIP encoder to\u0000produce context-aware embeddings and a Siamese fusion technique to capture the\u0000interactions between text and image. We perform an extensive experimentation on\u0000seven meme classification tasks across six datasets. We establish a new state\u0000of the art in Memotion7k with a 7.25% relative F1-score improvement, and\u0000achieve super-human performance on Harm-P with 13.73% F1-Score improvement. Our\u0000approach demonstrates the potential for compact meme classification models,\u0000enabling accurate and efficient meme monitoring. We share our code at\u0000https://github.com/jahuerta92/meme-classification-simclip","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Surya Kalvakolu, Heinrich Söbke, Jannicke Baalsrud Hauge, Eckhard Kraft
{"title":"Educational Virtual Field Trips based on Social VR and 360° Spaces","authors":"Surya Kalvakolu, Heinrich Söbke, Jannicke Baalsrud Hauge, Eckhard Kraft","doi":"arxiv-2409.05496","DOIUrl":"https://doi.org/arxiv-2409.05496","url":null,"abstract":"Virtual field trips (VFTs) have proven to be valuable learning tools. Such\u0000applications are mostly based on 360{deg} technology and are to be\u0000characterized as single-user applications in technological terms. In contrast,\u0000Social VR applications are characterized by multi-user capability and\u0000user-specific avatars. From a learning perspective, the concepts of\u0000collaborative learning and embodiment have long been proposed as conducive to\u0000learning. Both concepts might be supported using Social VR. However, little is\u0000currently known about the use of Social VR for VFTs. Accordingly, the research\u0000questions are to what extent VFTs can be implemented in Social VR environments\u0000and how these Social VR-based VFTs are perceived by learners. This article\u0000presents an evaluation study on the development and evaluation of a VFT\u0000environment using the Social VR platform Mozilla Hubs. It describes the design\u0000decisions to create the environment and evaluation results from a mixed-method\u0000study (N=16) using a questionnaire and focus group discussions. The study\u0000highlighted the opportunities offered by Social VR-based VFTs but also revealed\u0000several challenges that need to be addressed to embrace the potential of Social\u0000VR-based VFTs to be utilized regularly in education.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visual Grounding with Multi-modal Conditional Adaptation","authors":"Ruilin Yao, Shengwu Xiong, Yichen Zhao, Yi Rong","doi":"arxiv-2409.04999","DOIUrl":"https://doi.org/arxiv-2409.04999","url":null,"abstract":"Visual grounding is the task of locating objects specified by natural\u0000language expressions. Existing methods extend generic object detection\u0000frameworks to tackle this task. They typically extract visual and textual\u0000features separately using independent visual and textual encoders, then fuse\u0000these features in a multi-modal decoder for final prediction. However, visual\u0000grounding presents unique challenges. It often involves locating objects with\u0000different text descriptions within the same image. Existing methods struggle\u0000with this task because the independent visual encoder produces identical visual\u0000features for the same image, limiting detection performance. Some recently\u0000approaches propose various language-guided visual encoders to address this\u0000issue, but they mostly rely solely on textual information and require\u0000sophisticated designs. In this paper, we introduce Multi-modal Conditional\u0000Adaptation (MMCA), which enables the visual encoder to adaptively update\u0000weights, directing its focus towards text-relevant regions. Specifically, we\u0000first integrate information from different modalities to obtain multi-modal\u0000embeddings. Then we utilize a set of weighting coefficients, which generated\u0000from the multimodal embeddings, to reorganize the weight update matrices and\u0000apply them to the visual encoder of the visual grounding model. Extensive\u0000experiments on four widely used datasets demonstrate that MMCA achieves\u0000significant improvements and state-of-the-art results. Ablation experiments\u0000further demonstrate the lightweight and efficiency of our method. Our source\u0000code is available at: https://github.com/Mr-Bigworth/MMCA.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou
{"title":"POINTS: Improving Your Vision-language Model with Affordable Strategies","authors":"Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou","doi":"arxiv-2409.04828","DOIUrl":"https://doi.org/arxiv-2409.04828","url":null,"abstract":"In recent years, vision-language models have made significant strides,\u0000excelling in tasks like optical character recognition and geometric\u0000problem-solving. However, several critical issues remain: 1) Proprietary models\u0000often lack transparency about their architectures, while open-source models\u0000need more detailed ablations of their training strategies. 2) Pre-training data\u0000in open-source works is under-explored, with datasets added empirically, making\u0000the process cumbersome. 3) Fine-tuning often focuses on adding datasets,\u0000leading to diminishing returns. To address these issues, we propose the\u0000following contributions: 1) We trained a robust baseline model using the latest\u0000advancements in vision-language models, introducing effective improvements and\u0000conducting comprehensive ablation and validation for each technique. 2)\u0000Inspired by recent work on large language models, we filtered pre-training data\u0000using perplexity, selecting the lowest perplexity data for training. This\u0000approach allowed us to train on a curated 1M dataset, achieving competitive\u0000performance. 3) During visual instruction tuning, we used model soup on\u0000different datasets when adding more datasets yielded marginal improvements.\u0000These innovations resulted in a 9B parameter model that performs competitively\u0000with state-of-the-art models. Our strategies are efficient and lightweight,\u0000making them easily adoptable by the community.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}