Shixuan Gao, Pingping Zhang, Tianyu Yan, Huchuan Lu
{"title":"Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection","authors":"Shixuan Gao, Pingping Zhang, Tianyu Yan, Huchuan Lu","doi":"arxiv-2408.04326","DOIUrl":"https://doi.org/arxiv-2408.04326","url":null,"abstract":"Salient Object Detection (SOD) aims to identify and segment the most\u0000prominent objects in images. Advanced SOD methods often utilize various\u0000Convolutional Neural Networks (CNN) or Transformers for deep feature\u0000extraction. However, these methods still deliver low performance and poor\u0000generalization in complex cases. Recently, Segment Anything Model (SAM) has\u0000been proposed as a visual fundamental model, which gives strong segmentation\u0000and generalization capabilities. Nonetheless, SAM requires accurate prompts of\u0000target objects, which are unavailable in SOD. Additionally, SAM lacks the\u0000utilization of multi-scale and multi-level information, as well as the\u0000incorporation of fine-grained details. To address these shortcomings, we\u0000propose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. Specifically, we\u0000first introduce a Lightweight Multi-Scale Adapter (LMSA), which allows SAM to\u0000learn multi-scale information with very few trainable parameters. Then, we\u0000propose a Multi-Level Fusion Module (MLFM) to comprehensively utilize the\u0000multi-level information from the SAM's encoder. Finally, we propose a Detail\u0000Enhancement Module (DEM) to incorporate SAM with fine-grained details.\u0000Experimental results demonstrate the superior performance of our model on\u0000multiple SOD datasets and its strong generalization on other segmentation\u0000tasks. The source code is released at https://github.com/BellyBeauty/MDSAM.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haoxuan Li, Zhengmao Yang, Yunshan Ma, Yi Bin, Yang Yang, Tat-Seng Chua
{"title":"MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models","authors":"Haoxuan Li, Zhengmao Yang, Yunshan Ma, Yi Bin, Yang Yang, Tat-Seng Chua","doi":"arxiv-2408.04388","DOIUrl":"https://doi.org/arxiv-2408.04388","url":null,"abstract":"We study an emerging and intriguing problem of multimodal temporal event\u0000forecasting with large language models. Compared to using text or graph\u0000modalities, the investigation of utilizing images for temporal event\u0000forecasting has not been fully explored, especially in the era of large\u0000language models (LLMs). To bridge this gap, we are particularly interested in\u0000two key questions of: 1) why images will help in temporal event forecasting,\u0000and 2) how to integrate images into the LLM-based forecasting framework. To\u0000answer these research questions, we propose to identify two essential functions\u0000that images play in the scenario of temporal event forecasting, i.e.,\u0000highlighting and complementary. Then, we develop a novel framework, named\u0000MM-Forecast. It employs an Image Function Identification module to recognize\u0000these functions as verbal descriptions using multimodal large language models\u0000(MLLMs), and subsequently incorporates these function descriptions into\u0000LLM-based forecasting models. To evaluate our approach, we construct a new\u0000multimodal dataset, MidEast-TE-mm, by extending an existing event dataset\u0000MidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecast\u0000can correctly identify the image functions, and further more, incorporating\u0000these verbal function descriptions significantly improves the forecasting\u0000performance. The dataset, code, and prompts are available at\u0000https://github.com/LuminosityX/MM-Forecast.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedro Neto, Martin Hartmann, Geoff Luck, Petri Toiviainen
{"title":"The algorithmic nature of song-sequencing: statistical regularities in music albums","authors":"Pedro Neto, Martin Hartmann, Geoff Luck, Petri Toiviainen","doi":"arxiv-2408.04383","DOIUrl":"https://doi.org/arxiv-2408.04383","url":null,"abstract":"Based on a review of anecdotal beliefs, we explored patterns of\u0000track-sequencing within professional music albums. We found that songs with\u0000high levels of valence, energy and loudness are more likely to be positioned at\u0000the beginning of each album. We also found that transitions between consecutive\u0000tracks tend to alternate between increases and decreases of valence and energy.\u0000These findings were used to build a system which automates the process of\u0000album-sequencing. Our results and hypothesis have both practical and\u0000theoretical applications. Practically, sequencing regularities can be used to\u0000inform playlist generation systems. Theoretically, we show weak to moderate\u0000support for the idea that music is perceived in both global and local contexts.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"2012 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning","authors":"Rex Liu, Xin Liu","doi":"arxiv-2408.04243","DOIUrl":"https://doi.org/arxiv-2408.04243","url":null,"abstract":"With the exponential growth of multimedia data, leveraging multimodal sensors\u0000presents a promising approach for improving accuracy in human activity\u0000recognition. Nevertheless, accurately identifying these activities using both\u0000video data and wearable sensor data presents challenges due to the\u0000labor-intensive data annotation, and reliance on external pretrained models or\u0000additional data. To address these challenges, we introduce Multimodal Masked\u0000Autoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodal\u0000masked autoencoder with a synchronized masking strategy tailored for wearable\u0000sensors. This masking strategy compels the networks to capture more meaningful\u0000spatiotemporal features, which enables effective self-supervised pretraining\u0000without the need for external data. Furthermore, Mu-MAE leverages the\u0000representation extracted from multimodal masked autoencoders as prior\u0000information input to a cross-attention multimodal fusion layer. This fusion\u0000layer emphasizes spatiotemporal features requiring attention across different\u0000modalities while highlighting differences from other classes, aiding in the\u0000classification of various classes in metric-based one-shot learning.\u0000Comprehensive evaluations on MMAct one-shot classification show that Mu-MAE\u0000outperforms all the evaluated approaches, achieving up to an 80.17% accuracy\u0000for five-way one-shot multimodal classification, without the use of additional\u0000data.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuqi Chu, Lizi Liao, Zhiyuan Zhou, Chong-Wah Ngo, Richang Hong
{"title":"Towards Multimodal Emotional Support Conversation Systems","authors":"Yuqi Chu, Lizi Liao, Zhiyuan Zhou, Chong-Wah Ngo, Richang Hong","doi":"arxiv-2408.03650","DOIUrl":"https://doi.org/arxiv-2408.03650","url":null,"abstract":"The integration of conversational artificial intelligence (AI) into mental\u0000health care promises a new horizon for therapist-client interactions, aiming to\u0000closely emulate the depth and nuance of human conversations. Despite the\u0000potential, the current landscape of conversational AI is markedly limited by\u0000its reliance on single-modal data, constraining the systems' ability to\u0000empathize and provide effective emotional support. This limitation stems from a\u0000paucity of resources that encapsulate the multimodal nature of human\u0000communication essential for therapeutic counseling. To address this gap, we\u0000introduce the Multimodal Emotional Support Conversation (MESC) dataset, a\u0000first-of-its-kind resource enriched with comprehensive annotations across text,\u0000audio, and video modalities. This dataset captures the intricate interplay of\u0000user emotions, system strategies, system emotion, and system responses, setting\u0000a new precedent in the field. Leveraging the MESC dataset, we propose a general\u0000Sequential Multimodal Emotional Support framework (SMES) grounded in\u0000Therapeutic Skills Theory. Tailored for multimodal dialogue systems, the SMES\u0000framework incorporates an LLM-based reasoning model that sequentially generates\u0000user emotion recognition, system strategy prediction, system emotion\u0000prediction, and response generation. Our rigorous evaluations demonstrate that\u0000this framework significantly enhances the capability of AI systems to mimic\u0000therapist behaviors with heightened empathy and strategic responsiveness. By\u0000integrating multimodal data in this innovative manner, we bridge the critical\u0000gap between emotion recognition and emotional support, marking a significant\u0000advancement in conversational AI for mental health support.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization","authors":"Kien T. Pham, Jingye Chen, Qifeng Chen","doi":"arxiv-2408.03637","DOIUrl":"https://doi.org/arxiv-2408.03637","url":null,"abstract":"We present TALE, a novel training-free framework harnessing the generative\u0000capabilities of text-to-image diffusion models to address the cross-domain\u0000image composition task that focuses on flawlessly incorporating user-specified\u0000objects into a designated visual contexts regardless of domain disparity.\u0000Previous methods often involve either training auxiliary networks or finetuning\u0000diffusion models on customized datasets, which are expensive and may undermine\u0000the robust textual and visual priors of pre-trained diffusion models. Some\u0000recent works attempt to break the barrier by proposing training-free\u0000workarounds that rely on manipulating attention maps to tame the denoising\u0000process implicitly. However, composing via attention maps does not necessarily\u0000yield desired compositional outcomes. These approaches could only retain some\u0000semantic information and usually fall short in preserving identity\u0000characteristics of input objects or exhibit limited background-object style\u0000adaptation in generated images. In contrast, TALE is a novel method that\u0000operates directly on latent space to provide explicit and effective guidance\u0000for the composition process to resolve these problems. Specifically, we equip\u0000TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided\u0000Latent Optimization. The former formulates noisy latents conducive to\u0000initiating and steering the composition process by directly leveraging\u0000background and foreground latents at corresponding timesteps, and the latter\u0000exploits designated energy functions to further optimize intermediate latents\u0000conforming to specific conditions that complement the former to generate\u0000desired final results. Our experiments demonstrate that TALE surpasses prior\u0000baselines and attains state-of-the-art performance in image-guided composition\u0000across various photorealistic and artistic domains.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"100 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juho Jung, Chaewon Kang, Jeewoo Yoon, Seungbae Kim, Jinyoung Han
{"title":"HiQuE: Hierarchical Question Embedding Network for Multimodal Depression Detection","authors":"Juho Jung, Chaewon Kang, Jeewoo Yoon, Seungbae Kim, Jinyoung Han","doi":"arxiv-2408.03648","DOIUrl":"https://doi.org/arxiv-2408.03648","url":null,"abstract":"The utilization of automated depression detection significantly enhances\u0000early intervention for individuals experiencing depression. Despite numerous\u0000proposals on automated depression detection using recorded clinical interview\u0000videos, limited attention has been paid to considering the hierarchical\u0000structure of the interview questions. In clinical interviews for diagnosing\u0000depression, clinicians use a structured questionnaire that includes routine\u0000baseline questions and follow-up questions to assess the interviewee's\u0000condition. This paper introduces HiQuE (Hierarchical Question Embedding\u0000network), a novel depression detection framework that leverages the\u0000hierarchical relationship between primary and follow-up questions in clinical\u0000interviews. HiQuE can effectively capture the importance of each question in\u0000diagnosing depression by learning mutual information across multiple\u0000modalities. We conduct extensive experiments on the widely-used clinical\u0000interview data, DAIC-WOZ, where our model outperforms other state-of-the-art\u0000multimodal depression detection models and emotion recognition models,\u0000showcasing its clinical utility in depression detection.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"372 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei He, Xiang Li, Shengtian Xu, Yuzheng Chen, Chan-In Sio, Ge Lin Kan, Lik-Hang Lee
{"title":"MetaDragonBoat: Exploring Paddling Techniques of Virtual Dragon Boating in a Metaverse Campus","authors":"Wei He, Xiang Li, Shengtian Xu, Yuzheng Chen, Chan-In Sio, Ge Lin Kan, Lik-Hang Lee","doi":"arxiv-2408.04013","DOIUrl":"https://doi.org/arxiv-2408.04013","url":null,"abstract":"The preservation of cultural heritage, as mandated by the United Nations\u0000Sustainable Development Goals (SDGs), is integral to sustainable urban\u0000development. This paper focuses on the Dragon Boat Festival, a prominent event\u0000in Chinese cultural heritage, and proposes leveraging Virtual Reality (VR), to\u0000enhance its preservation and accessibility. Traditionally, participation in the\u0000festival's dragon boat races was limited to elite athletes, excluding broader\u0000demographics. Our proposed solution, named MetaDragonBoat, enables virtual\u0000participation in dragon boat racing, offering immersive experiences that\u0000replicate physical exertion through a cultural journey. Thus, we build a\u0000digital twin of a university campus located in a region with a rich dragon boat\u0000racing tradition. Coupled with three paddling techniques that are enabled by\u0000either commercial controllers or physical paddle controllers with haptic\u0000feedback, diversified users can engage in realistic rowing experiences. Our\u0000results demonstrate that by integrating resistance into the paddle controls,\u0000users could simulate the physical effort of dragon boat racing, promoting a\u0000deeper understanding and appreciation of this cultural heritage.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zebin Yao, Fangxiang Feng, Ruifan Li, Xiaojie Wang
{"title":"Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis","authors":"Zebin Yao, Fangxiang Feng, Ruifan Li, Xiaojie Wang","doi":"arxiv-2408.03632","DOIUrl":"https://doi.org/arxiv-2408.03632","url":null,"abstract":"The customization of text-to-image models has seen significant advancements,\u0000yet generating multiple personalized concepts remains a challenging task.\u0000Current methods struggle with attribute leakage and layout confusion when\u0000handling multiple concepts, leading to reduced concept fidelity and semantic\u0000consistency. In this work, we introduce a novel training-free framework,\u0000Concept Conductor, designed to ensure visual fidelity and correct layout in\u0000multi-concept customization. Concept Conductor isolates the sampling processes\u0000of multiple custom models to prevent attribute leakage between different\u0000concepts and corrects erroneous layouts through self-attention-based spatial\u0000guidance. Additionally, we present a concept injection technique that employs\u0000shape-aware masks to specify the generation area for each concept. This\u0000technique injects the structure and appearance of personalized concepts through\u0000feature fusion in the attention layers, ensuring harmony in the final image.\u0000Extensive qualitative and quantitative experiments demonstrate that Concept\u0000Conductor can consistently generate composite images with accurate layouts\u0000while preserving the visual details of each concept. Compared to existing\u0000baselines, Concept Conductor shows significant performance improvements. Our\u0000method supports the combination of any number of concepts and maintains high\u0000fidelity even when dealing with visually similar concepts. The code and models\u0000are available at https://github.com/Nihukat/Concept-Conductor.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li
{"title":"ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval","authors":"Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li","doi":"arxiv-2408.02978","DOIUrl":"https://doi.org/arxiv-2408.02978","url":null,"abstract":"E-commerce is increasingly multimedia-enriched, with products exhibited in a\u0000broad-domain manner as images, short videos, or live stream promotions. A\u0000unified and vectorized cross-domain production representation is essential. Due\u0000to large intra-product variance and high inter-product similarity in the\u0000broad-domain scenario, a visual-only representation is inadequate. While\u0000Automatic Speech Recognition (ASR) text derived from the short or live-stream\u0000videos is readily accessible, how to de-noise the excessively noisy text for\u0000multimodal representation learning is mostly untouched. We propose ASR-enhanced\u0000Multimodal Product Representation Learning (AMPere). In order to extract\u0000product-specific information from the raw ASR text, AMPere uses an\u0000easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text,\u0000together with visual data, is then fed into a multi-branch network to generate\u0000compact multimodal embeddings. Extensive experiments on a large-scale\u0000tri-domain dataset verify the effectiveness of AMPere in obtaining a unified\u0000multimodal product representation that clearly improves cross-domain product\u0000retrieval.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}