arXiv - CS - Multimedia最新文献_第5页

HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR HiSC4D：使用可穿戴式 IMU 和激光雷达在大尺度空间进行以人为本的交互和 4D 场景捕捉

arXiv - CS - Multimedia Pub Date : 2024-09-06 DOI: arxiv-2409.04398

Yudi Dai, Zhiyong Wang, Xiping Lin, Chenglu Wen, Lan Xu, Siqi Shen, Yuexin Ma, Cheng Wang

{"title":"HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR","authors":"Yudi Dai, Zhiyong Wang, Xiping Lin, Chenglu Wen, Lan Xu, Siqi Shen, Yuexin Ma, Cheng Wang","doi":"arxiv-2409.04398","DOIUrl":"https://doi.org/arxiv-2409.04398","url":null,"abstract":"We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capture\u0000method, aimed at accurately and efficiently creating a dynamic digital world,\u0000containing large-scale indoor-outdoor scenes, diverse human motions, rich\u0000human-human interactions, and human-environment interactions. By utilizing\u0000body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human\u0000motions in unconstrained space without the need for external devices and\u0000pre-built maps. This affords great flexibility and accessibility for\u0000human-centered interaction and 4D scene capturing in various environments.\u0000Taking into account that IMUs can capture human spatially unrestricted poses\u0000but are prone to drifting for long-period using, and while LiDAR is stable for\u0000global localization but rough for local positions and orientations, HiSC4D\u0000employs a joint optimization method, harmonizing all sensors and utilizing\u0000environment cues, yielding promising results for long-term capture in large\u0000scenes. To promote research of egocentric human interaction in large scenes and\u0000facilitate downstream tasks, we also present a dataset, containing 8 sequences\u0000in 4 large scenes (200 to 5,000 $m^2$), providing 36k frames of accurate 4D\u0000human motions with SMPL annotations and dynamic scenes, 31k frames of cropped\u0000human point clouds, and scene mesh of the environment. A variety of scenarios,\u0000such as the basketball gym and commercial street, alongside challenging human\u0000motions, such as daily greeting, one-on-one basketball playing, and tour\u0000guiding, demonstrate the effectiveness and the generalization ability of\u0000HiSC4D. The dataset and code will be publicated on\u0000www.lidarhumanmotion.net/hisc4d available for research purposes.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Question-Answering Dense Video Events 密集视频事件问答

arXiv - CS - Multimedia Pub Date : 2024-09-06 DOI: arxiv-2409.04388

Hangyu Qin, Junbin Xiao, Angela Yao

{"title":"Question-Answering Dense Video Events","authors":"Hangyu Qin, Junbin Xiao, Angela Yao","doi":"arxiv-2409.04388","DOIUrl":"https://doi.org/arxiv-2409.04388","url":null,"abstract":"Multimodal Large Language Models (MLLMs) have shown excellent performance in\u0000question-answering of single-event videos. In this paper, we present\u0000question-answering dense video events, a novel task that requires answering and\u0000grounding the dense-event questions in long videos, thus challenging MLLMs to\u0000faithfully comprehend and reason about multiple events occurring over extended\u0000time periods. To facilitate the study, we construct DeVE-QA - a dataset\u0000featuring 78K questions about 26K events on 10.6K long videos. We then\u0000benchmark and show that existing MLLMs excelling at single-event QA struggle to\u0000perform well in DeVE-QA. For improvement, we propose DeVi, a novel\u0000training-free MLLM approach that highlights a hierarchical captioning module, a\u0000temporal event memory module, and a self-consistency checking module to\u0000respectively detect, contextualize and memorize, and ground dense-events in\u0000long videos for question answering. Extensive experiments show that DeVi is\u0000superior at answering dense-event questions and grounding relevant video\u0000moments. Compared with existing MLLMs, it achieves a remarkable increase of 4.1\u0000percent and 3.7 percent for G(round)QA accuracy on DeVE-QA and NExT-GQA\u0000respectively.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"392 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WaterMAS: Sharpness-Aware Maximization for Neural Network Watermarking WaterMAS：神经网络水印的锐度感知最大化

arXiv - CS - Multimedia Pub Date : 2024-09-05 DOI: arxiv-2409.03902

Carl De Sousa Trias, Mihai Mitrea, Attilio Fiandrotti, Marco Cagnazzo, Sumanta Chaudhuri, Enzo Tartaglione

{"title":"WaterMAS: Sharpness-Aware Maximization for Neural Network Watermarking","authors":"Carl De Sousa Trias, Mihai Mitrea, Attilio Fiandrotti, Marco Cagnazzo, Sumanta Chaudhuri, Enzo Tartaglione","doi":"arxiv-2409.03902","DOIUrl":"https://doi.org/arxiv-2409.03902","url":null,"abstract":"Nowadays, deep neural networks are used for solving complex tasks in several\u0000critical applications and protecting both their integrity and intellectual\u0000property rights (IPR) has become of utmost importance. To this end, we advance\u0000WaterMAS, a substitutive, white-box neural network watermarking method that\u0000improves the trade-off among robustness, imperceptibility, and computational\u0000complexity, while making provisions for increased data payload and security.\u0000WasterMAS insertion keeps unchanged the watermarked weights while sharpening\u0000their underlying gradient space. The robustness is thus ensured by limiting the\u0000attack's strength: even small alterations of the watermarked weights would\u0000impact the model's performance. The imperceptibility is ensured by inserting\u0000the watermark during the training process. The relationship among the WaterMAS\u0000data payload, imperceptibility, and robustness properties is discussed. The\u0000secret key is represented by the positions of the weights conveying the\u0000watermark, randomly chosen through multiple layers of the model. The security\u0000is evaluated by investigating the case in which an attacker would intercept the\u0000key. The experimental validations consider 5 models and 2 tasks (VGG16,\u0000ResNet18, MobileNetV3, SwinT for CIFAR10 image classification, and DeepLabV3\u0000for Cityscapes image segmentation) as well as 4 types of attacks (Gaussian\u0000noise addition, pruning, fine-tuning, and quantization). The code will be\u0000released open-source upon acceptance of the article.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing SegTalker：基于分割的会说话人脸生成与遮罩引导的局部编辑

arXiv - CS - Multimedia Pub Date : 2024-09-05 DOI: arxiv-2409.03605

Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu

{"title":"SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing","authors":"Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu","doi":"arxiv-2409.03605","DOIUrl":"https://doi.org/arxiv-2409.03605","url":null,"abstract":"Audio-driven talking face generation aims to synthesize video with lip\u0000movements synchronized to input audio. However, current generative techniques\u0000face challenges in preserving intricate regional textures (skin, teeth). To\u0000address the aforementioned challenges, we propose a novel framework called\u0000SegTalker to decouple lip movements and image textures by introducing\u0000segmentation as intermediate representation. Specifically, given the mask of\u0000image employed by a parsing network, we first leverage the speech to drive the\u0000mask and generate talking segmentation. Then we disentangle semantic regions of\u0000image into style codes using a mask-guided encoder. Ultimately, we inject the\u0000previously generated talking segmentation and style codes into a mask-guided\u0000StyleGAN to synthesize video frame. In this way, most of textures are fully\u0000preserved. Moreover, our approach can inherently achieve background separation\u0000and facilitate mask-guided facial local editing. In particular, by editing the\u0000mask and swapping the region textures from a given reference image (e.g. hair,\u0000lip, eyebrows), our approach enables facial editing seamlessly when generating\u0000talking face video. Experiments demonstrate that our proposed approach can\u0000effectively preserve texture details and generate temporally consistent video\u0000while remaining competitive in lip synchronization. Quantitative and\u0000qualitative results on the HDTF and MEAD datasets illustrate the superior\u0000performance of our method over existing methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression 通过以表达为导向的动态门控和回归，让基于图表的引用表达理解再创辉煌

arXiv - CS - Multimedia Pub Date : 2024-09-05 DOI: arxiv-2409.03385

Jingcheng Ke, Dele Wang, Jun-Cheng Chen, I-Hong Jhuo, Chia-Wen Lin, Yen-Yu Lin

{"title":"Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression","authors":"Jingcheng Ke, Dele Wang, Jun-Cheng Chen, I-Hong Jhuo, Chia-Wen Lin, Yen-Yu Lin","doi":"arxiv-2409.03385","DOIUrl":"https://doi.org/arxiv-2409.03385","url":null,"abstract":"One common belief is that with complex models and pre-training on large-scale\u0000datasets, transformer-based methods for referring expression comprehension\u0000(REC) perform much better than existing graph-based methods. We observe that\u0000since most graph-based methods adopt an off-the-shelf detector to locate\u0000candidate objects (i.e., regions detected by the object detector), they face\u0000two challenges that result in subpar performance: (1) the presence of\u0000significant noise caused by numerous irrelevant objects during reasoning, and\u0000(2) inaccurate localization outcomes attributed to the provided detector. To\u0000address these issues, we introduce a plug-and-adapt module guided by\u0000sub-expressions, called dynamic gate constraint (DGC), which can adaptively\u0000disable irrelevant proposals and their connections in graphs during reasoning.\u0000We further introduce an expression-guided regression strategy (EGR) to refine\u0000location prediction. Extensive experimental results on the RefCOCO, RefCOCO+,\u0000RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate the\u0000effectiveness of the DGC module and the EGR strategy in consistently boosting\u0000the performances of various graph-based REC methods. Without any pretaining,\u0000the proposed graph-based method achieves better performance than the\u0000state-of-the-art (SOTA) transformer-based methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture LongLLaVA：通过混合架构将多模态 LLM 高效扩展到 1000 张图像

arXiv - CS - Multimedia Pub Date : 2024-09-04 DOI: arxiv-2409.02889

Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang

{"title":"LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture","authors":"Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang","doi":"arxiv-2409.02889","DOIUrl":"https://doi.org/arxiv-2409.02889","url":null,"abstract":"Expanding the long-context capabilities of Multi-modal Large Language\u0000Models~(MLLMs) is crucial for video understanding, high-resolution image\u0000understanding, and multi-modal agents. This involves a series of systematic\u0000optimizations, including model architecture, data construction and training\u0000strategy, particularly addressing challenges such as textit{degraded\u0000performance with more images} and textit{high computational costs}. In this\u0000paper, we adapt the model architecture to a hybrid of Mamba and Transformer\u0000blocks, approach data construction with both temporal and spatial dependencies\u0000among multiple images and employ a progressive training strategy. The released\u0000model textbf{LongLLaVA}~(textbf{Long}-Context textbf{L}arge\u0000textbf{L}anguage textbf{a}nd textbf{V}ision textbf{A}ssistant) is the first\u0000hybrid MLLM, which achieved a better balance between efficiency and\u0000effectiveness. LongLLaVA not only achieves competitive results across various\u0000benchmarks, but also maintains high throughput and low memory consumption.\u0000Especially, it could process nearly a thousand images on a single A100 80GB\u0000GPU, showing promising application prospects for a wide range of tasks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ExpLLM: Towards Chain of Thought for Facial Expression Recognition ExpLLM：实现面部表情识别的思维链

arXiv - CS - Multimedia Pub Date : 2024-09-04 DOI: arxiv-2409.02828

Xing Lan, Jian Xue, Ji Qi, Dongmei Jiang, Ke Lu, Tat-Seng Chua

{"title":"ExpLLM: Towards Chain of Thought for Facial Expression Recognition","authors":"Xing Lan, Jian Xue, Ji Qi, Dongmei Jiang, Ke Lu, Tat-Seng Chua","doi":"arxiv-2409.02828","DOIUrl":"https://doi.org/arxiv-2409.02828","url":null,"abstract":"Facial expression recognition (FER) is a critical task in multimedia with\u0000significant implications across various domains. However, analyzing the causes\u0000of facial expressions is essential for accurately recognizing them. Current\u0000approaches, such as those based on facial action units (AUs), typically provide\u0000AU names and intensities but lack insight into the interactions and\u0000relationships between AUs and the overall expression. In this paper, we propose\u0000a novel method called ExpLLM, which leverages large language models to generate\u0000an accurate chain of thought (CoT) for facial expression recognition.\u0000Specifically, we have designed the CoT mechanism from three key perspectives:\u0000key observations, overall emotional interpretation, and conclusion. The key\u0000observations describe the AU's name, intensity, and associated emotions. The\u0000overall emotional interpretation provides an analysis based on multiple AUs and\u0000their interactions, identifying the dominant emotions and their relationships.\u0000Finally, the conclusion presents the final expression label derived from the\u0000preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed\u0000to construct this expression CoT and generate instruction-description data for\u0000training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets\u0000demonstrate that ExpLLM outperforms current state-of-the-art FER methods.\u0000ExpLLM also surpasses the latest GPT-4o in expression CoT generation,\u0000particularly in recognizing micro-expressions where GPT-4o frequently fails.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation PoseTalk：基于文字和音频的姿态控制和动作细化，用于一次性生成对话头像

arXiv - CS - Multimedia Pub Date : 2024-09-04 DOI: arxiv-2409.02657

Jun Ling, Yiwen Wang, Han Xue, Rong Xie, Li Song

{"title":"PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation","authors":"Jun Ling, Yiwen Wang, Han Xue, Rong Xie, Li Song","doi":"arxiv-2409.02657","DOIUrl":"https://doi.org/arxiv-2409.02657","url":null,"abstract":"While previous audio-driven talking head generation (THG) methods generate\u0000head poses from driving audio, the generated poses or lips cannot match the\u0000audio well or are not editable. In this study, we propose textbf{PoseTalk}, a\u0000THG system that can freely generate lip-synchronized talking head videos with\u0000free head poses conditioned on text prompts and audio. The core insight of our\u0000method is using head pose to connect visual, linguistic, and audio signals.\u0000First, we propose to generate poses from both audio and text prompts, where the\u0000audio offers short-term variations and rhythm correspondence of the head\u0000movements and the text prompts describe the long-term semantics of head\u0000motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to\u0000generate motion latent from text prompts and audio cues in a pose latent space.\u0000Second, we observe a loss-imbalance problem: the loss for the lip region\u0000contributes less than 4% of the total reconstruction loss caused by both pose\u0000and lip, making optimization lean towards head movements rather than lip\u0000shapes. To address this issue, we propose a refinement-based learning strategy\u0000to synthesize natural talking videos using two cascaded networks, i.e.,\u0000CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce\u0000animated images in novel poses and the RefineNet focuses on learning finer lip\u0000motions by progressively estimating lip motions from low-to-high resolutions,\u0000yielding improved lip-synchronization performance. Experiments demonstrate our\u0000pose prediction strategy achieves better pose diversity and realness compared\u0000to text-only or audio-only, and our video generator model outperforms\u0000state-of-the-art methods in synthesizing talking videos with natural head\u0000motions. Project: https://junleen.github.io/projects/posetalk.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation 利用跨分辨率关系对比蒸馏技术识别低分辨率物体

arXiv - CS - Multimedia Pub Date : 2024-09-04 DOI: arxiv-2409.02555

Kangkai Zhang, Shiming Ge, Ruixin Shi, Dan Zeng

{"title":"Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation","authors":"Kangkai Zhang, Shiming Ge, Ruixin Shi, Dan Zeng","doi":"arxiv-2409.02555","DOIUrl":"https://doi.org/arxiv-2409.02555","url":null,"abstract":"Recognizing objects in low-resolution images is a challenging task due to the\u0000lack of informative details. Recent studies have shown that knowledge\u0000distillation approaches can effectively transfer knowledge from a\u0000high-resolution teacher model to a low-resolution student model by aligning\u0000cross-resolution representations. However, these approaches still face\u0000limitations in adapting to the situation where the recognized objects exhibit\u0000significant representation discrepancies between training and testing images.\u0000In this study, we propose a cross-resolution relational contrastive\u0000distillation approach to facilitate low-resolution object recognition. Our\u0000approach enables the student model to mimic the behavior of a well-trained\u0000teacher model which delivers high accuracy in identifying high-resolution\u0000objects. To extract sufficient knowledge, the student learning is supervised\u0000with contrastive relational distillation loss, which preserves the similarities\u0000in various relational structures in contrastive representation space. In this\u0000manner, the capability of recovering missing details of familiar low-resolution\u0000objects can be effectively enhanced, leading to a better knowledge transfer.\u0000Extensive experiments on low-resolution object classification and\u0000low-resolution face recognition clearly demonstrate the effectiveness and\u0000adaptability of our approach.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Coral Model Generation from Single Images for Virtual Reality Applications 根据单张图像生成珊瑚模型，用于虚拟现实应用

arXiv - CS - Multimedia Pub Date : 2024-09-04 DOI: arxiv-2409.02376

Jie FuUniversity of the Arts London, Creative Computing Institute, London, United Kingdom, Shun FuBloks Technology Company, Shanghai, China, Mick GriersonUniversity of the Arts London, Creative Computing Institute, London, United Kingdom

{"title":"Coral Model Generation from Single Images for Virtual Reality Applications","authors":"Jie FuUniversity of the Arts London, Creative Computing Institute, London, United Kingdom, Shun FuBloks Technology Company, Shanghai, China, Mick GriersonUniversity of the Arts London, Creative Computing Institute, London, United Kingdom","doi":"arxiv-2409.02376","DOIUrl":"https://doi.org/arxiv-2409.02376","url":null,"abstract":"With the rapid development of VR technology, the demand for high-quality 3D\u0000models is increasing. Traditional methods struggle with efficiency and quality\u0000in large-scale customization. This paper introduces a deep-learning framework\u0000that generates high-precision 3D coral models from a single image. Using the\u0000Coral dataset, the framework extracts geometric and texture features, performs\u00003D reconstruction, and optimizes design and material blending. Advanced\u0000optimization and polygon count control ensure shape accuracy, detail retention,\u0000and flexible output for various complexities, catering to high-quality\u0000rendering and real-time interaction needs.The project incorporates Explainable\u0000AI (XAI) to transform AI-generated models into interactive \"artworks,\" best\u0000viewed in VR and XR. This enhances model interpretability and human-machine\u0000collaboration. Real-time feedback in VR interactions displays information like\u0000coral species and habitat, enriching user experience. The generated models\u0000surpass traditional methods in detail, visual quality, and efficiency. This\u0000research offers an intelligent approach to 3D content creation for VR, lowering\u0000production barriers, and promoting widespread VR applications. Additionally,\u0000integrating XAI provides new insights into AI-generated visual content and\u0000advances research in 3D vision interpretability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0