Bo Han, Yuheng Li, Yixuan Shen, Yi Ren, Feilin Han
{"title":"Dance2MIDI: Dance-driven multi-instrument music generation","authors":"Bo Han, Yuheng Li, Yixuan Shen, Yi Ren, Feilin Han","doi":"10.1007/s41095-024-0417-1","DOIUrl":"https://doi.org/10.1007/s41095-024-0417-1","url":null,"abstract":"<p>Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multi-instrument scenario is under-explored. The challenges associated with dance-driven multi-instrument music (MIDI) generation are twofold: (i) lack of a publicly available multi-instrument MIDI and video paired dataset and (ii) the weak correlation between music and video. To tackle these challenges, we have built the first multi-instrument MIDI and dance paired dataset (D2MIDI). Based on this dataset, we introduce a multi-instrument MIDI generation framework (Dance2MIDI) conditioned on dance video. Firstly, to capture the relationship between dance and music, we employ a graph convolutional network to encode the dance motion. This allows us to extract features related to dance movement and dance style. Secondly, to generate a harmonious rhythm, we utilize a transformer model to decode the drum track sequence, leveraging a cross-attention mechanism. Thirdly, we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task. A BERT-like model is employed to comprehend the context of the entire music piece through self-supervised learning. We evaluate the music generated by our framework trained on the D2MIDI dataset and demonstrate that our method achieves state-of-the-art performance.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"38 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Continual few-shot patch-based learning for anime-style colorization","authors":"Akinobu Maejima, Seitaro Shinagawa, Hiroyuki Kubo, Takuya Funatomi, Tatsuo Yotsukura, Satoshi Nakamura, Yasuhiro Mukaigawa","doi":"10.1007/s41095-024-0414-4","DOIUrl":"https://doi.org/10.1007/s41095-024-0414-4","url":null,"abstract":"<p>The automatic colorization of anime line drawings is a challenging problem in production pipelines. Recent advances in deep neural networks have addressed this problem; however, collectingmany images of colorization targets in novel anime work before the colorization process starts leads to chicken-and-egg problems and has become an obstacle to using them in production pipelines. To overcome this obstacle, we propose a new patch-based learning method for few-shot anime-style colorization. The learning method adopts an efficient patch sampling technique with position embedding according to the characteristics of anime line drawings. We also present a continuous learning strategy that continuously updates our colorization model using new samples colorized by human artists. The advantage of our method is that it can learn our colorization model from scratch or pre-trained weights using only a few pre- and post-colorized line drawings that are created by artists in their usual colorization work. Therefore, our method can be easily incorporated within existing production pipelines. We quantitatively demonstrate that our colorizationmethod outperforms state-of-the-art methods.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"46 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tong Wu, Yu-Jie Yuan, Ling-Xiao Zhang, Jie Yang, Yan-Pei Cao, Ling-Qi Yan, Lin Gao
{"title":"Recent advances in 3D Gaussian splatting","authors":"Tong Wu, Yu-Jie Yuan, Ling-Xiao Zhang, Jie Yang, Yan-Pei Cao, Ling-Qi Yan, Lin Gao","doi":"10.1007/s41095-024-0436-y","DOIUrl":"https://doi.org/10.1007/s41095-024-0436-y","url":null,"abstract":"<p>The emergence of 3D Gaussian splatting (3DGS) has greatly accelerated rendering in novel view synthesis. Unlike neural implicit representations like neural radiance fields (NeRFs) that represent a 3D scene with position and viewpoint-conditioned neural networks, 3D Gaussian splatting utilizes a set of Gaussian ellipsoids to model the scene so that efficient rendering can be accomplished by rasterizing Gaussian ellipsoids into images. Apart from fast rendering, the explicit representation of 3D Gaussian splatting also facilitates downstream tasks like dynamic reconstruction, geometry editing, and physical simulation. Considering the rapid changes and growing number of works in this field, we present a literature review of recent 3D Gaussian splatting methods, which can be roughly classified by functionality into 3D reconstruction, 3D editing, and other downstream applications. Traditional point-based rendering methods and the rendering formulation of 3D Gaussian splatting are also covered to aid understanding of this technique. This survey aims to help beginners to quickly get started in this field and to provide experienced researchers with a comprehensive overview, aiming to stimulate future development of the 3D Gaussian splatting representation.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"52 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Illuminator: Image-based illumination editing for indoor scene harmonization","authors":"Zhongyun Bao, Gang Fu, Zipei Chen, Chunxia Xiao","doi":"10.1007/s41095-023-0397-6","DOIUrl":"https://doi.org/10.1007/s41095-023-0397-6","url":null,"abstract":"<p>Illumination harmonization is an important but challenging task that aims to achieve illumination compatibility between the foreground and background under different illumination conditions. Most current studies mainly focus on achieving seamless integration between the appearance (illumination or visual style) of the foreground object itself and the background scene or producing the foreground shadow. They rarely considered global illumination consistency (i.e., the illumination and shadow of the foreground object). In our work, we introduce “Illuminator”, an image-based illumination editing technique. This method aims to achieve more realistic global illumination harmonization, ensuring consistent illumination and plausible shadows in complex indoor environments. The Illuminator contains a shadow residual generation branch and an object illumination transfer branch. The shadow residual generation branch introduces a novel attention-aware graph convolutional mechanism to achieve reasonable foreground shadow generation. The object illumination transfer branch primarily transfers background illumination to the foreground region. In addition, we construct a real-world indoor illumination harmonization dataset called RIH, which consists of various foreground objects and background scenes captured under diverse illumination conditions for training and evaluating our Illuminator. Our comprehensive experiments, conducted on the RIH dataset and a collection of real-world everyday life photos, validate the effectiveness of our method.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"12 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Xing, Xiaoxuan Wang, Lin Lu, Andrei Sharf, Daniel Cohen-Or, Changhe Tu
{"title":"Shell stand: Stable thin shell models for 3D fabrication","authors":"Yu Xing, Xiaoxuan Wang, Lin Lu, Andrei Sharf, Daniel Cohen-Or, Changhe Tu","doi":"10.1007/s41095-024-0402-8","DOIUrl":"https://doi.org/10.1007/s41095-024-0402-8","url":null,"abstract":"<p>A thin shell model refers to a surface or structure, where the object’s thickness is considered negligible. In the context of 3D printing, thin shell models are characterized by having lightweight, hollow structures, and reduced material usage. Their versatility and visual appeal make them popular in various fields, such as cloth simulation, character skinning, and for thin-walled structures like leaves, paper, or metal sheets. Nevertheless, optimization of thin shell models without external support remains a challenge due to their minimal interior operational space. For the same reasons, hollowing methods are also unsuitable for this task. In fact, thin shell modulation methods are required to preserve the visual appearance of a two-sided surface which further constrain the problem space. In this paper, we introduce a new visual disparity metric tailored for shell models, integrating local details and global shape attributes in terms of visual perception. Our method modulates thin shell models using global deformations and local thickening while accounting for visual saliency, stability, and structural integrity. Thereby, thin shell models such as bas-reliefs, hollow shapes, and cloth can be stabilized to stand in arbitrary orientations, making them ideal for 3D printing.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"18 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal vectorized visibility for direct illumination of animated models","authors":"Zhenni Wang, Tze Yui Ho, Yi Xiao, Chi Sing Leung","doi":"10.1007/s41095-023-0339-3","DOIUrl":"https://doi.org/10.1007/s41095-023-0339-3","url":null,"abstract":"<p>Direct illumination rendering is an important technique in computer graphics. Precomputed radiance transfer algorithms can provide high quality rendering results in real time, but they can only support rigid models. On the other hand, ray tracing algorithms are flexible and can gracefully handle animated models. With NVIDIA RTX and the AI denoiser, we can use ray tracing algorithms to render visually appealing results in real time. Visually appealing though, they can deviate from the actual one considerably. We propose a visibility-boundary edge oriented infinite triangle bounding volume hierarchy (BVH) traversal algorithm to dynamically generate visibility in vector form. Our algorithm utilizes the properties of visibility-boundary edges and infinite triangle BVH traversal to maximize the efficiency of the vector form visibility generation. A novel data structure, temporal vectorized visibility, is proposed, which allows visibility in vector form to be shared across time and further increases the generation efficiency. Our algorithm can efficiently render close-to-reference direct illumination results. With the similar processing time, it provides a visual quality improvement around 10 dB in terms of peak signal-to-noise ratio (PSNR) w.r.t. the ray tracing algorithm reservoir-based spatiotemporal importance resampling (ReSTIR).\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"6 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141167717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Super-resolution reconstruction of single image for latent features","authors":"Xin Wang, Jing-Ke Yan, Jing-Ye Cai, Jian-Hua Deng, Qin Qin, Yao Cheng","doi":"10.1007/s41095-023-0387-8","DOIUrl":"https://doi.org/10.1007/s41095-023-0387-8","url":null,"abstract":"<p>Single-image super-resolution (SISR) typically focuses on restoring various degraded low-resolution (LR) images to a single high-resolution (HR) image. However, during SISR tasks, it is often challenging for models to simultaneously maintain high quality and rapid sampling while preserving diversity in details and texture features. This challenge can lead to issues such as model collapse, lack of rich details and texture features in the reconstructed HR images, and excessive time consumption for model sampling. To address these problems, this paper proposes a Latent Feature-oriented Diffusion Probability Model (LDDPM). First, we designed a conditional encoder capable of effectively encoding LR images, reducing the solution space for model image reconstruction and thereby improving the quality of the reconstructed images. We then employed a normalized flow and multimodal adversarial training, learning from complex multimodal distributions, to model the denoising distribution. Doing so boosts the generative modeling capabilities within a minimal number of sampling steps. Experimental comparisons of our proposed model with existing SISR methods on mainstream datasets demonstrate that our model reconstructs more realistic HR images and achieves better performance on multiple evaluation metrics, providing a fresh perspective for tackling SISR tasks.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"39 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Foundation models meet visualizations: Challenges and opportunities","authors":"Weikai Yang, Mengchen Liu, Zheng Wang, Shixia Liu","doi":"10.1007/s41095-023-0393-x","DOIUrl":"https://doi.org/10.1007/s41095-023-0393-x","url":null,"abstract":"<p>Recent studies have indicated that foundation models, such as BERT and GPT, excel at adapting to various downstream tasks. This adaptability has made them a dominant force in building artificial intelligence (AI) systems. Moreover, a new research paradigm has emerged as visualization techniques are incorporated into these models. This study divides these intersections into two research areas: visualization for foundation model (VIS4FM) and foundation model for visualization (FM4VIS). In terms of VIS4FM, we explore the primary role of visualizations in understanding, refining, and evaluating these intricate foundation models. VIS4FM addresses the pressing need for transparency, explainability, fairness, and robustness. Conversely, in terms of FM4VIS, we highlight how foundation models can be used to advance the visualization field itself. The intersection of foundation models with visualizations is promising but also introduces a set of challenges. By highlighting these challenges and promising opportunities, this study aims to provide a starting point for the continued exploration of this research avenue.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"113 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning layout generation for virtual worlds","authors":"Weihao Cheng, Ying Shan","doi":"10.1007/s41095-023-0365-1","DOIUrl":"https://doi.org/10.1007/s41095-023-0365-1","url":null,"abstract":"<p>The emergence of the metaverse has led to the rapidly increasing demand for the generation of extensive 3D worlds. We consider that an engaging world is built upon a rational layout of multiple land-use areas (e.g., forest, meadow, and farmland). To this end, we propose a generative model of land-use distribution that learns from geographic data. The model is based on a transformer architecture that generates a 2D map of the land-use layout, which can be conditioned on spatial and semantic controls, depending on whether either one or both are provided. This model enables diverse layout generation with user control and layout expansion by extending borders with partial inputs. To generate high-quality and satisfactory layouts, we devise a geometric objective function that supervises the model to perceive layout shapes and regularize generations using geometric priors. Additionally, we devise a planning objective function that supervises the model to perceive progressive composition demands and suppress generations deviating from controls. To evaluate the spatial distribution of the generations, we train an autoencoder to embed land-use layouts into vectors to enable comparison between the real and generated data using the Wasserstein metric, which is inspired by the Fréchet inception distance.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"13 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AdaPIP: Adaptive picture-in-picture guidance for 360° film watching","authors":"Yi-Xiao Li, Guan Luo, Yi-Ke Xu, Yu He, Fang-Lue Zhang, Song-Hai Zhang","doi":"10.1007/s41095-023-0347-3","DOIUrl":"https://doi.org/10.1007/s41095-023-0347-3","url":null,"abstract":"<p>360° videos enable viewers to watch freely from different directions but inevitably prevent them from perceiving all the helpful information. To mitigate this problem, picture-in-picture (PIP) guidance was proposed using preview windows to show regions of interest (ROIs) outside the current view range. We identify several drawbacks of this representation and propose a new method for 360° film watching called AdaPIP. AdaPIP enhances traditional PIP by adaptively arranging preview windows with changeable view ranges and sizes. In addition, AdaPIP incorporates the advantage of arrow-based guidance by presenting circular windows with arrows attached to them to help users locate the corresponding ROIs more efficiently. We also adapted AdaPIP and Outside-In to HMD-based immersive virtual reality environments to demonstrate the usability of PIP-guided approaches beyond 2D screens. Comprehensive user experiments on 2D screens, as well as in VR environments, indicate that AdaPIP is superior to alternative methods in terms of visual experiences while maintaining a comparable degree of immersion.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"17 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}