Yanran Dai, Jing Li, Yuqi Jiang, Haidong Qin, Bang Liang, Shikuan Hong, Haozhe Pan, Tao Yang
{"title":"Real-time distance field acceleration based free-viewpoint video synthesis for large sports fields","authors":"Yanran Dai, Jing Li, Yuqi Jiang, Haidong Qin, Bang Liang, Shikuan Hong, Haozhe Pan, Tao Yang","doi":"10.1007/s41095-022-0323-3","DOIUrl":"https://doi.org/10.1007/s41095-022-0323-3","url":null,"abstract":"<p>Free-viewpoint video allows the user to view objects from any virtual perspective, creating an immersive visual experience. This technology enhances the interactivity and freedom of multimedia performances. However, many free-viewpoint video synthesis methods hardly satisfy the requirement to work in real time with high precision, particularly for sports fields having large areas and numerous moving objects. To address these issues, we propose a free-viewpoint video synthesis method based on distance field acceleration. The central idea is to fuse multi-view distance field information and use it to adjust the search step size adaptively. Adaptive step size search is used in two ways: for fast estimation of multi-object three-dimensional surfaces, and synthetic view rendering based on global occlusion judgement. We have implemented our ideas using parallel computing for interactive display, using CUDA and OpenGL frameworks, and have used real-world and simulated experimental datasets for evaluation. The results show that the proposed method can render free-viewpoint videos with multiple objects on large sports fields at 25 fps. Furthermore, the visual quality of our synthetic novel viewpoint images exceeds that of state-of-the-art neural-rendering-based methods.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"12 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-modal visual tracking: Review and experimental comparison","authors":"Pengyu Zhang, Dong Wang, Huchuan Lu","doi":"10.1007/s41095-023-0345-5","DOIUrl":"https://doi.org/10.1007/s41095-023-0345-5","url":null,"abstract":"<p>Visual object tracking has been drawing increasing attention in recent years, as a fundamental task in computer vision. To extend the range of tracking applications, researchers have been introducing information from multiple modalities to handle specific scenes, with promising research prospects for emerging methods and benchmarks. To provide a thorough review of multi-modal tracking, different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy, with specific focus on visible-depth (RGB-D) and visible-thermal (RGB-T) tracking. Subsequently, a detailed description of the related benchmarks and challenges is provided. Extensive experiments were conducted to analyze the effectiveness of trackers on five datasets: PTB, VOT19-RGBD, GTOT, RGBT234, and VOT19-RGBT. Finally, various future directions, including model design and dataset construction, are discussed from different perspectives for further research.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"15 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuantian Huang, Satoshi Iizuka, Edgar Simo-Serra, Kazuhiro Fukui
{"title":"Controllable multi-domain semantic artwork synthesis","authors":"Yuantian Huang, Satoshi Iizuka, Edgar Simo-Serra, Kazuhiro Fukui","doi":"10.1007/s41095-023-0356-2","DOIUrl":"https://doi.org/10.1007/s41095-023-0356-2","url":null,"abstract":"<p>We present a novel framework for the multi-domain synthesis of artworks from semantic layouts. One of the main limitations of this challenging task is the lack of publicly available segmentation datasets for art synthesis. To address this problem, we propose a dataset called <i>ArtSem</i> that contains 40,000 images of artwork from four different domains, with their corresponding semantic label maps. We first extracted semantic maps from landscape photography and used a conditional generative adversarial network (GAN)-based approach for generating high-quality artwork from semantic maps without requiring paired training data. Furthermore, we propose an artwork-synthesis model using domain-dependent variational encoders for high-quality multi-domain synthesis. Subsequently, the model was improved and complemented with a simple but effective normalization method based on jointly normalizing semantics and style, which we call spatially style-adaptive normalization (SSTAN). Compared to the previous methods, which only take semantic layout as the input, our model jointly learns style and semantic information representation, improving the generation quality of artistic images. These results indicate that our model learned to separate the domains in the latent space. Thus, we can perform fine-grained control of the synthesized artwork by identifying hyperplanes that separate the different domains. Moreover, by combining the proposed dataset and approach, we generated user-controllable artworks of higher quality than that of existing approaches, as corroborated by quantitative metrics and a user study.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"21 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yihao Liu, Hengyuan Zhao, Kelvin C. K. Chan, Xintao Wang, Chen Change Loy, Yu Qiao, Chao Dong
{"title":"Temporally consistent video colorization with deep feature propagation and self-regularization learning","authors":"Yihao Liu, Hengyuan Zhao, Kelvin C. K. Chan, Xintao Wang, Chen Change Loy, Yu Qiao, Chao Dong","doi":"10.1007/s41095-023-0342-8","DOIUrl":"https://doi.org/10.1007/s41095-023-0342-8","url":null,"abstract":"<p>Video colorization is a challenging and highly ill-posed problem. Although recent years have witnessed remarkable progress in single image colorization, there is relatively less research effort on video colorization, and existing methods always suffer from severe flickering artifacts (temporal inconsistency) or unsatisfactory colorization. We address this problem from a new perspective, by jointly considering colorization and temporal consistency in a unified framework. Specifically, we propose a novel temporally consistent video colorization (TCVC) framework. TCVC effectively propagates frame-level deep features in a bidirectional way to enhance the temporal consistency of colorization. Furthermore, TCVC introduces a self-regularization learning (SRL) scheme to minimize the differences in predictions obtained using different time steps. SRL does not require any ground-truth color videos for training and can further improve temporal consistency. Experiments demonstrate that our method can not only provide visually pleasing colorized video, but also with clearly better temporal consistency than state-of-the-art methods. A video demo is provided at https://www.youtube.com/watch?v=c7dczMs-olE, while code is available at https://github.com/lyh-18/TCVC-Temporally-Consistent-Video-Colorization.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"55 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-granularity sequence generation for hierarchical image classification","authors":"Xinda Liu, Lili Wang","doi":"10.1007/s41095-022-0332-2","DOIUrl":"https://doi.org/10.1007/s41095-022-0332-2","url":null,"abstract":"<p>Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously. Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities, and also insufficiently consider relationships between the hierarchical multi-granularity labels. We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation (MGSG) approach for the hierarchical multi-granularity image classification task. Specifically, we introduce a transformer architecture to encode the image into visual representation sequences. Next, we traverse the taxonomic tree and organize the multi-granularity labels into sequences, and vectorize them and add positional information. The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs, and outputs the predicted multi-granularity label sequence. The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism, and relates visual information to the semantic label information through a cross-modality attention mechanism. In this way, the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities. Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method. Our project is available at https://github.com/liuxindazz/mgsg.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"19 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generating diverse clothed 3D human animations via a generative model","authors":"Min Shi, Wenke Feng, Lin Gao, Dengming Zhu","doi":"10.1007/s41095-022-0324-2","DOIUrl":"https://doi.org/10.1007/s41095-022-0324-2","url":null,"abstract":"<p>Data-driven garment animation is a current topic of interest in the computer graphics industry. Existing approaches generally establish the mapping between a single human pose or a temporal pose sequence, and garment deformation, but it is difficult to quickly generate diverse clothed human animations. We address this problem with a method to automatically synthesize dressed human animations with temporal consistency from a specified human motion label. At the heart of our method is a two-stage strategy. Specifically, we first learn a latent space encoding the sequence-level distribution of human motions utilizing a transformer-based conditional variational autoencoder (Transformer-CVAE). Then a garment simulator synthesizes dynamic garment shapes using a transformer encoder–decoder architecture. Since the learned latent space comes from varied human motions, our method can generate a variety of styles of motions given a specific motion label. By means of a novel beginning of sequence (BOS) learning strategy and a self-supervised refinement procedure, our garment simulator is capable of efficiently synthesizing garment deformation sequences corresponding to the generated human motions while maintaining temporal and spatial consistency. We verify our ideas experimentally. This is the first generative model that directly dresses human animation.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"24 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning physically based material and lighting decompositions for face editing","authors":"Qian Zhang, Vikas Thamizharasan, James Tompkin","doi":"10.1007/s41095-022-0309-1","DOIUrl":"https://doi.org/10.1007/s41095-022-0309-1","url":null,"abstract":"<p>Lighting is crucial for portrait photography, yet the complex interactions between the skin and incident light are expensive to model computationally in graphics and difficult to reconstruct analytically via computer vision. Alternatively, to allow fast and controllable reflectance and lighting editing, we developed a physically based decomposition through deep learned priors from path-traced portrait images. Previous approaches that used simplified material models or low-frequency or low-dynamic-range lighting struggled to model specular reflections or relight directly without intermediate decomposition. However, we estimate the surface normal, skin albedo and roughness, and high-frequency HDRI maps, and propose an architecture to estimate both diffuse and specular reflectance components. In our experiments, we show that this approach can represent the true appearance function more effectively than simpler baseline methods, leading to better generalization and higher-quality editing.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"1 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139084560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical vectorization for facial images","authors":"Qian Fu, Linlin Liu, Fei Hou, Ying He","doi":"10.1007/s41095-022-0314-4","DOIUrl":"https://doi.org/10.1007/s41095-022-0314-4","url":null,"abstract":"<p>The explosive growth of social media means portrait editing and retouching are in high demand. While portraits are commonly captured and stored as raster images, editing raster images is non-trivial and requires the user to be highly skilled. Aiming at developing intuitive and easy-to-use portrait editing tools, we propose a novel vectorization method that can automatically convert raster images into a 3-tier hierarchical representation. The base layer consists of a set of sparse diffusion curves (DCs) which characterize salient geometric features and low-frequency colors, providing a means for semantic color transfer and facial expression editing. The middle level encodes specular highlights and shadows as large, editable Poisson regions (PRs) and allows the user to directly adjust illumination by tuning the strength and changing the shapes of PRs. The top level contains two types of pixel-sized PRs for high-frequency residuals and fine details such as pimples and pigmentation. We train a deep generative model that can produce high-frequency residuals automatically. Thanks to the inherent meaning in vector primitives, editing portraits becomes easy and intuitive. In particular, our method supports color transfer, facial expression editing, highlight and shadow editing, and automatic retouching. To quantitatively evaluate the results, we extend the commonly used FLIP metric (which measures color and feature differences between two images) to consider illumination. The new metric, illumination-sensitive FLIP, can effectively capture salient changes in color transfer results, and is more consistent with human perception than FLIP and other quality measures for portrait images. We evaluate our method on the FFHQR dataset and show it to be effective for common portrait editing tasks, such as retouching, light editing, color transfer, and expression editing.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"7 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139078756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A unified multi-view multi-person tracking framework","authors":"Fan Yang, Shigeyuki Odashima, Sosuke Yamao, Hiroaki Fujimoto, Shoichi Masui, Shan Jiang","doi":"10.1007/s41095-023-0334-8","DOIUrl":"https://doi.org/10.1007/s41095-023-0334-8","url":null,"abstract":"<p>Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the <i>Campus</i> and <i>Shelf</i> datasets for 3D pose tracking, with comparable results on the <i>WILDTRACK</i> and <i>MMPTRACK</i> datasets for 3D footprint tracking.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"4 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139078705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}