Zezheng Feng, Yifan Jiang, Hongjun Wang, Zipei Fan, Yuxin Ma, Shuang-Hua Yang, Huamin Qu, Xuan Song
{"title":"TrafPS: A shapley-based visual analytics approach to interpret traffic","authors":"Zezheng Feng, Yifan Jiang, Hongjun Wang, Zipei Fan, Yuxin Ma, Shuang-Hua Yang, Huamin Qu, Xuan Song","doi":"10.1007/s41095-023-0351-7","DOIUrl":"https://doi.org/10.1007/s41095-023-0351-7","url":null,"abstract":"<p>Recent achievements in deep learning (DL) have demonstrated its potential in predicting traffic flows. Such predictions are beneficial for understanding the situation and making traffic control decisions. However, most state-of-the-art DL models are considered “black boxes” with little to no transparency of the underlying mechanisms for end users. Some previous studies attempted to “open the black box” and increase the interpretability of generated predictions. However, handling complex models on large-scale spatiotemporal data and discovering salient spatial and temporal patterns that significantly influence traffic flow remain challenging. To overcome these challenges, we present <i>TrafPS</i>, a visual analytics approach for interpreting traffic prediction outcomes to support decision-making in traffic management and urban planning. The measurements <i>region SHAP</i> and <i>trajectory SHAP</i> are proposed to quantify the impact of flow patterns on urban traffic at different levels. Based on the task requirements from domain experts, we employed an interactive visual interface for the multi-aspect exploration and analysis of significant flow patterns. Two real-world case studies demonstrate the effectiveness of <i>TrafPS</i> in identifying key routes and providing decision-making support for urban planning.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"4 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Ma, Ming Li, Jingyuan Yang, Or Patashnik, Dani Lischinski, Daniel Cohen-Or, Hui Huang
{"title":"CLIP-Flow: Decoding images encoded in CLIP space","authors":"Hao Ma, Ming Li, Jingyuan Yang, Or Patashnik, Dani Lischinski, Daniel Cohen-Or, Hui Huang","doi":"10.1007/s41095-023-0375-z","DOIUrl":"https://doi.org/10.1007/s41095-023-0375-z","url":null,"abstract":"<p>This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"30 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaao Li, Yixiang Huang, Ming Wu, Bin Zhang, Xu Ji, Chuang Zhang
{"title":"CLIP-SP: Vision-language model with adaptive prompting for scene parsing","authors":"Jiaao Li, Yixiang Huang, Ming Wu, Bin Zhang, Xu Ji, Chuang Zhang","doi":"10.1007/s41095-024-0430-4","DOIUrl":"https://doi.org/10.1007/s41095-024-0430-4","url":null,"abstract":"<p>We present a novel framework, CLIP-SP, and a novel adaptive prompt method to leverage pre-trained knowledge from CLIP for scene parsing. Our approach addresses the limitations of DenseCLIP, which demonstrates the superior image segmentation provided by CLIP pre-trained models over ImageNet pre-trained models, but struggles with rough pixel-text score maps for complex scene parsing. We argue that, as they contain all textual information in a dataset, the pixel-text score maps, i.e., dense prompts, are inevitably mixed with noise. To overcome this challenge, we propose a two-step method. Firstly, we extract visual and language features and perform multi-label classification to identify the most likely categories in the input images. Secondly, based on the top-<i>k</i> categories and confidence scores, our method generates scene tokens which can be treated as adaptive prompts for implicit modeling of scenes, and incorporates them into the visual features fed into the decoder for segmentation. Our method imposes a constraint on prompts and suppresses the probability of irrelevant categories appearing in the scene parsing results. Our method achieves competitive performance, limited by the available visual-language pre-trained models. Our CLIP-SP performs 1.14% better (in terms of mIoU) than DenseCLIP on ADE20K, using a ResNet-50 backbone.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"40 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SGformer: Boosting transformers for indoor lighting estimation from a single image","authors":"Junhong Zhao, Bing Xue, Mengjie Zhang","doi":"10.1007/s41095-024-0447-8","DOIUrl":"https://doi.org/10.1007/s41095-024-0447-8","url":null,"abstract":"<p>Estimating lighting from standard images can effectively circumvent the need for resource-intensive high-dynamic-range (HDR) lighting acquisition. However, this task is often ill-posed and challenging, particularly for indoor scenes, due to the intricacy and ambiguity inherent in various indoor illumination sources. We propose an innovative transformer-based method called SGformer for lighting estimation through modeling spherical Gaussian (SG) distributions—a compact yet expressive lighting representation. Diverging from previous approaches, we explore underlying local and global dependencies in lighting features, which are crucial for reliable lighting estimation. Additionally, we investigate the structural relationships spanning various resolutions of SG distributions, ranging from sparse to dense, aiming to enhance structural consistency and curtail potential stochastic noise stemming from independent SG component regressions. By harnessing the synergy of local-global lighting representation learning and incorporating consistency constraints from various SG resolutions, the proposed method yields more accurate lighting estimates, allowing for more realistic lighting effects in object relighting and composition. Our code and model implementing our work can be found at https://github.com/junhong-jennifer-zhao/SGformer.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"10 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaofeng Xuan, Dayan Wu, Wanqian Zhang, Qinghang Su, Bo Li, Weiping Wang
{"title":"Central similarity consistency hashing for asymmetric image retrieval","authors":"Zhaofeng Xuan, Dayan Wu, Wanqian Zhang, Qinghang Su, Bo Li, Weiping Wang","doi":"10.1007/s41095-024-0428-y","DOIUrl":"https://doi.org/10.1007/s41095-024-0428-y","url":null,"abstract":"<p>Asymmetric image retrieval methods have drawn much attention due to their effectiveness in resource-constrained scenarios. They try to learn two models in an asymmetric paradigm, i.e., a small model for the query side and a large model for the gallery. However, we empirically find that the mutual training scheme (learning with each other) will inevitably degrade the performance of the large gallery model, due to the negative effects exerted by the small query one. In this paper, we propose Central Similarity Consistency Hashing (CSCH), which simultaneously learns a small query model and a large gallery model in a mutually promoted manner, ensuring both high retrieval accuracy and efficiency on the query side. To achieve this, we first introduce heuristically generated hash centers as the common learning target for both two models. Instead of randomly assigning each hash center to its corresponding category, we introduce the Hungarian algorithm to optimally match each of them by aligning the Hamming similarity of hash centers to the semantic similarity of their classes. Furthermore, we introduce the instance-level consistency loss, which enables the explicit knowledge transfer from the gallery model to the query one, without the sacrifice of gallery performance. Guided by the unified learning of hash centers and the distilled knowledge from gallery model, the query model can be gradually aligned to the Hamming space of the gallery model in a decoupled manner. Extensive experiments demonstrate the superiority of our CSCH method compared with current state-of-the-art deep hashing methods. The open-source code is available at https://github.com/dubanx/CSCH.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"59 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAM-driven MAE pre-training and background-aware meta-learning for unsupervised vehicle re-identification","authors":"Dong Wang, Qi Wang, Weidong Min, Di Gai, Qing Han, Longfei Li, Yuhan Geng","doi":"10.1007/s41095-024-0424-2","DOIUrl":"https://doi.org/10.1007/s41095-024-0424-2","url":null,"abstract":"<p>Distinguishing identity-unrelated background information from discriminative identity information poses a challenge in unsupervised vehicle re-identification (Re-ID). Re-ID models suffer from varying degrees of background interference caused by continuous scene variations. The recently proposed segment anything model (SAM) has demonstrated exceptional performance in zero-shot segmentation tasks. The combination of SAM and vehicle Re-ID models can achieve efficient separation of vehicle identity and background information. This paper proposes a method that combines SAM-driven mask autoencoder (MAE) pre-training and background-aware meta-learning for unsupervised vehicle Re-ID. The method consists of three sub-modules. First, the segmentation capacity of SAM is utilized to separate the vehicle identity region from the background. SAM cannot be robustly employed in exceptional situations, such as those with ambiguity or occlusion. Thus, in vehicle Re-ID downstream tasks, a spatially-constrained vehicle background segmentation method is presented to obtain accurate background segmentation results. Second, SAM-driven MAE pre-training utilizes the aforementioned segmentation results to select patches belonging to the vehicle and to mask other patches, allowing MAE to learn identity-sensitive features in a self-supervised manner. Finally, we present a background-aware meta-learning method to fit varying degrees of background interference in different scenarios by combining different background region ratios. Our experiments demonstrate that the proposed method has state-of-the-art performance in reducing background interference variations.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"61 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seung Hyun Lee, Sieun Kim, Wonmin Byeon, Gyeongrok Oh, Sumin In, Hyeongcheol Park, Sang Ho Yoon, Sung-Hee Hong, Jinkyu Kim, Sangpil Kim
{"title":"Audio-guided implicit neural representation for local image stylization","authors":"Seung Hyun Lee, Sieun Kim, Wonmin Byeon, Gyeongrok Oh, Sumin In, Hyeongcheol Park, Sang Ho Yoon, Sung-Hee Hong, Jinkyu Kim, Sangpil Kim","doi":"10.1007/s41095-024-0413-5","DOIUrl":"https://doi.org/10.1007/s41095-024-0413-5","url":null,"abstract":"<p>We present a novel framework for audio-guided localized image stylization. Sound often provides information about the specific context of a scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a particular part of the image based on audio input is natural but challenging. This work proposes a framework in which a user provides an audio input to localize the target in the input image and another to locally stylize the target object or scene. We first produce a fine localization map using an audio-visual localization network leveraging CLIP embedding space. We then utilize an implicit neural representation (INR) along with the predicted localization map to stylize the target based on sound information. The INR manipulates local pixel values to be semantically consistent with the provided audio input. Our experiments show that the proposed framework outperforms other audio-guided stylization methods. Moreover, we observe that our method constructs concise localization maps and naturally manipulates the target object or scene in accordance with the given audio input.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"37 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yun Zhang, Yu-Kun Lai, Lang Nie, Fang-Lue Zhang, Lin Xu
{"title":"RecStitchNet: Learning to stitch images with rectangular boundaries","authors":"Yun Zhang, Yu-Kun Lai, Lang Nie, Fang-Lue Zhang, Lin Xu","doi":"10.1007/s41095-024-0420-6","DOIUrl":"https://doi.org/10.1007/s41095-024-0420-6","url":null,"abstract":"<p>Irregular boundaries in image stitching naturally occur due to freely moving cameras. To deal with this problem, existing methods focus on optimizing mesh warping to make boundaries regular using the traditional explicit solution. However, previous methods always depend on hand-crafted features (e.g., keypoints and line segments). Thus, failures often happen in overlapping regions without distinctive features. In this paper, we address this problem by proposing <i>RecStitchNet</i>, a reasonable and effective network for image stitching with rectangular boundaries. Considering that both stitching and imposing rectangularity are non-trivial tasks in the learning-based framework, we propose a three-step progressive learning based strategy, which not only simplifies this task, but gradually achieves a good balance between stitching and imposing rectangularity. In the first step, we perform initial stitching by a pre-trained state-of-the-art image stitching model, to produce initially warped stitching results without considering the boundary constraint. Then, we use a regression network with a comprehensive objective regarding mesh, perception, and shape to further encourage the stitched meshes to have rectangular boundaries with high content fidelity. Finally, we propose an unsupervised instance-wise optimization strategy to refine the stitched meshes iteratively, which can effectively improve the stitching results in terms of feature alignment, as well as boundary and structure preservation. Due to the lack of stitching datasets and the difficulty of label generation, we propose to generate a stitching dataset with rectangular stitched images as pseudo-ground-truth labels, and the performance upper bound induced from the it can be broken by our unsupervised refinement. Qualitative and quantitative results and evaluations demonstrate the advantages of our method over the state-of-the-art.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"1 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141944862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Taming diffusion model for exemplar-based image translation","authors":"Hao Ma, Jingyuan Yang, Hui Huang","doi":"10.1007/s41095-023-0371-3","DOIUrl":"https://doi.org/10.1007/s41095-023-0371-3","url":null,"abstract":"<p>Exemplar-based image translation involves converting semantic masks into photorealistic images that adopt the style of a given exemplar. However, most existing GAN-based translation methods fail to produce photorealistic results. In this study, we propose a new diffusion model-based approach for generating high-quality images that are semantically aligned with the input mask and resemble an exemplar in style. The proposed method trains a conditional denoising diffusion probabilistic model (DDPM) with a SPADE module to integrate the semantic map. We then used a novel contextual loss and auxiliary color loss to guide the optimization process, resulting in images that were visually pleasing and semantically accurate. Experiments demonstrate that our method outperforms state-of-the-art approaches in terms of both visual quality and quantitative metrics.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"35 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LDTR: Transformer-based lane detection with anchor-chain representation","authors":"Zhongyu Yang, Chen Shen, Wei Shao, Tengfei Xing, Runbo Hu, Pengfei Xu, Hua Chai, Ruini Xue","doi":"10.1007/s41095-024-0421-5","DOIUrl":"https://doi.org/10.1007/s41095-024-0421-5","url":null,"abstract":"<p>Despite recent advances in lane detection methods, scenarios with limited- or no-visual-clue of lanes due to factors such as lighting conditions and occlusion remain challenging and crucial for automated driving. Moreover, current lane representations require complex post-processing and struggle with specific instances. Inspired by the DETR architecture, we propose LDTR, a transformer-based model to address these issues. Lanes are modeled with a novel anchor-chain, regarding a lane as a whole from the beginning, which enables LDTR to handle special lanes inherently. To enhance lane instance perception, LDTR incorporates a novel multi-referenced deformable attention module to distribute attention around the object. Additionally, LDTR incorporates two line IoU algorithms to improve convergence efficiency and employs a Gaussian heatmap auxiliary branch to enhance model representation capability during training. To evaluate lane detection models, we rely on Fréchet distance, parameterized Fl-score, and additional synthetic metrics. Experimental results demonstrate that LDTR achieves state-of-the-art performance on well-known datasets.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"32 1","pages":""},"PeriodicalIF":6.9,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}