{"title":"LPT++: Efficient Training on Mixture of Long-tailed Experts","authors":"Bowen Dong, Pan Zhou, Wangmeng Zuo","doi":"arxiv-2409.11323","DOIUrl":"https://doi.org/arxiv-2409.11323","url":null,"abstract":"We introduce LPT++, a comprehensive framework for long-tailed classification\u0000that combines parameter-efficient fine-tuning (PEFT) with a learnable model\u0000ensemble. LPT++ enhances frozen Vision Transformers (ViTs) through the\u0000integration of three core components. The first is a universal long-tailed\u0000adaptation module, which aggregates long-tailed prompts and visual adapters to\u0000adapt the pretrained model to the target domain, meanwhile improving its\u0000discriminative ability. The second is the mixture of long-tailed experts\u0000framework with a mixture-of-experts (MoE) scorer, which adaptively calculates\u0000reweighting coefficients for confidence scores from both visual-only and\u0000visual-language (VL) model experts to generate more accurate predictions.\u0000Finally, LPT++ employs a three-phase training framework, wherein each critical\u0000module is learned separately, resulting in a stable and effective long-tailed\u0000classification training paradigm. Besides, we also propose the simple version\u0000of LPT++ namely LPT, which only integrates visual-only pretrained ViT and\u0000long-tailed prompts to formulate a single model method. LPT can clearly\u0000illustrate how long-tailed prompts works meanwhile achieving comparable\u0000performance without VL pretrained models. Experiments show that, with only ~1%\u0000extra trainable parameters, LPT++ achieves comparable accuracy against all the\u0000counterparts.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Esat Kalfaoglu, Halil Ibrahim Ozturk, Ozsel Kilinc, Alptekin Temizel
{"title":"TopoMaskV2: Enhanced Instance-Mask-Based Formulation for the Road Topology Problem","authors":"M. Esat Kalfaoglu, Halil Ibrahim Ozturk, Ozsel Kilinc, Alptekin Temizel","doi":"arxiv-2409.11325","DOIUrl":"https://doi.org/arxiv-2409.11325","url":null,"abstract":"Recently, the centerline has become a popular representation of lanes due to\u0000its advantages in solving the road topology problem. To enhance centerline\u0000prediction, we have developed a new approach called TopoMask. Unlike previous\u0000methods that rely on keypoints or parametric methods, TopoMask utilizes an\u0000instance-mask-based formulation coupled with a masked-attention-based\u0000transformer architecture. We introduce a quad-direction label representation to\u0000enrich the mask instances with flow information and design a corresponding\u0000post-processing technique for mask-to-centerline conversion. Additionally, we\u0000demonstrate that the instance-mask formulation provides complementary\u0000information to parametric Bezier regressions, and fusing both outputs leads to\u0000improved detection and topology performance. Moreover, we analyze the\u0000shortcomings of the pillar assumption in the Lift Splat technique and adapt a\u0000multi-height bin configuration. Experimental results show that TopoMask\u0000achieves state-of-the-art performance in the OpenLane-V2 dataset, increasing\u0000from 44.1 to 49.4 for Subset-A and 44.7 to 51.8 for Subset-B in the V1.1 OLS\u0000baseline.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"fMRI-3D: A Comprehensive Dataset for Enhancing fMRI-based 3D Reconstruction","authors":"Jianxiong Gao, Yuqian Fu, Yun Wang, Xuelin Qian, Jianfeng Feng, Yanwei Fu","doi":"arxiv-2409.11315","DOIUrl":"https://doi.org/arxiv-2409.11315","url":null,"abstract":"Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI)\u0000data, introduced as Recon3DMind in our conference work, is of significant\u0000interest to both cognitive neuroscience and computer vision. To advance this\u0000task, we present the fMRI-3D dataset, which includes data from 15 participants\u0000and showcases a total of 4768 3D objects. The dataset comprises two components:\u0000fMRI-Shape, previously introduced and accessible at\u0000https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape, and fMRI-Objaverse,\u0000proposed in this paper and available at\u0000https://huggingface.co/datasets/Fudan-fMRI/fMRI-Objaverse. fMRI-Objaverse\u0000includes data from 5 subjects, 4 of whom are also part of the Core set in\u0000fMRI-Shape, with each subject viewing 3142 3D objects across 117 categories,\u0000all accompanied by text captions. This significantly enhances the diversity and\u0000potential applications of the dataset. Additionally, we propose MinD-3D, a\u0000novel framework designed to decode 3D visual information from fMRI signals. The\u0000framework first extracts and aggregates features from fMRI data using a\u0000neuro-fusion encoder, then employs a feature-bridge diffusion model to generate\u0000visual features, and finally reconstructs the 3D object using a generative\u0000transformer decoder. We establish new benchmarks by designing metrics at both\u0000semantic and structural levels to evaluate model performance. Furthermore, we\u0000assess our model's effectiveness in an Out-of-Distribution setting and analyze\u0000the attribution of the extracted features and the visual ROIs in fMRI signals.\u0000Our experiments demonstrate that MinD-3D not only reconstructs 3D objects with\u0000high semantic and spatial accuracy but also deepens our understanding of how\u0000human brain processes 3D visual information. Project page at:\u0000https://jianxgao.github.io/MinD-3D.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenwei Wang, Tengfei Wang, Zexin He, Gerhard Hancke, Ziwei Liu, Rynson W. H. Lau
{"title":"Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion","authors":"Zhenwei Wang, Tengfei Wang, Zexin He, Gerhard Hancke, Ziwei Liu, Rynson W. H. Lau","doi":"arxiv-2409.11406","DOIUrl":"https://doi.org/arxiv-2409.11406","url":null,"abstract":"In 3D modeling, designers often use an existing 3D model as a reference to\u0000create new ones. This practice has inspired the development of Phidias, a novel\u0000generative model that uses diffusion for reference-augmented 3D generation.\u0000Given an image, our method leverages a retrieved or user-provided 3D reference\u0000model to guide the generation process, thereby enhancing the generation\u0000quality, generalization ability, and controllability. Our model integrates\u0000three key components: 1) meta-ControlNet that dynamically modulates the\u0000conditioning strength, 2) dynamic reference routing that mitigates misalignment\u0000between the input image and 3D reference, and 3) self-reference augmentations\u0000that enable self-supervised training with a progressive curriculum.\u0000Collectively, these designs result in a clear improvement over existing\u0000methods. Phidias establishes a unified framework for 3D generation using text,\u0000image, and 3D conditions with versatile applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, Yuexin Ma
{"title":"RenderWorld: World Model with Self-Supervised 3D Label","authors":"Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, Yuexin Ma","doi":"arxiv-2409.11356","DOIUrl":"https://doi.org/arxiv-2409.11356","url":null,"abstract":"End-to-end autonomous driving with vision-only is not only more\u0000cost-effective compared to LiDAR-vision fusion but also more reliable than\u0000traditional methods. To achieve a economical and robust purely visual\u0000autonomous driving system, we propose RenderWorld, a vision-only end-to-end\u0000autonomous driving framework, which generates 3D occupancy labels using a\u0000self-supervised gaussian-based Img2Occ Module, then encodes the labels by\u0000AM-VAE, and uses world model for forecasting and planning. RenderWorld employs\u0000Gaussian Splatting to represent 3D scenes and render 2D images greatly improves\u0000segmentation accuracy and reduces GPU memory consumption compared with\u0000NeRF-based methods. By applying AM-VAE to encode air and non-air separately,\u0000RenderWorld achieves more fine-grained scene element representation, leading to\u0000state-of-the-art performance in both 4D occupancy forecasting and motion\u0000planning from autoregressive world model.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yichen Zhang, Zihan Wang, Jiali Han, Peilin Li, Jiaxun Zhang, Jianqiang Wang, Lei He, Keqiang Li
{"title":"GS-Net: Generalizable Plug-and-Play 3D Gaussian Splatting Module","authors":"Yichen Zhang, Zihan Wang, Jiali Han, Peilin Li, Jiaxun Zhang, Jianqiang Wang, Lei He, Keqiang Li","doi":"arxiv-2409.11307","DOIUrl":"https://doi.org/arxiv-2409.11307","url":null,"abstract":"3D Gaussian Splatting (3DGS) integrates the strengths of primitive-based\u0000representations and volumetric rendering techniques, enabling real-time,\u0000high-quality rendering. However, 3DGS models typically overfit to single-scene\u0000training and are highly sensitive to the initialization of Gaussian ellipsoids,\u0000heuristically derived from Structure from Motion (SfM) point clouds, which\u0000limits both generalization and practicality. To address these limitations, we\u0000propose GS-Net, a generalizable, plug-and-play 3DGS module that densifies\u0000Gaussian ellipsoids from sparse SfM point clouds, enhancing geometric structure\u0000representation. To the best of our knowledge, GS-Net is the first plug-and-play\u00003DGS module with cross-scene generalization capabilities. Additionally, we\u0000introduce the CARLA-NVS dataset, which incorporates additional camera\u0000viewpoints to thoroughly evaluate reconstruction and rendering quality.\u0000Extensive experiments demonstrate that applying GS-Net to 3DGS yields a PSNR\u0000improvement of 2.08 dB for conventional viewpoints and 1.86 dB for novel\u0000viewpoints, confirming the method's effectiveness and robustness.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh
{"title":"MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping","authors":"Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh","doi":"arxiv-2409.11316","DOIUrl":"https://doi.org/arxiv-2409.11316","url":null,"abstract":"Few-shot Semantic Segmentation addresses the challenge of segmenting objects\u0000in query images with only a handful of annotated examples. However, many\u0000previous state-of-the-art methods either have to discard intricate local\u0000semantic features or suffer from high computational complexity. To address\u0000these challenges, we propose a new Few-shot Semantic Segmentation framework\u0000based on the transformer architecture. Our approach introduces the spatial\u0000transformer decoder and the contextual mask generation module to improve the\u0000relational understanding between support and query images. Moreover, we\u0000introduce a multi-scale decoder to refine the segmentation mask by\u0000incorporating features from different resolutions in a hierarchical manner.\u0000Additionally, our approach integrates global features from intermediate encoder\u0000stages to improve contextual understanding, while maintaining a lightweight\u0000structure to reduce complexity. This balance between performance and efficiency\u0000enables our method to achieve state-of-the-art results on benchmark datasets\u0000such as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings.\u0000Notably, our model with only 1.5 million parameters demonstrates competitive\u0000performance while overcoming limitations of existing methodologies.\u0000https://github.com/amirrezafateh/MSDNet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edgar Heinert, Stephan Tilgner, Timo Palm, Matthias Rottmann
{"title":"Uncertainty and Prediction Quality Estimation for Semantic Segmentation via Graph Neural Networks","authors":"Edgar Heinert, Stephan Tilgner, Timo Palm, Matthias Rottmann","doi":"arxiv-2409.11373","DOIUrl":"https://doi.org/arxiv-2409.11373","url":null,"abstract":"When employing deep neural networks (DNNs) for semantic segmentation in\u0000safety-critical applications like automotive perception or medical imaging, it\u0000is important to estimate their performance at runtime, e.g. via uncertainty\u0000estimates or prediction quality estimates. Previous works mostly performed\u0000uncertainty estimation on pixel-level. In a line of research, a\u0000connected-component-wise (segment-wise) perspective was taken, approaching\u0000uncertainty estimation on an object-level by performing so-called meta\u0000classification and regression to estimate uncertainty and prediction quality,\u0000respectively. In those works, each predicted segment is considered individually\u0000to estimate its uncertainty or prediction quality. However, the neighboring\u0000segments may provide additional hints on whether a given predicted segment is\u0000of high quality, which we study in the present work. On the basis of\u0000uncertainty indicating metrics on segment-level, we use graph neural networks\u0000(GNNs) to model the relationship of a given segment's quality as a function of\u0000the given segment's metrics as well as those of its neighboring segments. We\u0000compare different GNN architectures and achieve a notable performance\u0000improvement.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe
{"title":"Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think","authors":"Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe","doi":"arxiv-2409.11355","DOIUrl":"https://doi.org/arxiv-2409.11355","url":null,"abstract":"Recent work showed that large diffusion models can be reused as highly\u0000precise monocular depth estimators by casting depth estimation as an\u0000image-conditional image generation task. While the proposed model achieved\u0000state-of-the-art results, high computational demands due to multi-step\u0000inference limited its use in many scenarios. In this paper, we show that the\u0000perceived inefficiency was caused by a flaw in the inference pipeline that has\u0000so far gone unnoticed. The fixed model performs comparably to the best\u0000previously reported configuration while being more than 200$times$ faster. To\u0000optimize for downstream task performance, we perform end-to-end fine-tuning on\u0000top of the single-step model with task-specific losses and get a deterministic\u0000model that outperforms all other diffusion-based depth and normal estimation\u0000models on common zero-shot benchmarks. We surprisingly find that this\u0000fine-tuning protocol also works directly on Stable Diffusion and achieves\u0000comparable performance to current state-of-the-art diffusion-based depth and\u0000normal estimation models, calling into question some of the conclusions drawn\u0000from prior works.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OmniGen: Unified Image Generation","authors":"Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu","doi":"arxiv-2409.11340","DOIUrl":"https://doi.org/arxiv-2409.11340","url":null,"abstract":"In this work, we introduce OmniGen, a new diffusion model for unified image\u0000generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen\u0000no longer requires additional modules such as ControlNet or IP-Adapter to\u0000process diverse control conditions. OmniGenis characterized by the following\u0000features: 1) Unification: OmniGen not only demonstrates text-to-image\u0000generation capabilities but also inherently supports other downstream tasks,\u0000such as image editing, subject-driven generation, and visual-conditional\u0000generation. Additionally, OmniGen can handle classical computer vision tasks by\u0000transforming them into image generation tasks, such as edge detection and human\u0000pose recognition. 2) Simplicity: The architecture of OmniGen is highly\u0000simplified, eliminating the need for additional text encoders. Moreover, it is\u0000more user-friendly compared to existing diffusion models, enabling complex\u0000tasks to be accomplished through instructions without the need for extra\u0000preprocessing steps (e.g., human pose estimation), thereby significantly\u0000simplifying the workflow of image generation. 3) Knowledge Transfer: Through\u0000learning in a unified format, OmniGen effectively transfers knowledge across\u0000different tasks, manages unseen tasks and domains, and exhibits novel\u0000capabilities. We also explore the model's reasoning capabilities and potential\u0000applications of chain-of-thought mechanism. This work represents the first\u0000attempt at a general-purpose image generation model, and there remain several\u0000unresolved issues. We will open-source the related resources at\u0000https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}