Visual intelligencePub Date : 2025-01-01Epub Date: 2025-09-24DOI: 10.1007/s44267-025-00087-w
Chengcheng Song, Hui Li, Tianyang Xu, Xiao-Jun Wu, Josef Kittler
{"title":"RefineFuse: an end-to-end network for multi-scale refinement fusion of multi-modality images.","authors":"Chengcheng Song, Hui Li, Tianyang Xu, Xiao-Jun Wu, Josef Kittler","doi":"10.1007/s44267-025-00087-w","DOIUrl":"10.1007/s44267-025-00087-w","url":null,"abstract":"<p><p>The goal of multi-modality image fusion is to integrate complementary information from different modal images to create high-quality, informative fused images. In recent years, significant advances have been made in deep learning for image fusion tasks. Nevertheless, current fusion techniques are still unable to capture more intricate details from the source images. For instance, many existing methods used for tasks such as infrared and visible image fusion are susceptible to adverse lighting conditions. To enhance the ability of fusion networks to preserve detailed information in complex scenes, we propose RefineFuse, a multi-scale interaction network for multi-modal image fusion tasks. To balance and exploit local detailed features and global semantic information during the fusion process, we utilize specific modules to model cross-modal feature coupling in both the pixel and semantic domains. Specifically, a dual attention-based feature interaction module is introduced to integrate detailed information from both modalities for extracting shallow features. To obtain deep semantic information, we adopt a global attention mechanism for cross-modal feature interaction. Additionally, to bridge the gap between deep semantic information and shallow detailed information, we gradually incorporate deep semantic information to shallow detailed information via specific feature interaction modules. Extensive comparative and generalization experiments demonstrate that RefineFuse achieves high-quality fusions of infrared, visible, and medical images, while also facilitating advanced visual tasks, such as object detection.</p>","PeriodicalId":520376,"journal":{"name":"Visual intelligence","volume":"3 1","pages":"16"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12460437/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145188214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual intelligencePub Date : 2025-01-01Epub Date: 2025-07-15DOI: 10.1007/s44267-025-00085-y
Qianggang Ding, Zhichao Shen, Weiqiang Zhu, Bang Liu
{"title":"DASFormer: self-supervised pretraining for earthquake monitoring.","authors":"Qianggang Ding, Zhichao Shen, Weiqiang Zhu, Bang Liu","doi":"10.1007/s44267-025-00085-y","DOIUrl":"10.1007/s44267-025-00085-y","url":null,"abstract":"<p><p>Earthquake monitoring is a fundamental task to unravel the underlying physics of earthquakes and mitigate associated hazards for public safety. Distributed acoustic sensing, or DAS, which transforms pre-existing telecommunication cables into ultra-dense seismic networks, offers a cost-effective and scalable solution for next-generation earthquake monitoring. However, current approaches for earthquake monitoring like PhaseNet and PhaseNet-2 primarily rely on supervised learning, while manually labeled DAS data is quite limited and it is difficult to obtain more annotated datasets. In this paper, we present DASFormer, a novel self-supervised pretraining technique on DAS data with a coarse-to-fine framework that models spatial-temporal signal correlation. We treat earthquake monitoring as an anomaly detection task and demonstrate DASFormer can be directly utilized as a seismic phase detector. Experimental results demonstrate that DASFormer is effective in terms of several evaluation metrics and outperforms state-of-the-art time-series forecasting, anomaly detection, and foundation models on the unsupervised seismic detection task. We also demonstrate the potential of fine-tuning DASFormer to downstream tasks through case studies.</p>","PeriodicalId":520376,"journal":{"name":"Visual intelligence","volume":"3 1","pages":"14"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12259731/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144651910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual intelligencePub Date : 2024-01-01Epub Date: 2024-12-30DOI: 10.1007/s44267-024-00070-x
Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno
{"title":"An empirical study of LLaMA3 quantization: from LLMs to MLLMs.","authors":"Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno","doi":"10.1007/s44267-024-00070-x","DOIUrl":"https://doi.org/10.1007/s44267-024-00070-x","url":null,"abstract":"<p><p>The LLaMA family, a collection of foundation language models ranging from 7B to 65B parameters, has become one of the most powerful open-source large language models (LLMs) and the popular LLM backbone of multi-modal large language models (MLLMs), widely used in computer vision and natural language understanding tasks. In particular, LLaMA3 models have recently been released and have achieved impressive performance in various domains with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-constrained scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration can potentially provide new insights and challenges for the low-bit quantization of LLaMA3 and other future LLMs, especially in addressing performance degradation issues that suffer in LLM compression. Specifically, we comprehensively evaluate the 10 existing post-training quantization and LoRA fine-tuning (LoRA-FT) methods of LLaMA3 on 1-8 bits and various datasets to reveal the low-bit quantization performance of LLaMA3. To uncover the capabilities of low-bit quantized MLLM, we assessed the performance of the LLaMA3-based LLaVA-Next-8B model under 2-4 ultra-low bits with post-training quantization methods. Our experimental results indicate that LLaMA3 still suffers from non-negligible degradation in linguistic and visual contexts, particularly under ultra-low bit widths. This highlights the significant performance gap at low bit-width that needs to be addressed in future developments. We expect that this empirical study will prove valuable in advancing future models, driving LLMs and MLLMs to achieve higher accuracy at lower bit to enhance practicality.</p>","PeriodicalId":520376,"journal":{"name":"Visual intelligence","volume":"2 1","pages":"36"},"PeriodicalIF":0.0,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11728678/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142981122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}