IEEE Transactions on Multimedia最新文献

High Throughput Shelf Life Determination of Atlantic Cod (Gadus morhua L.) by Use of Hyperspectral Imaging 利用高光谱成像技术测定大西洋鳕鱼的高通量保质期

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-16 DOI: 10.1109/TMM.2025.3561661

Samuel Ortega;Tatiana N. Ageeva;Silje Kristoffersen;Karsten Heia;Heidi A. Nilsen

{"title":"High Throughput Shelf Life Determination of Atlantic Cod (Gadus morhua L.) by Use of Hyperspectral Imaging","authors":"Samuel Ortega;Tatiana N. Ageeva;Silje Kristoffersen;Karsten Heia;Heidi A. Nilsen","doi":"10.1109/TMM.2025.3561661","DOIUrl":"https://doi.org/10.1109/TMM.2025.3561661","url":null,"abstract":"Fish quality and shelf life can be evaluated using various assessment methods, such as sensory analysis, biochemical tests, microbiological evaluations, and physicochemical analyses. However, these methods are invasive and time-consuming, driving interest in technologies capable of estimating shelf life through non-invasive procedures. This study investigates the potential of hyperspectral imaging as a non-invasive technology for predicting the shelf life of Atlantic cod. A storage experiment was conducted that included both gutted fish with heads (GFWH) and fillets, with sensory evaluation and biochemical measurements employed to determine shelf life. Subsequently, hyperspectral images of the fish samples were captured under industrial production conditions, and the spectral data were analyzed using different regression algorithms. The majority of the regression techniques utilized in this research successfully predicted shelf life for both fillets and GFWH, achieving a root mean square error (RMSE) lower than one day. While most regression models exhibited comparable performance in predicting the shelf life of fillets, deep learning-based models demonstrated superior performance for GFWH. These results suggest that hyperspectral imaging technology has significant potential as a non-invasive tool for estimating the shelf life of Atlantic cod, thereby enabling effective quality-based sorting, reducing food waste, and enhancing sustainability in the seafood supply chain.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2809-2824"},"PeriodicalIF":8.4,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10966199","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144170923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TrAME: Trajectory-Anchored Multi-View Editing for Text-Guided 3D Gaussian Manipulation 轨迹锚定多视图编辑文本引导的三维高斯操作

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-14 DOI: 10.1109/TMM.2025.3557618

Chaofan Luo;Donglin Di;Xun Yang;Yongjia Ma;Zhou Xue;Wei Chen;Xiaofei Gou;Yebin Liu

{"title":"TrAME: Trajectory-Anchored Multi-View Editing for Text-Guided 3D Gaussian Manipulation","authors":"Chaofan Luo;Donglin Di;Xun Yang;Yongjia Ma;Zhou Xue;Wei Chen;Xiaofei Gou;Yebin Liu","doi":"10.1109/TMM.2025.3557618","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557618","url":null,"abstract":"Despite significant strides in the field of 3D scene editing, current methods encounter substantial challenge, particularly in preserving 3D consistency during the multi-view editing process. To tackle this challenge, we propose a progressive 3D editing strategy that ensures multi-view consistency via a Trajectory-Anchored Scheme (TAS) with a dual-branch editing mechanism. Specifically, TAS facilitates a tightly coupled iterative process between 2D view editing and 3D updating, preventing error accumulation yielded from the text-to-image process. Additionally, we explore the connection between optimization-based methods and reconstruction-based methods, offering a unified perspective for selecting superior design choices, supporting the rationale behind the designed TAS. We further present a tuning-free View-Consistent Attention Control (VCAC) module that leverages cross-view semantic and geometric reference from the source branch to yield aligned views from the target branch during the editing of 2D views. To validate the effectiveness of our method, we analyze 2D examples to demonstrate the improved consistency with the VCAC module. Extensive quantitative and qualitative results in text-guided 3D scene editing clearly indicate that our method can achieve superior editing quality compared with state-of-the-art 3D scene editing methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2886-2898"},"PeriodicalIF":8.4,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Imp: Highly Capable Large Multimodal Models for Mobile Devices Imp：移动设备的高性能大型多模态模型

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-11 DOI: 10.1109/TMM.2025.3557680

Zhenwei Shao;Zhou Yu;Jun Yu;Xuecheng Ouyang;Lihao Zheng;Zhenbiao Gai;Mingyang Wang;Zhenzhong Kuang;Jiajun Ding

{"title":"Imp: Highly Capable Large Multimodal Models for Mobile Devices","authors":"Zhenwei Shao;Zhou Yu;Jun Yu;Xuecheng Ouyang;Lihao Zheng;Zhenbiao Gai;Mingyang Wang;Zhenzhong Kuang;Jiajun Ding","doi":"10.1109/TMM.2025.3557680","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557680","url":null,"abstract":"By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B<inline-formula><tex-math>$sim$</tex-math></inline-formula>4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2961-2974"},"PeriodicalIF":8.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding 自动几何图像数据集创建增强几何理解

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-10 DOI: 10.1109/TMM.2025.3557720

Zihan Huang;Tao Wu;Wang Lin;Shengyu Zhang;Jingyuan Chen;Fei Wu

{"title":"AutoGeo: Automating Geometric Image Dataset Creation for Enhanced Geometry Understanding","authors":"Zihan Huang;Tao Wu;Wang Lin;Shengyu Zhang;Jingyuan Chen;Fei Wu","doi":"10.1109/TMM.2025.3557720","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557720","url":null,"abstract":"With the rapid advancement of large language models, there has been a growing interest in their capabilities in mathematical reasoning. However, existing research has primarily focused on text-based algebra problems, neglecting the study of geometry due to the lack of high-quality geometric datasets. To address this gap, this paper introduces AutoGeo, a novel approach for automatically generating mathematical geometric images to fulfill the demand for large-scale and diverse geometric datasets. AutoGeo facilitates the creation of AutoGeo-100 k, an extensive repository comprising 100 k high-quality geometry image-text pairs. By leveraging precisely defined geometric clauses, AutoGeo-100 k contains a wide variety of geometric shapes, including lines, polygons, circles, and complex spatial relationships, etc. Furthermore, this paper demonstrates the efficacy of AutoGeo-100 k in enhancing the performance of multimodal large language models through fine-tuning. Experimental results indicate significant improvements in the model's ability in handling geometric images, as evidenced by enhanced accuracy in tasks such as geometric captioning and mathematical reasoning. This research not only fills a critical gap in the availability of geometric datasets but also paves the way for the advancement of sophisticated AI-driven tools in education and research.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3105-3116"},"PeriodicalIF":8.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation 利用有效的sam和时间相干性进行视听分割

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-10 DOI: 10.1109/TMM.2025.3557637

Yue Zhu;Kun Li;Zongxin Yang

{"title":"Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation","authors":"Yue Zhu;Kun Li;Zongxin Yang","doi":"10.1109/TMM.2025.3557637","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557637","url":null,"abstract":"Audio-Visual Segmentation (AVS) aims to accurately identify and segment sound sources within video content at the pixel level and requires a fine-grained semantic understanding of both visual and audio cues. While the Segment Anything Model (SAM) has demonstrated outstanding results across various segmentation tasks, its design is primarily focused on single-image segmentation with points, boxes, and mask prompts. As a result, when SAM is applied directly to AVS, it struggles to effectively leverage contextual information from audio data and capture temporal correlations across video frames. Additionally, its high computational requirements pose challenges to its practical applicability in AVS applications. In this paper, we introduce ESAM-AVS, a new framework built on EfficientSAM, aimed at transferring SAM's prior knowledge to the AVS domain. Specifically, we utilize the EfficientSAM as the backbone to maintain model adaptability while significantly lowering computational and processing costs. To tackle the challenges posed by temporal and audio-visual correlations, we designed the Inter-Frame Coherence module, which independently integrates the temporal information from both visual and audio modalities. Furthermore, we incorporate an audio-guided prompt encoder that generates audio prompts to provide guidance, effectively integrating audio cues into the segmentation process. By combining these components, our model maximizes the potential of SAM's prior knowledge, and adapts it to the more complex AVS task. Extensive experiments on the AVSBench dataset demonstrate that ESAM-AVS outperforms existing state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2999-3008"},"PeriodicalIF":8.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VideoDreamer: Customized Multi-Subject Text-to-Video Generation With Disen-Mix Finetuning on Language-Video Foundation Models 自定义多主题文本到视频生成与Disen-Mix微调语言视频基础模型

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-08 DOI: 10.1109/TMM.2025.3557634

Hong Chen;Xin Wang;Guanning Zeng;Yipeng Zhang;Yuwei Zhou;Feilin Han;Yaofei Wu;Wenwu Zhu

{"title":"VideoDreamer: Customized Multi-Subject Text-to-Video Generation With Disen-Mix Finetuning on Language-Video Foundation Models","authors":"Hong Chen;Xin Wang;Guanning Zeng;Yipeng Zhang;Yuwei Zhou;Feilin Han;Yaofei Wu;Wenwu Zhu","doi":"10.1109/TMM.2025.3557634","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557634","url":null,"abstract":"Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle the attribute binding problem of multi-subject generation. Additionally, we present a disentangled motion customization strategy to finetune the temporal modules so that we can generate videos with both customized subjects and motions. To evaluate the performance of customized multi-subject text-to-video generation, we introduce the MultiStudioBench benchmark. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2875-2885"},"PeriodicalIF":8.4,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144170897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter Cps-STS：弥合内容和位置之间的差距，为粗点监督场景文本观测者

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-04 DOI: 10.1109/TMM.2024.3521756

Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao

{"title":"Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter","authors":"Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao","doi":"10.1109/TMM.2024.3521756","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521756","url":null,"abstract":"Recently, weakly supervised methods for scene text spotter are increasingly popular with researchers due to their potential to significantly reduce dataset annotation efforts. The latest progress in this field is text spotter based on single or multi-point annotations. However, this method struggles with the sensitivity of text recognition to the precise annotation location and fails to capture the relative positions and shapes of characters, leading to impaired recognition of texts with extensive rotations and flips. To address these challenges, this paper develops a novel method named Coarse-point-supervised Scene Text Spotter (Cps-STS). Cps-STS first utilizes a few approximate points as text location labels and introduces a learnable position modulation mechanism, easing the accuracy requirements for annotations and enhancing model robustness. Additionally, we incorporate a Spatial Compatibility Attention (SCA) module for text decoding to effectively utilize spatial data such as position and shape. This module fuses compound queries and global feature maps, serving as a bias in the SCA module to express text spatial morphology. In order to accurately locate and decode text content, we introduce features containing spatial morphology information and text content into the input features of the text decoder. By introducing features with spatial morphology information as bias terms into the text decoder, ablation experiments demonstrate that this operation enables the model to effectively identify and utilize the relationship between text content and position to enhance the recognition performance of our model. One significant advantage of Cps-STS is its ability to achieve full supervision-level performance with just a few imprecise coarse points at a low cost. Extensive experiments validate the effectiveness and superiority of Cps-STS over existing approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1652-1664"},"PeriodicalIF":8.4,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Large Models are Effective Action Anticipators 多模态大模型是有效的行动预测器

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557615

Binglu Wang;Yao Tian;Shunzhou Wang;Le Yang

{"title":"Multimodal Large Models are Effective Action Anticipators","authors":"Binglu Wang;Yao Tian;Shunzhou Wang;Le Yang","doi":"10.1109/TMM.2025.3557615","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557615","url":null,"abstract":"The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2949-2960"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation 基于语义交替增强和双向聚合的参考视频对象分割

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557689

Jiaxing Yang;Lihe Zhang;Huchuan Lu

{"title":"Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation","authors":"Jiaxing Yang;Lihe Zhang;Huchuan Lu","doi":"10.1109/TMM.2025.3557689","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557689","url":null,"abstract":"Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2987-2998"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection 开放世界弱监督视频异常检测的多模态证据学习

IF 8.4 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI: 10.1109/TMM.2025.3557682

Chao Huang;Weiliang Huang;Qiuping Jiang;Wei Wang;Jie Wen;Bob Zhang

{"title":"Multimodal Evidential Learning for Open-World Weakly-Supervised Video Anomaly Detection","authors":"Chao Huang;Weiliang Huang;Qiuping Jiang;Wei Wang;Jie Wen;Bob Zhang","doi":"10.1109/TMM.2025.3557682","DOIUrl":"https://doi.org/10.1109/TMM.2025.3557682","url":null,"abstract":"Efforts in weakly-supervised video anomaly detection center on detecting abnormal events within videos by coarse-grained labels, which has been successfully applied to many real-world applications. However, a significant limitation of most existing methods is that they are only effective for specific objects in specific scenarios, which makes them prone to misclassification or omission when confronted with previously unseen anomalies. Relative to conventional anomaly detection tasks, Open-world Weakly-supervised Video Anomaly Detection (OWVAD) poses greater challenges due to the absence of labels and fine-grained annotations for unknown anomalies. To address the above problem, we propose a multi-scale evidential vision-language model to achieve open-world video anomaly detection. Specifically, we leverage generalized visual-language associations derived from CLIP to harness the full potential of large pre-trained models in addressing the OWVAD task. Subsequently, we integrate a multi-scale temporal modeling module with a multimodal evidence collector to achieve precise frame-level detection of both seen and unseen anomalies. Extensive experiments on two widely-utilized benchmarks have conclusively validated the effectiveness of our method. The code will be made publicly available.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3132-3143"},"PeriodicalIF":8.4,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0