SportSummarizer: A unified multimodal fusion transformer for context-aware sports video summarization

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-07-19 DOI:10.1016/j.neucom.2025.131011

D. Minola Davids , A. Arul Edwin Raj , C. Seldev Christopher

{"title":"SportSummarizer: A unified multimodal fusion transformer for context-aware sports video summarization","authors":"D. Minola Davids , A. Arul Edwin Raj , C. Seldev Christopher","doi":"10.1016/j.neucom.2025.131011","DOIUrl":null,"url":null,"abstract":"<div><div>Automated sports video summarization faces critical challenges due to the complexity of dynamic gameplay, event variability, and the intricate rules governing sports like cricket and soccer. Existing methods often struggle to capture key moments accurately, resulting in false positives, redundant content like replays, ineffective multimodal data integration, and challenges in spatio-temporal modeling and semantic event understanding. To overcome these limitations, a novel Unified Multimodal Fusion Transformer is proposed for the summarization of cricket and soccer videos. This approach utilizes advanced feature encoding across multiple modalities: ViViT for video, OpenL3 for audio, and DistilBERT for text, ensuring robust multimodal representations. A multimodal fusion transformer with contextual cross-quadrimodal attention is introduced to address weak multimodal integration, enabling the model to capture complex interactions across visual, audio, and textual data for precise event detection. Further, Hierarchical Temporal Convolutional Networks (Hierarchical TCN) module integrates hierarchical temporal and metadata-enhanced positional to model both short and long-term game sequences effectively. Additionally, replay and redundancy elimination mechanisms are applied to remove repetitive content, generating concise and high-quality video summaries that reflect the game's critical moments. The proposed method achieves state-of-the-art results, with the highest precision (99.2 %) and recall (98.9 %), and a low error rate (4 %). It also demonstrates superior ROC-AUC performance (0.88) and maintains peak accuracy (89.5 %) with strong performance in mIoU (0.82) and highlight diversity (0.93), highlighting its robustness across various event detection metrics.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 131011"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225016832","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Automated sports video summarization faces critical challenges due to the complexity of dynamic gameplay, event variability, and the intricate rules governing sports like cricket and soccer. Existing methods often struggle to capture key moments accurately, resulting in false positives, redundant content like replays, ineffective multimodal data integration, and challenges in spatio-temporal modeling and semantic event understanding. To overcome these limitations, a novel Unified Multimodal Fusion Transformer is proposed for the summarization of cricket and soccer videos. This approach utilizes advanced feature encoding across multiple modalities: ViViT for video, OpenL3 for audio, and DistilBERT for text, ensuring robust multimodal representations. A multimodal fusion transformer with contextual cross-quadrimodal attention is introduced to address weak multimodal integration, enabling the model to capture complex interactions across visual, audio, and textual data for precise event detection. Further, Hierarchical Temporal Convolutional Networks (Hierarchical TCN) module integrates hierarchical temporal and metadata-enhanced positional to model both short and long-term game sequences effectively. Additionally, replay and redundancy elimination mechanisms are applied to remove repetitive content, generating concise and high-quality video summaries that reflect the game's critical moments. The proposed method achieves state-of-the-art results, with the highest precision (99.2 %) and recall (98.9 %), and a low error rate (4 %). It also demonstrates superior ROC-AUC performance (0.88) and maintains peak accuracy (89.5 %) with strong performance in mIoU (0.82) and highlight diversity (0.93), highlighting its robustness across various event detection metrics.

查看原文本刊更多论文

SportSummarizer：一个统一的多模态融合转换器，用于上下文感知的体育视频摘要

由于动态游戏的复杂性、事件的可变性以及管理板球和足球等运动的复杂规则，自动体育视频摘要面临着严峻的挑战。现有的方法往往难以准确地捕捉关键时刻，导致误报、重播等冗余内容、无效的多模态数据集成，以及在时空建模和语义事件理解方面的挑战。为了克服这些限制，提出了一种用于板球和足球视频摘要的新型统一多模态融合变压器。这种方法利用了跨多种模式的高级特征编码：ViViT用于视频，OpenL3用于音频，蒸馏器用于文本，确保了健壮的多模式表示。引入具有上下文跨四模态注意力的多模态融合转换器来解决弱多模态集成问题，使模型能够捕获跨视觉、音频和文本数据的复杂交互，以进行精确的事件检测。此外，分层时间卷积网络（Hierarchical Temporal Convolutional Networks，简称Hierarchical TCN）模块集成了分层时间和元数据增强的位置，可以有效地对短期和长期博弈序列进行建模。此外，重播和冗余消除机制用于删除重复内容，生成反映游戏关键时刻的简洁高质量视频摘要。该方法具有较高的准确率（99.2 %）、召回率（98.9 %）和较低的错误率（4 %）。它还展示了卓越的ROC-AUC性能（0.88），并保持峰值精度（89.5 %），在mIoU（0.82）和高光多样性（0.93）方面表现出色，突出了其在各种事件检测指标中的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.