D. Minola Davids , A. Arul Edwin Raj , C. Seldev Christopher
{"title":"SportSummarizer: A unified multimodal fusion transformer for context-aware sports video summarization","authors":"D. Minola Davids , A. Arul Edwin Raj , C. Seldev Christopher","doi":"10.1016/j.neucom.2025.131011","DOIUrl":null,"url":null,"abstract":"<div><div>Automated sports video summarization faces critical challenges due to the complexity of dynamic gameplay, event variability, and the intricate rules governing sports like cricket and soccer. Existing methods often struggle to capture key moments accurately, resulting in false positives, redundant content like replays, ineffective multimodal data integration, and challenges in spatio-temporal modeling and semantic event understanding. To overcome these limitations, a novel Unified Multimodal Fusion Transformer is proposed for the summarization of cricket and soccer videos. This approach utilizes advanced feature encoding across multiple modalities: ViViT for video, OpenL3 for audio, and DistilBERT for text, ensuring robust multimodal representations. A multimodal fusion transformer with contextual cross-quadrimodal attention is introduced to address weak multimodal integration, enabling the model to capture complex interactions across visual, audio, and textual data for precise event detection. Further, Hierarchical Temporal Convolutional Networks (Hierarchical TCN) module integrates hierarchical temporal and metadata-enhanced positional to model both short and long-term game sequences effectively. Additionally, replay and redundancy elimination mechanisms are applied to remove repetitive content, generating concise and high-quality video summaries that reflect the game's critical moments. The proposed method achieves state-of-the-art results, with the highest precision (99.2 %) and recall (98.9 %), and a low error rate (4 %). It also demonstrates superior ROC-AUC performance (0.88) and maintains peak accuracy (89.5 %) with strong performance in mIoU (0.82) and highlight diversity (0.93), highlighting its robustness across various event detection metrics.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 131011"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225016832","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Automated sports video summarization faces critical challenges due to the complexity of dynamic gameplay, event variability, and the intricate rules governing sports like cricket and soccer. Existing methods often struggle to capture key moments accurately, resulting in false positives, redundant content like replays, ineffective multimodal data integration, and challenges in spatio-temporal modeling and semantic event understanding. To overcome these limitations, a novel Unified Multimodal Fusion Transformer is proposed for the summarization of cricket and soccer videos. This approach utilizes advanced feature encoding across multiple modalities: ViViT for video, OpenL3 for audio, and DistilBERT for text, ensuring robust multimodal representations. A multimodal fusion transformer with contextual cross-quadrimodal attention is introduced to address weak multimodal integration, enabling the model to capture complex interactions across visual, audio, and textual data for precise event detection. Further, Hierarchical Temporal Convolutional Networks (Hierarchical TCN) module integrates hierarchical temporal and metadata-enhanced positional to model both short and long-term game sequences effectively. Additionally, replay and redundancy elimination mechanisms are applied to remove repetitive content, generating concise and high-quality video summaries that reflect the game's critical moments. The proposed method achieves state-of-the-art results, with the highest precision (99.2 %) and recall (98.9 %), and a low error rate (4 %). It also demonstrates superior ROC-AUC performance (0.88) and maintains peak accuracy (89.5 %) with strong performance in mIoU (0.82) and highlight diversity (0.93), highlighting its robustness across various event detection metrics.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.