Abhijit Mitra, Jayanta Paul, Tanis Ahamed, Sagar Basak, Jaya Sil
{"title":"From words to visuals: Bridging text and visual insights using MetA-MARC framework for enhanced scholarly article categorization","authors":"Abhijit Mitra, Jayanta Paul, Tanis Ahamed, Sagar Basak, Jaya Sil","doi":"10.1016/j.knosys.2025.113896","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid growth of technology has led to approximately 28,100 journals disseminating 2.5 million research articles annually, posing significant challenges in locating and categorizing articles of interest. Search engines, citation indexes, and digital libraries often return predominantly irrelevant papers due to limited indexing. Existing classification techniques leveraging content and metadata face challenges such as incomplete data and lack of semantic context. Metadata-based methods frequently rely on statistical metrics that neglect semantic meanings and require subject expertise for threshold setting. To address these issues, we propose <span>Metadata-Driven Attention-Based Multimodal Academic Research Classifier (MetA-MARC)</span>, a framework leveraging the pretrained CLIP model to integrate text and image modalities for enhanced scholarly article classification. <span>MetA-MARC</span> captures semantic and contextual meaning by integrating metadata, OCR-extracted features, and images through CLIP (Contrastive Language-Image Pre-training). It introduces a novel textual inversion approach to map images to pseudo-word tokens in the CLIP embedding space for robust multimodal representations. The framework employs <span>FusionWeave</span>, a multimodal fusion network combining features using concatenation, cross fusion, and attention-based techniques, alongside <span>Modality-Driven Adaptive Re-weighting (MoDAR)</span> to dynamically prioritize relevant features. Experiments on JUCS, ACM, and proprietary <span>CompScholar</span> datasets demonstrate average accuracies of 0.86, 0.84, and 0.8848, respectively, surpassing state-of-the-art methods by up to 4.05%. These results highlight <span>MetA-MARC’s</span> potential as a robust, adaptive tool for automated scholarly article classification, effectively bridging text and visual modalities.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"324 ","pages":"Article 113896"},"PeriodicalIF":7.6000,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125009426","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid growth of technology has led to approximately 28,100 journals disseminating 2.5 million research articles annually, posing significant challenges in locating and categorizing articles of interest. Search engines, citation indexes, and digital libraries often return predominantly irrelevant papers due to limited indexing. Existing classification techniques leveraging content and metadata face challenges such as incomplete data and lack of semantic context. Metadata-based methods frequently rely on statistical metrics that neglect semantic meanings and require subject expertise for threshold setting. To address these issues, we propose Metadata-Driven Attention-Based Multimodal Academic Research Classifier (MetA-MARC), a framework leveraging the pretrained CLIP model to integrate text and image modalities for enhanced scholarly article classification. MetA-MARC captures semantic and contextual meaning by integrating metadata, OCR-extracted features, and images through CLIP (Contrastive Language-Image Pre-training). It introduces a novel textual inversion approach to map images to pseudo-word tokens in the CLIP embedding space for robust multimodal representations. The framework employs FusionWeave, a multimodal fusion network combining features using concatenation, cross fusion, and attention-based techniques, alongside Modality-Driven Adaptive Re-weighting (MoDAR) to dynamically prioritize relevant features. Experiments on JUCS, ACM, and proprietary CompScholar datasets demonstrate average accuracies of 0.86, 0.84, and 0.8848, respectively, surpassing state-of-the-art methods by up to 4.05%. These results highlight MetA-MARC’s potential as a robust, adaptive tool for automated scholarly article classification, effectively bridging text and visual modalities.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.