Enhancing multi-modal document distillation with energy-weighted supervision

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-09-24 DOI:10.1016/j.knosys.2025.114542

Jen-Chun Chang , Chia-Cheng Lee , Chung-Fu Lu , Victor R.L. Shen

{"title":"Enhancing multi-modal document distillation with energy-weighted supervision","authors":"Jen-Chun Chang , Chia-Cheng Lee , Chung-Fu Lu , Victor R.L. Shen","doi":"10.1016/j.knosys.2025.114542","DOIUrl":null,"url":null,"abstract":"<div><div>As large multi-modal document models (e.g. LayoutLMv3) grow increasingly complex, knowledge distillation (KD) has become essential for practical deployment. EnergyKD enhances conventional logit-based KD by adjusting temperature per sample using energy scores. However, it still misleads students when teacher predictions are incorrect on high energy (i.e. low confidence) inputs. Although High-Energy Data Augmentation (HE- DA) is introduced to address this issue, it adds significant training overhead. In this work, we propose Energy-Weighted Supervision (EWS), a general-purpose supervision augmentation framework that builds upon an energy-based sample stratification mechanism. EWS dynamically adjusts the balance between hard-label and soft-label losses according to each sample’s energy score, thereby increasing the likelihood that the student model receives accurate and corrective supervision, without requiring additional data augmentation or training overhead. Our experiments demonstrate that EWS effectively improves the performance of various KD methods. On the harder FUNSD benchmark, EWS yields the largest gains (+2.35 F1), while on CORD and SROIE the improvements are smaller but consistently positive (up to +0.84 and +0.11 F1, respectively), confirming broad applicability across KD paradigms. Especially, when applied to EnergyKD, EWS addresses its core limitation, namely, the misleading influence of sharpened teacher outputs on high energy samples, by allocating greater weight to hard-label signals. Conversely, for low energy samples, EWS preserves soft-label emphasis to fully exploit the teacher’s informative predictions. Compared to conventional logit-based KD, EnergyKD, and even HE-DA, our energy-guided loss modulation approach consistently improves student performance across multiple documents understanding benchmarks, without additional training cost. To the best of our knowledge, this is the first framework in multi-modal document distillation that simultaneously integrates energy-aware temperature scaling and dynamic supervision weighting, offering a promising direction for future research and deployment on resource-limited devices.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114542"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125015813","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

As large multi-modal document models (e.g. LayoutLMv3) grow increasingly complex, knowledge distillation (KD) has become essential for practical deployment. EnergyKD enhances conventional logit-based KD by adjusting temperature per sample using energy scores. However, it still misleads students when teacher predictions are incorrect on high energy (i.e. low confidence) inputs. Although High-Energy Data Augmentation (HE- DA) is introduced to address this issue, it adds significant training overhead. In this work, we propose Energy-Weighted Supervision (EWS), a general-purpose supervision augmentation framework that builds upon an energy-based sample stratification mechanism. EWS dynamically adjusts the balance between hard-label and soft-label losses according to each sample’s energy score, thereby increasing the likelihood that the student model receives accurate and corrective supervision, without requiring additional data augmentation or training overhead. Our experiments demonstrate that EWS effectively improves the performance of various KD methods. On the harder FUNSD benchmark, EWS yields the largest gains (+2.35 F1), while on CORD and SROIE the improvements are smaller but consistently positive (up to +0.84 and +0.11 F1, respectively), confirming broad applicability across KD paradigms. Especially, when applied to EnergyKD, EWS addresses its core limitation, namely, the misleading influence of sharpened teacher outputs on high energy samples, by allocating greater weight to hard-label signals. Conversely, for low energy samples, EWS preserves soft-label emphasis to fully exploit the teacher’s informative predictions. Compared to conventional logit-based KD, EnergyKD, and even HE-DA, our energy-guided loss modulation approach consistently improves student performance across multiple documents understanding benchmarks, without additional training cost. To the best of our knowledge, this is the first framework in multi-modal document distillation that simultaneously integrates energy-aware temperature scaling and dynamic supervision weighting, offering a promising direction for future research and deployment on resource-limited devices.

查看原文本刊更多论文

基于能量加权监督的多模态文件蒸馏

随着大型多模态文档模型（例如LayoutLMv3）变得越来越复杂，知识蒸馏（KD）在实际部署中变得至关重要。EnergyKD通过使用能量评分调节每个样本的温度来增强传统的基于逻辑的KD。然而，当教师对高能量（即低置信度）输入的预测不正确时，它仍然会误导学生。虽然高能数据增强（HE- DA）被引入来解决这个问题，但它增加了显著的训练开销。在这项工作中，我们提出了能量加权监督（EWS），这是一种基于能量的样本分层机制的通用监督增强框架。EWS根据每个样本的能量得分动态调整硬标签和软标签损失之间的平衡，从而增加学生模型获得准确和正确监督的可能性，而不需要额外的数据增强或训练开销。我们的实验表明，EWS有效地提高了各种KD方法的性能。在较硬的fundd基准上，EWS的收益最大（+2.35 F1），而在CORD和SROIE上，收益较小但始终是正的（分别高达+0.84和+0.11 F1），证实了KD范式的广泛适用性。特别是，当应用于EnergyKD时，EWS通过为硬标签信号分配更大的权重，解决了其核心限制，即锐化的教师输出对高能量样本的误导影响。相反，对于低能量样本，EWS保留软标签强调，以充分利用教师的信息预测。与传统的基于逻辑的KD、EnergyKD甚至HE-DA相比，我们的能量引导损耗调制方法可以持续提高学生在多个文档理解基准中的表现，而无需额外的培训成本。据我们所知，这是多模态文档蒸馏中第一个同时集成能源感知温度缩放和动态监督加权的框架，为未来在资源有限的设备上的研究和部署提供了一个有希望的方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.