Jen-Chun Chang , Chia-Cheng Lee , Chung-Fu Lu , Victor R.L. Shen
{"title":"Enhancing multi-modal document distillation with energy-weighted supervision","authors":"Jen-Chun Chang , Chia-Cheng Lee , Chung-Fu Lu , Victor R.L. Shen","doi":"10.1016/j.knosys.2025.114542","DOIUrl":null,"url":null,"abstract":"<div><div>As large multi-modal document models (e.g. LayoutLMv3) grow increasingly complex, knowledge distillation (KD) has become essential for practical deployment. EnergyKD enhances conventional logit-based KD by adjusting temperature per sample using energy scores. However, it still misleads students when teacher predictions are incorrect on high energy (i.e. low confidence) inputs. Although High-Energy Data Augmentation (HE- DA) is introduced to address this issue, it adds significant training overhead. In this work, we propose Energy-Weighted Supervision (EWS), a general-purpose supervision augmentation framework that builds upon an energy-based sample stratification mechanism. EWS dynamically adjusts the balance between hard-label and soft-label losses according to each sample’s energy score, thereby increasing the likelihood that the student model receives accurate and corrective supervision, without requiring additional data augmentation or training overhead. Our experiments demonstrate that EWS effectively improves the performance of various KD methods. On the harder FUNSD benchmark, EWS yields the largest gains (+2.35 F1), while on CORD and SROIE the improvements are smaller but consistently positive (up to +0.84 and +0.11 F1, respectively), confirming broad applicability across KD paradigms. Especially, when applied to EnergyKD, EWS addresses its core limitation, namely, the misleading influence of sharpened teacher outputs on high energy samples, by allocating greater weight to hard-label signals. Conversely, for low energy samples, EWS preserves soft-label emphasis to fully exploit the teacher’s informative predictions. Compared to conventional logit-based KD, EnergyKD, and even HE-DA, our energy-guided loss modulation approach consistently improves student performance across multiple documents understanding benchmarks, without additional training cost. To the best of our knowledge, this is the first framework in multi-modal document distillation that simultaneously integrates energy-aware temperature scaling and dynamic supervision weighting, offering a promising direction for future research and deployment on resource-limited devices.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114542"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125015813","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
As large multi-modal document models (e.g. LayoutLMv3) grow increasingly complex, knowledge distillation (KD) has become essential for practical deployment. EnergyKD enhances conventional logit-based KD by adjusting temperature per sample using energy scores. However, it still misleads students when teacher predictions are incorrect on high energy (i.e. low confidence) inputs. Although High-Energy Data Augmentation (HE- DA) is introduced to address this issue, it adds significant training overhead. In this work, we propose Energy-Weighted Supervision (EWS), a general-purpose supervision augmentation framework that builds upon an energy-based sample stratification mechanism. EWS dynamically adjusts the balance between hard-label and soft-label losses according to each sample’s energy score, thereby increasing the likelihood that the student model receives accurate and corrective supervision, without requiring additional data augmentation or training overhead. Our experiments demonstrate that EWS effectively improves the performance of various KD methods. On the harder FUNSD benchmark, EWS yields the largest gains (+2.35 F1), while on CORD and SROIE the improvements are smaller but consistently positive (up to +0.84 and +0.11 F1, respectively), confirming broad applicability across KD paradigms. Especially, when applied to EnergyKD, EWS addresses its core limitation, namely, the misleading influence of sharpened teacher outputs on high energy samples, by allocating greater weight to hard-label signals. Conversely, for low energy samples, EWS preserves soft-label emphasis to fully exploit the teacher’s informative predictions. Compared to conventional logit-based KD, EnergyKD, and even HE-DA, our energy-guided loss modulation approach consistently improves student performance across multiple documents understanding benchmarks, without additional training cost. To the best of our knowledge, this is the first framework in multi-modal document distillation that simultaneously integrates energy-aware temperature scaling and dynamic supervision weighting, offering a promising direction for future research and deployment on resource-limited devices.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.