Pengcheng Guo , Yonghong Song , Boyu Wang , Yankai Cao , Jiayang Ren , Chaojie Ji , Jiahao Liu , Qi Zhang , Qiangqiang Mao
{"title":"基于协同改进估计的半监督文档关键信息提取","authors":"Pengcheng Guo , Yonghong Song , Boyu Wang , Yankai Cao , Jiayang Ren , Chaojie Ji , Jiahao Liu , Qi Zhang , Qiangqiang Mao","doi":"10.1016/j.ipm.2025.104335","DOIUrl":null,"url":null,"abstract":"<div><div>Current methods for document key information extraction (DKIE) rely heavily on labeled data with high annotation costs. To mitigate this issue, the semi-supervised learning (SSL) paradigm, which utilizes unlabeled document samples, has gained broad attention in DKIE. However, existing SSL methods require labeled and unlabeled data to share an identical label space, which is impractical in many DKIE tasks (i.e., some unlabeled samples do not belong to any known classes in the labeled set). In this paper, we formulate this problem as Class-Missing Semi-supervised (CMSS) DKIE. In DKIE, unknown classes usually belong to minority and fine-grained categories, intensifying the misconnections between known and unknown classes and making CMSS more challenging. To address this issue, we propose Synergistic Refinement Estimation (SRE), a progressive prototype estimation scheme that alleviates the unknown classes bias to the majority known classes on long-tailed unlabeled data. Furthermore, dynamic threshold hash rectification and structural calibration mechanisms are proposed to correct connections between fine-grained classes. Extensive experimental results demonstrate that SRE surpasses existing state-of-the-art methods on several DKIE benchmarks. Code is available at <span><span>https://github.com/anonymoulink/SRE_DKIE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 1","pages":"Article 104335"},"PeriodicalIF":6.9000,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Class-Missing Semi-supervised document key information extraction via synergistic refinement estimation\",\"authors\":\"Pengcheng Guo , Yonghong Song , Boyu Wang , Yankai Cao , Jiayang Ren , Chaojie Ji , Jiahao Liu , Qi Zhang , Qiangqiang Mao\",\"doi\":\"10.1016/j.ipm.2025.104335\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Current methods for document key information extraction (DKIE) rely heavily on labeled data with high annotation costs. To mitigate this issue, the semi-supervised learning (SSL) paradigm, which utilizes unlabeled document samples, has gained broad attention in DKIE. However, existing SSL methods require labeled and unlabeled data to share an identical label space, which is impractical in many DKIE tasks (i.e., some unlabeled samples do not belong to any known classes in the labeled set). In this paper, we formulate this problem as Class-Missing Semi-supervised (CMSS) DKIE. In DKIE, unknown classes usually belong to minority and fine-grained categories, intensifying the misconnections between known and unknown classes and making CMSS more challenging. To address this issue, we propose Synergistic Refinement Estimation (SRE), a progressive prototype estimation scheme that alleviates the unknown classes bias to the majority known classes on long-tailed unlabeled data. Furthermore, dynamic threshold hash rectification and structural calibration mechanisms are proposed to correct connections between fine-grained classes. Extensive experimental results demonstrate that SRE surpasses existing state-of-the-art methods on several DKIE benchmarks. Code is available at <span><span>https://github.com/anonymoulink/SRE_DKIE</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"63 1\",\"pages\":\"Article 104335\"},\"PeriodicalIF\":6.9000,\"publicationDate\":\"2025-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325002766\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325002766","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Class-Missing Semi-supervised document key information extraction via synergistic refinement estimation
Current methods for document key information extraction (DKIE) rely heavily on labeled data with high annotation costs. To mitigate this issue, the semi-supervised learning (SSL) paradigm, which utilizes unlabeled document samples, has gained broad attention in DKIE. However, existing SSL methods require labeled and unlabeled data to share an identical label space, which is impractical in many DKIE tasks (i.e., some unlabeled samples do not belong to any known classes in the labeled set). In this paper, we formulate this problem as Class-Missing Semi-supervised (CMSS) DKIE. In DKIE, unknown classes usually belong to minority and fine-grained categories, intensifying the misconnections between known and unknown classes and making CMSS more challenging. To address this issue, we propose Synergistic Refinement Estimation (SRE), a progressive prototype estimation scheme that alleviates the unknown classes bias to the majority known classes on long-tailed unlabeled data. Furthermore, dynamic threshold hash rectification and structural calibration mechanisms are proposed to correct connections between fine-grained classes. Extensive experimental results demonstrate that SRE surpasses existing state-of-the-art methods on several DKIE benchmarks. Code is available at https://github.com/anonymoulink/SRE_DKIE.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.