基于协同改进估计的半监督文档关键信息提取

IF 6.9 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2025-08-26 DOI:10.1016/j.ipm.2025.104335

Pengcheng Guo , Yonghong Song , Boyu Wang , Yankai Cao , Jiayang Ren , Chaojie Ji , Jiahao Liu , Qi Zhang , Qiangqiang Mao

{"title":"基于协同改进估计的半监督文档关键信息提取","authors":"Pengcheng Guo , Yonghong Song , Boyu Wang , Yankai Cao , Jiayang Ren , Chaojie Ji , Jiahao Liu , Qi Zhang , Qiangqiang Mao","doi":"10.1016/j.ipm.2025.104335","DOIUrl":null,"url":null,"abstract":"<div><div>Current methods for document key information extraction (DKIE) rely heavily on labeled data with high annotation costs. To mitigate this issue, the semi-supervised learning (SSL) paradigm, which utilizes unlabeled document samples, has gained broad attention in DKIE. However, existing SSL methods require labeled and unlabeled data to share an identical label space, which is impractical in many DKIE tasks (i.e., some unlabeled samples do not belong to any known classes in the labeled set). In this paper, we formulate this problem as Class-Missing Semi-supervised (CMSS) DKIE. In DKIE, unknown classes usually belong to minority and fine-grained categories, intensifying the misconnections between known and unknown classes and making CMSS more challenging. To address this issue, we propose Synergistic Refinement Estimation (SRE), a progressive prototype estimation scheme that alleviates the unknown classes bias to the majority known classes on long-tailed unlabeled data. Furthermore, dynamic threshold hash rectification and structural calibration mechanisms are proposed to correct connections between fine-grained classes. Extensive experimental results demonstrate that SRE surpasses existing state-of-the-art methods on several DKIE benchmarks. Code is available at <span><span>https://github.com/anonymoulink/SRE_DKIE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 1","pages":"Article 104335"},"PeriodicalIF":6.9000,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Class-Missing Semi-supervised document key information extraction via synergistic refinement estimation\",\"authors\":\"Pengcheng Guo , Yonghong Song , Boyu Wang , Yankai Cao , Jiayang Ren , Chaojie Ji , Jiahao Liu , Qi Zhang , Qiangqiang Mao\",\"doi\":\"10.1016/j.ipm.2025.104335\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Current methods for document key information extraction (DKIE) rely heavily on labeled data with high annotation costs. To mitigate this issue, the semi-supervised learning (SSL) paradigm, which utilizes unlabeled document samples, has gained broad attention in DKIE. However, existing SSL methods require labeled and unlabeled data to share an identical label space, which is impractical in many DKIE tasks (i.e., some unlabeled samples do not belong to any known classes in the labeled set). In this paper, we formulate this problem as Class-Missing Semi-supervised (CMSS) DKIE. In DKIE, unknown classes usually belong to minority and fine-grained categories, intensifying the misconnections between known and unknown classes and making CMSS more challenging. To address this issue, we propose Synergistic Refinement Estimation (SRE), a progressive prototype estimation scheme that alleviates the unknown classes bias to the majority known classes on long-tailed unlabeled data. Furthermore, dynamic threshold hash rectification and structural calibration mechanisms are proposed to correct connections between fine-grained classes. Extensive experimental results demonstrate that SRE surpasses existing state-of-the-art methods on several DKIE benchmarks. Code is available at <span><span>https://github.com/anonymoulink/SRE_DKIE</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"63 1\",\"pages\":\"Article 104335\"},\"PeriodicalIF\":6.9000,\"publicationDate\":\"2025-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325002766\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325002766","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

当前的文档关键信息提取（DKIE）方法严重依赖标注数据，标注成本高。为了缓解这个问题，利用未标记文档样本的半监督学习（SSL）范式在DKIE中得到了广泛关注。然而，现有的SSL方法要求标记和未标记的数据共享相同的标签空间，这在许多DKIE任务中是不切实际的（例如，一些未标记的样本不属于标记集中的任何已知类）。本文将该问题形式化为缺类半监督（CMSS） DKIE。在DKIE中，未知类通常属于少数和细粒度的类别，这加剧了已知类和未知类之间的错误连接，使CMSS更具挑战性。为了解决这个问题，我们提出了协同改进估计（SRE），这是一种渐进的原型估计方案，可以减轻长尾未标记数据上未知类对大多数已知类的偏差。此外，还提出了动态阈值哈希校正和结构校准机制来校正细粒度类之间的连接。大量的实验结果表明，SRE在几个DKIE基准上超过了现有的最先进的方法。代码可从https://github.com/anonymoulink/SRE_DKIE获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Class-Missing Semi-supervised document key information extraction via synergistic refinement estimation

Current methods for document key information extraction (DKIE) rely heavily on labeled data with high annotation costs. To mitigate this issue, the semi-supervised learning (SSL) paradigm, which utilizes unlabeled document samples, has gained broad attention in DKIE. However, existing SSL methods require labeled and unlabeled data to share an identical label space, which is impractical in many DKIE tasks (i.e., some unlabeled samples do not belong to any known classes in the labeled set). In this paper, we formulate this problem as Class-Missing Semi-supervised (CMSS) DKIE. In DKIE, unknown classes usually belong to minority and fine-grained categories, intensifying the misconnections between known and unknown classes and making CMSS more challenging. To address this issue, we propose Synergistic Refinement Estimation (SRE), a progressive prototype estimation scheme that alleviates the unknown classes bias to the majority known classes on long-tailed unlabeled data. Furthermore, dynamic threshold hash rectification and structural calibration mechanisms are proposed to correct connections between fine-grained classes. Extensive experimental results demonstrate that SRE surpasses existing state-of-the-art methods on several DKIE benchmarks. Code is available at https://github.com/anonymoulink/SRE_DKIE.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.