基于协同改进估计的半监督文档关键信息提取

IF 6.9 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Pengcheng Guo , Yonghong Song , Boyu Wang , Yankai Cao , Jiayang Ren , Chaojie Ji , Jiahao Liu , Qi Zhang , Qiangqiang Mao
{"title":"基于协同改进估计的半监督文档关键信息提取","authors":"Pengcheng Guo ,&nbsp;Yonghong Song ,&nbsp;Boyu Wang ,&nbsp;Yankai Cao ,&nbsp;Jiayang Ren ,&nbsp;Chaojie Ji ,&nbsp;Jiahao Liu ,&nbsp;Qi Zhang ,&nbsp;Qiangqiang Mao","doi":"10.1016/j.ipm.2025.104335","DOIUrl":null,"url":null,"abstract":"<div><div>Current methods for document key information extraction (DKIE) rely heavily on labeled data with high annotation costs. To mitigate this issue, the semi-supervised learning (SSL) paradigm, which utilizes unlabeled document samples, has gained broad attention in DKIE. However, existing SSL methods require labeled and unlabeled data to share an identical label space, which is impractical in many DKIE tasks (i.e., some unlabeled samples do not belong to any known classes in the labeled set). In this paper, we formulate this problem as Class-Missing Semi-supervised (CMSS) DKIE. In DKIE, unknown classes usually belong to minority and fine-grained categories, intensifying the misconnections between known and unknown classes and making CMSS more challenging. To address this issue, we propose Synergistic Refinement Estimation (SRE), a progressive prototype estimation scheme that alleviates the unknown classes bias to the majority known classes on long-tailed unlabeled data. Furthermore, dynamic threshold hash rectification and structural calibration mechanisms are proposed to correct connections between fine-grained classes. Extensive experimental results demonstrate that SRE surpasses existing state-of-the-art methods on several DKIE benchmarks. Code is available at <span><span>https://github.com/anonymoulink/SRE_DKIE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 1","pages":"Article 104335"},"PeriodicalIF":6.9000,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Class-Missing Semi-supervised document key information extraction via synergistic refinement estimation\",\"authors\":\"Pengcheng Guo ,&nbsp;Yonghong Song ,&nbsp;Boyu Wang ,&nbsp;Yankai Cao ,&nbsp;Jiayang Ren ,&nbsp;Chaojie Ji ,&nbsp;Jiahao Liu ,&nbsp;Qi Zhang ,&nbsp;Qiangqiang Mao\",\"doi\":\"10.1016/j.ipm.2025.104335\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Current methods for document key information extraction (DKIE) rely heavily on labeled data with high annotation costs. To mitigate this issue, the semi-supervised learning (SSL) paradigm, which utilizes unlabeled document samples, has gained broad attention in DKIE. However, existing SSL methods require labeled and unlabeled data to share an identical label space, which is impractical in many DKIE tasks (i.e., some unlabeled samples do not belong to any known classes in the labeled set). In this paper, we formulate this problem as Class-Missing Semi-supervised (CMSS) DKIE. In DKIE, unknown classes usually belong to minority and fine-grained categories, intensifying the misconnections between known and unknown classes and making CMSS more challenging. To address this issue, we propose Synergistic Refinement Estimation (SRE), a progressive prototype estimation scheme that alleviates the unknown classes bias to the majority known classes on long-tailed unlabeled data. Furthermore, dynamic threshold hash rectification and structural calibration mechanisms are proposed to correct connections between fine-grained classes. Extensive experimental results demonstrate that SRE surpasses existing state-of-the-art methods on several DKIE benchmarks. Code is available at <span><span>https://github.com/anonymoulink/SRE_DKIE</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"63 1\",\"pages\":\"Article 104335\"},\"PeriodicalIF\":6.9000,\"publicationDate\":\"2025-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325002766\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325002766","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

当前的文档关键信息提取(DKIE)方法严重依赖标注数据,标注成本高。为了缓解这个问题,利用未标记文档样本的半监督学习(SSL)范式在DKIE中得到了广泛关注。然而,现有的SSL方法要求标记和未标记的数据共享相同的标签空间,这在许多DKIE任务中是不切实际的(例如,一些未标记的样本不属于标记集中的任何已知类)。本文将该问题形式化为缺类半监督(CMSS) DKIE。在DKIE中,未知类通常属于少数和细粒度的类别,这加剧了已知类和未知类之间的错误连接,使CMSS更具挑战性。为了解决这个问题,我们提出了协同改进估计(SRE),这是一种渐进的原型估计方案,可以减轻长尾未标记数据上未知类对大多数已知类的偏差。此外,还提出了动态阈值哈希校正和结构校准机制来校正细粒度类之间的连接。大量的实验结果表明,SRE在几个DKIE基准上超过了现有的最先进的方法。代码可从https://github.com/anonymoulink/SRE_DKIE获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Class-Missing Semi-supervised document key information extraction via synergistic refinement estimation
Current methods for document key information extraction (DKIE) rely heavily on labeled data with high annotation costs. To mitigate this issue, the semi-supervised learning (SSL) paradigm, which utilizes unlabeled document samples, has gained broad attention in DKIE. However, existing SSL methods require labeled and unlabeled data to share an identical label space, which is impractical in many DKIE tasks (i.e., some unlabeled samples do not belong to any known classes in the labeled set). In this paper, we formulate this problem as Class-Missing Semi-supervised (CMSS) DKIE. In DKIE, unknown classes usually belong to minority and fine-grained categories, intensifying the misconnections between known and unknown classes and making CMSS more challenging. To address this issue, we propose Synergistic Refinement Estimation (SRE), a progressive prototype estimation scheme that alleviates the unknown classes bias to the majority known classes on long-tailed unlabeled data. Furthermore, dynamic threshold hash rectification and structural calibration mechanisms are proposed to correct connections between fine-grained classes. Extensive experimental results demonstrate that SRE surpasses existing state-of-the-art methods on several DKIE benchmarks. Code is available at https://github.com/anonymoulink/SRE_DKIE.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Information Processing & Management
Information Processing & Management 工程技术-计算机:信息系统
CiteScore
17.00
自引率
11.60%
发文量
276
审稿时长
39 days
期刊介绍: Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信