基于类再平衡和合并语义伪标记的半监督关键信息提取方法

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Qi Zhang, Yonghong Song, Pengcheng Guo, Yangyang Hui
{"title":"基于类再平衡和合并语义伪标记的半监督关键信息提取方法","authors":"Qi Zhang,&nbsp;Yonghong Song,&nbsp;Pengcheng Guo,&nbsp;Yangyang Hui","doi":"10.1016/j.neucom.2024.128907","DOIUrl":null,"url":null,"abstract":"<div><div>There is a growing demand in the field of Key Information Extraction (KIE) to apply semi-supervised learning (SSL) to save manpower and costs, as training document data using fully-supervised methods requires labor-intensive manual annotation. The main challenges of applying SSL in the KIE are (1) underestimation of the confidence of tail classes in the long-tailed distribution and (2) difficulty in achieving intra-class compactness and inter-class separability of tail features. To address these challenges, we propose a novel semi-supervised approach for KIE with Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Firstly, the Class-Rebalancing Pseudo-Labeling (CRP) module introduces a reweighting factor to rebalance pseudo-labels, increasing attention to tail classes. Secondly, we propose the Merged Semantic Pseudo-Labeling (MSP) module to cluster tail features of unlabeled data by assigning samples to Merged Prototypes (MP). Additionally, we designed a new contrastive loss specifically for MSP. Extensive experimental results on three well-known benchmarks demonstrate that CRMSP achieves state-of-the-art performance. Remarkably, CRMSP achieves 3.24% f1-score improvement over state-of-the-art on the CORD.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"616 ","pages":"Article 128907"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CRMSP: A semi-supervised approach for key information extraction with Class-Rebalancing and Merged Semantic Pseudo-Labeling\",\"authors\":\"Qi Zhang,&nbsp;Yonghong Song,&nbsp;Pengcheng Guo,&nbsp;Yangyang Hui\",\"doi\":\"10.1016/j.neucom.2024.128907\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>There is a growing demand in the field of Key Information Extraction (KIE) to apply semi-supervised learning (SSL) to save manpower and costs, as training document data using fully-supervised methods requires labor-intensive manual annotation. The main challenges of applying SSL in the KIE are (1) underestimation of the confidence of tail classes in the long-tailed distribution and (2) difficulty in achieving intra-class compactness and inter-class separability of tail features. To address these challenges, we propose a novel semi-supervised approach for KIE with Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Firstly, the Class-Rebalancing Pseudo-Labeling (CRP) module introduces a reweighting factor to rebalance pseudo-labels, increasing attention to tail classes. Secondly, we propose the Merged Semantic Pseudo-Labeling (MSP) module to cluster tail features of unlabeled data by assigning samples to Merged Prototypes (MP). Additionally, we designed a new contrastive loss specifically for MSP. Extensive experimental results on three well-known benchmarks demonstrate that CRMSP achieves state-of-the-art performance. Remarkably, CRMSP achieves 3.24% f1-score improvement over state-of-the-art on the CORD.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"616 \",\"pages\":\"Article 128907\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224016783\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016783","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

关键信息提取(Key Information Extraction, KIE)领域对半监督学习(semi-supervised learning, SSL)的应用需求日益增长,以节省人力和成本,因为使用全监督方法训练文档数据需要耗费大量劳动的人工标注。在KIE中应用SSL的主要挑战是:(1)低估尾类在长尾分布中的置信度;(2)难以实现尾特征的类内紧密性和类间可分性。为了解决这些挑战,我们提出了一种基于类再平衡和合并语义伪标记(CRMSP)的半监督KIE方法。首先,类再平衡伪标签(CRP)模块引入了一个重新加权因子来重新平衡伪标签,增加了对尾部类的关注。其次,我们提出了合并语义伪标记(MSP)模块,通过将样本分配给合并原型(MP)来对未标记数据的尾部特征进行聚类。此外,我们还专门为MSP设计了一种新的对比损耗。在三个知名基准测试上的大量实验结果表明,CRMSP达到了最先进的性能。值得注意的是,CRMSP在CORD上的得分比最先进的水平提高了3.24%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CRMSP: A semi-supervised approach for key information extraction with Class-Rebalancing and Merged Semantic Pseudo-Labeling
There is a growing demand in the field of Key Information Extraction (KIE) to apply semi-supervised learning (SSL) to save manpower and costs, as training document data using fully-supervised methods requires labor-intensive manual annotation. The main challenges of applying SSL in the KIE are (1) underestimation of the confidence of tail classes in the long-tailed distribution and (2) difficulty in achieving intra-class compactness and inter-class separability of tail features. To address these challenges, we propose a novel semi-supervised approach for KIE with Class-Rebalancing and Merged Semantic Pseudo-Labeling (CRMSP). Firstly, the Class-Rebalancing Pseudo-Labeling (CRP) module introduces a reweighting factor to rebalance pseudo-labels, increasing attention to tail classes. Secondly, we propose the Merged Semantic Pseudo-Labeling (MSP) module to cluster tail features of unlabeled data by assigning samples to Merged Prototypes (MP). Additionally, we designed a new contrastive loss specifically for MSP. Extensive experimental results on three well-known benchmarks demonstrate that CRMSP achieves state-of-the-art performance. Remarkably, CRMSP achieves 3.24% f1-score improvement over state-of-the-art on the CORD.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信