PLBR: A Semi-Supervised Document Key Information Extraction via Pseudo-Labeling Bias Rectification

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2024-08-28 DOI:10.1109/TKDE.2024.3443928

Pengcheng Guo;Yonghong Song;Boyu Wang;Jiaohao Liu;Qi Zhang

{"title":"PLBR: A Semi-Supervised Document Key Information Extraction via Pseudo-Labeling Bias Rectification","authors":"Pengcheng Guo;Yonghong Song;Boyu Wang;Jiaohao Liu;Qi Zhang","doi":"10.1109/TKDE.2024.3443928","DOIUrl":null,"url":null,"abstract":"Document key information extraction (DKIE) methods often require a large number of labeled samples, imposing substantial annotation costs in practical scenarios. Fortunately, pseudo-labeling based semi-supervised learning (PSSL) algorithms provide an effective paradigm to alleviate the reliance on labeled data by leveraging unlabeled data. However, the main challenges for PSSL in DKIE tasks: 1) context dependency of DKIE results in incorrect pseudo-labels. 2) high intra-class variance and low inter-class variation on DKIE. To this end, this paper proposes a similarity matrix Pseudo-Label Bias Rectification (PLBR) semi-supervised method for DKIE tasks, which improves the quality of pseudo-labels on DKIE benchmarks with rare labels. More specifically, the Similarity Matrix Bias Rectification (SMBR) module is proposed to improve the quality of pseudo-labels, which utilizes the contextual information of DKIE data through the analysis of similarity between labeled and unlabeled data. Moreover, a dual branch adaptive alignment (DBAA) mechanism is designed to adaptively align intra-class variance and alleviate inter-class variation on DKIE benchmarks, which is composed of two adaptive alignment ways. One is the intra-class alignment branch, which is designed to adaptively align intra-class variance. The other one is the inter-class alignment branch, which is developed to adaptively alleviate inter-class variance changes on the representation level. Extensive experiment results on two benchmarks demonstrate that PLBR achieves state-of-the-art performance and its performance surpasses the previous SOTA by \n<inline-formula><tex-math>$2.11\\% \\sim 2.53\\%$</tex-math></inline-formula>\n, \n<inline-formula><tex-math>$2.09\\% \\sim 2.49\\%$</tex-math></inline-formula>\n F1-score on FUNSD and CORD with rare labeled samples, respectively. Code will be open to the public.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"9025-9036"},"PeriodicalIF":8.9000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10654529/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Document key information extraction (DKIE) methods often require a large number of labeled samples, imposing substantial annotation costs in practical scenarios. Fortunately, pseudo-labeling based semi-supervised learning (PSSL) algorithms provide an effective paradigm to alleviate the reliance on labeled data by leveraging unlabeled data. However, the main challenges for PSSL in DKIE tasks: 1) context dependency of DKIE results in incorrect pseudo-labels. 2) high intra-class variance and low inter-class variation on DKIE. To this end, this paper proposes a similarity matrix Pseudo-Label Bias Rectification (PLBR) semi-supervised method for DKIE tasks, which improves the quality of pseudo-labels on DKIE benchmarks with rare labels. More specifically, the Similarity Matrix Bias Rectification (SMBR) module is proposed to improve the quality of pseudo-labels, which utilizes the contextual information of DKIE data through the analysis of similarity between labeled and unlabeled data. Moreover, a dual branch adaptive alignment (DBAA) mechanism is designed to adaptively align intra-class variance and alleviate inter-class variation on DKIE benchmarks, which is composed of two adaptive alignment ways. One is the intra-class alignment branch, which is designed to adaptively align intra-class variance. The other one is the inter-class alignment branch, which is developed to adaptively alleviate inter-class variance changes on the representation level. Extensive experiment results on two benchmarks demonstrate that PLBR achieves state-of-the-art performance and its performance surpasses the previous SOTA by

$2.11\% \sim 2.53\%$

$2.09\% \sim 2.49\%$

F1-score on FUNSD and CORD with rare labeled samples, respectively. Code will be open to the public.

查看原文本刊更多论文

PLBR：通过伪标签纠偏实现半监督文档关键信息提取

文档关键信息提取（DKIE）方法通常需要大量的标注样本，在实际应用中会产生巨大的标注成本。幸运的是，基于伪标注的半监督学习（PSSL）算法提供了一种有效的范式，通过利用非标注数据来减轻对标注数据的依赖。然而，PSSL 在 DKIE 任务中面临的主要挑战有1) DKIE 的上下文依赖性会导致不正确的伪标签。2）DKIE 的类内差异大，类间差异小。为此，本文提出了一种针对 DKIE 任务的相似性矩阵伪标签偏差校正（PLBR）半监督方法，该方法可以提高具有稀有标签的 DKIE 基准上的伪标签质量。更具体地说，为提高伪标签的质量，提出了相似性矩阵偏差矫正（SMBR）模块，该模块通过分析标记数据和未标记数据之间的相似性，利用了 DKIE 数据的上下文信息。此外，还设计了一种双分支自适应配准（DBAA）机制，以自适应地配准 DKIE 基准上的类内差异并减轻类间差异。一种是类内对齐分支，旨在自适应地对齐类内差异。另一个是类间配准分支，旨在自适应地减轻表示层面上的类间差异变化。在两个基准上进行的大量实验结果表明，PLBR实现了最先进的性能，其性能在带有稀有标签样本的FUNSD和CORD上分别以2.11% \sim 2.53%$、2.09% \sim 2.49%$的F1-score超过了之前的SOTA。代码将对公众开放。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.