End-to-end entity extraction from OCRed texts using summarization models

Pedro A. Villa-García, Raúl Alonso-Calvo, Miguel García-Remesal
{"title":"End-to-end entity extraction from OCRed texts using summarization models","authors":"Pedro A. Villa-García, Raúl Alonso-Calvo, Miguel García-Remesal","doi":"10.1007/s00521-024-10422-9","DOIUrl":null,"url":null,"abstract":"<p>A novel methodology is introduced for extracting entities from noisy scanned documents by using end-to-end data and reformulating the entity extraction task as a text summarization problem. This approach offers two significant advantages over traditional entity extraction methods while maintaining comparable performance. First, it utilizes preexisting data to construct datasets, thereby eliminating the need for labor-intensive annotation procedures. Second, it employs multitask learning, enabling the training of a model via a single dataset. To evaluate our approach against state-of-the-art methods, we adapted three commonly used datasets, namely, Conference on Natural Language Learning (CoNLL++), few-shot named entity recognition (Few-NERD), and WikiNEuRal domain adaptation (WikiNEuRal + DA), to the format required by our methodology. We subsequently fine-tuned four sequence-to-sequence models: text-to-text transfer transformer (T5), fine-tuned language net T5 (FLAN-T5), bidirectional autoregressive transformer (BART), and pretraining with extracted gap sentences for abstractive summarization sequence-to-sequence models (PEGASUS). The results indicate that, in the absence of optical character recognition (OCR) noise, the BART model performs comparably to state-of-the-art methods. Furthermore, the performance degradation was limited to 3.49–5.23% when 39–62% of the sentences contained OCR noise. This performance is significantly superior to that of previous studies, which reported a 10–20% decrease in the F1 score with texts that had a 20% OCR error rate. Our experimental results demonstrate that a single model trained via our methodology can reliably extract entities from noisy OCRed texts, unlike existing state-of-the-art approaches, which require separate models for correcting OCR errors and extracting entities.</p>","PeriodicalId":18925,"journal":{"name":"Neural Computing and Applications","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computing and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00521-024-10422-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

A novel methodology is introduced for extracting entities from noisy scanned documents by using end-to-end data and reformulating the entity extraction task as a text summarization problem. This approach offers two significant advantages over traditional entity extraction methods while maintaining comparable performance. First, it utilizes preexisting data to construct datasets, thereby eliminating the need for labor-intensive annotation procedures. Second, it employs multitask learning, enabling the training of a model via a single dataset. To evaluate our approach against state-of-the-art methods, we adapted three commonly used datasets, namely, Conference on Natural Language Learning (CoNLL++), few-shot named entity recognition (Few-NERD), and WikiNEuRal domain adaptation (WikiNEuRal + DA), to the format required by our methodology. We subsequently fine-tuned four sequence-to-sequence models: text-to-text transfer transformer (T5), fine-tuned language net T5 (FLAN-T5), bidirectional autoregressive transformer (BART), and pretraining with extracted gap sentences for abstractive summarization sequence-to-sequence models (PEGASUS). The results indicate that, in the absence of optical character recognition (OCR) noise, the BART model performs comparably to state-of-the-art methods. Furthermore, the performance degradation was limited to 3.49–5.23% when 39–62% of the sentences contained OCR noise. This performance is significantly superior to that of previous studies, which reported a 10–20% decrease in the F1 score with texts that had a 20% OCR error rate. Our experimental results demonstrate that a single model trained via our methodology can reliably extract entities from noisy OCRed texts, unlike existing state-of-the-art approaches, which require separate models for correcting OCR errors and extracting entities.

Abstract Image

使用摘要模型从 OCR 文本中进行端到端实体提取
通过使用端到端数据,并将实体提取任务重新表述为文本摘要问题,引入了一种从噪声扫描文档中提取实体的新方法。与传统的实体提取方法相比,这种方法有两个显著优势,同时还能保持相当的性能。首先,它利用已有数据构建数据集,从而省去了耗费大量人力的标注程序。其次,它采用了多任务学习技术,可以通过单个数据集来训练模型。为了将我们的方法与最先进的方法进行对比评估,我们将三个常用数据集,即自然语言学习会议(CoNLL++)、少量命名实体识别(Few-NERD)和 WikiNEuRal 领域适应(WikiNEuRal + DA),调整为我们的方法所需的格式。随后,我们对四种序列到序列模型进行了微调:文本到文本传输转换器(T5)、微调语言网 T5(FLAN-T5)、双向自回归转换器(BART),以及抽象概括序列到序列模型(PEGASUS)的提取空白句预训练。结果表明,在没有光学字符识别(OCR)噪声的情况下,BART 模型的性能与最先进的方法相当。此外,当 39-62% 的句子含有 OCR 噪音时,性能下降幅度限制在 3.49-5.23% 之间。这一性能明显优于之前的研究,之前的研究报告称,在 OCR 错误率为 20% 的文本中,F1 分数下降了 10-20%。我们的实验结果表明,通过我们的方法训练出的单一模型可以从有噪声的 OCR 文本中可靠地提取实体,这与现有的先进方法不同,后者需要单独的模型来纠正 OCR 错误和提取实体。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信