PDF Entity Annotation Tool (PEAT).

Journal of open source software Pub Date : 2025-04-08 DOI:10.21105/joss.05336

Christopher G Stahl, Kristan J Markey, Brian C Jewell, Dahnish Shams, Michele M Taylor, A Amina Wilkins, Sean Watford, Andy Shapiro, Michelle Angrish

{"title":"PDF Entity Annotation Tool (PEAT).","authors":"Christopher G Stahl, Kristan J Markey, Brian C Jewell, Dahnish Shams, Michele M Taylor, A Amina Wilkins, Sean Watford, Andy Shapiro, Michelle Angrish","doi":"10.21105/joss.05336","DOIUrl":null,"url":null,"abstract":"<p><p>While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.</p>","PeriodicalId":94101,"journal":{"name":"Journal of open source software","volume":"10 108","pages":"5336"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12180754/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of open source software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21105/joss.05336","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.

查看原文本刊更多论文

PDF实体注释工具（泥炭）。

虽然不同的文本挖掘方法——包括使用人工智能（AI）和其他基于机器的方法——继续以快速的速度扩展，但研究人员用于创建训练、建模和评估所需的标记数据集的工具仍然是初级的。标记的数据集包含机器将要学习的目标属性；例如，训练一种算法来描述汽车或卡车的图像，通常需要一组图像，其中包含每种车辆类型的潜在特征的定量描述。可用于为科学文献构建自然语言机器学习模型的标记文本数据的开发目前尚未集成到领域专家使用的现有手动工作流程中。已发表的文献中含有丰富的重要信息，例如不同类型的嵌入式文本、图表和表格，当以机器可读的格式提取和准备时，这些信息都可以用作训练ML/自然语言处理（NLP）模型的输入。目前，用于领域专家的规范化数据提取和用于支持ML/NLP模型开发的提取都是劳动密集型和繁琐的手工过程。从pdf等格式中自动提取数据和信息，这些格式针对布局和人类可读性而不是机器可读性进行了优化。PDF（可移植文档格式）实体注释工具（PEAT）的开发目标是允许用户在其当前的打印格式中注释出版物，同时还允许以机器可读的格式捕获这些注释。传统注释工具的一个主要问题是，它们需要将PDF转换为纯文本以促进注释过程。虽然这样做减少了注释数据的技术挑战，但用户失去了底层PDF中固有的所有结构和来源。此外，从pdf中提取文本数据可能是一个容易出错的过程。挑战包括识别连续的文本块和多种文档格式（多列、字体编码等）。由于这些挑战，使用现有工具直接从pdf开发NLP/ML模型是困难的，因为生成的输出不能互操作。我们创建了一个系统，允许在原始PDF文档结构上完成注释，而不需要提取纯文本。结果是一个允许更容易和更准确的注释的应用程序。此外，通过包含允许用户轻松创建模式的特性，我们开发了一个系统，该系统可用于为与主题专家相关的不同以领域为中心的模式注释文本。不同的知识领域需要不同的模式和注释标签来支持机器学习。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of open source software

自引率

0.00%

发文量

审稿时长

3 weeks