Extracting structure from text documents based on machine learning

PROBLEMS IN PROGRAMMING Pub Date : 2022-12-01 DOI:10.15407/pp2022.03-04.154

Удк, K. Kudim, G. Proskudina

引用次数: 0

Abstract

This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. The method consists of data preparation, building and training the model and results evaluation. Data preparation includes collecting corpora of documents, converting a variety of file formats into plain text, and manual labeling each document structure. Then documents are split into tokens and into paragraphs. The text paragraphs are represented as feature vectors to provide input to the neural network. The model is trained and validated on the selected data subsets. Trained model results evaluation is presented. The final performance is calculated per label using precision, recall, and F1 measures, and overall average. The trained model can be used to extract sections of documents bearing similar structure.

查看原文本刊更多论文

基于机器学习的文本文档结构提取

本文研究了一种利用人工神经网络从文本文档中提取结构的方法。该方法包括数据准备、模型的建立和训练以及结果评价。数据准备包括收集文档的语料库，将各种文件格式转换为纯文本，并手动标记每个文档结构。然后将文档分成令牌和段落。将文本段落表示为特征向量，为神经网络提供输入。在选定的数据子集上对模型进行训练和验证。给出了训练模型结果的评价。使用精度、召回率和F1度量以及总体平均值计算每个标签的最终性能。训练后的模型可用于提取具有相似结构的文档部分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PROBLEMS IN PROGRAMMING

自引率

0.00%

发文量