ODT格式文本文档结构元素信息提取算法的实现

Q4 Materials Science

Radioelektronika, Nanosistemy, Informacionnye Tehnologii Pub Date : 2023-06-16 DOI:10.17587/it.29.307-315

A. Berezhkov, G. S. Larionova, V. Martsinkevich, V. Tereshchenko

{"title":"ODT格式文本文档结构元素信息提取算法的实现","authors":"A. Berezhkov, G. S. Larionova, V. Martsinkevich, V. Tereshchenko","doi":"10.17587/it.29.307-315","DOIUrl":null,"url":null,"abstract":"The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.","PeriodicalId":37476,"journal":{"name":"Radioelektronika, Nanosistemy, Informacionnye Tehnologii","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Implementation of an Algorithm for Extracting Information about Structural Elements of Text Documents in ODT Format\",\"authors\":\"A. Berezhkov, G. S. Larionova, V. Martsinkevich, V. Tereshchenko\",\"doi\":\"10.17587/it.29.307-315\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.\",\"PeriodicalId\":37476,\"journal\":{\"name\":\"Radioelektronika, Nanosistemy, Informacionnye Tehnologii\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radioelektronika, Nanosistemy, Informacionnye Tehnologii\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17587/it.29.307-315\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Materials Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radioelektronika, Nanosistemy, Informacionnye Tehnologii","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17587/it.29.307-315","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Materials Science","Score":null,"Total":0}

引用次数: 0

摘要

考虑了ODT格式的数字文档的XML标记对用于创建该文档的工具的依赖性。在比较中不仅使用专门的工具，而且还使用那些不直接使用ODT格式的工具来识别最易受攻击的点。还描述了从文档的结构元素(如表、列表和图像)中提取数据的特性。提出并描述了一种获取用于创建数字文档自动规范控制系统的样式属性的算法的实现。结果表明，ODT格式的非严格标准导致XML标记依赖于用于创建文档的文本编辑器。因此，在开发文档解析算法时可以依赖的标记数量有限。然而，正如本文所演示的那样，该任务是可行的。同样，默认值、绕过块文档的算法描述和结构元素构成了为后续创建分类器和规范控制过程自动化准备数据的基础。因此，本文提出的算法和所执行的XML标记分析是解决创建自动化文档标准控制系统问题的有效工具，并且该算法具有进一步改进的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Implementation of an Algorithm for Extracting Information about Structural Elements of Text Documents in ODT Format

The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Radioelektronika, Nanosistemy, Informacionnye Tehnologii Materials Science-Materials Science (miscellaneous)

CiteScore

0.60

自引率

0.00%

发文量

期刊介绍： Journal “Radioelectronics. Nanosystems. Information Technologies” (abbr RENSIT) publishes original articles, reviews and brief reports, not previously published, on topical problems in radioelectronics (including biomedical) and fundamentals of information, nano- and biotechnologies and adjacent areas of physics and mathematics. The authors of the journal are academicians, corresponding members and foreign members of the Russian Academy of Natural Sciences (RANS) and their colleagues, as well as other russian and foreign authors on the proposal of the members of RANS, which can be obtained by the author before sending articles to the editor or after its arrival on the recommendation of a member of the editorial board or another member of the RANS, who gave the opinion on the article at the request of the editior. The editors will accept articles in both Russian and English languages. Articles are internally peer reviewed (double-blind peer review) by members of the Editorial Board. Some articles undergo external review, if necessary. Designed for researchers, graduate students, physics students of senior courses and teachers. It turns out 2 times a year (that includes 2 rooms)