A. Berezhkov, G. S. Larionova, V. Martsinkevich, V. Tereshchenko
{"title":"ODT格式文本文档结构元素信息提取算法的实现","authors":"A. Berezhkov, G. S. Larionova, V. Martsinkevich, V. Tereshchenko","doi":"10.17587/it.29.307-315","DOIUrl":null,"url":null,"abstract":"The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.","PeriodicalId":37476,"journal":{"name":"Radioelektronika, Nanosistemy, Informacionnye Tehnologii","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Implementation of an Algorithm for Extracting Information about Structural Elements of Text Documents in ODT Format\",\"authors\":\"A. Berezhkov, G. S. Larionova, V. Martsinkevich, V. Tereshchenko\",\"doi\":\"10.17587/it.29.307-315\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.\",\"PeriodicalId\":37476,\"journal\":{\"name\":\"Radioelektronika, Nanosistemy, Informacionnye Tehnologii\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radioelektronika, Nanosistemy, Informacionnye Tehnologii\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17587/it.29.307-315\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Materials Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radioelektronika, Nanosistemy, Informacionnye Tehnologii","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17587/it.29.307-315","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Materials Science","Score":null,"Total":0}
Implementation of an Algorithm for Extracting Information about Structural Elements of Text Documents in ODT Format
The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.
期刊介绍:
Journal “Radioelectronics. Nanosystems. Information Technologies” (abbr RENSIT) publishes original articles, reviews and brief reports, not previously published, on topical problems in radioelectronics (including biomedical) and fundamentals of information, nano- and biotechnologies and adjacent areas of physics and mathematics. The authors of the journal are academicians, corresponding members and foreign members of the Russian Academy of Natural Sciences (RANS) and their colleagues, as well as other russian and foreign authors on the proposal of the members of RANS, which can be obtained by the author before sending articles to the editor or after its arrival on the recommendation of a member of the editorial board or another member of the RANS, who gave the opinion on the article at the request of the editior. The editors will accept articles in both Russian and English languages. Articles are internally peer reviewed (double-blind peer review) by members of the Editorial Board. Some articles undergo external review, if necessary. Designed for researchers, graduate students, physics students of senior courses and teachers. It turns out 2 times a year (that includes 2 rooms)