The design of the structure of the software system for processing text document corpus

IF 0.6 Q4 BUSINESS

Biznes Informatika-Business Informatics Pub Date : 2019-12-31 DOI:10.17323/1998-0663.2019.4.60.72

V. Barakhnin, O. Kozhemyakina, R. Mukhamedyev, Yu. S. Borzilova, K. Yakunin

{"title":"The design of the structure of the software system for processing text document corpus","authors":"V. Barakhnin, O. Kozhemyakina, R. Mukhamedyev, Yu. S. Borzilova, K. Yakunin","doi":"10.17323/1998-0663.2019.4.60.72","DOIUrl":null,"url":null,"abstract":"One of the most difficult tasks in the field of data mining is the development of universal tools for the analysis of texts written in the literary and business styles. A popular path in the development of algorithms for processing text document corpus is the use of machine learning methods that allow one to solve NLP (natural language processing) tasks. The basis for research in the field of natural language processing is to be found in the following factors: the specificity of the structure of literary and business style texts (all of which requires the formation of separate datasets and, in the case of machine learning methods, the additional feature selection) and the lack of complete systems of mass processing of text documents for the Russian language (in relation to the scientific community-in the commercial environment, there are some systems of smaller scale, which are solving highly specialized tasks, for example, the definition of the tonality of the text). The aim of the current study is to design and further develop the structure of a text document corpus processing system. The design took into account the requirements for large-scale systems: modularity, the ability to scale components, the conditional independence of components. The system we designed is a set of components, each of which is formed and used in the form of Docker-containers. The levels of the system are: the data processing level, the data storage level, the visualization and management of the results of data processing (visualization and management level). At the data processing level, the text documents (for example, news events) are collected (scrapped) and further processed using an ensemble of machine learning methods, each of which is implemented in the system as a separate Airflow-task. The results are placed for storage in a relational database; ElasticSearch is used to increase the speed of data search (more than 1 million units). The visualization of statistics which is obtained as a result of the algorithms is carried out using the Plotly plugin. The administration and the viewing of processed texts are available through a web-interface using the Django framework. The general scheme of the interaction of components is organized on the principle of ETL (extract, transform, load). Currently the system is used to analyze the corpus of news texts in order to identify information of a destructive nature. In the future, we expect to improve the system and to publish the components in the open repository GitHub for access by the scientific community.","PeriodicalId":41920,"journal":{"name":"Biznes Informatika-Business Informatics","volume":"13 1","pages":"60-72"},"PeriodicalIF":0.6000,"publicationDate":"2019-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biznes Informatika-Business Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17323/1998-0663.2019.4.60.72","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BUSINESS","Score":null,"Total":0}

引用次数: 7

Abstract

One of the most difficult tasks in the field of data mining is the development of universal tools for the analysis of texts written in the literary and business styles. A popular path in the development of algorithms for processing text document corpus is the use of machine learning methods that allow one to solve NLP (natural language processing) tasks. The basis for research in the field of natural language processing is to be found in the following factors: the specificity of the structure of literary and business style texts (all of which requires the formation of separate datasets and, in the case of machine learning methods, the additional feature selection) and the lack of complete systems of mass processing of text documents for the Russian language (in relation to the scientific community-in the commercial environment, there are some systems of smaller scale, which are solving highly specialized tasks, for example, the definition of the tonality of the text). The aim of the current study is to design and further develop the structure of a text document corpus processing system. The design took into account the requirements for large-scale systems: modularity, the ability to scale components, the conditional independence of components. The system we designed is a set of components, each of which is formed and used in the form of Docker-containers. The levels of the system are: the data processing level, the data storage level, the visualization and management of the results of data processing (visualization and management level). At the data processing level, the text documents (for example, news events) are collected (scrapped) and further processed using an ensemble of machine learning methods, each of which is implemented in the system as a separate Airflow-task. The results are placed for storage in a relational database; ElasticSearch is used to increase the speed of data search (more than 1 million units). The visualization of statistics which is obtained as a result of the algorithms is carried out using the Plotly plugin. The administration and the viewing of processed texts are available through a web-interface using the Django framework. The general scheme of the interaction of components is organized on the principle of ETL (extract, transform, load). Currently the system is used to analyze the corpus of news texts in order to identify information of a destructive nature. In the future, we expect to improve the system and to publish the components in the open repository GitHub for access by the scientific community.

查看原文本刊更多论文

文本文档语料库处理软件系统的结构设计

数据挖掘领域中最困难的任务之一是开发用于分析以文学和商业风格编写的文本的通用工具。在处理文本文档语料库的算法开发中，一个流行的路径是使用机器学习方法，允许人们解决NLP(自然语言处理)任务。自然语言处理领域的研究基础有以下几个方面:文学和商业风格文本结构的特殊性(所有这些都需要形成单独的数据集，在机器学习方法的情况下，需要额外的特征选择)以及缺乏完整的俄语文本文档大规模处理系统(与科学界相关)在商业环境中，有一些规模较小的系统，它们正在解决高度专业化的任务，例如，文本调性的定义)。本研究的目的是设计并进一步开发一个文本文档语料库处理系统的结构。该设计考虑了大型系统的要求:模块化、扩展组件的能力、组件的条件独立性。我们设计的系统是一组组件，每个组件都以docker -container的形式形成和使用。系统的层次有:数据处理层、数据存储层、数据处理结果的可视化和管理(可视化和管理层)。在数据处理层面，文本文档(例如，新闻事件)被收集(废弃)，并使用机器学习方法的集合进一步处理，每个方法在系统中作为单独的airflow任务实现。结果存储在关系数据库中;ElasticSearch用于提高数据搜索速度(超过100万单位)。利用Plotly插件实现了该算法所获得的统计数据的可视化。通过使用Django框架的web界面，可以管理和查看处理过的文本。组件交互的总体方案是按照ETL(提取、转换、加载)的原则组织的。目前，该系统主要用于分析新闻文本的语料库，以识别具有破坏性的信息。在未来，我们希望改进系统，并在GitHub开放存储库中发布组件，供科学界访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biznes Informatika-Business Informatics BUSINESS-

自引率

33.30%

发文量