使用决策树自动检测文章术语的方法

Herald of Khmelnytskyi National University. Technical sciences Pub Date : 2023-04-27 DOI:10.31891/2307-5732-2023-319-1-338-343

A. Synko, P. Zhezhnych

{"title":"使用决策树自动检测文章术语的方法","authors":"A. Synko, P. Zhezhnych","doi":"10.31891/2307-5732-2023-319-1-338-343","DOIUrl":null,"url":null,"abstract":"Every day, the number of users of virtual communities is increasing, and therefore the data that occurs during communication between them. The posted data can contain valuable information because they contain not only the manufacturer’s opinion, but also consumer experience about a certain product. But, due to the fact that virtual communities have a weak structure in terms of providing information, they are more focused on entertaining content – they may contain data that do not carry a meaningful load, and also, when placing data, not all users foresee techniques that will help increase the relevance of the search for this data. Therefore, the search for target data requires significant time costs. To improve the search for data in the article, a method is proposed that allows you to analyze the content of posted posts and identify keywords from a certain subject area. This method is automated and works on the basis of a previously developed dictionary of key phrases or regular expressions with weighting coefficients of belonging to one or another term. As a result, a decision-making tree is built for each term, which determines the weight of the term to the content of the post, article. At the same time, the level of location of the post in the discussion is taken into account, because the discussion contains a set of chronologically ordered posts. Posts placed at higher levels have a higher coefficient in the calculation. While posts are placed at lower levels – lower weighting factors. Identified key phrases before the specified term are ordered in descending order of weight. At each level of the tree, the total weight of key phrases must be equal to one. To process the data from the virtual communities, they were downloaded using the data consolidation technique. As a result, the concept of consolidated data storage was introduced, which allows collecting data from disparate sources. The paper presents the weight calculation for one term from part of the CodeProject community post.","PeriodicalId":386560,"journal":{"name":"Herald of Khmelnytskyi National University. Technical sciences","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"METHOD OF AUTOMATED DETECTION OF ARTICLE TERMS USING A DECISION TREE\",\"authors\":\"A. Synko, P. Zhezhnych\",\"doi\":\"10.31891/2307-5732-2023-319-1-338-343\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Every day, the number of users of virtual communities is increasing, and therefore the data that occurs during communication between them. The posted data can contain valuable information because they contain not only the manufacturer’s opinion, but also consumer experience about a certain product. But, due to the fact that virtual communities have a weak structure in terms of providing information, they are more focused on entertaining content – they may contain data that do not carry a meaningful load, and also, when placing data, not all users foresee techniques that will help increase the relevance of the search for this data. Therefore, the search for target data requires significant time costs. To improve the search for data in the article, a method is proposed that allows you to analyze the content of posted posts and identify keywords from a certain subject area. This method is automated and works on the basis of a previously developed dictionary of key phrases or regular expressions with weighting coefficients of belonging to one or another term. As a result, a decision-making tree is built for each term, which determines the weight of the term to the content of the post, article. At the same time, the level of location of the post in the discussion is taken into account, because the discussion contains a set of chronologically ordered posts. Posts placed at higher levels have a higher coefficient in the calculation. While posts are placed at lower levels – lower weighting factors. Identified key phrases before the specified term are ordered in descending order of weight. At each level of the tree, the total weight of key phrases must be equal to one. To process the data from the virtual communities, they were downloaded using the data consolidation technique. As a result, the concept of consolidated data storage was introduced, which allows collecting data from disparate sources. The paper presents the weight calculation for one term from part of the CodeProject community post.\",\"PeriodicalId\":386560,\"journal\":{\"name\":\"Herald of Khmelnytskyi National University. Technical sciences\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Herald of Khmelnytskyi National University. Technical sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31891/2307-5732-2023-319-1-338-343\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Herald of Khmelnytskyi National University. Technical sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31891/2307-5732-2023-319-1-338-343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

每天，虚拟社区的用户数量都在增加，因此，在他们之间的通信中发生的数据也在增加。发布的数据可能包含有价值的信息，因为它们不仅包含制造商的意见，还包含消费者对某种产品的体验。但是，由于虚拟社区在提供信息方面结构薄弱，它们更侧重于娱乐内容——它们可能包含的数据没有承载有意义的负载，而且，在放置数据时，并非所有用户都能预见到有助于增加数据搜索相关性的技术。因此，搜索目标数据需要大量的时间成本。为了改进对文章中数据的搜索，提出了一种方法，该方法允许您分析发布的文章的内容，并从某个主题领域识别关键字。这种方法是自动化的，并且基于先前开发的关键短语或正则表达式字典，其权重系数属于一个或另一个术语。因此，为每个术语构建决策树，该决策树确定了该术语对帖子、文章内容的权重。同时，考虑到讨论中员额的位置级别，因为讨论中包含一组按时间顺序排列的员额。较高职等的员额在计算中具有较高的系数。而员额的职等较低-加权系数较低。在指定项之前确定的关键短语按权重降序排列。在树的每一层，关键短语的总权重必须等于1。为了处理来自虚拟社区的数据，使用数据整合技术下载了这些数据。因此，引入了统一数据存储的概念，它允许从不同的数据源收集数据。本文给出了来自CodeProject社区帖子的一个术语的权重计算。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

METHOD OF AUTOMATED DETECTION OF ARTICLE TERMS USING A DECISION TREE

Every day, the number of users of virtual communities is increasing, and therefore the data that occurs during communication between them. The posted data can contain valuable information because they contain not only the manufacturer’s opinion, but also consumer experience about a certain product. But, due to the fact that virtual communities have a weak structure in terms of providing information, they are more focused on entertaining content – they may contain data that do not carry a meaningful load, and also, when placing data, not all users foresee techniques that will help increase the relevance of the search for this data. Therefore, the search for target data requires significant time costs. To improve the search for data in the article, a method is proposed that allows you to analyze the content of posted posts and identify keywords from a certain subject area. This method is automated and works on the basis of a previously developed dictionary of key phrases or regular expressions with weighting coefficients of belonging to one or another term. As a result, a decision-making tree is built for each term, which determines the weight of the term to the content of the post, article. At the same time, the level of location of the post in the discussion is taken into account, because the discussion contains a set of chronologically ordered posts. Posts placed at higher levels have a higher coefficient in the calculation. While posts are placed at lower levels – lower weighting factors. Identified key phrases before the specified term are ordered in descending order of weight. At each level of the tree, the total weight of key phrases must be equal to one. To process the data from the virtual communities, they were downloaded using the data consolidation technique. As a result, the concept of consolidated data storage was introduced, which allows collecting data from disparate sources. The paper presents the weight calculation for one term from part of the CodeProject community post.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Herald of Khmelnytskyi National University. Technical sciences

自引率

0.00%

发文量