Application of Thematic Modeling Methods in Text Topic Recognition Tasks to Detect Telephone Fraud

Программные системы и вычислительные методы Pub Date : 2022-03-01 DOI:10.7256/2454-0714.2022.3.38770

E. Pleshakova, S. T. Gataullin, A. V. Osipov, E. V. Romanova, Anna Sergeevna Marun'ko

{"title":"Application of Thematic Modeling Methods in Text Topic Recognition Tasks to Detect Telephone Fraud","authors":"E. Pleshakova, S. T. Gataullin, A. V. Osipov, E. V. Romanova, Anna Sergeevna Marun'ko","doi":"10.7256/2454-0714.2022.3.38770","DOIUrl":null,"url":null,"abstract":"\n The Internet has emerged as a powerful infrastructure for worldwide communication and human interaction. Some unethical use of this technology spam, phishing, trolls, cyberbullying, viruses caused problems in the development of mechanisms that guarantee affordable and safe opportunities for its use. Currently, many studies are being conducted to detect spam and phishing. The detection of telephone fraud has become critically important, as it entails huge losses. Machine learning and natural language processing algorithms are used to analyze a huge amount of text data. Fraudsters are identified using text mining and can be implemented by analyzing the terms of a word or phrase. One of the difficult tasks is to divide this huge unstructured data into clusters. There are several thematic modeling models for these purposes. This article presents the application of these models, in particular LDA, LSI and NMF. A data set has been formed. A preliminary analysis of the data was carried out and signs were constructed for models in the task of recognizing the subject of the text. The approaches of keyword extraction in the tasks of text topic recognition are considered. The key concepts of these approaches are given. The disadvantages of these models are shown, and directions for improving text processing algorithms are proposed. The evaluation of the quality of the models was carried out. Improved models thanks to the selection of hyperparameters and changing the data preprocessing function.\n","PeriodicalId":155484,"journal":{"name":"Программные системы и вычислительные методы","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Программные системы и вычислительные методы","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7256/2454-0714.2022.3.38770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The Internet has emerged as a powerful infrastructure for worldwide communication and human interaction. Some unethical use of this technology spam, phishing, trolls, cyberbullying, viruses caused problems in the development of mechanisms that guarantee affordable and safe opportunities for its use. Currently, many studies are being conducted to detect spam and phishing. The detection of telephone fraud has become critically important, as it entails huge losses. Machine learning and natural language processing algorithms are used to analyze a huge amount of text data. Fraudsters are identified using text mining and can be implemented by analyzing the terms of a word or phrase. One of the difficult tasks is to divide this huge unstructured data into clusters. There are several thematic modeling models for these purposes. This article presents the application of these models, in particular LDA, LSI and NMF. A data set has been formed. A preliminary analysis of the data was carried out and signs were constructed for models in the task of recognizing the subject of the text. The approaches of keyword extraction in the tasks of text topic recognition are considered. The key concepts of these approaches are given. The disadvantages of these models are shown, and directions for improving text processing algorithms are proposed. The evaluation of the quality of the models was carried out. Improved models thanks to the selection of hyperparameters and changing the data preprocessing function.

查看原文本刊更多论文

主题建模方法在文本主题识别任务中的应用

互联网已经成为全球通信和人类互动的强大基础设施。一些不道德的使用这种技术的垃圾邮件，网络钓鱼，巨魔，网络欺凌，病毒造成了发展机制的问题，以保证其使用的可负担和安全的机会。目前，正在进行许多研究来检测垃圾邮件和网络钓鱼。侦查电话诈骗已变得至关重要，因为它会带来巨大的损失。机器学习和自然语言处理算法被用来分析大量的文本数据。使用文本挖掘识别欺诈者，可以通过分析单词或短语的术语来实现。其中一个困难的任务是将这些庞大的非结构化数据划分为集群。有几个主题建模模型可以用于这些目的。本文介绍了这些模型的应用，特别是LDA、LSI和NMF。一个数据集已经形成。对数据进行了初步分析，并为识别文本主题的任务中的模型构建了符号。研究了文本主题识别任务中关键词提取的方法。给出了这些方法的关键概念。指出了这些模型的不足，并提出了改进文本处理算法的方向。对模型的质量进行了评价。通过选择超参数和改变数据预处理功能来改进模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Программные системы и вычислительные методы

自引率

0.00%

发文量