一种基于无监督方法的异常文本检测新方法

IF 0.6 Q4 ENGINEERING, ELECTRICAL & ELECTRONIC
Elham Amouee, Morteza Zanjireh Mohammadi, Mahdi Bahaghighat, Mohsen Ghorbani
{"title":"一种基于无监督方法的异常文本检测新方法","authors":"Elham Amouee, Morteza Zanjireh Mohammadi, Mahdi Bahaghighat, Mohsen Ghorbani","doi":"10.2298/fuee2004631a","DOIUrl":null,"url":null,"abstract":"Increasing size of text data in databases requires appropriate classification and analysis in order to acquire knowledge and improve the quality of decision-making in organizations. The process of discovering the hidden patterns in the data set, called data mining, requires access to quality data in order to receive a valid response from the system. Detecting and removing anomalous data is one of the pre-processing steps and cleaning data in this process. Methods for anomalous data detection are generally classified into three groups including supervised, semi-supervised, and unsupervised. This research tried to offer an unsupervised approach for spotting the anomalous data in text collections. In the proposed method, a combination of two approaches (i.e., clustering-based and distance-based) is used for detecting anomaly in the text data. In order to evaluate the efficiency of the proposed approach, this method is applied on four labeled data sets. The accuracy of Na¨ive Bayes classification algorithms and decision tree are compared before and after removal of anomalous data with the proposed method and some other methods such as Density-based spatial clustering of applications with noise (DBSCAN). Our proposed method shows that accuracy of more than 92.39% can be achieved. In general, the results revealed that in most cases the proposed method has a good performance.","PeriodicalId":44296,"journal":{"name":"Facta Universitatis-Series Electronics and Energetics","volume":"54 1","pages":"631-653"},"PeriodicalIF":0.6000,"publicationDate":"2020-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"A new anomalous text detection approach using unsupervised methods\",\"authors\":\"Elham Amouee, Morteza Zanjireh Mohammadi, Mahdi Bahaghighat, Mohsen Ghorbani\",\"doi\":\"10.2298/fuee2004631a\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Increasing size of text data in databases requires appropriate classification and analysis in order to acquire knowledge and improve the quality of decision-making in organizations. The process of discovering the hidden patterns in the data set, called data mining, requires access to quality data in order to receive a valid response from the system. Detecting and removing anomalous data is one of the pre-processing steps and cleaning data in this process. Methods for anomalous data detection are generally classified into three groups including supervised, semi-supervised, and unsupervised. This research tried to offer an unsupervised approach for spotting the anomalous data in text collections. In the proposed method, a combination of two approaches (i.e., clustering-based and distance-based) is used for detecting anomaly in the text data. In order to evaluate the efficiency of the proposed approach, this method is applied on four labeled data sets. The accuracy of Na¨ive Bayes classification algorithms and decision tree are compared before and after removal of anomalous data with the proposed method and some other methods such as Density-based spatial clustering of applications with noise (DBSCAN). Our proposed method shows that accuracy of more than 92.39% can be achieved. In general, the results revealed that in most cases the proposed method has a good performance.\",\"PeriodicalId\":44296,\"journal\":{\"name\":\"Facta Universitatis-Series Electronics and Energetics\",\"volume\":\"54 1\",\"pages\":\"631-653\"},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2020-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Facta Universitatis-Series Electronics and Energetics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2298/fuee2004631a\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Facta Universitatis-Series Electronics and Energetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2298/fuee2004631a","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 5

摘要

数据库中的文本数据越来越大,需要对其进行适当的分类和分析,以获取知识,提高组织的决策质量。发现数据集中隐藏模式的过程称为数据挖掘,它需要访问高质量的数据,以便从系统接收有效的响应。异常数据的检测和去除是该过程中的预处理步骤之一,并对数据进行清洗。异常数据的检测方法一般分为监督、半监督和无监督三大类。本研究试图提供一种无监督的方法来发现文本集合中的异常数据。该方法结合了基于聚类和基于距离的两种方法来检测文本数据中的异常。为了评估该方法的有效性,将该方法应用于四个标记数据集。比较了纳伊夫贝叶斯分类算法和决策树去除异常数据前后与其他方法(如基于密度的带噪声应用空间聚类(DBSCAN))的准确性。结果表明,该方法可以达到92.39%以上的准确率。总的来说,结果表明,在大多数情况下,该方法具有良好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A new anomalous text detection approach using unsupervised methods
Increasing size of text data in databases requires appropriate classification and analysis in order to acquire knowledge and improve the quality of decision-making in organizations. The process of discovering the hidden patterns in the data set, called data mining, requires access to quality data in order to receive a valid response from the system. Detecting and removing anomalous data is one of the pre-processing steps and cleaning data in this process. Methods for anomalous data detection are generally classified into three groups including supervised, semi-supervised, and unsupervised. This research tried to offer an unsupervised approach for spotting the anomalous data in text collections. In the proposed method, a combination of two approaches (i.e., clustering-based and distance-based) is used for detecting anomaly in the text data. In order to evaluate the efficiency of the proposed approach, this method is applied on four labeled data sets. The accuracy of Na¨ive Bayes classification algorithms and decision tree are compared before and after removal of anomalous data with the proposed method and some other methods such as Density-based spatial clustering of applications with noise (DBSCAN). Our proposed method shows that accuracy of more than 92.39% can be achieved. In general, the results revealed that in most cases the proposed method has a good performance.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Facta Universitatis-Series Electronics and Energetics
Facta Universitatis-Series Electronics and Energetics ENGINEERING, ELECTRICAL & ELECTRONIC-
自引率
16.70%
发文量
10
审稿时长
20 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信