Wadi' Hijawi, Hossam Faris, Ja'far Alqatawna, Ala’ M. Al-Zoubi, Ibrahim Aljarah
{"title":"Improving email spam detection using content based feature engineering approach","authors":"Wadi' Hijawi, Hossam Faris, Ja'far Alqatawna, Ala’ M. Al-Zoubi, Ibrahim Aljarah","doi":"10.1109/AEECT.2017.8257764","DOIUrl":null,"url":null,"abstract":"Recently, a wide range of Machine Learning (ML) algorithms have been proposed for building email spam detection models. However, the performance of ML methods highly depends on the extracted features. In this paper, we discuss the most influencing spam features reported in the literature. We also describe the development and implementation of an open source tool that provides a flexible way to extract a large number of features from any email corpus to produce cleansed dataset which can be used to train and test various classification algorithms. A total of 140 features are extracted from SpamAssassin email corpus using the developed tool. Extracted features are used to evaluate four popular ML classifiers and a better results are achieved in comparison with the results of a similar previous study.","PeriodicalId":286127,"journal":{"name":"2017 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AEECT.2017.8257764","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 27
Abstract
Recently, a wide range of Machine Learning (ML) algorithms have been proposed for building email spam detection models. However, the performance of ML methods highly depends on the extracted features. In this paper, we discuss the most influencing spam features reported in the literature. We also describe the development and implementation of an open source tool that provides a flexible way to extract a large number of features from any email corpus to produce cleansed dataset which can be used to train and test various classification algorithms. A total of 140 features are extracted from SpamAssassin email corpus using the developed tool. Extracted features are used to evaluate four popular ML classifiers and a better results are achieved in comparison with the results of a similar previous study.