Exploring the effectiveness of word embedding based deep learning model for improving email classification

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications Pub Date : 2022-02-02 DOI:10.1108/dta-07-2021-0191

D. Asudani, N. K. Nagwani, Pradeep Singh

{"title":"Exploring the effectiveness of word embedding based deep learning model for improving email classification","authors":"D. Asudani, N. K. Nagwani, Pradeep Singh","doi":"10.1108/dta-07-2021-0191","DOIUrl":null,"url":null,"abstract":"PurposeClassifying emails as ham or spam based on their content is essential. Determining the semantic and syntactic meaning of words and putting them in a high-dimensional feature vector form for processing is the most difficult challenge in email categorization. The purpose of this paper is to examine the effectiveness of the pre-trained embedding model for the classification of emails using deep learning classifiers such as the long short-term memory (LSTM) model and convolutional neural network (CNN) model.Design/methodology/approachIn this paper, global vectors (GloVe) and Bidirectional Encoder Representations Transformers (BERT) pre-trained word embedding are used to identify relationships between words, which helps to classify emails into their relevant categories using machine learning and deep learning models. Two benchmark datasets, SpamAssassin and Enron, are used in the experimentation.FindingsIn the first set of experiments, machine learning classifiers, the support vector machine (SVM) model, perform better than other machine learning methodologies. The second set of experiments compares the deep learning model performance without embedding, GloVe and BERT embedding. The experiments show that GloVe embedding can be helpful for faster execution with better performance on large-sized datasets.Originality/valueThe experiment reveals that the CNN model with GloVe embedding gives slightly better accuracy than the model with BERT embedding and traditional machine learning algorithms to classify an email as ham or spam. It is concluded that the word embedding models improve email classifiers accuracy.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"95 1","pages":"483-505"},"PeriodicalIF":1.5000,"publicationDate":"2022-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Technologies and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1108/dta-07-2021-0191","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 3

Abstract

PurposeClassifying emails as ham or spam based on their content is essential. Determining the semantic and syntactic meaning of words and putting them in a high-dimensional feature vector form for processing is the most difficult challenge in email categorization. The purpose of this paper is to examine the effectiveness of the pre-trained embedding model for the classification of emails using deep learning classifiers such as the long short-term memory (LSTM) model and convolutional neural network (CNN) model.Design/methodology/approachIn this paper, global vectors (GloVe) and Bidirectional Encoder Representations Transformers (BERT) pre-trained word embedding are used to identify relationships between words, which helps to classify emails into their relevant categories using machine learning and deep learning models. Two benchmark datasets, SpamAssassin and Enron, are used in the experimentation.FindingsIn the first set of experiments, machine learning classifiers, the support vector machine (SVM) model, perform better than other machine learning methodologies. The second set of experiments compares the deep learning model performance without embedding, GloVe and BERT embedding. The experiments show that GloVe embedding can be helpful for faster execution with better performance on large-sized datasets.Originality/valueThe experiment reveals that the CNN model with GloVe embedding gives slightly better accuracy than the model with BERT embedding and traditional machine learning algorithms to classify an email as ham or spam. It is concluded that the word embedding models improve email classifiers accuracy.

查看原文本刊更多论文

探索基于词嵌入的深度学习模型改进电子邮件分类的有效性

目的根据内容将电子邮件分类为火腿或垃圾邮件是必要的。确定词的语义和句法意义并将其转化为高维特征向量形式进行处理是电子邮件分类中最困难的挑战。本文的目的是研究使用深度学习分类器(如长短期记忆(LSTM)模型和卷积神经网络(CNN)模型)对电子邮件进行分类的预训练嵌入模型的有效性。设计/方法/方法在本文中，使用全局向量(GloVe)和双向编码器表示变形(BERT)预训练词嵌入来识别词之间的关系，这有助于使用机器学习和深度学习模型将电子邮件分类到相关的类别中。实验中使用了两个基准数据集，SpamAssassin和Enron。在第一组实验中，机器学习分类器，即支持向量机(SVM)模型，比其他机器学习方法表现得更好。第二组实验比较了未嵌入、GloVe和BERT嵌入的深度学习模型的性能。实验表明，在大型数据集上，GloVe嵌入有助于提高算法的执行速度和性能。原创性/价值实验表明，在将电子邮件分类为火腿或垃圾邮件时，使用GloVe嵌入的CNN模型比使用BERT嵌入和传统机器学习算法的模型的准确率略高。结果表明，词嵌入模型提高了电子邮件分类器的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Technologies and Applications Social Sciences-Library and Information Sciences

CiteScore

3.80

自引率

6.20%

发文量

期刊介绍： Previously published as: Program Online from: 2018 Subject Area: Information & Knowledge Management, Library Studies