基于SMOTE的SVM和朴素贝叶斯在情感分析数据集上的评价

2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST) Pub Date : 2018-07-01 DOI:10.1109/ICEAST.2018.8434401

Andrew Christian Flores, Rogelyn I. Icoy, Christine F. Peña, Ken Gorro

{"title":"基于SMOTE的SVM和朴素贝叶斯在情感分析数据集上的评价","authors":"Andrew Christian Flores, Rogelyn I. Icoy, Christine F. Peña, Ken Gorro","doi":"10.1109/ICEAST.2018.8434401","DOIUrl":null,"url":null,"abstract":"Data classification is highly significant in data mining which leads to a number of studies in machine learning with preprocessing and algorithmic technique. Class imbalance is a problem in data classification wherein a class of data will outnumber another data class. Sentiment Analysis is an evaluation of written and spoken language which determines a person's expressions, sentiments, emotions and attitudes and is commonly used as dataset in machine learning. This study is a comparative analysis of Support Vector Machine (SVM) algorithm: Sequential Minimal Optimization (SMO) with Synthetic Minority Over-Sampling Technique (SMOTE) and Naive Bayes Multinomial (NBM) algorithm with SMOTE for classification of data given the same Sentiment Analysis datasets gathered by students of University of San Carlos. Weka, a Graphic User Interface (GUI) with a collection of machine learning algorithms for data mining, is use to preprocess and classify the datasets. The results had shown that 10 Folds validation provides better findings compared to 70:30 split in testing SVM and NBM with SMOTE. However, it also depends on how the datasets is preprocessed especially when it contains noisy data.","PeriodicalId":138654,"journal":{"name":"2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"An Evaluation of SVM and Naive Bayes with SMOTE on Sentiment Analysis Data Set\",\"authors\":\"Andrew Christian Flores, Rogelyn I. Icoy, Christine F. Peña, Ken Gorro\",\"doi\":\"10.1109/ICEAST.2018.8434401\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data classification is highly significant in data mining which leads to a number of studies in machine learning with preprocessing and algorithmic technique. Class imbalance is a problem in data classification wherein a class of data will outnumber another data class. Sentiment Analysis is an evaluation of written and spoken language which determines a person's expressions, sentiments, emotions and attitudes and is commonly used as dataset in machine learning. This study is a comparative analysis of Support Vector Machine (SVM) algorithm: Sequential Minimal Optimization (SMO) with Synthetic Minority Over-Sampling Technique (SMOTE) and Naive Bayes Multinomial (NBM) algorithm with SMOTE for classification of data given the same Sentiment Analysis datasets gathered by students of University of San Carlos. Weka, a Graphic User Interface (GUI) with a collection of machine learning algorithms for data mining, is use to preprocess and classify the datasets. The results had shown that 10 Folds validation provides better findings compared to 70:30 split in testing SVM and NBM with SMOTE. However, it also depends on how the datasets is preprocessed especially when it contains noisy data.\",\"PeriodicalId\":138654,\"journal\":{\"name\":\"2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEAST.2018.8434401\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEAST.2018.8434401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

摘要

数据分类在数据挖掘中具有非常重要的意义，这导致了大量基于预处理和算法技术的机器学习研究。类不平衡是数据分类中的一个问题，其中一类数据在数量上超过另一类数据。情感分析是对书面和口头语言的评估，它决定了一个人的表达、情绪、情绪和态度，通常用作机器学习中的数据集。本研究对比分析了支持向量机(SVM)算法:序列最小优化(SMO)与合成少数过采样技术(SMOTE)和朴素贝叶斯多项式(NBM)算法与SMOTE的分类，给出了由圣卡洛斯大学学生收集的相同的情感分析数据集。Weka是一个图形用户界面(GUI)，具有用于数据挖掘的机器学习算法集合，用于预处理和分类数据集。结果表明，在使用SMOTE测试SVM和NBM时，10倍验证比70:30分割提供了更好的结果。然而，这也取决于数据集的预处理方式，特别是当它包含噪声数据时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Evaluation of SVM and Naive Bayes with SMOTE on Sentiment Analysis Data Set

Data classification is highly significant in data mining which leads to a number of studies in machine learning with preprocessing and algorithmic technique. Class imbalance is a problem in data classification wherein a class of data will outnumber another data class. Sentiment Analysis is an evaluation of written and spoken language which determines a person's expressions, sentiments, emotions and attitudes and is commonly used as dataset in machine learning. This study is a comparative analysis of Support Vector Machine (SVM) algorithm: Sequential Minimal Optimization (SMO) with Synthetic Minority Over-Sampling Technique (SMOTE) and Naive Bayes Multinomial (NBM) algorithm with SMOTE for classification of data given the same Sentiment Analysis datasets gathered by students of University of San Carlos. Weka, a Graphic User Interface (GUI) with a collection of machine learning algorithms for data mining, is use to preprocess and classify the datasets. The results had shown that 10 Folds validation provides better findings compared to 70:30 split in testing SVM and NBM with SMOTE. However, it also depends on how the datasets is preprocessed especially when it contains noisy data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 International Conference on Engineering, Applied Sciences, and Technology (ICEAST)

自引率

0.00%

发文量