Naïve贝叶斯算法与补充Naïve贝叶斯算法的性能比较

2019 6th International Conference on Electrical and Electronics Engineering (ICEEE) Pub Date : 2019-04-01 DOI:10.1109/ICEEE2019.2019.00033

Berna Şeref, E. Bostanci

{"title":"Naïve贝叶斯算法与补充Naïve贝叶斯算法的性能比较","authors":"Berna Şeref, E. Bostanci","doi":"10.1109/ICEEE2019.2019.00033","DOIUrl":null,"url":null,"abstract":"Big data is defined with 3 V which are volume, velocity and variety. It is hard to analyze, store and process this data because of its size and complexity. When traditional tools are used to analyze the data, execution time is too much. On the other hand, there are some tools and libraries to analyze and process the big data. As a result, it does not take too much time to analyze and process the data. For example; Hadoop is an open source library that allow the distributed computing for large datasets. Mahout is a library that allows machine learning, Hive allows querying and Kafka allows messaging. In this paper, Hadoop and Mahout are used and performance of Naïve Bayes and Complement Naïve Bayes Algorithms are compared based on average correctly classified instance percentage, average training time and average testing time with different size of the dataset. As a dataset, \"20 Newsgroups\" is used and size of the dataset is increased by scaling the dataset with 2, 4 and 8. As a result, datasets with the size of 37692, 75384 and 150768 are created. All experiments are carried on with all the datasets using different smoothing, weight and normalization parameters for 10 times and then, average of all the results are taken. After all the experiments, it is observed that performance of Naïve Bayes Algorithm is better than Complement Naïve Bayes Algorithm based on average training time. On the other hand, performance of Complement Naïve Bayes is better than the other based on average correctly classified instance percentage.","PeriodicalId":407725,"journal":{"name":"2019 6th International Conference on Electrical and Electronics Engineering (ICEEE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Performance Comparison of Naïve Bayes and Complement Naïve Bayes Algorithms\",\"authors\":\"Berna Şeref, E. Bostanci\",\"doi\":\"10.1109/ICEEE2019.2019.00033\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Big data is defined with 3 V which are volume, velocity and variety. It is hard to analyze, store and process this data because of its size and complexity. When traditional tools are used to analyze the data, execution time is too much. On the other hand, there are some tools and libraries to analyze and process the big data. As a result, it does not take too much time to analyze and process the data. For example; Hadoop is an open source library that allow the distributed computing for large datasets. Mahout is a library that allows machine learning, Hive allows querying and Kafka allows messaging. In this paper, Hadoop and Mahout are used and performance of Naïve Bayes and Complement Naïve Bayes Algorithms are compared based on average correctly classified instance percentage, average training time and average testing time with different size of the dataset. As a dataset, \\\"20 Newsgroups\\\" is used and size of the dataset is increased by scaling the dataset with 2, 4 and 8. As a result, datasets with the size of 37692, 75384 and 150768 are created. All experiments are carried on with all the datasets using different smoothing, weight and normalization parameters for 10 times and then, average of all the results are taken. After all the experiments, it is observed that performance of Naïve Bayes Algorithm is better than Complement Naïve Bayes Algorithm based on average training time. On the other hand, performance of Complement Naïve Bayes is better than the other based on average correctly classified instance percentage.\",\"PeriodicalId\":407725,\"journal\":{\"name\":\"2019 6th International Conference on Electrical and Electronics Engineering (ICEEE)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 6th International Conference on Electrical and Electronics Engineering (ICEEE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEEE2019.2019.00033\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 6th International Conference on Electrical and Electronics Engineering (ICEEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEEE2019.2019.00033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

大数据被定义为3v，即体积(volume)、速度(velocity)和种类(variety)。由于这些数据的规模和复杂性，很难分析、存储和处理这些数据。当使用传统工具分析数据时，执行时间太长。另一方面，有一些工具和库来分析和处理大数据。因此，它不需要太多的时间来分析和处理数据。例如;Hadoop是一个开源库，允许对大型数据集进行分布式计算。Mahout是一个允许机器学习的库，Hive允许查询，Kafka允许消息传递。本文使用Hadoop和Mahout，根据不同数据集大小的平均正确分类实例百分比、平均训练时间和平均测试时间，比较Naïve贝叶斯和Complement Naïve贝叶斯算法的性能。作为一个数据集，使用“20新闻组”，并通过将数据集缩放为2、4和8来增加数据集的大小。结果，创建了大小为37692、75384和150768的数据集。使用不同的平滑、权值和归一化参数对所有数据集进行10次实验，然后对所有结果取平均值。经过所有的实验，可以观察到Naïve贝叶斯算法的性能优于基于平均训练时间的补体Naïve贝叶斯算法。另一方面，基于平均正确分类实例百分比，补体Naïve贝叶斯的性能优于其他方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance Comparison of Naïve Bayes and Complement Naïve Bayes Algorithms

Big data is defined with 3 V which are volume, velocity and variety. It is hard to analyze, store and process this data because of its size and complexity. When traditional tools are used to analyze the data, execution time is too much. On the other hand, there are some tools and libraries to analyze and process the big data. As a result, it does not take too much time to analyze and process the data. For example; Hadoop is an open source library that allow the distributed computing for large datasets. Mahout is a library that allows machine learning, Hive allows querying and Kafka allows messaging. In this paper, Hadoop and Mahout are used and performance of Naïve Bayes and Complement Naïve Bayes Algorithms are compared based on average correctly classified instance percentage, average training time and average testing time with different size of the dataset. As a dataset, "20 Newsgroups" is used and size of the dataset is increased by scaling the dataset with 2, 4 and 8. As a result, datasets with the size of 37692, 75384 and 150768 are created. All experiments are carried on with all the datasets using different smoothing, weight and normalization parameters for 10 times and then, average of all the results are taken. After all the experiments, it is observed that performance of Naïve Bayes Algorithm is better than Complement Naïve Bayes Algorithm based on average training time. On the other hand, performance of Complement Naïve Bayes is better than the other based on average correctly classified instance percentage.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 6th International Conference on Electrical and Electronics Engineering (ICEEE)

自引率

0.00%

发文量