Data Mining Library for Big Data Processing Platforms: A Case Study-Sparkling Water Platform

2018 3rd International Conference on Computer Science and Engineering (UBMK) Pub Date : 2018-09-01 DOI:10.1109/UBMK.2018.8566278

Elif Cansu Yıldız, M. Aktaş, O. Kalipsiz, Alper Nebi Kanlı, Umut Orçun Turgut

{"title":"Data Mining Library for Big Data Processing Platforms: A Case Study-Sparkling Water Platform","authors":"Elif Cansu Yıldız, M. Aktaş, O. Kalipsiz, Alper Nebi Kanlı, Umut Orçun Turgut","doi":"10.1109/UBMK.2018.8566278","DOIUrl":null,"url":null,"abstract":"Nowadays, many data from millions of websites, applications, social media resources, surveys, video surveillance platforms, and many other sources are obtained in a very large amount. By processing large datasets that occur every day, useful information can be derived. Distributed data processing platforms are needed to handle large amounts of data. For big data processing and analytics platforms such as Hadoop and Spark, there are machine learning libraries that operates distributed and exploits the advantages of distributed computing. For example; The Mahout library uses the Hadoop platform, while the Spark-MLLib library uses the Spark platform. However, for these platforms, it seems that there is no implementation for the algorithms included in the data mining steps, or there is only the implementation for some of the steps’ algorithms. Within the scope of this research, algorithms in different data mining steps on a large data platform will be implemented and a performance evaluation will be performed. In the context of this research, as a case study, the Sparkling Water platform was chosen as a major data processing platform. The banking data set was used for the tests of the implemented data mining algorithms. A software layer containing all data mining steps was developed on the Sparkling Water platform and performance evaluation was conducted. As a result of the evaluation, it has been observed that performance enhancement which comes with distributed data processing has been successful.","PeriodicalId":293249,"journal":{"name":"2018 3rd International Conference on Computer Science and Engineering (UBMK)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 3rd International Conference on Computer Science and Engineering (UBMK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UBMK.2018.8566278","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Nowadays, many data from millions of websites, applications, social media resources, surveys, video surveillance platforms, and many other sources are obtained in a very large amount. By processing large datasets that occur every day, useful information can be derived. Distributed data processing platforms are needed to handle large amounts of data. For big data processing and analytics platforms such as Hadoop and Spark, there are machine learning libraries that operates distributed and exploits the advantages of distributed computing. For example; The Mahout library uses the Hadoop platform, while the Spark-MLLib library uses the Spark platform. However, for these platforms, it seems that there is no implementation for the algorithms included in the data mining steps, or there is only the implementation for some of the steps’ algorithms. Within the scope of this research, algorithms in different data mining steps on a large data platform will be implemented and a performance evaluation will be performed. In the context of this research, as a case study, the Sparkling Water platform was chosen as a major data processing platform. The banking data set was used for the tests of the implemented data mining algorithms. A software layer containing all data mining steps was developed on the Sparkling Water platform and performance evaluation was conducted. As a result of the evaluation, it has been observed that performance enhancement which comes with distributed data processing has been successful.

查看原文本刊更多论文

面向大数据处理平台的数据挖掘库:以sparkling Water平台为例

如今，从数以百万计的网站、应用程序、社交媒体资源、调查、视频监控平台和许多其他来源获得了大量数据。通过处理每天发生的大型数据集，可以获得有用的信息。分布式数据处理平台需要处理大量的数据。对于Hadoop和Spark等大数据处理和分析平台，有一些机器学习库可以分布式运行，并利用分布式计算的优势。例如;Mahout库使用Hadoop平台，Spark- mllib库使用Spark平台。然而，对于这些平台来说，数据挖掘步骤中包含的算法似乎没有实现，或者只有一些步骤的算法实现。在本研究范围内，将在大数据平台上实现不同数据挖掘步骤的算法，并进行性能评估。在本研究的背景下，作为案例研究，我们选择了Sparkling Water平台作为主要的数据处理平台。银行数据集用于测试所实现的数据挖掘算法。在Sparkling Water平台上开发了包含所有数据挖掘步骤的软件层，并进行了性能评估。作为评估的结果，我们观察到分布式数据处理带来的性能增强是成功的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 3rd International Conference on Computer Science and Engineering (UBMK)

自引率

0.00%

发文量