基于密度聚类和模糊聚类的hadoop生态系统大数据高效聚类

International Journal of Engineering in Computer Science Pub Date : 2020-01-01 DOI:10.33545/26633582.2020.v2.i1a.29

M. Prasanna

{"title":"基于密度聚类和模糊聚类的hadoop生态系统大数据高效聚类","authors":"M. Prasanna","doi":"10.33545/26633582.2020.v2.i1a.29","DOIUrl":null,"url":null,"abstract":"In this paper suggesting to use parallel distributed Hadoop Map Reduce technology. Map Reduce is a parallel technology which can process bigdata by creating multiple instances of thread. Map will take input data and split it into multiple chunks or parts and distribute all parts to different reducers. All reducers will process the data and send result back to mapper. Mapper gather output from all mappers and then generate a single output. Due to multiple parallel processing of Map Reduce technology allow us to process any amount of data. In this paper author describing clustering algorithms such Density Based Clustering and Fuzzy Clustering. Both algorithms are not efficient to group all similar data to single cluster and make some data to compromise by putting little un-similar to different clusters. To implement this project author is using National Climatic Data Center (NCDC) dataset which contains climate information. To find out similar temperature on different dates author is applying Hybrid clustering algorithm. From this dataset author is using date and temperature value and then passing date as key to Mapper and temperature as value to Mapper. Mapper always read data in the form of key value pairs. In dataset date we can find at position 6 to14 and temperature we can find at position 39 to 45. Below are some dataset example values.","PeriodicalId":147954,"journal":{"name":"International Journal of Engineering in Computer Science","volume":"379 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Density based clustering and fuzzy clustering for efficient clustering of big data in hadoop ecosystem\",\"authors\":\"M. Prasanna\",\"doi\":\"10.33545/26633582.2020.v2.i1a.29\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper suggesting to use parallel distributed Hadoop Map Reduce technology. Map Reduce is a parallel technology which can process bigdata by creating multiple instances of thread. Map will take input data and split it into multiple chunks or parts and distribute all parts to different reducers. All reducers will process the data and send result back to mapper. Mapper gather output from all mappers and then generate a single output. Due to multiple parallel processing of Map Reduce technology allow us to process any amount of data. In this paper author describing clustering algorithms such Density Based Clustering and Fuzzy Clustering. Both algorithms are not efficient to group all similar data to single cluster and make some data to compromise by putting little un-similar to different clusters. To implement this project author is using National Climatic Data Center (NCDC) dataset which contains climate information. To find out similar temperature on different dates author is applying Hybrid clustering algorithm. From this dataset author is using date and temperature value and then passing date as key to Mapper and temperature as value to Mapper. Mapper always read data in the form of key value pairs. In dataset date we can find at position 6 to14 and temperature we can find at position 39 to 45. Below are some dataset example values.\",\"PeriodicalId\":147954,\"journal\":{\"name\":\"International Journal of Engineering in Computer Science\",\"volume\":\"379 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Engineering in Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.33545/26633582.2020.v2.i1a.29\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Engineering in Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33545/26633582.2020.v2.i1a.29","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文建议采用并行分布式Hadoop Map Reduce技术。Map Reduce是一种并行技术，它可以通过创建多个线程实例来处理大数据。Map将获取输入数据并将其分割成多个块或部分，并将所有部分分发给不同的reducer。所有的reducer都将处理数据并将结果发送回映射器。Mapper收集来自所有映射器的输出，然后生成单个输出。由于Map Reduce技术的多重并行处理使我们能够处理任意数量的数据。本文介绍了基于密度的聚类算法和模糊聚类算法。这两种算法都不能有效地将所有相似的数据分组到单个集群中，并且通过将少量不相似的数据放入不同的集群而使一些数据折衷。为了实现这个项目，作者使用了包含气候信息的国家气候数据中心(NCDC)数据集。为了找出不同日期的相似温度，作者采用了混合聚类算法。从这个数据集作者使用日期和温度值，然后将日期作为关键传递给Mapper和温度作为值传递给Mapper。映射器总是以键值对的形式读取数据。在数据集日期中，我们可以找到位置6至14和温度我们可以找到位置39至45。下面是一些数据集示例值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Density based clustering and fuzzy clustering for efficient clustering of big data in hadoop ecosystem

In this paper suggesting to use parallel distributed Hadoop Map Reduce technology. Map Reduce is a parallel technology which can process bigdata by creating multiple instances of thread. Map will take input data and split it into multiple chunks or parts and distribute all parts to different reducers. All reducers will process the data and send result back to mapper. Mapper gather output from all mappers and then generate a single output. Due to multiple parallel processing of Map Reduce technology allow us to process any amount of data. In this paper author describing clustering algorithms such Density Based Clustering and Fuzzy Clustering. Both algorithms are not efficient to group all similar data to single cluster and make some data to compromise by putting little un-similar to different clusters. To implement this project author is using National Climatic Data Center (NCDC) dataset which contains climate information. To find out similar temperature on different dates author is applying Hybrid clustering algorithm. From this dataset author is using date and temperature value and then passing date as key to Mapper and temperature as value to Mapper. Mapper always read data in the form of key value pairs. In dataset date we can find at position 6 to14 and temperature we can find at position 39 to 45. Below are some dataset example values.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Engineering in Computer Science

自引率

0.00%

发文量