基于分层和分区的聚类在数据密集型应用中分组感知数据放置中的意义

2017 National Conference on Parallel Computing Technologies (PARCOMPTECH) Pub Date : 2017-02-01 DOI:10.1109/PARCOMPTECH.2017.8068331

S. Vengadeswaran, S. Balasundaram

{"title":"基于分层和分区的聚类在数据密集型应用中分组感知数据放置中的意义","authors":"S. Vengadeswaran, S. Balasundaram","doi":"10.1109/PARCOMPTECH.2017.8068331","DOIUrl":null,"url":null,"abstract":"Recent development and exponential growth in the field of IT generates large volume of data every day in a variety of domains such as Social networks, Health care, Government sectors etc. These data are voluminous, varied and ever increasing at an unprecedented pace which makes storage and computing a mammoth task. Generally the time taken to execute a query and return the results, increases exponentially as the amount of data increases leading to more waiting times on the user. This processing inability has led to the use of Hadoop to analyze and gain insights from the data. With its distributed processing capability, Hadoop is considered as an efficient solution for query processing but it has its own limitation when the data to be processed exhibit interest locality. Generally it is observed that the data required for any query execution follows grouping behavior wherein only a part of the Big-Data set is utilized more often. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior among the dataset, it does not perform efficiently resulting in lacunas such as decreased local map execution, increased query execution time etc. Hence in this paper we experiment the significance of two most promising Matrix clustering techniques viz. partitioning and hierarchical in grouping aware data placement for improved performance. Both clustering techniques are separately applied over the user history log to obtain independent data groupings. These data groupings are interpreted and validated to extract the optimal data grouping for improved parallel execution. The proposed strategy was tested in 15 node Hadoop cluster. The results show an improved performance for Big-Data sets in heterogeneous distributed environment. It improves the data locality by 25.75% and reduces query execution time by 28% compared to HDDPS. Also Hierarchical based Matrix clustering shows a marginal improved performance over Partitioning based methods for queries exhibiting interest localities.","PeriodicalId":219266,"journal":{"name":"2017 National Conference on Parallel Computing Technologies (PARCOMPTECH)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Significance of hierarchical and partitioning based clustering in grouping aware data placement for data intensive applications\",\"authors\":\"S. Vengadeswaran, S. Balasundaram\",\"doi\":\"10.1109/PARCOMPTECH.2017.8068331\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent development and exponential growth in the field of IT generates large volume of data every day in a variety of domains such as Social networks, Health care, Government sectors etc. These data are voluminous, varied and ever increasing at an unprecedented pace which makes storage and computing a mammoth task. Generally the time taken to execute a query and return the results, increases exponentially as the amount of data increases leading to more waiting times on the user. This processing inability has led to the use of Hadoop to analyze and gain insights from the data. With its distributed processing capability, Hadoop is considered as an efficient solution for query processing but it has its own limitation when the data to be processed exhibit interest locality. Generally it is observed that the data required for any query execution follows grouping behavior wherein only a part of the Big-Data set is utilized more often. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior among the dataset, it does not perform efficiently resulting in lacunas such as decreased local map execution, increased query execution time etc. Hence in this paper we experiment the significance of two most promising Matrix clustering techniques viz. partitioning and hierarchical in grouping aware data placement for improved performance. Both clustering techniques are separately applied over the user history log to obtain independent data groupings. These data groupings are interpreted and validated to extract the optimal data grouping for improved parallel execution. The proposed strategy was tested in 15 node Hadoop cluster. The results show an improved performance for Big-Data sets in heterogeneous distributed environment. It improves the data locality by 25.75% and reduces query execution time by 28% compared to HDDPS. Also Hierarchical based Matrix clustering shows a marginal improved performance over Partitioning based methods for queries exhibiting interest localities.\",\"PeriodicalId\":219266,\"journal\":{\"name\":\"2017 National Conference on Parallel Computing Technologies (PARCOMPTECH)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 National Conference on Parallel Computing Technologies (PARCOMPTECH)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PARCOMPTECH.2017.8068331\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 National Conference on Parallel Computing Technologies (PARCOMPTECH)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PARCOMPTECH.2017.8068331","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

IT领域的最新发展和指数级增长每天在各种领域产生大量数据，如社交网络、医疗保健、政府部门等。这些数据量巨大，种类繁多，并且以前所未有的速度不断增长，这使得存储和计算成为一项艰巨的任务。通常，执行查询和返回结果所花费的时间会随着数据量的增加而呈指数增长，从而导致用户等待的时间更长。这种处理的无能导致使用Hadoop来分析数据并从数据中获得洞察力。由于其分布式处理能力，Hadoop被认为是查询处理的有效解决方案，但是当要处理的数据表现出兴趣局域性时，它有自己的局限性。通常可以观察到，任何查询执行所需的数据都遵循分组行为，其中只有大数据集的一部分被更频繁地使用。由于Hadoop默认的数据放置策略(HDDPS)没有考虑数据集之间的分组行为，因此它不能有效地执行，导致诸如减少本地映射执行，增加查询执行时间等缺陷。因此，本文实验了两种最有前途的矩阵聚类技术，即划分和分层，在分组感知数据放置方面对提高性能的意义。这两种聚类技术分别应用于用户历史日志，以获得独立的数据分组。对这些数据分组进行解释和验证，以提取最佳的数据分组，以改进并行执行。在15节点Hadoop集群中进行了测试。结果表明，在异构分布式环境下，大数据集的性能得到了提高。与HDDPS相比，它将数据局部性提高了25.75%，并将查询执行时间减少了28%。此外，对于显示兴趣位置的查询，基于层次的矩阵聚类比基于分区的方法的性能略有提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Significance of hierarchical and partitioning based clustering in grouping aware data placement for data intensive applications

Recent development and exponential growth in the field of IT generates large volume of data every day in a variety of domains such as Social networks, Health care, Government sectors etc. These data are voluminous, varied and ever increasing at an unprecedented pace which makes storage and computing a mammoth task. Generally the time taken to execute a query and return the results, increases exponentially as the amount of data increases leading to more waiting times on the user. This processing inability has led to the use of Hadoop to analyze and gain insights from the data. With its distributed processing capability, Hadoop is considered as an efficient solution for query processing but it has its own limitation when the data to be processed exhibit interest locality. Generally it is observed that the data required for any query execution follows grouping behavior wherein only a part of the Big-Data set is utilized more often. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior among the dataset, it does not perform efficiently resulting in lacunas such as decreased local map execution, increased query execution time etc. Hence in this paper we experiment the significance of two most promising Matrix clustering techniques viz. partitioning and hierarchical in grouping aware data placement for improved performance. Both clustering techniques are separately applied over the user history log to obtain independent data groupings. These data groupings are interpreted and validated to extract the optimal data grouping for improved parallel execution. The proposed strategy was tested in 15 node Hadoop cluster. The results show an improved performance for Big-Data sets in heterogeneous distributed environment. It improves the data locality by 25.75% and reduces query execution time by 28% compared to HDDPS. Also Hierarchical based Matrix clustering shows a marginal improved performance over Partitioning based methods for queries exhibiting interest localities.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 National Conference on Parallel Computing Technologies (PARCOMPTECH)

自引率

0.00%

发文量