{"title":"基于分层和分区的聚类在数据密集型应用中分组感知数据放置中的意义","authors":"S. Vengadeswaran, S. Balasundaram","doi":"10.1109/PARCOMPTECH.2017.8068331","DOIUrl":null,"url":null,"abstract":"Recent development and exponential growth in the field of IT generates large volume of data every day in a variety of domains such as Social networks, Health care, Government sectors etc. These data are voluminous, varied and ever increasing at an unprecedented pace which makes storage and computing a mammoth task. Generally the time taken to execute a query and return the results, increases exponentially as the amount of data increases leading to more waiting times on the user. This processing inability has led to the use of Hadoop to analyze and gain insights from the data. With its distributed processing capability, Hadoop is considered as an efficient solution for query processing but it has its own limitation when the data to be processed exhibit interest locality. Generally it is observed that the data required for any query execution follows grouping behavior wherein only a part of the Big-Data set is utilized more often. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior among the dataset, it does not perform efficiently resulting in lacunas such as decreased local map execution, increased query execution time etc. Hence in this paper we experiment the significance of two most promising Matrix clustering techniques viz. partitioning and hierarchical in grouping aware data placement for improved performance. Both clustering techniques are separately applied over the user history log to obtain independent data groupings. These data groupings are interpreted and validated to extract the optimal data grouping for improved parallel execution. The proposed strategy was tested in 15 node Hadoop cluster. The results show an improved performance for Big-Data sets in heterogeneous distributed environment. It improves the data locality by 25.75% and reduces query execution time by 28% compared to HDDPS. Also Hierarchical based Matrix clustering shows a marginal improved performance over Partitioning based methods for queries exhibiting interest localities.","PeriodicalId":219266,"journal":{"name":"2017 National Conference on Parallel Computing Technologies (PARCOMPTECH)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Significance of hierarchical and partitioning based clustering in grouping aware data placement for data intensive applications\",\"authors\":\"S. Vengadeswaran, S. Balasundaram\",\"doi\":\"10.1109/PARCOMPTECH.2017.8068331\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent development and exponential growth in the field of IT generates large volume of data every day in a variety of domains such as Social networks, Health care, Government sectors etc. These data are voluminous, varied and ever increasing at an unprecedented pace which makes storage and computing a mammoth task. Generally the time taken to execute a query and return the results, increases exponentially as the amount of data increases leading to more waiting times on the user. This processing inability has led to the use of Hadoop to analyze and gain insights from the data. With its distributed processing capability, Hadoop is considered as an efficient solution for query processing but it has its own limitation when the data to be processed exhibit interest locality. Generally it is observed that the data required for any query execution follows grouping behavior wherein only a part of the Big-Data set is utilized more often. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior among the dataset, it does not perform efficiently resulting in lacunas such as decreased local map execution, increased query execution time etc. Hence in this paper we experiment the significance of two most promising Matrix clustering techniques viz. partitioning and hierarchical in grouping aware data placement for improved performance. Both clustering techniques are separately applied over the user history log to obtain independent data groupings. These data groupings are interpreted and validated to extract the optimal data grouping for improved parallel execution. The proposed strategy was tested in 15 node Hadoop cluster. The results show an improved performance for Big-Data sets in heterogeneous distributed environment. It improves the data locality by 25.75% and reduces query execution time by 28% compared to HDDPS. Also Hierarchical based Matrix clustering shows a marginal improved performance over Partitioning based methods for queries exhibiting interest localities.\",\"PeriodicalId\":219266,\"journal\":{\"name\":\"2017 National Conference on Parallel Computing Technologies (PARCOMPTECH)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 National Conference on Parallel Computing Technologies (PARCOMPTECH)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PARCOMPTECH.2017.8068331\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 National Conference on Parallel Computing Technologies (PARCOMPTECH)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PARCOMPTECH.2017.8068331","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Significance of hierarchical and partitioning based clustering in grouping aware data placement for data intensive applications
Recent development and exponential growth in the field of IT generates large volume of data every day in a variety of domains such as Social networks, Health care, Government sectors etc. These data are voluminous, varied and ever increasing at an unprecedented pace which makes storage and computing a mammoth task. Generally the time taken to execute a query and return the results, increases exponentially as the amount of data increases leading to more waiting times on the user. This processing inability has led to the use of Hadoop to analyze and gain insights from the data. With its distributed processing capability, Hadoop is considered as an efficient solution for query processing but it has its own limitation when the data to be processed exhibit interest locality. Generally it is observed that the data required for any query execution follows grouping behavior wherein only a part of the Big-Data set is utilized more often. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior among the dataset, it does not perform efficiently resulting in lacunas such as decreased local map execution, increased query execution time etc. Hence in this paper we experiment the significance of two most promising Matrix clustering techniques viz. partitioning and hierarchical in grouping aware data placement for improved performance. Both clustering techniques are separately applied over the user history log to obtain independent data groupings. These data groupings are interpreted and validated to extract the optimal data grouping for improved parallel execution. The proposed strategy was tested in 15 node Hadoop cluster. The results show an improved performance for Big-Data sets in heterogeneous distributed environment. It improves the data locality by 25.75% and reduces query execution time by 28% compared to HDDPS. Also Hierarchical based Matrix clustering shows a marginal improved performance over Partitioning based methods for queries exhibiting interest localities.