Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref
{"title":"袋鼠:Hadoop中范围数据和范围查询的工作负载感知处理","authors":"Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref","doi":"10.1145/2835776.2835841","DOIUrl":null,"url":null,"abstract":"Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop\",\"authors\":\"Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref\",\"doi\":\"10.1145/2835776.2835841\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.\",\"PeriodicalId\":20567,\"journal\":{\"name\":\"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining\",\"volume\":\"2 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-02-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2835776.2835841\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2835776.2835841","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop
Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.