Francisco García-García, A. Corral, L. Iribarne, M. Vassilakopoulos
{"title":"基于spark的空间分析系统中距离连接查询的高效分布式算法","authors":"Francisco García-García, A. Corral, L. Iribarne, M. Vassilakopoulos","doi":"10.1080/03081079.2023.2173750","DOIUrl":null,"url":null,"abstract":"ABSTRACT Apache Sedona (formerly GeoSpark) is a new in-memory cluster computing system for processing large-scale spatial data, which extends the core of Apache Spark to support spatial datatypes, partitioning techniques, spatial indexes, and spatial operations (e.g. spatial range, nearest neighbor, and spatial join queries). Distance-based Join Queries (DJQs), like nearest neighbor join (kNNJQ) or closest pairs queries (kCPQ), are not supported by it. Therefore, in this paper, we investigate how to design and implement efficient DJQ distributed algorithms in Apache Sedona, using the most appropriate spatial partitioning and other optimization techniques. The results of an extensive set of experiments with real-world datasets are presented, demonstrating that the proposed kNNJQ and kCPQ distributed algorithms are efficient, scalable, and robust in Apache Sedona. Finally, Sedona is also compared to other similar cluster computing systems, showing the best performance for kCPQ and competitive results for kNNJQ.","PeriodicalId":50322,"journal":{"name":"International Journal of General Systems","volume":"52 1","pages":"206 - 250"},"PeriodicalIF":2.4000,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Efficient distributed algorithms for distance join queries in spark-based spatial analytics systems\",\"authors\":\"Francisco García-García, A. Corral, L. Iribarne, M. Vassilakopoulos\",\"doi\":\"10.1080/03081079.2023.2173750\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT Apache Sedona (formerly GeoSpark) is a new in-memory cluster computing system for processing large-scale spatial data, which extends the core of Apache Spark to support spatial datatypes, partitioning techniques, spatial indexes, and spatial operations (e.g. spatial range, nearest neighbor, and spatial join queries). Distance-based Join Queries (DJQs), like nearest neighbor join (kNNJQ) or closest pairs queries (kCPQ), are not supported by it. Therefore, in this paper, we investigate how to design and implement efficient DJQ distributed algorithms in Apache Sedona, using the most appropriate spatial partitioning and other optimization techniques. The results of an extensive set of experiments with real-world datasets are presented, demonstrating that the proposed kNNJQ and kCPQ distributed algorithms are efficient, scalable, and robust in Apache Sedona. Finally, Sedona is also compared to other similar cluster computing systems, showing the best performance for kCPQ and competitive results for kNNJQ.\",\"PeriodicalId\":50322,\"journal\":{\"name\":\"International Journal of General Systems\",\"volume\":\"52 1\",\"pages\":\"206 - 250\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-02-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of General Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1080/03081079.2023.2173750\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of General Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/03081079.2023.2173750","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Efficient distributed algorithms for distance join queries in spark-based spatial analytics systems
ABSTRACT Apache Sedona (formerly GeoSpark) is a new in-memory cluster computing system for processing large-scale spatial data, which extends the core of Apache Spark to support spatial datatypes, partitioning techniques, spatial indexes, and spatial operations (e.g. spatial range, nearest neighbor, and spatial join queries). Distance-based Join Queries (DJQs), like nearest neighbor join (kNNJQ) or closest pairs queries (kCPQ), are not supported by it. Therefore, in this paper, we investigate how to design and implement efficient DJQ distributed algorithms in Apache Sedona, using the most appropriate spatial partitioning and other optimization techniques. The results of an extensive set of experiments with real-world datasets are presented, demonstrating that the proposed kNNJQ and kCPQ distributed algorithms are efficient, scalable, and robust in Apache Sedona. Finally, Sedona is also compared to other similar cluster computing systems, showing the best performance for kCPQ and competitive results for kNNJQ.
期刊介绍:
International Journal of General Systems is a periodical devoted primarily to the publication of original research contributions to system science, basic as well as applied. However, relevant survey articles, invited book reviews, bibliographies, and letters to the editor are also published.
The principal aim of the journal is to promote original systems ideas (concepts, principles, methods, theoretical or experimental results, etc.) that are broadly applicable to various kinds of systems. The term “general system” in the name of the journal is intended to indicate this aim–the orientation to systems ideas that have a general applicability. Typical subject areas covered by the journal include: uncertainty and randomness; fuzziness and imprecision; information; complexity; inductive and deductive reasoning about systems; learning; systems analysis and design; and theoretical as well as experimental knowledge regarding various categories of systems. Submitted research must be well presented and must clearly state the contribution and novelty. Manuscripts dealing with particular kinds of systems which lack general applicability across a broad range of systems should be sent to journals specializing in the respective topics.