Mohamed Ragab, Riccardo Tommasini, Sadiq Eyvazov, S. Sakr
{"title":"Towards making sense of Spark-SQL performance for processing vast distributed RDF datasets","authors":"Mohamed Ragab, Riccardo Tommasini, Sadiq Eyvazov, S. Sakr","doi":"10.1145/3391274.3393632","DOIUrl":null,"url":null,"abstract":"Recently, a wide range of Web applications (e.g. DBPedia, Uniprot, and Probase) are built on top of vast RDF knowledge bases and using the SPARQL query language. The continuous growth of these knowledge bases led to the investigation of new paradigms and technologies for storing, accessing, and querying RDF data. In practice, modern big data systems (e.g, Hadoop, Spark) can handle vast relational repositories, however, their application in the Semantic Web context is still limited. One possible reason is that such frameworks rely on distributed systems, which are good for relational data, however, their performance on dealing with graph data models like RDF has not been well-studied yet. In this paper, we present a systematic evaluation of the performance of SparkSQL engine for processing SPARQL queries. We stated it using three relevant RDF relational schemas, and two different storage backends, namely, Hive, and HDFS. In addition, we show the impact of using three different RDF-based partitioning techniques with our relational scenario. Additionally, we discuss the results of our experiments: (i) we present insights about the trade-offs that characterize different experimental configurations, and (ii) we identify the best and the worst ones for the SP2Bench's benchmark scenario.","PeriodicalId":210506,"journal":{"name":"Proceedings of the International Workshop on Semantic Big Data","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Workshop on Semantic Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3391274.3393632","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Recently, a wide range of Web applications (e.g. DBPedia, Uniprot, and Probase) are built on top of vast RDF knowledge bases and using the SPARQL query language. The continuous growth of these knowledge bases led to the investigation of new paradigms and technologies for storing, accessing, and querying RDF data. In practice, modern big data systems (e.g, Hadoop, Spark) can handle vast relational repositories, however, their application in the Semantic Web context is still limited. One possible reason is that such frameworks rely on distributed systems, which are good for relational data, however, their performance on dealing with graph data models like RDF has not been well-studied yet. In this paper, we present a systematic evaluation of the performance of SparkSQL engine for processing SPARQL queries. We stated it using three relevant RDF relational schemas, and two different storage backends, namely, Hive, and HDFS. In addition, we show the impact of using three different RDF-based partitioning techniques with our relational scenario. Additionally, we discuss the results of our experiments: (i) we present insights about the trade-offs that characterize different experimental configurations, and (ii) we identify the best and the worst ones for the SP2Bench's benchmark scenario.