Amir Hossein Atashkar, Nasser Ghadiri, Mehdi Joodaki
{"title":"Linked data partitioning for RDF processing on Apache Spark","authors":"Amir Hossein Atashkar, Nasser Ghadiri, Mehdi Joodaki","doi":"10.1109/ICWR.2017.7959308","DOIUrl":null,"url":null,"abstract":"RDF models are widely used in the web of data due to their flexibility and similarity to graph patterns. Because of the growing use of RDFs, their volumes and contents are increasing. Therefore, processing of such massive amount of data on a single machine is not efficient enough, because of the response time and limited hardware resources. A common approach to overcome this limitation is cluster processing and huge datasets could benefit distributed cluster processing on Apache Hadoop. Because of using too much of hard disks, the processing time is usually inadequate. In this paper, we propose a partitiong approach based on Apache Spark for rapid processing of RDF data models. A key feature of Apache Spark is using main memory instead of hard disk, so the speed of data processing in our method is improved. We have evaluated the proposed method by runing SQL queris on RDF data which partitioned on the cluster and demonstrates improved performance.","PeriodicalId":304897,"journal":{"name":"2017 3th International Conference on Web Research (ICWR)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 3th International Conference on Web Research (ICWR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICWR.2017.7959308","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
RDF models are widely used in the web of data due to their flexibility and similarity to graph patterns. Because of the growing use of RDFs, their volumes and contents are increasing. Therefore, processing of such massive amount of data on a single machine is not efficient enough, because of the response time and limited hardware resources. A common approach to overcome this limitation is cluster processing and huge datasets could benefit distributed cluster processing on Apache Hadoop. Because of using too much of hard disks, the processing time is usually inadequate. In this paper, we propose a partitiong approach based on Apache Spark for rapid processing of RDF data models. A key feature of Apache Spark is using main memory instead of hard disk, so the speed of data processing in our method is improved. We have evaluated the proposed method by runing SQL queris on RDF data which partitioned on the cluster and demonstrates improved performance.