使用bloom过滤器在分布式流和存储的RDF图之间进行快速SPARQL连接处理

2018 12th International Conference on Research Challenges in Information Science (RCIS) Pub Date : 2018-05-29 DOI:10.1109/RCIS.2018.8406674

Amadou Fall Dia, Zakia Kazi-Aoul, Aliou Boly, Elisabeth Métais

{"title":"使用bloom过滤器在分布式流和存储的RDF图之间进行快速SPARQL连接处理","authors":"Amadou Fall Dia, Zakia Kazi-Aoul, Aliou Boly, Elisabeth Métais","doi":"10.1109/RCIS.2018.8406674","DOIUrl":null,"url":null,"abstract":"The growth of real-time data generation and stored data leads us to be constantly in thinking about the three V's big data challenges: volume, velocity and variety. Existing RDF Stream Processing (RSP) systems have solved the variety lock by defining a common model for producing, transmitting and continuously querying data in RDF model. On the volume and velocity side, the performances of RSP systems need to be improved particularly in terms of joins process between stored and streaming RDF graphs. Stored RDF data are very important in streaming context (related ontologies, summarized RDF data, non-evolutive RDF data or evolve very slowly over time, etc.) but existing RSP systems such as C-SPARQL, CQELS, SPARQLstream, EP-SPARQL, Sparkwave, etc. use non-optimized and non-scalable approaches for performing join operations between stored and dynamic RDF data. Indeed, these systems need to read the entire local or remote stored RDF data sets while RDF data streams continuously arrived and need to be processed in near real-time. This latency may negatively affect performances in terms of continuous processing and often causes multiple bottlenecks within the network in a distributed environment. That also makes impractical to refresh data or update the stored contents. This paper proposes an approach for distributed real-time joins between stored and streaming RDF graphs using Bloom filters. The join procedure consists of adding fast processing by greatly reducing intermediate results, in-memory indices storage and precomputing query partitions according to the picked SPARQL query variable(s) between the two natures of RDF data. Experimental and evaluations results confirm the performances gained with our approach which significantly speeds up the query processing compared to the actual RSP's techniques.","PeriodicalId":408651,"journal":{"name":"2018 12th International Conference on Research Challenges in Information Science (RCIS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Fast SPARQL join processing between distributed streams and stored RDF graphs using bloom filters\",\"authors\":\"Amadou Fall Dia, Zakia Kazi-Aoul, Aliou Boly, Elisabeth Métais\",\"doi\":\"10.1109/RCIS.2018.8406674\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The growth of real-time data generation and stored data leads us to be constantly in thinking about the three V's big data challenges: volume, velocity and variety. Existing RDF Stream Processing (RSP) systems have solved the variety lock by defining a common model for producing, transmitting and continuously querying data in RDF model. On the volume and velocity side, the performances of RSP systems need to be improved particularly in terms of joins process between stored and streaming RDF graphs. Stored RDF data are very important in streaming context (related ontologies, summarized RDF data, non-evolutive RDF data or evolve very slowly over time, etc.) but existing RSP systems such as C-SPARQL, CQELS, SPARQLstream, EP-SPARQL, Sparkwave, etc. use non-optimized and non-scalable approaches for performing join operations between stored and dynamic RDF data. Indeed, these systems need to read the entire local or remote stored RDF data sets while RDF data streams continuously arrived and need to be processed in near real-time. This latency may negatively affect performances in terms of continuous processing and often causes multiple bottlenecks within the network in a distributed environment. That also makes impractical to refresh data or update the stored contents. This paper proposes an approach for distributed real-time joins between stored and streaming RDF graphs using Bloom filters. The join procedure consists of adding fast processing by greatly reducing intermediate results, in-memory indices storage and precomputing query partitions according to the picked SPARQL query variable(s) between the two natures of RDF data. Experimental and evaluations results confirm the performances gained with our approach which significantly speeds up the query processing compared to the actual RSP's techniques.\",\"PeriodicalId\":408651,\"journal\":{\"name\":\"2018 12th International Conference on Research Challenges in Information Science (RCIS)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 12th International Conference on Research Challenges in Information Science (RCIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RCIS.2018.8406674\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 12th International Conference on Research Challenges in Information Science (RCIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RCIS.2018.8406674","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

实时数据生成和存储数据的增长使我们不断思考大数据的3v挑战:量(volume)、速度(velocity)和种类(variety)。现有的RDF流处理(RSP)系统通过定义一个通用的模型，在RDF模型中生成、传输和连续查询数据，解决了品种锁问题。在容量和速度方面，RSP系统的性能需要改进，特别是在存储RDF图和流RDF图之间的连接过程方面。存储的RDF数据在流环境中非常重要(相关的本体、总结的RDF数据、非进化的RDF数据或随着时间的推移而缓慢发展等)，但是现有的RSP系统，如C-SPARQL、CQELS、SPARQLstream、EP-SPARQL、Sparkwave等，使用非优化和不可扩展的方法来执行存储和动态RDF数据之间的连接操作。实际上，当RDF数据流不断到达并且需要近乎实时地处理时，这些系统需要读取整个本地或远程存储的RDF数据集。这种延迟可能会对连续处理的性能产生负面影响，并且通常会在分布式环境中导致网络中的多个瓶颈。这也使得刷新数据或更新存储的内容变得不切实际。本文提出了一种利用Bloom过滤器实现存储RDF图和流RDF图之间的分布式实时连接的方法。连接过程包括通过大大减少中间结果、内存中索引存储和根据在RDF数据的两种性质之间选择的SPARQL查询变量预先计算查询分区来添加快速处理。实验和评估结果证实了我们的方法所获得的性能，与实际的RSP技术相比，我们的方法显著加快了查询处理的速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fast SPARQL join processing between distributed streams and stored RDF graphs using bloom filters

The growth of real-time data generation and stored data leads us to be constantly in thinking about the three V's big data challenges: volume, velocity and variety. Existing RDF Stream Processing (RSP) systems have solved the variety lock by defining a common model for producing, transmitting and continuously querying data in RDF model. On the volume and velocity side, the performances of RSP systems need to be improved particularly in terms of joins process between stored and streaming RDF graphs. Stored RDF data are very important in streaming context (related ontologies, summarized RDF data, non-evolutive RDF data or evolve very slowly over time, etc.) but existing RSP systems such as C-SPARQL, CQELS, SPARQLstream, EP-SPARQL, Sparkwave, etc. use non-optimized and non-scalable approaches for performing join operations between stored and dynamic RDF data. Indeed, these systems need to read the entire local or remote stored RDF data sets while RDF data streams continuously arrived and need to be processed in near real-time. This latency may negatively affect performances in terms of continuous processing and often causes multiple bottlenecks within the network in a distributed environment. That also makes impractical to refresh data or update the stored contents. This paper proposes an approach for distributed real-time joins between stored and streaming RDF graphs using Bloom filters. The join procedure consists of adding fast processing by greatly reducing intermediate results, in-memory indices storage and precomputing query partitions according to the picked SPARQL query variable(s) between the two natures of RDF data. Experimental and evaluations results confirm the performances gained with our approach which significantly speeds up the query processing compared to the actual RSP's techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 12th International Conference on Research Challenges in Information Science (RCIS)

自引率

0.00%

发文量