Analyzing workload trends for boosting triple stores performance

IF 3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2024-06-10 DOI:10.1016/j.is.2024.102420

Ahmed Al-Ghezi, Lena Wiese

{"title":"Analyzing workload trends for boosting triple stores performance","authors":"Ahmed Al-Ghezi, Lena Wiese","doi":"10.1016/j.is.2024.102420","DOIUrl":null,"url":null,"abstract":"<div><p>The Resource Description Framework (RDF) is widely used to model web data. The scale and complexity of the modeled data emphasized performance challenges on the RDF-triple stores. Workload adaption is one important strategy to deal with those challenges on the storage level. Current workload-adaption approaches lack the necessary generalization of the problem and only optimize part of the storage layer with the workload (mostly the replication). This creates a big performance gap within other data structures (e.g. indexes and cache) that could heavily benefit from the same workload adaption strategy. Moreover, the workload statistics are built collectively in most of the current approaches. Thus, the analysis process is unaware of whether workloads’ items are old or recent. However, that does not simulate the temporal trends that exist naturally in user queries which causes the analysis process to lag behind the rapid workload development. We present a novel universal adaption approach to the storage management of a distributed RDF store. The system aims to find optimal data assignments to the different indexes, replications, and join cache within the limited storage space. We present a cost model based on the workload that often contains frequent patterns. The workload is dynamically and continuously analyzed to evaluate predefined rules considering the benefits and costs of all options of assigning data to the storage structures. The objective is to reduce query execution time by letting different data containers compete on the limited storage space. By modeling the workload statistics as time series, we can apply well-known smoothing techniques allowing the importance of the workload to decay over time. That allows the universal adaption to stay tuned with potential changes in the workload trends.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102420"},"PeriodicalIF":3.0000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437924000784/pdfft?md5=4a9d8f0acac2d10b05565ee129773c94&pid=1-s2.0-S0306437924000784-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437924000784","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The Resource Description Framework (RDF) is widely used to model web data. The scale and complexity of the modeled data emphasized performance challenges on the RDF-triple stores. Workload adaption is one important strategy to deal with those challenges on the storage level. Current workload-adaption approaches lack the necessary generalization of the problem and only optimize part of the storage layer with the workload (mostly the replication). This creates a big performance gap within other data structures (e.g. indexes and cache) that could heavily benefit from the same workload adaption strategy. Moreover, the workload statistics are built collectively in most of the current approaches. Thus, the analysis process is unaware of whether workloads’ items are old or recent. However, that does not simulate the temporal trends that exist naturally in user queries which causes the analysis process to lag behind the rapid workload development. We present a novel universal adaption approach to the storage management of a distributed RDF store. The system aims to find optimal data assignments to the different indexes, replications, and join cache within the limited storage space. We present a cost model based on the workload that often contains frequent patterns. The workload is dynamically and continuously analyzed to evaluate predefined rules considering the benefits and costs of all options of assigning data to the storage structures. The objective is to reduce query execution time by letting different data containers compete on the limited storage space. By modeling the workload statistics as time series, we can apply well-known smoothing techniques allowing the importance of the workload to decay over time. That allows the universal adaption to stay tuned with potential changes in the workload trends.

查看原文本刊更多论文

分析工作负载趋势，提高三重存储性能

资源描述框架（RDF）被广泛用于网络数据建模。建模数据的规模和复杂性给 RDF 三重存储带来了性能挑战。工作负载自适应是在存储层面应对这些挑战的重要策略之一。目前的工作负载适应方法缺乏对问题的必要概括，只能根据工作负载优化存储层的一部分（主要是复制）。这就在其他数据结构（如索引和高速缓存）中造成了巨大的性能差距，而这些数据结构可以从相同的工作负载适应策略中获益良多。此外，在当前的大多数方法中，工作负载统计数据都是集体建立的。因此，分析过程不知道工作负载的项目是新的还是旧的。然而，这并不能模拟用户查询中自然存在的时间趋势，从而导致分析流程落后于工作负载的快速发展。我们为分布式 RDF 存储的存储管理提出了一种新颖的通用适应方法。该系统的目标是在有限的存储空间内为不同的索引、复制和连接缓存找到最佳的数据分配。我们根据经常包含频繁模式的工作负载提出了一种成本模型。对工作负载进行动态和持续的分析，以评估预先定义的规则，同时考虑将数据分配到存储结构的所有选项的优势和成本。其目的是通过让不同的数据容器在有限的存储空间上竞争来缩短查询执行时间。通过将工作负载统计数据建模为时间序列，我们可以应用著名的平滑技术，使工作负载的重要性随时间衰减。这样，通用自适应就能根据工作负载趋势的潜在变化进行调整。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.