{"title":"Hybrid Cost Modeling for Reducing Query Performance Regression in Index Tuning","authors":"Wentao Wu","doi":"10.1109/TKDE.2024.3484954","DOIUrl":null,"url":null,"abstract":"Autonomous index tuning (“auto-indexing” for short) has recently started being supported by cloud database service providers. Index tuners rely on query optimizer's cost estimates to recommend indexes that can minimize the execution cost of an input workload. Such cost estimates can often be erroneous that lead to significant query performance regression. To reduce the chance of regression, existing work primarily uses machine learning (ML) technologies to build prediction models to improve query execution cost estimation using actual query execution telemetry as training data. However, training data collection is typically an expensive process, especially for index tuning due to the significant overhead of creating/dropping indexes. As a result, the amount of training data can be limited in auto-indexing for cloud databases. In this paper, we propose a new approach named “hybrid cost modeling” to address this challenge. The key idea is to limit the ML-based modeling effort to the \n<italic>leaf operators</i>\n such as table scans, index scans, and index seeks, and then combine the ML-model predicted costs of the leaf operators with optimizer's estimated costs of the other operators in the query plan. We conduct theoretical study as well as empirical evaluation to demonstrate the efficacy of applying hybrid cost modeling to index tuning, using both industrial benchmarks and real workloads.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"379-391"},"PeriodicalIF":8.9000,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10726868/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Autonomous index tuning (“auto-indexing” for short) has recently started being supported by cloud database service providers. Index tuners rely on query optimizer's cost estimates to recommend indexes that can minimize the execution cost of an input workload. Such cost estimates can often be erroneous that lead to significant query performance regression. To reduce the chance of regression, existing work primarily uses machine learning (ML) technologies to build prediction models to improve query execution cost estimation using actual query execution telemetry as training data. However, training data collection is typically an expensive process, especially for index tuning due to the significant overhead of creating/dropping indexes. As a result, the amount of training data can be limited in auto-indexing for cloud databases. In this paper, we propose a new approach named “hybrid cost modeling” to address this challenge. The key idea is to limit the ML-based modeling effort to the
leaf operators
such as table scans, index scans, and index seeks, and then combine the ML-model predicted costs of the leaf operators with optimizer's estimated costs of the other operators in the query plan. We conduct theoretical study as well as empirical evaluation to demonstrate the efficacy of applying hybrid cost modeling to index tuning, using both industrial benchmarks and real workloads.
期刊介绍:
The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.