{"title":"AutoPart: automating schema design for large scientific databases using data partitioning","authors":"Stratos Papadomanolakis, A. Ailamaki","doi":"10.1109/SSDBM.2004.19","DOIUrl":null,"url":null,"abstract":"Database applications that use multi-terabyte datasets are becoming increasingly important for scientific fields such as astronomy and biology. Scientific databases are particularly suited for the application of automated physical design techniques, because of their data volume and the complexity of the scientific workloads. Current automated physical design tools focus on the selection of indexes and materialized views. In large-scale scientific databases, however the data volume and the continuous insertion of new data allows for only limited indexes and materialized views. By contrast, data partitioning does not replicate data, thereby reducing space requirements and minimizing update overhead. In this paper we present AutoPart, an algorithm that automatically partitions database tables to optimize sequential access assuming prior knowledge of a representative workload. The resulting schema is indexed using a fraction of the space required for indexing the original schema. To evaluate AutoPart we built an automated schema design tool that interfaces to commercial database systems. We experiment with AutoPart in the context of the Sloan Digital Sky Survey database, a real-world astronomical database, running on SQL Server 2000. Our experiments demonstrate the benefits of partitioning for large-scale systems: partitioning alone improves query execution performance by a factor of two on average. Combined with indexes, the new schema also outperforms the indexed original schema by 20% (for queries) and a factor of five (for updates), while using only half the original index space.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"154","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSDBM.2004.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 154
Abstract
Database applications that use multi-terabyte datasets are becoming increasingly important for scientific fields such as astronomy and biology. Scientific databases are particularly suited for the application of automated physical design techniques, because of their data volume and the complexity of the scientific workloads. Current automated physical design tools focus on the selection of indexes and materialized views. In large-scale scientific databases, however the data volume and the continuous insertion of new data allows for only limited indexes and materialized views. By contrast, data partitioning does not replicate data, thereby reducing space requirements and minimizing update overhead. In this paper we present AutoPart, an algorithm that automatically partitions database tables to optimize sequential access assuming prior knowledge of a representative workload. The resulting schema is indexed using a fraction of the space required for indexing the original schema. To evaluate AutoPart we built an automated schema design tool that interfaces to commercial database systems. We experiment with AutoPart in the context of the Sloan Digital Sky Survey database, a real-world astronomical database, running on SQL Server 2000. Our experiments demonstrate the benefits of partitioning for large-scale systems: partitioning alone improves query execution performance by a factor of two on average. Combined with indexes, the new schema also outperforms the indexed original schema by 20% (for queries) and a factor of five (for updates), while using only half the original index space.
使用多tb数据集的数据库应用程序在天文学和生物学等科学领域变得越来越重要。科学数据库特别适合自动化物理设计技术的应用,因为它们的数据量和科学工作负载的复杂性。当前的自动化物理设计工具侧重于索引和物化视图的选择。然而,在大型科学数据库中,数据量和新数据的不断插入只允许有限的索引和物化视图。相比之下,数据分区不复制数据,因此减少了空间需求并最小化了更新开销。在本文中,我们提出了AutoPart,一种自动划分数据库表以优化顺序访问的算法,假设具有代表性工作负载的先验知识。使用索引原始模式所需空间的一小部分对生成的模式进行索引。为了评估AutoPart,我们构建了一个与商业数据库系统接口的自动模式设计工具。我们在斯隆数字巡天数据库(一个运行在SQL Server 2000上的真实的天文数据库)的背景下使用AutoPart进行实验。我们的实验证明了对大规模系统进行分区的好处:单独分区平均可以将查询执行性能提高两倍。与索引相结合,新模式的性能比已索引的原始模式高出20%(用于查询)和五倍(用于更新),而只使用原始索引空间的一半。