Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey
{"title":"Learned Indexes with Distribution Smoothing via Virtual Points","authors":"Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey","doi":"arxiv-2408.06134","DOIUrl":null,"url":null,"abstract":"Recent research on learned indexes has created a new perspective for indexes\nas models that map keys to their respective storage locations. These learned\nindexes are created to approximate the cumulative distribution function of the\nkey set, where using only a single model may have limited accuracy. To overcome\nthis limitation, a typical method is to use multiple models, arranged in a\nhierarchical manner, where the query performance depends on two aspects: (i)\ntraversal time to find the correct model and (ii) search time to find the key\nin the selected model. Such a method may cause some key space regions that are\ndifficult to model to be placed at deeper levels in the hierarchy. To address\nthis issue, we propose an alternative method that modifies the key space as\nopposed to any structural or model modifications. This is achieved through\nmaking the key set more learnable (i.e., smoothing the distribution) by\ninserting virtual points. Further, we develop an algorithm named CSV to\nintegrate our virtual point insertion method into existing learned indexes,\nreducing both their traversal and search time. We implement CSV on\nstate-of-the-art learned indexes and evaluate them on real-world datasets. The\nextensive experimental results show significant query performance improvement\nfor the keys in deeper levels of the index structures at a low storage cost.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.06134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent research on learned indexes has created a new perspective for indexes
as models that map keys to their respective storage locations. These learned
indexes are created to approximate the cumulative distribution function of the
key set, where using only a single model may have limited accuracy. To overcome
this limitation, a typical method is to use multiple models, arranged in a
hierarchical manner, where the query performance depends on two aspects: (i)
traversal time to find the correct model and (ii) search time to find the key
in the selected model. Such a method may cause some key space regions that are
difficult to model to be placed at deeper levels in the hierarchy. To address
this issue, we propose an alternative method that modifies the key space as
opposed to any structural or model modifications. This is achieved through
making the key set more learnable (i.e., smoothing the distribution) by
inserting virtual points. Further, we develop an algorithm named CSV to
integrate our virtual point insertion method into existing learned indexes,
reducing both their traversal and search time. We implement CSV on
state-of-the-art learned indexes and evaluate them on real-world datasets. The
extensive experimental results show significant query performance improvement
for the keys in deeper levels of the index structures at a low storage cost.