实时可更新样条学习索引

Fourth Workshop in Exploiting AI Techniques for Data Management Pub Date : 2021-06-20 DOI:10.1145/3464509.3464886

Mayank Mishra, Rekha Singhal

{"title":"实时可更新样条学习索引","authors":"Mayank Mishra, Rekha Singhal","doi":"10.1145/3464509.3464886","DOIUrl":null,"url":null,"abstract":"Machine learning algorithms have accelerated data access through ‘learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory. Most of the learned indexes require retraining of the model for new data insertions in the data set. The retraining is expensive and takes as much time as the model training. So, today, learned indexes are updated by retraining on batch inserts to amortize the cost. However, real-time applications, such as data-driven recommendation applications need to access users’ feature store in real-time both for reading data of existing users and adding new users as well. This motivates us to present a real-time updatable spline learned index, RUSLI, by learning the distribution of data keys with their positions in memory through splines. We have extended RadixSpline [8] to build the updatable learned index while supporting real-time inserts in a data set without affecting the lookup time on the updated data set. We have shown that RUSLI can update the index in constant time with an additional temporary memory of size proportional to the number of splines. We have discussed how to reduce the size of the presented index using the distribution of spline keys while building the radix table. RULSI is shown to incur 270ns for lookup and 50ns for insert operations. Further, we have shown that RUSLI supports concurrent lookup and insert operations with a throughput of 40 million ops/sec. We have presented and discussed performance numbers of RUSLI for single and concurrent inserts, lookup, and range queries on SOSD [9] benchmark.","PeriodicalId":306522,"journal":{"name":"Fourth Workshop in Exploiting AI Techniques for Data Management","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"RUSLI: Real-time Updatable Spline Learned Index\",\"authors\":\"Mayank Mishra, Rekha Singhal\",\"doi\":\"10.1145/3464509.3464886\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning algorithms have accelerated data access through ‘learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory. Most of the learned indexes require retraining of the model for new data insertions in the data set. The retraining is expensive and takes as much time as the model training. So, today, learned indexes are updated by retraining on batch inserts to amortize the cost. However, real-time applications, such as data-driven recommendation applications need to access users’ feature store in real-time both for reading data of existing users and adding new users as well. This motivates us to present a real-time updatable spline learned index, RUSLI, by learning the distribution of data keys with their positions in memory through splines. We have extended RadixSpline [8] to build the updatable learned index while supporting real-time inserts in a data set without affecting the lookup time on the updated data set. We have shown that RUSLI can update the index in constant time with an additional temporary memory of size proportional to the number of splines. We have discussed how to reduce the size of the presented index using the distribution of spline keys while building the radix table. RULSI is shown to incur 270ns for lookup and 50ns for insert operations. Further, we have shown that RUSLI supports concurrent lookup and insert operations with a throughput of 40 million ops/sec. We have presented and discussed performance numbers of RUSLI for single and concurrent inserts, lookup, and range queries on SOSD [9] benchmark.\",\"PeriodicalId\":306522,\"journal\":{\"name\":\"Fourth Workshop in Exploiting AI Techniques for Data Management\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Fourth Workshop in Exploiting AI Techniques for Data Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3464509.3464886\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Fourth Workshop in Exploiting AI Techniques for Data Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3464509.3464886","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

机器学习算法通过“学习索引”加速了数据访问，其中一组数据项由根据数据键对和相应记录在内存中的位置学习的模型索引。大多数学习索引都需要对模型进行重新训练，以便在数据集中插入新的数据。再培训是昂贵的，花费的时间和模型培训一样多。所以，今天，学习到的索引是通过在批插入上重新训练来更新的，以摊销成本。而实时应用，如数据驱动的推荐应用，需要实时访问用户的特性库，读取已有用户的数据，添加新用户。这促使我们提出一个实时更新的样条学习索引，RUSLI，通过样条学习数据键的分布及其在内存中的位置。我们扩展了RadixSpline[8]来构建可更新的学习索引，同时支持数据集中的实时插入，而不会影响对更新数据集的查找时间。我们已经证明，RUSLI可以在常量时间内使用与样条数目成比例的额外临时存储器来更新索引。我们已经讨论了如何在构建基数表时使用样条键的分布来减小索引的大小。RULSI显示查找需要270ns，插入操作需要50ns。此外，我们还展示了RUSLI支持并发查找和插入操作，吞吐量为4000万次/秒。我们已经介绍并讨论了RUSLI在SOSD[9]基准测试上用于单个和并发插入、查找和范围查询的性能数字。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

RUSLI: Real-time Updatable Spline Learned Index

Machine learning algorithms have accelerated data access through ‘learned index’, where a set of data items is indexed by a model learned on the pairs of data key and the corresponding record’s position in the memory. Most of the learned indexes require retraining of the model for new data insertions in the data set. The retraining is expensive and takes as much time as the model training. So, today, learned indexes are updated by retraining on batch inserts to amortize the cost. However, real-time applications, such as data-driven recommendation applications need to access users’ feature store in real-time both for reading data of existing users and adding new users as well. This motivates us to present a real-time updatable spline learned index, RUSLI, by learning the distribution of data keys with their positions in memory through splines. We have extended RadixSpline [8] to build the updatable learned index while supporting real-time inserts in a data set without affecting the lookup time on the updated data set. We have shown that RUSLI can update the index in constant time with an additional temporary memory of size proportional to the number of splines. We have discussed how to reduce the size of the presented index using the distribution of spline keys while building the radix table. RULSI is shown to incur 270ns for lookup and 50ns for insert operations. Further, we have shown that RUSLI supports concurrent lookup and insert operations with a throughput of 40 million ops/sec. We have presented and discussed performance numbers of RUSLI for single and concurrent inserts, lookup, and range queries on SOSD [9] benchmark.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Fourth Workshop in Exploiting AI Techniques for Data Management

自引率

0.00%

发文量