ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality

Proceedings. ACM-SIGMOD International Conference on Management of Data Pub Date : 2013-06-22 DOI:10.1145/2463676.2465316

K. Whang, Tae-Seob Yun, Yeon-Mi Yeo, I. Song, Hyuk-Yoon Kwon, In-Joong Kim

{"title":"ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality","authors":"K. Whang, Tae-Seob Yun, Yeon-Mi Yeo, I. Song, Hyuk-Yoon Kwon, In-Joong Kim","doi":"10.1145/2463676.2465316","DOIUrl":null,"url":null,"abstract":"Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. Regarding higher-level functionality, we can draw a parallel with the traditional O/S file system vs. DBMS. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS. To estimate the performance, we propose a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We argue that the model can accurately estimate the performance of a massively-parallel (e.g., 300-node) search engine using the experimental results obtained from a small-scale (e.g., 5-node) one. We show that the estimation error between the model and the actual experiment is less than 2.13% by observing that the bulk of the query processing time is spent at the slave (vs. at the master and network) and by estimating the time spent at the slave based on actual measurement. Using our model, we demonstrate a commercial-level scalability and performance of our architecture. Our proposed system ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion Web pages by using only 43,472 nodes with an average query response time of 194 ms. By using twice as many (86,944) nodes, ODYS can provide an average query response time of 148 ms. These results show that building a massively-parallel search engine using a parallel DBMS is a viable approach with advantages of supporting the high-level (i.e., DBMS-level), SQL-like programming interface.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"27 1","pages":"313-324"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. ACM-SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2463676.2465316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. Regarding higher-level functionality, we can draw a parallel with the traditional O/S file system vs. DBMS. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS. To estimate the performance, we propose a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We argue that the model can accurately estimate the performance of a massively-parallel (e.g., 300-node) search engine using the experimental results obtained from a small-scale (e.g., 5-node) one. We show that the estimation error between the model and the actual experiment is less than 2.13% by observing that the bulk of the query processing time is spent at the slave (vs. at the master and network) and by estimating the time spent at the slave based on actual measurement. Using our model, we demonstrate a commercial-level scalability and performance of our architecture. Our proposed system ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion Web pages by using only 43,472 nodes with an average query response time of 194 ms. By using twice as many (86,944) nodes, ODYS can provide an average query response time of 148 ms. These results show that building a massively-parallel search engine using a parallel DBMS is a viable approach with advantages of supporting the high-level (i.e., DBMS-level), SQL-like programming interface.

查看原文本刊更多论文

ODYS:一种使用DB-IR紧密集成的并行DBMS构建大规模并行搜索引擎的方法，用于实现更高级别的功能

最近，并行搜索引擎已经基于可扩展的分布式文件系统(如Google file System)实现。然而，我们声称，使用并行DBMS构建大规模并行搜索引擎可能是一个有吸引力的选择，因为它支持比分布式文件系统更高级别(即sql级别)的接口，在提供可伸缩性的同时，更容易和更少出错的应用程序开发。关于更高级别的功能，我们可以将传统的O/S文件系统与DBMS进行比较。在本文中，我们提出了一种使用DB-IR紧密集成的并行DBMS构建大规模并行搜索引擎的新方法。为了评估性能，我们提出了并行搜索引擎的混合(即分析和实验)性能模型。我们认为，该模型可以使用从小规模(例如5节点)搜索引擎获得的实验结果准确地估计大规模并行(例如300节点)搜索引擎的性能。通过观察大部分查询处理时间花在从端(相对于主端和网络)以及根据实际测量估计从端花费的时间，我们表明模型和实际实验之间的估计误差小于2.13%。使用我们的模型，我们演示了我们架构的商业级可伸缩性和性能。我们建议的系统ODYS每天能够处理300亿个Web页面的10亿个查询(81个查询/秒)，仅使用43,472个节点，平均查询响应时间为194毫秒。通过使用两倍的节点(86,944)，ODYS可以提供148 ms的平均查询响应时间。这些结果表明，使用并行DBMS构建大规模并行搜索引擎是一种可行的方法，它具有支持高级(即DBMS级)、类似sql的编程接口的优点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. ACM-SIGMOD International Conference on Management of Data

自引率

0.00%

发文量