Characterization of real workloads of web search engines

2011 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2011-11-01 DOI:10.1109/IISWC.2011.6114193

Huafeng Xi, Jianfeng Zhan, Zhen Jia, Xuehai Hong, Lei Wang, Lixin Zhang, Ninghui Sun, Gang Lu

{"title":"Characterization of real workloads of web search engines","authors":"Huafeng Xi, Jianfeng Zhan, Zhen Jia, Xuehai Hong, Lei Wang, Lixin Zhang, Ninghui Sun, Gang Lu","doi":"10.1109/IISWC.2011.6114193","DOIUrl":null,"url":null,"abstract":"Search is the most heavily used web application in the world and is still growing at an extraordinary rate. Understanding the behaviors of web search engines, therefore, is becoming increasingly important to the design and deployment of data center systems hosting search engines. In this paper, we study three search query traces collected from real world web search engines in three different search service providers. The first part of our study is to uncover the patterns hidden in the query traces by analyzing the variations, frequencies, and locality of query requests. Our analysis reveals that, contradicted to some previous studies, real-world query traces do not follow well-defined probability models, such as Poisson distribution and log-normal distribution. The second part of our study is to deploy the real query traces and three synthetic traces generated using probability models proposed by other researchers on a Nutch based search engine. The measured performance data from the deployments further confirm that synthetic traces do not accurately reflect the real traces. We develop an evaluation tool that can collect performance metrics on-line with negligible overhead. The performance metrics include average response time, CPU utilization, Disk accesses, and cycles-per-instructions, etc. The third of our study is to compare the search engine with representative benchmarks, namely Gridmix, SPECweb2005, TPC-C, SPECCPU2006, and HPCC, with respect to basic architecture-level characteristics and performance metrics, such as instruction mix, processor pipeline stall breakdown, memory access latency, and disk accesses. The experimental results show that web search engines have a high percentage of load/store instructions, but have good cache/memory performance. We hope those results presented in this paper will enable system designers to gain insights on optimizing systems hosting search engines.","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2011.6114193","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

Search is the most heavily used web application in the world and is still growing at an extraordinary rate. Understanding the behaviors of web search engines, therefore, is becoming increasingly important to the design and deployment of data center systems hosting search engines. In this paper, we study three search query traces collected from real world web search engines in three different search service providers. The first part of our study is to uncover the patterns hidden in the query traces by analyzing the variations, frequencies, and locality of query requests. Our analysis reveals that, contradicted to some previous studies, real-world query traces do not follow well-defined probability models, such as Poisson distribution and log-normal distribution. The second part of our study is to deploy the real query traces and three synthetic traces generated using probability models proposed by other researchers on a Nutch based search engine. The measured performance data from the deployments further confirm that synthetic traces do not accurately reflect the real traces. We develop an evaluation tool that can collect performance metrics on-line with negligible overhead. The performance metrics include average response time, CPU utilization, Disk accesses, and cycles-per-instructions, etc. The third of our study is to compare the search engine with representative benchmarks, namely Gridmix, SPECweb2005, TPC-C, SPECCPU2006, and HPCC, with respect to basic architecture-level characteristics and performance metrics, such as instruction mix, processor pipeline stall breakdown, memory access latency, and disk accesses. The experimental results show that web search engines have a high percentage of load/store instructions, but have good cache/memory performance. We hope those results presented in this paper will enable system designers to gain insights on optimizing systems hosting search engines.

查看原文本刊更多论文

web搜索引擎实际工作负载的表征

搜索是世界上使用最频繁的网络应用程序，并且仍在以惊人的速度增长。因此，理解web搜索引擎的行为对于设计和部署承载搜索引擎的数据中心系统变得越来越重要。在本文中，我们研究了从三个不同的搜索服务提供商的真实网络搜索引擎中收集的三个搜索查询痕迹。我们研究的第一部分是通过分析查询请求的变化、频率和局部性来揭示隐藏在查询跟踪中的模式。我们的分析表明，与之前的一些研究相反，现实世界的查询轨迹并不遵循定义良好的概率模型，如泊松分布和对数正态分布。我们研究的第二部分是在基于Nutch的搜索引擎上部署使用其他研究人员提出的概率模型生成的真实查询痕迹和三个合成痕迹。部署的实测性能数据进一步证实，合成轨迹不能准确反映真实轨迹。我们开发了一种评估工具，可以在线收集性能指标，开销可以忽略不计。性能指标包括平均响应时间、CPU利用率、磁盘访问和每条指令的周期等。我们研究的第三个方面是比较搜索引擎与代表性基准，即Gridmix, SPECweb2005, TPC-C, SPECCPU2006和HPCC，在基本架构级特征和性能指标方面，如指令混合，处理器管道失速故障，内存访问延迟和磁盘访问。实验结果表明，web搜索引擎具有较高的加载/存储指令百分比，但具有良好的缓存/内存性能。我们希望在论文中提出的这些结果将使系统设计者能够获得关于优化系统托管搜索引擎的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量