Scrub:用于大型关键任务应用程序的在线故障排除

Proceedings of the Thirteenth EuroSys Conference Pub Date : 2018-04-23 DOI:10.1145/3190508.3190513

A. Satish, Thomas Shiou, Chuck Zhang, Khaled Elmeleegy, W. Zwaenepoel

{"title":"Scrub:用于大型关键任务应用程序的在线故障排除","authors":"A. Satish, Thomas Shiou, Chuck Zhang, Khaled Elmeleegy, W. Zwaenepoel","doi":"10.1145/3190508.3190513","DOIUrl":null,"url":null,"abstract":"Scrub is a troubleshooting tool for distributed applications that operate under strict SLOs common in production environments. It allows users to formulate queries on events occurring during execution in order to assess the correctness of the application's operation. Scrub has been in use for two years at Turn, where developers and users have relied on it to resolve numerous issues in its online advertisement bidding platform. This platform spans thousands of machines across the globe, serving several million bid requests per second, and dispensing many millions of dollars in advertising budgets. Troubleshooting distributed applications is notoriously hard, and its difficulty is exacerbated by the presence of strict SLOs, which requires the troubleshooting tool to have only minimal impact on the hosts running the application. Furthermore, with large amounts of money at stake, users expect to be able to run frequent diagnostics and demand quick evaluation and remediation of any problems. These constraints have led to a number of design and implementation decisions, that go counter to conventional wisdom. In particular, Scrub supports only a restricted form of joins. Its query execution strategy eschews imposing any overhead on the application hosts. In particular, joins, group-by operations and aggregations are sent to a dedicated centralized facility. In terms of implementation, Scrub avoids the overhead and security concerns of dynamic instrumentation. Finally, at all levels of the system, accuracy is traded for minimal impact on the hosts. We present the design and implementation of Scrub and contrast its choices to those made in earlier systems. We illustrate its power by describing a number of use cases, and we demonstrate its negligible overhead on the underlying application. On average, we observe a maximum CPU overhead of up to 2.5% on application hosts and a 1% increase in request latency. These overheads allow the advertisement bidding platform to operate well within its SLOs.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Scrub: online troubleshooting for large mission-critical applications\",\"authors\":\"A. Satish, Thomas Shiou, Chuck Zhang, Khaled Elmeleegy, W. Zwaenepoel\",\"doi\":\"10.1145/3190508.3190513\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scrub is a troubleshooting tool for distributed applications that operate under strict SLOs common in production environments. It allows users to formulate queries on events occurring during execution in order to assess the correctness of the application's operation. Scrub has been in use for two years at Turn, where developers and users have relied on it to resolve numerous issues in its online advertisement bidding platform. This platform spans thousands of machines across the globe, serving several million bid requests per second, and dispensing many millions of dollars in advertising budgets. Troubleshooting distributed applications is notoriously hard, and its difficulty is exacerbated by the presence of strict SLOs, which requires the troubleshooting tool to have only minimal impact on the hosts running the application. Furthermore, with large amounts of money at stake, users expect to be able to run frequent diagnostics and demand quick evaluation and remediation of any problems. These constraints have led to a number of design and implementation decisions, that go counter to conventional wisdom. In particular, Scrub supports only a restricted form of joins. Its query execution strategy eschews imposing any overhead on the application hosts. In particular, joins, group-by operations and aggregations are sent to a dedicated centralized facility. In terms of implementation, Scrub avoids the overhead and security concerns of dynamic instrumentation. Finally, at all levels of the system, accuracy is traded for minimal impact on the hosts. We present the design and implementation of Scrub and contrast its choices to those made in earlier systems. We illustrate its power by describing a number of use cases, and we demonstrate its negligible overhead on the underlying application. On average, we observe a maximum CPU overhead of up to 2.5% on application hosts and a 1% increase in request latency. These overheads allow the advertisement bidding platform to operate well within its SLOs.\",\"PeriodicalId\":334267,\"journal\":{\"name\":\"Proceedings of the Thirteenth EuroSys Conference\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-04-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Thirteenth EuroSys Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3190508.3190513\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Thirteenth EuroSys Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3190508.3190513","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

Scrub是一种故障排除工具，适用于在生产环境中常见的严格slo下运行的分布式应用程序。它允许用户对执行期间发生的事件制定查询，以便评估应用程序操作的正确性。Scrub已经在Turn公司使用了两年，开发商和用户都依靠它来解决在线广告竞价平台上的许多问题。这个平台横跨全球数千台机器，每秒处理数百万个投标请求，并分配数百万美元的广告预算。对分布式应用程序进行故障排除是非常困难的，严格的slo的存在加剧了其难度，这要求故障排除工具对运行应用程序的主机只产生最小的影响。此外，由于涉及大量资金，用户希望能够频繁地进行诊断，并要求对任何问题进行快速评估和补救。这些限制导致了许多与传统智慧相反的设计和实现决策。特别是，Scrub只支持受限制形式的连接。它的查询执行策略避免对应用程序主机施加任何开销。特别是，连接、分组操作和聚合被发送到专用的集中式设施。在实现方面，Scrub避免了动态检测的开销和安全问题。最后，在系统的所有级别上，准确性是为了对主机的影响最小。我们介绍了Scrub的设计和实现，并将其选择与早期系统中的选择进行了对比。我们通过描述一些用例来说明它的强大功能，并说明它对底层应用程序的开销可以忽略不计。平均而言，我们观察到应用程序主机上的最大CPU开销高达2.5%，请求延迟增加1%。这些管理费用使广告投标平台能够在其slo范围内良好运作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scrub: online troubleshooting for large mission-critical applications

Scrub is a troubleshooting tool for distributed applications that operate under strict SLOs common in production environments. It allows users to formulate queries on events occurring during execution in order to assess the correctness of the application's operation. Scrub has been in use for two years at Turn, where developers and users have relied on it to resolve numerous issues in its online advertisement bidding platform. This platform spans thousands of machines across the globe, serving several million bid requests per second, and dispensing many millions of dollars in advertising budgets. Troubleshooting distributed applications is notoriously hard, and its difficulty is exacerbated by the presence of strict SLOs, which requires the troubleshooting tool to have only minimal impact on the hosts running the application. Furthermore, with large amounts of money at stake, users expect to be able to run frequent diagnostics and demand quick evaluation and remediation of any problems. These constraints have led to a number of design and implementation decisions, that go counter to conventional wisdom. In particular, Scrub supports only a restricted form of joins. Its query execution strategy eschews imposing any overhead on the application hosts. In particular, joins, group-by operations and aggregations are sent to a dedicated centralized facility. In terms of implementation, Scrub avoids the overhead and security concerns of dynamic instrumentation. Finally, at all levels of the system, accuracy is traded for minimal impact on the hosts. We present the design and implementation of Scrub and contrast its choices to those made in earlier systems. We illustrate its power by describing a number of use cases, and we demonstrate its negligible overhead on the underlying application. On average, we observe a maximum CPU overhead of up to 2.5% on application hosts and a 1% increase in request latency. These overheads allow the advertisement bidding platform to operate well within its SLOs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Thirteenth EuroSys Conference

自引率

0.00%

发文量