Efficient tracing and performance analysis for large distributed systems

2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems Pub Date : 2009-12-28 DOI:10.1109/MASCOT.2009.5366158

Eric Anderson, Christopher Hoover, Xiaozhou Li, Joseph A. Tucek

{"title":"Efficient tracing and performance analysis for large distributed systems","authors":"Eric Anderson, Christopher Hoover, Xiaozhou Li, Joseph A. Tucek","doi":"10.1109/MASCOT.2009.5366158","DOIUrl":null,"url":null,"abstract":"Distributed systems are notoriously difficult to implement and debug. One important tool for understanding the behavior of distributed systems is tracing. Unfortunately, effective tracing for modern distributed systems faces several challenges. First, many interesting behaviors in distributed systems only occur rarely, or at full production scale. Hence we need tracing mechanisms which impose minimal overhead, in order to allow always-on tracing of production instances. Second, for high-speed systems, messages can be delivered in significantly less time than the error of traditional time synchronization techniques such as network time protocol (NTP), necessitating time adjustment techniques with much higher precision. Third, distributed systems today may generate millions of events per second systemwide, resulting in traces consisting of billions of events. Such large traces can overwhelm existing trace analysis tools. These challenges make effective tracing difficult. We present techniques that address these three challenges. Our contributions include 1) a low-overhead tracing mechanism, which allows tracing of large systems without impacting their behavior or performance (0.14 μs/event), 2) a post hoc technique for producing highly accurate time synchronization across hosts (within 10 /ts, compared to between 100 μs to 2 ms for NTP), and 3) incremental data processing techniques which facilitate analyzing traces containing billions of trace points on desktop systems. We have successfully applied these techniques to two distributed systems, a cooperative caching system and a distributed storage system, and from our experience, we believe our techniques are applicable to other distributed systems.","PeriodicalId":275737,"journal":{"name":"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOT.2009.5366158","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

Distributed systems are notoriously difficult to implement and debug. One important tool for understanding the behavior of distributed systems is tracing. Unfortunately, effective tracing for modern distributed systems faces several challenges. First, many interesting behaviors in distributed systems only occur rarely, or at full production scale. Hence we need tracing mechanisms which impose minimal overhead, in order to allow always-on tracing of production instances. Second, for high-speed systems, messages can be delivered in significantly less time than the error of traditional time synchronization techniques such as network time protocol (NTP), necessitating time adjustment techniques with much higher precision. Third, distributed systems today may generate millions of events per second systemwide, resulting in traces consisting of billions of events. Such large traces can overwhelm existing trace analysis tools. These challenges make effective tracing difficult. We present techniques that address these three challenges. Our contributions include 1) a low-overhead tracing mechanism, which allows tracing of large systems without impacting their behavior or performance (0.14 μs/event), 2) a post hoc technique for producing highly accurate time synchronization across hosts (within 10 /ts, compared to between 100 μs to 2 ms for NTP), and 3) incremental data processing techniques which facilitate analyzing traces containing billions of trace points on desktop systems. We have successfully applied these techniques to two distributed systems, a cooperative caching system and a distributed storage system, and from our experience, we believe our techniques are applicable to other distributed systems.

查看原文本刊更多论文

大型分布式系统的高效跟踪和性能分析

众所周知，分布式系统难以实现和调试。了解分布式系统行为的一个重要工具是跟踪。不幸的是，现代分布式系统的有效跟踪面临着几个挑战。首先，分布式系统中许多有趣的行为很少发生，或者在完全生产规模下才会发生。因此，我们需要最小化开销的跟踪机制，以便允许对生产实例进行始终在线的跟踪。其次，对于高速系统，消息可以在比传统时间同步技术(如网络时间协议(NTP))的误差更短的时间内传递，因此需要具有更高精度的时间调整技术。第三，今天的分布式系统可能在系统范围内每秒生成数百万个事件，从而导致由数十亿个事件组成的跟踪。如此大的跟踪可能会压倒现有的跟踪分析工具。这些挑战使得有效追踪变得困难。我们提出了解决这三个挑战的技术。我们的贡献包括1)低开销跟踪机制，允许在不影响其行为或性能的情况下跟踪大型系统(0.14 μs/event)， 2)用于在主机之间产生高度精确时间同步的事后技术(在10 /ts内，而NTP在100 μs到2 ms之间)，以及3)增量数据处理技术，有助于分析桌面系统上包含数十亿跟踪点的跟踪。我们已经成功地将这些技术应用于两个分布式系统，一个协作缓存系统和一个分布式存储系统，从我们的经验来看，我们相信我们的技术也适用于其他分布式系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems

自引率

0.00%

发文量