tprof:通过结构化聚合和分布式系统跟踪的自动分析来进行性能分析

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference) Pub Date : 2021-11-01 DOI:10.1145/3472883.3486994

Lexiang Huang, T. Zhu

{"title":"tprof:通过结构化聚合和分布式系统跟踪的自动分析来进行性能分析","authors":"Lexiang Huang, T. Zhu","doi":"10.1145/3472883.3486994","DOIUrl":null,"url":null,"abstract":"The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"88 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces\",\"authors\":\"Lexiang Huang, T. Zhu\",\"doi\":\"10.1145/3472883.3486994\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.\",\"PeriodicalId\":91949,\"journal\":{\"name\":\"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)\",\"volume\":\"88 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3472883.3486994\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3472883.3486994","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

传统的性能调试方法依赖于提供平均函数运行时信息的性能分析器(例如，gprof、VTune)。这些汇总统计信息有助于识别影响整个工作负载的缓慢区域，但它们不适合识别仅影响一小部分工作负载的缓慢区域，例如尾部延迟效应。本文采用了一种利用分布式跟踪系统(例如，Dapper, Zipkin, Jaeger)进行性能分析的新方法。由于跟踪提供了基于每个请求的详细计时信息，因此可以用许多不同的方式对跟踪数据进行分组和聚合，以识别系统中速度较慢的部分。我们的跟踪聚合新方法使用嵌入在跟踪中的结构分层地对相似的跟踪进行分组，并根据跟踪分组的方式计算越来越详细的聚合统计信息。我们还开发了一个自动化工具，用于分析统计数据的层次结构，以识别最可能出现的性能问题。我们跨两个复杂分布式系统的案例研究说明了我们的工具如何能够发现多个性能问题，从而分别在平均和尾延迟方面实现10倍和28倍的性能改进。我们与最先进的行业工具的比较表明，我们的工具可以比目前的方法更准确地查明性能下降。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces

The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

自引率

0.00%

发文量