{"title":"tprof:通过结构化聚合和分布式系统跟踪的自动分析来进行性能分析","authors":"Lexiang Huang, T. Zhu","doi":"10.1145/3472883.3486994","DOIUrl":null,"url":null,"abstract":"The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"88 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces\",\"authors\":\"Lexiang Huang, T. Zhu\",\"doi\":\"10.1145/3472883.3486994\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.\",\"PeriodicalId\":91949,\"journal\":{\"name\":\"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)\",\"volume\":\"88 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3472883.3486994\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3472883.3486994","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces
The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.