Chang Liu, Zhengong Cai, Bingshen Wang, Zhimin Tang, Jiaxu Liu
{"title":"A protocol-independent container network observability analysis system based on eBPF","authors":"Chang Liu, Zhengong Cai, Bingshen Wang, Zhimin Tang, Jiaxu Liu","doi":"10.1109/ICPADS51040.2020.00099","DOIUrl":null,"url":null,"abstract":"Technologies such as microservices, containerization and Kubernetes in cloud-native environments make large-scale application delivery easier and easier, but problem troubleshooting and fault location in the face of massive applications is becoming more and more complex. Currently, the data collected by the mainstream monitoring technologies based on sampling is difficult to cover all anomalies, and the kernel's lack of observability also makes it difficult to monitor more detailed data in container environments such as the Kuber-netes platform. In addition, most of the current technology solutions use tracing and application performance monitoring tools (APMs), but these technologies limit the language used by the application and need to be invasive into the application code, many scenarios require more general network performance detection diagnostic methods that do not invade the user application. In this paper, we propose to introduce network monitoring at the kernel level below the application for the Kubernetes cluster in Alibaba container service. By nonintrusive collection of user application L7/L4 layer network protocol interaction information based on eBPF, data collection of more than 10M throughputs per second can be achieved without modifying any kernel and application code, while the impact on the system application is less than 1%. It also uses machine learning methods to analyze and diagnose application network performance and problems, analyze network performance bottlenecks and locate specific instance information for different applications, and realize protocol-independent network performance problem location and analysis.","PeriodicalId":196548,"journal":{"name":"2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"17 11","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS51040.2020.00099","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17
Abstract
Technologies such as microservices, containerization and Kubernetes in cloud-native environments make large-scale application delivery easier and easier, but problem troubleshooting and fault location in the face of massive applications is becoming more and more complex. Currently, the data collected by the mainstream monitoring technologies based on sampling is difficult to cover all anomalies, and the kernel's lack of observability also makes it difficult to monitor more detailed data in container environments such as the Kuber-netes platform. In addition, most of the current technology solutions use tracing and application performance monitoring tools (APMs), but these technologies limit the language used by the application and need to be invasive into the application code, many scenarios require more general network performance detection diagnostic methods that do not invade the user application. In this paper, we propose to introduce network monitoring at the kernel level below the application for the Kubernetes cluster in Alibaba container service. By nonintrusive collection of user application L7/L4 layer network protocol interaction information based on eBPF, data collection of more than 10M throughputs per second can be achieved without modifying any kernel and application code, while the impact on the system application is less than 1%. It also uses machine learning methods to analyze and diagnose application network performance and problems, analyze network performance bottlenecks and locate specific instance information for different applications, and realize protocol-independent network performance problem location and analysis.