Kangjin Wang, Chuanjia Hou, Ying Li, Yaoyong Dou, Cheng Wang, Yang Wen, Jie Yao, Liping Zhang
{"title":"Cache Antagonists Identification: A Practice from Alibaba Colocation Datacenter","authors":"Kangjin Wang, Chuanjia Hou, Ying Li, Yaoyong Dou, Cheng Wang, Yang Wen, Jie Yao, Liping Zhang","doi":"10.1109/ISSREW55968.2022.00031","DOIUrl":null,"url":null,"abstract":"Colocating latency-critical (LC) jobs and best-effort (BE) jobs on a host effectively improve resource efficiency in modern datacenters. But it increases resource contention between jobs, which seriously affects job performance. In Alibaba's real- world LC- BE colocation datacenters, we observed that cache is one of the most contended resources in the CPU. When cache contention occurs, identifying the antagonists that caused cache resource contention is the first step to mitigate cache contention, called cache antagonists identification (CAl). However, it is chal-lenging to identify cache antagonists because cache contention is difficult to observe and quantify. In this paper, we first propose cache usage graph (CUG) to finely characterize cache usage of jobs in the multiple CPU microarchitectural hierarchies and locations, and we provide a monitoring tool to collect “per-container-per-logic CPU” Ll/2/3 cache misses and build CUG. Then we propose a CUG-based CAl approach, $\\mu$ Tactic. $\\mu$ Tactic leverages machine learning models to quantify the cache contention on every cache hierarchy, then reasons out the cache antagonists with CUG. Experiments in production datacenters show that $\\mu$ Tactic has a high precision (85+%) and low cost (32 ms), which are better than state-of-the-art approaches.","PeriodicalId":178302,"journal":{"name":"2022 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSREW55968.2022.00031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Colocating latency-critical (LC) jobs and best-effort (BE) jobs on a host effectively improve resource efficiency in modern datacenters. But it increases resource contention between jobs, which seriously affects job performance. In Alibaba's real- world LC- BE colocation datacenters, we observed that cache is one of the most contended resources in the CPU. When cache contention occurs, identifying the antagonists that caused cache resource contention is the first step to mitigate cache contention, called cache antagonists identification (CAl). However, it is chal-lenging to identify cache antagonists because cache contention is difficult to observe and quantify. In this paper, we first propose cache usage graph (CUG) to finely characterize cache usage of jobs in the multiple CPU microarchitectural hierarchies and locations, and we provide a monitoring tool to collect “per-container-per-logic CPU” Ll/2/3 cache misses and build CUG. Then we propose a CUG-based CAl approach, $\mu$ Tactic. $\mu$ Tactic leverages machine learning models to quantify the cache contention on every cache hierarchy, then reasons out the cache antagonists with CUG. Experiments in production datacenters show that $\mu$ Tactic has a high precision (85+%) and low cost (32 ms), which are better than state-of-the-art approaches.