{"title":"Missing Data Recovery in Large-Scale, Sparse Datacenter Traces: An Alibaba Case Study","authors":"Yi Liang, Linfeng Bi, Xing Su","doi":"10.1109/CCGRID.2019.00039","DOIUrl":null,"url":null,"abstract":"The trace analysis for datacenter holds a prominent importance for the datacenter performance optimization. However, due to the error and low execution priority of trace collection tasks, modern datacenter traces suffer from the serious data missing problem. Previous works handle the trace data recovery via the statistical imputation methods. However, such methods either recover the missing data with fixed values or require users to decide the relationship model among trace attributes, which are not feasible or accurate when dealing with the two missing data trends in datacenter traces: the data sparsity and the complex correlations among trace attributes. To this end, we focus on a trace released by Alibaba and propose a tensor-based trace data recovery model to facilitate the efficient and accurate data recovery for large-scale, sparse datacenter traces. The proposed model consists of two main phases. First, the data discretization and attribute selection methods work together to select the trace attributes with strong correlations with the value-missing attribute. Then, a tensor is constructed and the missing values are recovered by employing the CANDECOMP/PARAFAC decomposition-based tensor completion method. The experimental results demonstrate that our model achieves higher accuracy than six statistical or machine learning-based methods.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The trace analysis for datacenter holds a prominent importance for the datacenter performance optimization. However, due to the error and low execution priority of trace collection tasks, modern datacenter traces suffer from the serious data missing problem. Previous works handle the trace data recovery via the statistical imputation methods. However, such methods either recover the missing data with fixed values or require users to decide the relationship model among trace attributes, which are not feasible or accurate when dealing with the two missing data trends in datacenter traces: the data sparsity and the complex correlations among trace attributes. To this end, we focus on a trace released by Alibaba and propose a tensor-based trace data recovery model to facilitate the efficient and accurate data recovery for large-scale, sparse datacenter traces. The proposed model consists of two main phases. First, the data discretization and attribute selection methods work together to select the trace attributes with strong correlations with the value-missing attribute. Then, a tensor is constructed and the missing values are recovered by employing the CANDECOMP/PARAFAC decomposition-based tensor completion method. The experimental results demonstrate that our model achieves higher accuracy than six statistical or machine learning-based methods.