Missing Data Recovery in Large-Scale, Sparse Datacenter Traces: An Alibaba Case Study

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI:10.1109/CCGRID.2019.00039

Yi Liang, Linfeng Bi, Xing Su

{"title":"Missing Data Recovery in Large-Scale, Sparse Datacenter Traces: An Alibaba Case Study","authors":"Yi Liang, Linfeng Bi, Xing Su","doi":"10.1109/CCGRID.2019.00039","DOIUrl":null,"url":null,"abstract":"The trace analysis for datacenter holds a prominent importance for the datacenter performance optimization. However, due to the error and low execution priority of trace collection tasks, modern datacenter traces suffer from the serious data missing problem. Previous works handle the trace data recovery via the statistical imputation methods. However, such methods either recover the missing data with fixed values or require users to decide the relationship model among trace attributes, which are not feasible or accurate when dealing with the two missing data trends in datacenter traces: the data sparsity and the complex correlations among trace attributes. To this end, we focus on a trace released by Alibaba and propose a tensor-based trace data recovery model to facilitate the efficient and accurate data recovery for large-scale, sparse datacenter traces. The proposed model consists of two main phases. First, the data discretization and attribute selection methods work together to select the trace attributes with strong correlations with the value-missing attribute. Then, a tensor is constructed and the missing values are recovered by employing the CANDECOMP/PARAFAC decomposition-based tensor completion method. The experimental results demonstrate that our model achieves higher accuracy than six statistical or machine learning-based methods.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The trace analysis for datacenter holds a prominent importance for the datacenter performance optimization. However, due to the error and low execution priority of trace collection tasks, modern datacenter traces suffer from the serious data missing problem. Previous works handle the trace data recovery via the statistical imputation methods. However, such methods either recover the missing data with fixed values or require users to decide the relationship model among trace attributes, which are not feasible or accurate when dealing with the two missing data trends in datacenter traces: the data sparsity and the complex correlations among trace attributes. To this end, we focus on a trace released by Alibaba and propose a tensor-based trace data recovery model to facilitate the efficient and accurate data recovery for large-scale, sparse datacenter traces. The proposed model consists of two main phases. First, the data discretization and attribute selection methods work together to select the trace attributes with strong correlations with the value-missing attribute. Then, a tensor is constructed and the missing values are recovered by employing the CANDECOMP/PARAFAC decomposition-based tensor completion method. The experimental results demonstrate that our model achieves higher accuracy than six statistical or machine learning-based methods.

查看原文本刊更多论文

大规模稀疏数据中心轨迹中的缺失数据恢复:一个阿里巴巴案例研究

数据中心跟踪分析对于数据中心性能优化具有重要意义。然而，由于跟踪收集任务的错误和执行优先级低，现代数据中心跟踪存在严重的数据丢失问题。以往的工作都是通过统计归算的方法来处理轨迹数据的恢复。然而，这些方法要么是用固定值恢复缺失数据，要么是需要用户自行确定轨迹属性之间的关系模型，在处理数据中心轨迹中数据的稀疏性和轨迹属性之间复杂的相关性这两种缺失数据趋势时，这些方法既不可行，也不准确。为此，我们以阿里巴巴发布的一条轨迹为研究对象，提出了一种基于张量的轨迹数据恢复模型，为大规模、稀疏的数据中心轨迹提供高效、准确的数据恢复。提出的模型包括两个主要阶段。首先，将数据离散化和属性选择方法相结合，选择与缺失值属性相关性强的跟踪属性。然后，构造张量，利用基于CANDECOMP/PARAFAC分解的张量补全方法恢复缺失值;实验结果表明，我们的模型比六种基于统计或机器学习的方法具有更高的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量