Missing Data Recovery in Large-Scale, Sparse Datacenter Traces: An Alibaba Case Study

Yi Liang, Linfeng Bi, Xing Su
{"title":"Missing Data Recovery in Large-Scale, Sparse Datacenter Traces: An Alibaba Case Study","authors":"Yi Liang, Linfeng Bi, Xing Su","doi":"10.1109/CCGRID.2019.00039","DOIUrl":null,"url":null,"abstract":"The trace analysis for datacenter holds a prominent importance for the datacenter performance optimization. However, due to the error and low execution priority of trace collection tasks, modern datacenter traces suffer from the serious data missing problem. Previous works handle the trace data recovery via the statistical imputation methods. However, such methods either recover the missing data with fixed values or require users to decide the relationship model among trace attributes, which are not feasible or accurate when dealing with the two missing data trends in datacenter traces: the data sparsity and the complex correlations among trace attributes. To this end, we focus on a trace released by Alibaba and propose a tensor-based trace data recovery model to facilitate the efficient and accurate data recovery for large-scale, sparse datacenter traces. The proposed model consists of two main phases. First, the data discretization and attribute selection methods work together to select the trace attributes with strong correlations with the value-missing attribute. Then, a tensor is constructed and the missing values are recovered by employing the CANDECOMP/PARAFAC decomposition-based tensor completion method. The experimental results demonstrate that our model achieves higher accuracy than six statistical or machine learning-based methods.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The trace analysis for datacenter holds a prominent importance for the datacenter performance optimization. However, due to the error and low execution priority of trace collection tasks, modern datacenter traces suffer from the serious data missing problem. Previous works handle the trace data recovery via the statistical imputation methods. However, such methods either recover the missing data with fixed values or require users to decide the relationship model among trace attributes, which are not feasible or accurate when dealing with the two missing data trends in datacenter traces: the data sparsity and the complex correlations among trace attributes. To this end, we focus on a trace released by Alibaba and propose a tensor-based trace data recovery model to facilitate the efficient and accurate data recovery for large-scale, sparse datacenter traces. The proposed model consists of two main phases. First, the data discretization and attribute selection methods work together to select the trace attributes with strong correlations with the value-missing attribute. Then, a tensor is constructed and the missing values are recovered by employing the CANDECOMP/PARAFAC decomposition-based tensor completion method. The experimental results demonstrate that our model achieves higher accuracy than six statistical or machine learning-based methods.
大规模稀疏数据中心轨迹中的缺失数据恢复:一个阿里巴巴案例研究
数据中心跟踪分析对于数据中心性能优化具有重要意义。然而,由于跟踪收集任务的错误和执行优先级低,现代数据中心跟踪存在严重的数据丢失问题。以往的工作都是通过统计归算的方法来处理轨迹数据的恢复。然而,这些方法要么是用固定值恢复缺失数据,要么是需要用户自行确定轨迹属性之间的关系模型,在处理数据中心轨迹中数据的稀疏性和轨迹属性之间复杂的相关性这两种缺失数据趋势时,这些方法既不可行,也不准确。为此,我们以阿里巴巴发布的一条轨迹为研究对象,提出了一种基于张量的轨迹数据恢复模型,为大规模、稀疏的数据中心轨迹提供高效、准确的数据恢复。提出的模型包括两个主要阶段。首先,将数据离散化和属性选择方法相结合,选择与缺失值属性相关性强的跟踪属性。然后,构造张量,利用基于CANDECOMP/PARAFAC分解的张量补全方法恢复缺失值;实验结果表明,我们的模型比六种基于统计或机器学习的方法具有更高的精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信