变化存在下大数据分析过程的高效再计算:计算框架、参考架构和应用

2019 IEEE International Congress on Big Data (BigDataCongress) Pub Date : 2019-07-08 DOI:10.1109/BigDataCongress.2019.00017

P. Missier, J. Cala

{"title":"变化存在下大数据分析过程的高效再计算:计算框架、参考架构和应用","authors":"P. Missier, J. Cala","doi":"10.1109/BigDataCongress.2019.00017","DOIUrl":null,"url":null,"abstract":"Insights generated from Big Data through analytics processes are often unstable over time and thus lose their value, as the analysis typically depends on elements that change and evolve dynamically. However, the cost of having to periodically \"redo\" computationally expensive data analytics is not normally taken into account when assessing the benefits of the outcomes. The ReComp project addresses the problem of efficiently re-computing, all or in part, outcomes from complex analytical processes in response to some of the changes that occur to process dependencies. While such dependencies may include application and system libraries, as well as the deployment environment, ReComp is focused exclusively on changes to reference datasets as well as to the original inputs. Our hypothesis is that an efficient re-computation strategy requires the ability to (i) observe and quantify data changes, (ii) estimate the impact of those changes on a population of prior outcomes, (iii) identify the minimal process fragments that can restore the currency of the impacted outcomes, and (iv) selectively drive their refresh. In this paper we present a generic framework that addresses these requirements, and show how it can be customised to operate on two case studies of very diverse domains, namely genomics and geosciences. We discuss lessons learnt and outline the next steps towards the ReComp vision.","PeriodicalId":335850,"journal":{"name":"2019 IEEE International Congress on Big Data (BigDataCongress)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Efficient Re-Computation of Big Data Analytics Processes in the Presence of Changes: Computational Framework, Reference Architecture, and Applications\",\"authors\":\"P. Missier, J. Cala\",\"doi\":\"10.1109/BigDataCongress.2019.00017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Insights generated from Big Data through analytics processes are often unstable over time and thus lose their value, as the analysis typically depends on elements that change and evolve dynamically. However, the cost of having to periodically \\\"redo\\\" computationally expensive data analytics is not normally taken into account when assessing the benefits of the outcomes. The ReComp project addresses the problem of efficiently re-computing, all or in part, outcomes from complex analytical processes in response to some of the changes that occur to process dependencies. While such dependencies may include application and system libraries, as well as the deployment environment, ReComp is focused exclusively on changes to reference datasets as well as to the original inputs. Our hypothesis is that an efficient re-computation strategy requires the ability to (i) observe and quantify data changes, (ii) estimate the impact of those changes on a population of prior outcomes, (iii) identify the minimal process fragments that can restore the currency of the impacted outcomes, and (iv) selectively drive their refresh. In this paper we present a generic framework that addresses these requirements, and show how it can be customised to operate on two case studies of very diverse domains, namely genomics and geosciences. We discuss lessons learnt and outline the next steps towards the ReComp vision.\",\"PeriodicalId\":335850,\"journal\":{\"name\":\"2019 IEEE International Congress on Big Data (BigDataCongress)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Congress on Big Data (BigDataCongress)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BigDataCongress.2019.00017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Congress on Big Data (BigDataCongress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2019.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

通过分析过程从大数据中产生的见解往往随着时间的推移而不稳定，从而失去其价值，因为分析通常依赖于动态变化和演变的元素。然而，在评估结果的好处时，通常不会考虑必须定期“重做”计算上昂贵的数据分析的成本。ReComp项目解决了有效地重新计算(全部或部分)复杂分析过程的结果的问题，以响应过程依赖关系中发生的一些变化。虽然这些依赖关系可能包括应用程序和系统库，以及部署环境，但ReComp只关注对参考数据集和原始输入的更改。我们的假设是，有效的重新计算策略需要具备以下能力:(i)观察和量化数据变化，(ii)估计这些变化对先前结果群体的影响，(iii)确定可以恢复受影响结果的货币的最小过程片段，以及(iv)有选择地驱动它们的刷新。在本文中，我们提出了一个解决这些要求的通用框架，并展示了如何对其进行定制，以在两个非常不同的领域(即基因组学和地球科学)的案例研究中进行操作。我们讨论了经验教训，并概述了实现ReComp愿景的下一步步骤。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Re-Computation of Big Data Analytics Processes in the Presence of Changes: Computational Framework, Reference Architecture, and Applications

Insights generated from Big Data through analytics processes are often unstable over time and thus lose their value, as the analysis typically depends on elements that change and evolve dynamically. However, the cost of having to periodically "redo" computationally expensive data analytics is not normally taken into account when assessing the benefits of the outcomes. The ReComp project addresses the problem of efficiently re-computing, all or in part, outcomes from complex analytical processes in response to some of the changes that occur to process dependencies. While such dependencies may include application and system libraries, as well as the deployment environment, ReComp is focused exclusively on changes to reference datasets as well as to the original inputs. Our hypothesis is that an efficient re-computation strategy requires the ability to (i) observe and quantify data changes, (ii) estimate the impact of those changes on a population of prior outcomes, (iii) identify the minimal process fragments that can restore the currency of the impacted outcomes, and (iv) selectively drive their refresh. In this paper we present a generic framework that addresses these requirements, and show how it can be customised to operate on two case studies of very diverse domains, namely genomics and geosciences. We discuss lessons learnt and outline the next steps towards the ReComp vision.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Congress on Big Data (BigDataCongress)

自引率

0.00%

发文量