Building A Robust, Company-Wide Data Science Pipeline Using Programming Abstraction And Virtualization

N. Jones, K. Torbert
{"title":"Building A Robust, Company-Wide Data Science Pipeline Using Programming Abstraction And Virtualization","authors":"N. Jones, K. Torbert","doi":"10.3997/2214-4609.201803030","DOIUrl":null,"url":null,"abstract":"The oil and gas industry presents a challenging and exciting environment for data projects due to the size, complexity, and variability in formatting, type, and quality of the data collected. This environment makes delivering and maintaining a data science pipeline from source systems through to the end user an enormous challenge in many companies (Scully et al. 2014). Many projects fail before any analytics can even applied to the data due to difficulties handling legacy systems, data silos, complex dependencies between data sources, and more. In other cases, data science projects can only advance in one area or division of a company because of differences in data handling despite having broad applicability through the company’s assets. This presentation will discuss California Resources Corporation’s new company-wide data analytics effort as a case study of how we have used technologies like data virtualization (Van Der Lans, 2018) and programming architectural principles such as abstraction to tackle difficult data integration and data quality problems to construct a data science pipeline capable of delivering results company-wide. Many of these problems have frustrated multimillion dollar attempts to address them in the recent past.","PeriodicalId":231338,"journal":{"name":"First EAGE/PESGB Workshop Machine Learning","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First EAGE/PESGB Workshop Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3997/2214-4609.201803030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The oil and gas industry presents a challenging and exciting environment for data projects due to the size, complexity, and variability in formatting, type, and quality of the data collected. This environment makes delivering and maintaining a data science pipeline from source systems through to the end user an enormous challenge in many companies (Scully et al. 2014). Many projects fail before any analytics can even applied to the data due to difficulties handling legacy systems, data silos, complex dependencies between data sources, and more. In other cases, data science projects can only advance in one area or division of a company because of differences in data handling despite having broad applicability through the company’s assets. This presentation will discuss California Resources Corporation’s new company-wide data analytics effort as a case study of how we have used technologies like data virtualization (Van Der Lans, 2018) and programming architectural principles such as abstraction to tackle difficult data integration and data quality problems to construct a data science pipeline capable of delivering results company-wide. Many of these problems have frustrated multimillion dollar attempts to address them in the recent past.
使用编程抽象和虚拟化构建健壮的全公司范围的数据科学管道
由于所收集数据的规模、复杂性、格式、类型和质量的可变性,石油和天然气行业的数据项目具有挑战性和令人兴奋的环境。这种环境使得交付和维护从源系统到最终用户的数据科学管道对许多公司来说是一个巨大的挑战(Scully et al. 2014)。由于难以处理遗留系统、数据孤岛、数据源之间复杂的依赖关系等原因,许多项目甚至在对数据进行分析之前就失败了。在其他情况下,数据科学项目只能在公司的一个领域或部门推进,因为数据处理的差异,尽管在公司的资产中具有广泛的适用性。本演讲将讨论加州资源公司新的全公司范围的数据分析工作,作为我们如何使用数据虚拟化等技术(Van Der Lans, 2018)和编程架构原则(如抽象)来解决困难的数据集成和数据质量问题,以构建能够在全公司范围内交付结果的数据科学管道的案例研究。在最近的一段时间里,这些问题中的许多都使数百万美元的努力付诸于失败。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信