{"title":"Building A Robust, Company-Wide Data Science Pipeline Using Programming Abstraction And Virtualization","authors":"N. Jones, K. Torbert","doi":"10.3997/2214-4609.201803030","DOIUrl":null,"url":null,"abstract":"The oil and gas industry presents a challenging and exciting environment for data projects due to the size, complexity, and variability in formatting, type, and quality of the data collected. This environment makes delivering and maintaining a data science pipeline from source systems through to the end user an enormous challenge in many companies (Scully et al. 2014). Many projects fail before any analytics can even applied to the data due to difficulties handling legacy systems, data silos, complex dependencies between data sources, and more. In other cases, data science projects can only advance in one area or division of a company because of differences in data handling despite having broad applicability through the company’s assets. This presentation will discuss California Resources Corporation’s new company-wide data analytics effort as a case study of how we have used technologies like data virtualization (Van Der Lans, 2018) and programming architectural principles such as abstraction to tackle difficult data integration and data quality problems to construct a data science pipeline capable of delivering results company-wide. Many of these problems have frustrated multimillion dollar attempts to address them in the recent past.","PeriodicalId":231338,"journal":{"name":"First EAGE/PESGB Workshop Machine Learning","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First EAGE/PESGB Workshop Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3997/2214-4609.201803030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The oil and gas industry presents a challenging and exciting environment for data projects due to the size, complexity, and variability in formatting, type, and quality of the data collected. This environment makes delivering and maintaining a data science pipeline from source systems through to the end user an enormous challenge in many companies (Scully et al. 2014). Many projects fail before any analytics can even applied to the data due to difficulties handling legacy systems, data silos, complex dependencies between data sources, and more. In other cases, data science projects can only advance in one area or division of a company because of differences in data handling despite having broad applicability through the company’s assets. This presentation will discuss California Resources Corporation’s new company-wide data analytics effort as a case study of how we have used technologies like data virtualization (Van Der Lans, 2018) and programming architectural principles such as abstraction to tackle difficult data integration and data quality problems to construct a data science pipeline capable of delivering results company-wide. Many of these problems have frustrated multimillion dollar attempts to address them in the recent past.
由于所收集数据的规模、复杂性、格式、类型和质量的可变性,石油和天然气行业的数据项目具有挑战性和令人兴奋的环境。这种环境使得交付和维护从源系统到最终用户的数据科学管道对许多公司来说是一个巨大的挑战(Scully et al. 2014)。由于难以处理遗留系统、数据孤岛、数据源之间复杂的依赖关系等原因,许多项目甚至在对数据进行分析之前就失败了。在其他情况下,数据科学项目只能在公司的一个领域或部门推进,因为数据处理的差异,尽管在公司的资产中具有广泛的适用性。本演讲将讨论加州资源公司新的全公司范围的数据分析工作,作为我们如何使用数据虚拟化等技术(Van Der Lans, 2018)和编程架构原则(如抽象)来解决困难的数据集成和数据质量问题,以构建能够在全公司范围内交付结果的数据科学管道的案例研究。在最近的一段时间里,这些问题中的许多都使数百万美元的努力付诸于失败。