Full text loading...
-
Building A Robust, Company-Wide Data Science Pipeline Using Programming Abstraction And Virtualization
- Publisher: European Association of Geoscientists & Engineers
- Source: Conference Proceedings, First EAGE/PESGB Workshop Machine Learning, Nov 2018, Volume 2018, p.1 - 4
Abstract
The oil and gas industry presents a challenging and exciting environment for data projects due to the size, complexity, and variability in formatting, type, and quality of the data collected. This environment makes delivering and maintaining a data science pipeline from source systems through to the end user an enormous challenge in many companies ( Scully et al. 2014 ). Many projects fail before any analytics can even applied to the data due to difficulties handling legacy systems, data silos, complex dependencies between data sources, and more. In other cases, data science projects can only advance in one area or division of a company because of differences in data handling despite having broad applicability through the company’s assets. This presentation will discuss California Resources Corporation’s new company-wide data analytics effort as a case study of how we have used technologies like data virtualization ( Van Der Lans, 2018 ) and programming architectural principles such as abstraction to tackle difficult data integration and data quality problems to construct a data science pipeline capable of delivering results company-wide. Many of these problems have frustrated multimillion dollar attempts to address them in the recent past.