Smart Archive Generation Using Computer Vision, NLP and Big Data

Day 3 Wed, November 17, 2021 Pub Date : 2021-12-09 DOI:10.2118/207365-ms

M. Marzouk, M. Elzahed

{"title":"Smart Archive Generation Using Computer Vision, NLP and Big Data","authors":"M. Marzouk, M. Elzahed","doi":"10.2118/207365-ms","DOIUrl":null,"url":null,"abstract":"\n Gaining insights from the dense network of interrelated documents involved in E&P projects requires experience, knowledge, and awareness about the existence of the required data. This framework aims to facilitate the decision-making process while consuming shorter time periods and lower costs, without sacrificing the accuracy of the data and decreasing the probability of human errors. The high complexity of E&P Projects results in a dense network of interrelated documents which are produced to cover the various aspects and details of the project. Gaining insights from old data requires experience, knowledge, and awareness about the existence of the required data. Accordingly, the knowledge accumulated over the time from various projects can be considered a key asset, since it can be leveraged to perform more informed decisions. This paper presents a framework that aim at capturing organizational knowledge locked in paper-based datasets and store it in a structured digital format that facilitates its retrieval and enables analyses which help uncover valuable insights. This research aims to generate valuable data from existing archives while causing minimal disturbance to existing business processes and workflows. The framework performs four main functions: image processing, text recognition, Data Analytics and Data storage. Initially the text recognition module; which is performs Image Processing to enhance the quality of the scanned files, and optical character recognition using LSTM which extracts the text contained in images. The Data Analytics Module, then cleanses and mines the extracted text using Big Data Analytics tools. Text Matching and searching is performed on the Spark Dataframe using regular expressions to identify different attributes and their different types. Finally, the data is stored in a SQL Database. In order to measure the workflow's accuracy a manual baseline was generated for a sample project. The accuracy is measured using field-level verification, since it was found to be the most fit-for-purpose, as it allows to measure the accuracy of the workflow on the level of each field.","PeriodicalId":10959,"journal":{"name":"Day 3 Wed, November 17, 2021","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Day 3 Wed, November 17, 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2118/207365-ms","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Gaining insights from the dense network of interrelated documents involved in E&P projects requires experience, knowledge, and awareness about the existence of the required data. This framework aims to facilitate the decision-making process while consuming shorter time periods and lower costs, without sacrificing the accuracy of the data and decreasing the probability of human errors. The high complexity of E&P Projects results in a dense network of interrelated documents which are produced to cover the various aspects and details of the project. Gaining insights from old data requires experience, knowledge, and awareness about the existence of the required data. Accordingly, the knowledge accumulated over the time from various projects can be considered a key asset, since it can be leveraged to perform more informed decisions. This paper presents a framework that aim at capturing organizational knowledge locked in paper-based datasets and store it in a structured digital format that facilitates its retrieval and enables analyses which help uncover valuable insights. This research aims to generate valuable data from existing archives while causing minimal disturbance to existing business processes and workflows. The framework performs four main functions: image processing, text recognition, Data Analytics and Data storage. Initially the text recognition module; which is performs Image Processing to enhance the quality of the scanned files, and optical character recognition using LSTM which extracts the text contained in images. The Data Analytics Module, then cleanses and mines the extracted text using Big Data Analytics tools. Text Matching and searching is performed on the Spark Dataframe using regular expressions to identify different attributes and their different types. Finally, the data is stored in a SQL Database. In order to measure the workflow's accuracy a manual baseline was generated for a sample project. The accuracy is measured using field-level verification, since it was found to be the most fit-for-purpose, as it allows to measure the accuracy of the workflow on the level of each field.

查看原文本刊更多论文

使用计算机视觉、自然语言处理和大数据的智能档案生成

从勘探开发项目中密集的相互关联的文件网络中获得见解需要经验、知识和对所需数据存在的认识。该框架旨在促进决策过程，同时消耗更短的时间和更低的成本，而不会牺牲数据的准确性和减少人为错误的可能性。勘探开发项目的高度复杂性导致产生了密集的相互关联的文件网络，这些文件涵盖了项目的各个方面和细节。从旧数据中获得洞察力需要经验、知识和对所需数据存在的认识。因此，从各种项目中积累的知识可以被视为关键资产，因为可以利用它来执行更明智的决策。本文提出了一个框架，旨在捕获锁定在纸质数据集中的组织知识，并将其存储在结构化的数字格式中，以便于检索和分析，从而帮助发现有价值的见解。本研究旨在从现有档案中生成有价值的数据，同时对现有业务流程和工作流程造成最小的干扰。该框架实现了四个主要功能:图像处理、文本识别、数据分析和数据存储。初步实现了文本识别模块;它通过图像处理来提高扫描文件的质量，并使用LSTM进行光学字符识别，提取图像中包含的文本。数据分析模块，然后使用大数据分析工具清理和挖掘提取的文本。文本匹配和搜索使用正则表达式在Spark Dataframe上执行，以识别不同的属性及其不同的类型。最后，将数据存储在SQL数据库中。为了度量工作流的准确性，为一个示例项目生成了一个手动基线。使用字段级验证来测量准确性，因为它被发现是最适合目的的，因为它允许在每个字段级别上测量工作流的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Day 3 Wed, November 17, 2021

自引率

0.00%

发文量