Smart Archive Generation Using Computer Vision, NLP and Big Data

M. Marzouk, M. Elzahed
{"title":"Smart Archive Generation Using Computer Vision, NLP and Big Data","authors":"M. Marzouk, M. Elzahed","doi":"10.2118/207365-ms","DOIUrl":null,"url":null,"abstract":"\n Gaining insights from the dense network of interrelated documents involved in E&P projects requires experience, knowledge, and awareness about the existence of the required data. This framework aims to facilitate the decision-making process while consuming shorter time periods and lower costs, without sacrificing the accuracy of the data and decreasing the probability of human errors. The high complexity of E&P Projects results in a dense network of interrelated documents which are produced to cover the various aspects and details of the project. Gaining insights from old data requires experience, knowledge, and awareness about the existence of the required data. Accordingly, the knowledge accumulated over the time from various projects can be considered a key asset, since it can be leveraged to perform more informed decisions. This paper presents a framework that aim at capturing organizational knowledge locked in paper-based datasets and store it in a structured digital format that facilitates its retrieval and enables analyses which help uncover valuable insights. This research aims to generate valuable data from existing archives while causing minimal disturbance to existing business processes and workflows. The framework performs four main functions: image processing, text recognition, Data Analytics and Data storage. Initially the text recognition module; which is performs Image Processing to enhance the quality of the scanned files, and optical character recognition using LSTM which extracts the text contained in images. The Data Analytics Module, then cleanses and mines the extracted text using Big Data Analytics tools. Text Matching and searching is performed on the Spark Dataframe using regular expressions to identify different attributes and their different types. Finally, the data is stored in a SQL Database. In order to measure the workflow's accuracy a manual baseline was generated for a sample project. The accuracy is measured using field-level verification, since it was found to be the most fit-for-purpose, as it allows to measure the accuracy of the workflow on the level of each field.","PeriodicalId":10959,"journal":{"name":"Day 3 Wed, November 17, 2021","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Day 3 Wed, November 17, 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2118/207365-ms","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Gaining insights from the dense network of interrelated documents involved in E&P projects requires experience, knowledge, and awareness about the existence of the required data. This framework aims to facilitate the decision-making process while consuming shorter time periods and lower costs, without sacrificing the accuracy of the data and decreasing the probability of human errors. The high complexity of E&P Projects results in a dense network of interrelated documents which are produced to cover the various aspects and details of the project. Gaining insights from old data requires experience, knowledge, and awareness about the existence of the required data. Accordingly, the knowledge accumulated over the time from various projects can be considered a key asset, since it can be leveraged to perform more informed decisions. This paper presents a framework that aim at capturing organizational knowledge locked in paper-based datasets and store it in a structured digital format that facilitates its retrieval and enables analyses which help uncover valuable insights. This research aims to generate valuable data from existing archives while causing minimal disturbance to existing business processes and workflows. The framework performs four main functions: image processing, text recognition, Data Analytics and Data storage. Initially the text recognition module; which is performs Image Processing to enhance the quality of the scanned files, and optical character recognition using LSTM which extracts the text contained in images. The Data Analytics Module, then cleanses and mines the extracted text using Big Data Analytics tools. Text Matching and searching is performed on the Spark Dataframe using regular expressions to identify different attributes and their different types. Finally, the data is stored in a SQL Database. In order to measure the workflow's accuracy a manual baseline was generated for a sample project. The accuracy is measured using field-level verification, since it was found to be the most fit-for-purpose, as it allows to measure the accuracy of the workflow on the level of each field.
使用计算机视觉、自然语言处理和大数据的智能档案生成
从勘探开发项目中密集的相互关联的文件网络中获得见解需要经验、知识和对所需数据存在的认识。该框架旨在促进决策过程,同时消耗更短的时间和更低的成本,而不会牺牲数据的准确性和减少人为错误的可能性。勘探开发项目的高度复杂性导致产生了密集的相互关联的文件网络,这些文件涵盖了项目的各个方面和细节。从旧数据中获得洞察力需要经验、知识和对所需数据存在的认识。因此,从各种项目中积累的知识可以被视为关键资产,因为可以利用它来执行更明智的决策。本文提出了一个框架,旨在捕获锁定在纸质数据集中的组织知识,并将其存储在结构化的数字格式中,以便于检索和分析,从而帮助发现有价值的见解。本研究旨在从现有档案中生成有价值的数据,同时对现有业务流程和工作流程造成最小的干扰。该框架实现了四个主要功能:图像处理、文本识别、数据分析和数据存储。初步实现了文本识别模块;它通过图像处理来提高扫描文件的质量,并使用LSTM进行光学字符识别,提取图像中包含的文本。数据分析模块,然后使用大数据分析工具清理和挖掘提取的文本。文本匹配和搜索使用正则表达式在Spark Dataframe上执行,以识别不同的属性及其不同的类型。最后,将数据存储在SQL数据库中。为了度量工作流的准确性,为一个示例项目生成了一个手动基线。使用字段级验证来测量准确性,因为它被发现是最适合目的的,因为它允许在每个字段级别上测量工作流的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信