{"title":"使用计算机视觉、自然语言处理和大数据的智能档案生成","authors":"M. Marzouk, M. Elzahed","doi":"10.2118/207365-ms","DOIUrl":null,"url":null,"abstract":"\n Gaining insights from the dense network of interrelated documents involved in E&P projects requires experience, knowledge, and awareness about the existence of the required data. This framework aims to facilitate the decision-making process while consuming shorter time periods and lower costs, without sacrificing the accuracy of the data and decreasing the probability of human errors. The high complexity of E&P Projects results in a dense network of interrelated documents which are produced to cover the various aspects and details of the project. Gaining insights from old data requires experience, knowledge, and awareness about the existence of the required data. Accordingly, the knowledge accumulated over the time from various projects can be considered a key asset, since it can be leveraged to perform more informed decisions. This paper presents a framework that aim at capturing organizational knowledge locked in paper-based datasets and store it in a structured digital format that facilitates its retrieval and enables analyses which help uncover valuable insights. This research aims to generate valuable data from existing archives while causing minimal disturbance to existing business processes and workflows. The framework performs four main functions: image processing, text recognition, Data Analytics and Data storage. Initially the text recognition module; which is performs Image Processing to enhance the quality of the scanned files, and optical character recognition using LSTM which extracts the text contained in images. The Data Analytics Module, then cleanses and mines the extracted text using Big Data Analytics tools. Text Matching and searching is performed on the Spark Dataframe using regular expressions to identify different attributes and their different types. Finally, the data is stored in a SQL Database. In order to measure the workflow's accuracy a manual baseline was generated for a sample project. The accuracy is measured using field-level verification, since it was found to be the most fit-for-purpose, as it allows to measure the accuracy of the workflow on the level of each field.","PeriodicalId":10959,"journal":{"name":"Day 3 Wed, November 17, 2021","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Smart Archive Generation Using Computer Vision, NLP and Big Data\",\"authors\":\"M. Marzouk, M. Elzahed\",\"doi\":\"10.2118/207365-ms\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Gaining insights from the dense network of interrelated documents involved in E&P projects requires experience, knowledge, and awareness about the existence of the required data. This framework aims to facilitate the decision-making process while consuming shorter time periods and lower costs, without sacrificing the accuracy of the data and decreasing the probability of human errors. The high complexity of E&P Projects results in a dense network of interrelated documents which are produced to cover the various aspects and details of the project. Gaining insights from old data requires experience, knowledge, and awareness about the existence of the required data. Accordingly, the knowledge accumulated over the time from various projects can be considered a key asset, since it can be leveraged to perform more informed decisions. This paper presents a framework that aim at capturing organizational knowledge locked in paper-based datasets and store it in a structured digital format that facilitates its retrieval and enables analyses which help uncover valuable insights. This research aims to generate valuable data from existing archives while causing minimal disturbance to existing business processes and workflows. The framework performs four main functions: image processing, text recognition, Data Analytics and Data storage. Initially the text recognition module; which is performs Image Processing to enhance the quality of the scanned files, and optical character recognition using LSTM which extracts the text contained in images. The Data Analytics Module, then cleanses and mines the extracted text using Big Data Analytics tools. Text Matching and searching is performed on the Spark Dataframe using regular expressions to identify different attributes and their different types. Finally, the data is stored in a SQL Database. In order to measure the workflow's accuracy a manual baseline was generated for a sample project. The accuracy is measured using field-level verification, since it was found to be the most fit-for-purpose, as it allows to measure the accuracy of the workflow on the level of each field.\",\"PeriodicalId\":10959,\"journal\":{\"name\":\"Day 3 Wed, November 17, 2021\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Day 3 Wed, November 17, 2021\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2118/207365-ms\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Day 3 Wed, November 17, 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2118/207365-ms","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Smart Archive Generation Using Computer Vision, NLP and Big Data
Gaining insights from the dense network of interrelated documents involved in E&P projects requires experience, knowledge, and awareness about the existence of the required data. This framework aims to facilitate the decision-making process while consuming shorter time periods and lower costs, without sacrificing the accuracy of the data and decreasing the probability of human errors. The high complexity of E&P Projects results in a dense network of interrelated documents which are produced to cover the various aspects and details of the project. Gaining insights from old data requires experience, knowledge, and awareness about the existence of the required data. Accordingly, the knowledge accumulated over the time from various projects can be considered a key asset, since it can be leveraged to perform more informed decisions. This paper presents a framework that aim at capturing organizational knowledge locked in paper-based datasets and store it in a structured digital format that facilitates its retrieval and enables analyses which help uncover valuable insights. This research aims to generate valuable data from existing archives while causing minimal disturbance to existing business processes and workflows. The framework performs four main functions: image processing, text recognition, Data Analytics and Data storage. Initially the text recognition module; which is performs Image Processing to enhance the quality of the scanned files, and optical character recognition using LSTM which extracts the text contained in images. The Data Analytics Module, then cleanses and mines the extracted text using Big Data Analytics tools. Text Matching and searching is performed on the Spark Dataframe using regular expressions to identify different attributes and their different types. Finally, the data is stored in a SQL Database. In order to measure the workflow's accuracy a manual baseline was generated for a sample project. The accuracy is measured using field-level verification, since it was found to be the most fit-for-purpose, as it allows to measure the accuracy of the workflow on the level of each field.