Fábio Corrêa Cordeiro , Patrícia Ferreira da Silva , Alexandre Tessarollo , Cláudia Freitas , Elvis de Souza , Diogo da Silva Magalhaes Gomes , Renato Rocha Souza , Flávio Codeço Coelho
{"title":"石油 NLP:石油天然气行业自然语言处理和信息提取资源","authors":"Fábio Corrêa Cordeiro , Patrícia Ferreira da Silva , Alexandre Tessarollo , Cláudia Freitas , Elvis de Souza , Diogo da Silva Magalhaes Gomes , Renato Rocha Souza , Flávio Codeço Coelho","doi":"10.1016/j.cageo.2024.105714","DOIUrl":null,"url":null,"abstract":"<div><p>Most companies struggle to find and extract relevant information from their technical documents. In particular, the Oil and Gas (O&G) industry faces the challenge of dealing with large amounts of data hidden within old and new geoscientific reports collected over decades of operation. Making this information available in a structured format can unlock valuable information among these <em>mountains</em> of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built from general domain corpora extracted from the Internet and primarily written in English. This paper presents <span>Petro NLP</span>, a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese.</p><p>We connected an interdisciplinary team of geoscientists, linguists, computer scientists, petroleum engineers, librarians, and ontologists to build a knowledge graph and several annotated corpora. The <span>Petro NLP</span> resources comprise: (i) <span>Petro KGraph</span>– a knowledge graph populated with entities and relations commonly found on technical reports; and (ii) <span>Petrolês</span>, <span>PetroGold</span>, <span>PetroNER</span>, and <span>PetroRE</span>– sets of corpora containing raw text and documents annotated with morphosyntactic labels, named entities, and relations. These resources are fundamental infrastructure for future research in natural language processing and information extraction in the oil industry. Our ongoing research uses these datasets to train and enhance pre-trained machine learning models that automatically extract information from geoscientific technical documents.</p></div>","PeriodicalId":55221,"journal":{"name":"Computers & Geosciences","volume":"193 ","pages":"Article 105714"},"PeriodicalIF":4.2000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Petro NLP: Resources for natural language processing and information extraction for the oil and gas industry\",\"authors\":\"Fábio Corrêa Cordeiro , Patrícia Ferreira da Silva , Alexandre Tessarollo , Cláudia Freitas , Elvis de Souza , Diogo da Silva Magalhaes Gomes , Renato Rocha Souza , Flávio Codeço Coelho\",\"doi\":\"10.1016/j.cageo.2024.105714\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Most companies struggle to find and extract relevant information from their technical documents. In particular, the Oil and Gas (O&G) industry faces the challenge of dealing with large amounts of data hidden within old and new geoscientific reports collected over decades of operation. Making this information available in a structured format can unlock valuable information among these <em>mountains</em> of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built from general domain corpora extracted from the Internet and primarily written in English. This paper presents <span>Petro NLP</span>, a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese.</p><p>We connected an interdisciplinary team of geoscientists, linguists, computer scientists, petroleum engineers, librarians, and ontologists to build a knowledge graph and several annotated corpora. The <span>Petro NLP</span> resources comprise: (i) <span>Petro KGraph</span>– a knowledge graph populated with entities and relations commonly found on technical reports; and (ii) <span>Petrolês</span>, <span>PetroGold</span>, <span>PetroNER</span>, and <span>PetroRE</span>– sets of corpora containing raw text and documents annotated with morphosyntactic labels, named entities, and relations. These resources are fundamental infrastructure for future research in natural language processing and information extraction in the oil industry. Our ongoing research uses these datasets to train and enhance pre-trained machine learning models that automatically extract information from geoscientific technical documents.</p></div>\",\"PeriodicalId\":55221,\"journal\":{\"name\":\"Computers & Geosciences\",\"volume\":\"193 \",\"pages\":\"Article 105714\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers & Geosciences\",\"FirstCategoryId\":\"89\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0098300424001973\",\"RegionNum\":2,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Geosciences","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0098300424001973","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
摘要
大多数公司都在努力从技术文件中查找和提取相关信息。特别是,石油和天然气(O&G)行业面临的挑战是如何处理几十年来收集的新旧地球科学报告中隐藏的大量数据。以结构化的格式提供这些信息可以从堆积如山的数据中挖掘出有价值的信息,这对支持广泛的工业和学术应用至关重要。然而,大多数自然语言处理资源都是从互联网上提取的通用领域语料库中建立的,而且主要是用英语编写的。我们将一个由地球科学家、语言学家、计算机科学家、石油工程师、图书馆员和本体论专家组成的跨学科团队联系起来,构建了一个知识图谱和若干注释语料库。Petro NLP 资源包括:(i) Petro KGraph--一个知识图谱,其中包含技术报告中常见的实体和关系;(ii) Petrolês、PetroGold、PetroNER 和 PetroRE--包含原始文本和文档的语料集,其中标注了语态句法标签、命名实体和关系。这些资源是未来石油工业自然语言处理和信息提取研究的基础架构。我们正在进行的研究利用这些数据集来训练和增强预训练的机器学习模型,这些模型可自动从地球科学技术文档中提取信息。
Petro NLP: Resources for natural language processing and information extraction for the oil and gas industry
Most companies struggle to find and extract relevant information from their technical documents. In particular, the Oil and Gas (O&G) industry faces the challenge of dealing with large amounts of data hidden within old and new geoscientific reports collected over decades of operation. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built from general domain corpora extracted from the Internet and primarily written in English. This paper presents Petro NLP, a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese.
We connected an interdisciplinary team of geoscientists, linguists, computer scientists, petroleum engineers, librarians, and ontologists to build a knowledge graph and several annotated corpora. The Petro NLP resources comprise: (i) Petro KGraph– a knowledge graph populated with entities and relations commonly found on technical reports; and (ii) Petrolês, PetroGold, PetroNER, and PetroRE– sets of corpora containing raw text and documents annotated with morphosyntactic labels, named entities, and relations. These resources are fundamental infrastructure for future research in natural language processing and information extraction in the oil industry. Our ongoing research uses these datasets to train and enhance pre-trained machine learning models that automatically extract information from geoscientific technical documents.
期刊介绍:
Computers & Geosciences publishes high impact, original research at the interface between Computer Sciences and Geosciences. Publications should apply modern computer science paradigms, whether computational or informatics-based, to address problems in the geosciences.