面向勘探的地球科学语言处理

Day 3 Wed, November 17, 2021 Pub Date : 2021-12-09 DOI:10.2118/207766-ms

H. Denli, HassanJaved Chughtai, Brian Hughes, Robert Gistri, Peng Xu

{"title":"面向勘探的地球科学语言处理","authors":"H. Denli, HassanJaved Chughtai, Brian Hughes, Robert Gistri, Peng Xu","doi":"10.2118/207766-ms","DOIUrl":null,"url":null,"abstract":"\n Deep learning has recently been providing step-change capabilities, particularly using transformer models, for natural language processing applications such as question answering, query-based summarization, and language translation for general-purpose context. We have developed a geoscience-specific language processing solution using such models to enable geoscientists to perform rapid, fully-quantitative and automated analysis of large corpuses of data and gain insights.\n One of the key transformer-based model is BERT (Bidirectional Encoder Representations from Transformers). It is trained with a large amount of general-purpose text (e.g., Common Crawl). Use of such a model for geoscience applications can face a number of challenges. One is due to the insignificant presence of geoscience-specific vocabulary in general-purpose context (e.g. daily language) and the other one is due to the geoscience jargon (domain-specific meaning of words). For example, salt is more likely to be associated with table salt within a daily language but it is used as a subsurface entity within geosciences.\n To elevate such challenges, we retrained a pre-trained BERT model with our 20M internal geoscientific records. We will refer the retrained model as GeoBERT. We fine-tuned the GeoBERT model for a number of tasks including geoscience question answering and query-based summarization.\n BERT models are very large in size. For example, BERT-Large has 340M trained parameters. Geoscience language processing with these models, including GeoBERT, could result in a substantial latency when all database is processed at every call of the model. To address this challenge, we developed a retriever-reader engine consisting of an embedding-based similarity search as a context retrieval step, which helps the solution to narrow the context for a given query before processing the context with GeoBERT.\n We built a solution integrating context-retrieval and GeoBERT models. Benchmarks show that it is effective to help geologists to identify answers and context for given questions. The prototype will also produce a summary to different granularity for a given set of documents. We have also demonstrated that domain-specific GeoBERT outperforms general-purpose BERT for geoscience applications.","PeriodicalId":10959,"journal":{"name":"Day 3 Wed, November 17, 2021","volume":"9 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Geoscience Language Processing for Exploration\",\"authors\":\"H. Denli, HassanJaved Chughtai, Brian Hughes, Robert Gistri, Peng Xu\",\"doi\":\"10.2118/207766-ms\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Deep learning has recently been providing step-change capabilities, particularly using transformer models, for natural language processing applications such as question answering, query-based summarization, and language translation for general-purpose context. We have developed a geoscience-specific language processing solution using such models to enable geoscientists to perform rapid, fully-quantitative and automated analysis of large corpuses of data and gain insights.\\n One of the key transformer-based model is BERT (Bidirectional Encoder Representations from Transformers). It is trained with a large amount of general-purpose text (e.g., Common Crawl). Use of such a model for geoscience applications can face a number of challenges. One is due to the insignificant presence of geoscience-specific vocabulary in general-purpose context (e.g. daily language) and the other one is due to the geoscience jargon (domain-specific meaning of words). For example, salt is more likely to be associated with table salt within a daily language but it is used as a subsurface entity within geosciences.\\n To elevate such challenges, we retrained a pre-trained BERT model with our 20M internal geoscientific records. We will refer the retrained model as GeoBERT. We fine-tuned the GeoBERT model for a number of tasks including geoscience question answering and query-based summarization.\\n BERT models are very large in size. For example, BERT-Large has 340M trained parameters. Geoscience language processing with these models, including GeoBERT, could result in a substantial latency when all database is processed at every call of the model. To address this challenge, we developed a retriever-reader engine consisting of an embedding-based similarity search as a context retrieval step, which helps the solution to narrow the context for a given query before processing the context with GeoBERT.\\n We built a solution integrating context-retrieval and GeoBERT models. Benchmarks show that it is effective to help geologists to identify answers and context for given questions. The prototype will also produce a summary to different granularity for a given set of documents. We have also demonstrated that domain-specific GeoBERT outperforms general-purpose BERT for geoscience applications.\",\"PeriodicalId\":10959,\"journal\":{\"name\":\"Day 3 Wed, November 17, 2021\",\"volume\":\"9 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Day 3 Wed, November 17, 2021\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2118/207766-ms\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Day 3 Wed, November 17, 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2118/207766-ms","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

深度学习最近为自然语言处理应用程序(如问答、基于查询的摘要和通用上下文的语言翻译)提供了逐步变化的功能，特别是使用转换器模型。我们已经开发了一个地球科学专用的语言处理解决方案，使用这些模型，使地球科学家能够对大量数据进行快速、全定量和自动化的分析，并获得见解。一个关键的基于变压器的模型是BERT(双向编码器表示从变压器)。它使用大量通用文本(例如Common Crawl)进行训练。在地球科学应用中使用这种模型可能会面临许多挑战。一个是由于地球科学专用词汇在一般情况下(例如日常语言)的存在微不足道，另一个是由于地球科学术语(单词的特定领域含义)。例如，在日常语言中，盐更可能与食盐联系在一起，但它在地球科学中被用作地下实体。为了提升这些挑战，我们用我们的20M内部地球科学记录重新训练了一个预训练的BERT模型。我们将重新训练的模型称为GeoBERT。我们为许多任务调整了GeoBERT模型，包括地球科学问题回答和基于查询的摘要。BERT模型的尺寸非常大。例如，BERT-Large有340M个训练参数。使用这些模型(包括GeoBERT)进行地球科学语言处理可能会导致在每次调用模型时处理所有数据库时产生很大的延迟。为了解决这个问题，我们开发了一个检索器-阅读器引擎，其中包含基于嵌入的相似性搜索作为上下文检索步骤，这有助于解决方案在使用GeoBERT处理上下文之前缩小给定查询的上下文。我们构建了一个集成上下文检索和GeoBERT模型的解决方案。基准测试表明，它可以有效地帮助地质学家确定给定问题的答案和背景。原型还将为给定的一组文档生成不同粒度的摘要。我们还证明了特定领域的GeoBERT在地球科学应用中优于通用的BERT。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Geoscience Language Processing for Exploration

Deep learning has recently been providing step-change capabilities, particularly using transformer models, for natural language processing applications such as question answering, query-based summarization, and language translation for general-purpose context. We have developed a geoscience-specific language processing solution using such models to enable geoscientists to perform rapid, fully-quantitative and automated analysis of large corpuses of data and gain insights. One of the key transformer-based model is BERT (Bidirectional Encoder Representations from Transformers). It is trained with a large amount of general-purpose text (e.g., Common Crawl). Use of such a model for geoscience applications can face a number of challenges. One is due to the insignificant presence of geoscience-specific vocabulary in general-purpose context (e.g. daily language) and the other one is due to the geoscience jargon (domain-specific meaning of words). For example, salt is more likely to be associated with table salt within a daily language but it is used as a subsurface entity within geosciences. To elevate such challenges, we retrained a pre-trained BERT model with our 20M internal geoscientific records. We will refer the retrained model as GeoBERT. We fine-tuned the GeoBERT model for a number of tasks including geoscience question answering and query-based summarization. BERT models are very large in size. For example, BERT-Large has 340M trained parameters. Geoscience language processing with these models, including GeoBERT, could result in a substantial latency when all database is processed at every call of the model. To address this challenge, we developed a retriever-reader engine consisting of an embedding-based similarity search as a context retrieval step, which helps the solution to narrow the context for a given query before processing the context with GeoBERT. We built a solution integrating context-retrieval and GeoBERT models. Benchmarks show that it is effective to help geologists to identify answers and context for given questions. The prototype will also produce a summary to different granularity for a given set of documents. We have also demonstrated that domain-specific GeoBERT outperforms general-purpose BERT for geoscience applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Day 3 Wed, November 17, 2021

自引率

0.00%

发文量