基于文档嵌入技术的油气领域技术智能分析

Day 2 Wed, October 30, 2019 Pub Date : 2019-10-28 DOI:10.4043/29707-ms

Fábio Corrêa Cordeiro, Diogo da Silva Magalhães Gomes, Flávio Antônio Machado Gomes, Renata Cristina Texeira

{"title":"基于文档嵌入技术的油气领域技术智能分析","authors":"Fábio Corrêa Cordeiro, Diogo da Silva Magalhães Gomes, Flávio Antônio Machado Gomes, Renata Cristina Texeira","doi":"10.4043/29707-ms","DOIUrl":null,"url":null,"abstract":"\n \n \n we propose a methodology based on document embedding techniques for applying Technology Intelligence Analysis in Oil and Gas (O&G) domain. We build a specialized corpus in O&G domain and train a Vector Space Model (VSM) to represent each document as a vector, in such a way that the distance between two vectors captures their semantic similarity. We explore different analysis on this VSM to infer relations between documents, in order to obtain new insights in a strategic context.\n \n \n \n this proposed methodology is based on Natural Language Processing (NLP) techniques to obtain strategic insights in a technology intelligence analysis scenario. It consists on generating a vector space model (VSM) induced from a domain-specific Oil and Gas corpus, composed of thousands of scientific articles collected from the Elsevier online database. We explore an approach to represent different entities - such as articles, authors and keywords - in the same vector space, making it possible to correlate them and infer relations of similarity based on their cosine distance. An evaluation metric is also provided in order to assist the training process and hyperparameters optimization.\n \n \n \n Oil and Gas highly technical vocabulary represents a challenge to NLP applications, in which some terms may assume a completely different meaning from the general - context domain. In this scenario, gathering an Oil and Gas corpus and training specialized vector space models for this specific domain allows increasing the quality in Technology Intelligence Analysis. The most significant finding is that we were able to explicit the semantic relationships between different entities of interest in the same VSM, also linking these relationships together with some additional metadata. An interesting application is to compare the publications of authors affiliated to two or more O&G companies at a given time. These non-trivial correlations are important to gain strategic insights considering a Technology Intelligence Analysis scenario.\n \n \n \n the novelty of this proposed methodology is the possibility of exploring new insights when correlating different entities in a technology intelligence scenario for the Oil and Gas domain, using a simple yet efficient approach based on document embedding techniques. This method applies some advanced NLP techniques to quickly process more than a hundred thousand documents in a few seconds, without requiring complex hardware resources, which would be impractical using traditional techniques.\n","PeriodicalId":11089,"journal":{"name":"Day 2 Wed, October 30, 2019","volume":"29 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Technology Intelligence Analysis Based on Document Embedding Techniques for Oil and Gas Domain\",\"authors\":\"Fábio Corrêa Cordeiro, Diogo da Silva Magalhães Gomes, Flávio Antônio Machado Gomes, Renata Cristina Texeira\",\"doi\":\"10.4043/29707-ms\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n \\n \\n we propose a methodology based on document embedding techniques for applying Technology Intelligence Analysis in Oil and Gas (O&G) domain. We build a specialized corpus in O&G domain and train a Vector Space Model (VSM) to represent each document as a vector, in such a way that the distance between two vectors captures their semantic similarity. We explore different analysis on this VSM to infer relations between documents, in order to obtain new insights in a strategic context.\\n \\n \\n \\n this proposed methodology is based on Natural Language Processing (NLP) techniques to obtain strategic insights in a technology intelligence analysis scenario. It consists on generating a vector space model (VSM) induced from a domain-specific Oil and Gas corpus, composed of thousands of scientific articles collected from the Elsevier online database. We explore an approach to represent different entities - such as articles, authors and keywords - in the same vector space, making it possible to correlate them and infer relations of similarity based on their cosine distance. An evaluation metric is also provided in order to assist the training process and hyperparameters optimization.\\n \\n \\n \\n Oil and Gas highly technical vocabulary represents a challenge to NLP applications, in which some terms may assume a completely different meaning from the general - context domain. In this scenario, gathering an Oil and Gas corpus and training specialized vector space models for this specific domain allows increasing the quality in Technology Intelligence Analysis. The most significant finding is that we were able to explicit the semantic relationships between different entities of interest in the same VSM, also linking these relationships together with some additional metadata. An interesting application is to compare the publications of authors affiliated to two or more O&G companies at a given time. These non-trivial correlations are important to gain strategic insights considering a Technology Intelligence Analysis scenario.\\n \\n \\n \\n the novelty of this proposed methodology is the possibility of exploring new insights when correlating different entities in a technology intelligence scenario for the Oil and Gas domain, using a simple yet efficient approach based on document embedding techniques. This method applies some advanced NLP techniques to quickly process more than a hundred thousand documents in a few seconds, without requiring complex hardware resources, which would be impractical using traditional techniques.\\n\",\"PeriodicalId\":11089,\"journal\":{\"name\":\"Day 2 Wed, October 30, 2019\",\"volume\":\"29 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Day 2 Wed, October 30, 2019\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4043/29707-ms\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Day 2 Wed, October 30, 2019","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4043/29707-ms","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

提出了一种基于文档嵌入技术的技术智能分析方法，将其应用于油气领域。我们在O&G领域建立了一个专门的语料库，并训练了一个向量空间模型(VSM)来将每个文档表示为一个向量，以这样一种方式，两个向量之间的距离捕获它们的语义相似性。我们在这个VSM上探索不同的分析来推断文件之间的关系，以便在战略背景下获得新的见解。该方法基于自然语言处理(NLP)技术，以获得技术智能分析场景中的战略见解。它包括从特定领域的石油和天然气语料库中生成向量空间模型(VSM)，该语料库由从Elsevier在线数据库收集的数千篇科学文章组成。我们探索了一种在同一向量空间中表示不同实体(如文章、作者和关键词)的方法，使它们能够相互关联，并根据它们的余弦距离推断出相似关系。为了帮助训练过程和超参数优化，还提供了一个评价指标。石油和天然气的高技术性词汇对NLP应用提出了挑战，其中一些术语可能具有与一般上下文领域完全不同的含义。在这种情况下，收集石油和天然气语料库并为该特定领域训练专门的向量空间模型可以提高技术智能分析的质量。最重要的发现是，我们能够在同一个VSM中明确不同感兴趣的实体之间的语义关系，并将这些关系与一些额外的元数据联系在一起。一个有趣的应用程序是比较在给定时间内隶属于两个或多个油气公司的作者的出版物。这些重要的相关性对于获得考虑技术智能分析场景的战略洞察力非常重要。该方法的新颖之处在于，在油气领域的技术智能场景中，使用一种基于文档嵌入技术的简单而有效的方法，可以在关联不同实体时探索新的见解。该方法应用了一些先进的NLP技术，可以在几秒钟内快速处理超过十万份文档，而不需要复杂的硬件资源，这在使用传统技术时是不切实际的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Technology Intelligence Analysis Based on Document Embedding Techniques for Oil and Gas Domain

we propose a methodology based on document embedding techniques for applying Technology Intelligence Analysis in Oil and Gas (O&G) domain. We build a specialized corpus in O&G domain and train a Vector Space Model (VSM) to represent each document as a vector, in such a way that the distance between two vectors captures their semantic similarity. We explore different analysis on this VSM to infer relations between documents, in order to obtain new insights in a strategic context. this proposed methodology is based on Natural Language Processing (NLP) techniques to obtain strategic insights in a technology intelligence analysis scenario. It consists on generating a vector space model (VSM) induced from a domain-specific Oil and Gas corpus, composed of thousands of scientific articles collected from the Elsevier online database. We explore an approach to represent different entities - such as articles, authors and keywords - in the same vector space, making it possible to correlate them and infer relations of similarity based on their cosine distance. An evaluation metric is also provided in order to assist the training process and hyperparameters optimization. Oil and Gas highly technical vocabulary represents a challenge to NLP applications, in which some terms may assume a completely different meaning from the general - context domain. In this scenario, gathering an Oil and Gas corpus and training specialized vector space models for this specific domain allows increasing the quality in Technology Intelligence Analysis. The most significant finding is that we were able to explicit the semantic relationships between different entities of interest in the same VSM, also linking these relationships together with some additional metadata. An interesting application is to compare the publications of authors affiliated to two or more O&G companies at a given time. These non-trivial correlations are important to gain strategic insights considering a Technology Intelligence Analysis scenario. the novelty of this proposed methodology is the possibility of exploring new insights when correlating different entities in a technology intelligence scenario for the Oil and Gas domain, using a simple yet efficient approach based on document embedding techniques. This method applies some advanced NLP techniques to quickly process more than a hundred thousand documents in a few seconds, without requiring complex hardware resources, which would be impractical using traditional techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Day 2 Wed, October 30, 2019

自引率

0.00%

发文量