{"title":"基于文档嵌入技术的油气领域技术智能分析","authors":"Fábio Corrêa Cordeiro, Diogo da Silva Magalhães Gomes, Flávio Antônio Machado Gomes, Renata Cristina Texeira","doi":"10.4043/29707-ms","DOIUrl":null,"url":null,"abstract":"\n \n \n we propose a methodology based on document embedding techniques for applying Technology Intelligence Analysis in Oil and Gas (O&G) domain. We build a specialized corpus in O&G domain and train a Vector Space Model (VSM) to represent each document as a vector, in such a way that the distance between two vectors captures their semantic similarity. We explore different analysis on this VSM to infer relations between documents, in order to obtain new insights in a strategic context.\n \n \n \n this proposed methodology is based on Natural Language Processing (NLP) techniques to obtain strategic insights in a technology intelligence analysis scenario. It consists on generating a vector space model (VSM) induced from a domain-specific Oil and Gas corpus, composed of thousands of scientific articles collected from the Elsevier online database. We explore an approach to represent different entities - such as articles, authors and keywords - in the same vector space, making it possible to correlate them and infer relations of similarity based on their cosine distance. An evaluation metric is also provided in order to assist the training process and hyperparameters optimization.\n \n \n \n Oil and Gas highly technical vocabulary represents a challenge to NLP applications, in which some terms may assume a completely different meaning from the general - context domain. In this scenario, gathering an Oil and Gas corpus and training specialized vector space models for this specific domain allows increasing the quality in Technology Intelligence Analysis. The most significant finding is that we were able to explicit the semantic relationships between different entities of interest in the same VSM, also linking these relationships together with some additional metadata. An interesting application is to compare the publications of authors affiliated to two or more O&G companies at a given time. These non-trivial correlations are important to gain strategic insights considering a Technology Intelligence Analysis scenario.\n \n \n \n the novelty of this proposed methodology is the possibility of exploring new insights when correlating different entities in a technology intelligence scenario for the Oil and Gas domain, using a simple yet efficient approach based on document embedding techniques. This method applies some advanced NLP techniques to quickly process more than a hundred thousand documents in a few seconds, without requiring complex hardware resources, which would be impractical using traditional techniques.\n","PeriodicalId":11089,"journal":{"name":"Day 2 Wed, October 30, 2019","volume":"29 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Technology Intelligence Analysis Based on Document Embedding Techniques for Oil and Gas Domain\",\"authors\":\"Fábio Corrêa Cordeiro, Diogo da Silva Magalhães Gomes, Flávio Antônio Machado Gomes, Renata Cristina Texeira\",\"doi\":\"10.4043/29707-ms\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n \\n \\n we propose a methodology based on document embedding techniques for applying Technology Intelligence Analysis in Oil and Gas (O&G) domain. We build a specialized corpus in O&G domain and train a Vector Space Model (VSM) to represent each document as a vector, in such a way that the distance between two vectors captures their semantic similarity. We explore different analysis on this VSM to infer relations between documents, in order to obtain new insights in a strategic context.\\n \\n \\n \\n this proposed methodology is based on Natural Language Processing (NLP) techniques to obtain strategic insights in a technology intelligence analysis scenario. It consists on generating a vector space model (VSM) induced from a domain-specific Oil and Gas corpus, composed of thousands of scientific articles collected from the Elsevier online database. We explore an approach to represent different entities - such as articles, authors and keywords - in the same vector space, making it possible to correlate them and infer relations of similarity based on their cosine distance. An evaluation metric is also provided in order to assist the training process and hyperparameters optimization.\\n \\n \\n \\n Oil and Gas highly technical vocabulary represents a challenge to NLP applications, in which some terms may assume a completely different meaning from the general - context domain. In this scenario, gathering an Oil and Gas corpus and training specialized vector space models for this specific domain allows increasing the quality in Technology Intelligence Analysis. The most significant finding is that we were able to explicit the semantic relationships between different entities of interest in the same VSM, also linking these relationships together with some additional metadata. An interesting application is to compare the publications of authors affiliated to two or more O&G companies at a given time. These non-trivial correlations are important to gain strategic insights considering a Technology Intelligence Analysis scenario.\\n \\n \\n \\n the novelty of this proposed methodology is the possibility of exploring new insights when correlating different entities in a technology intelligence scenario for the Oil and Gas domain, using a simple yet efficient approach based on document embedding techniques. This method applies some advanced NLP techniques to quickly process more than a hundred thousand documents in a few seconds, without requiring complex hardware resources, which would be impractical using traditional techniques.\\n\",\"PeriodicalId\":11089,\"journal\":{\"name\":\"Day 2 Wed, October 30, 2019\",\"volume\":\"29 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Day 2 Wed, October 30, 2019\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4043/29707-ms\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Day 2 Wed, October 30, 2019","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4043/29707-ms","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Technology Intelligence Analysis Based on Document Embedding Techniques for Oil and Gas Domain
we propose a methodology based on document embedding techniques for applying Technology Intelligence Analysis in Oil and Gas (O&G) domain. We build a specialized corpus in O&G domain and train a Vector Space Model (VSM) to represent each document as a vector, in such a way that the distance between two vectors captures their semantic similarity. We explore different analysis on this VSM to infer relations between documents, in order to obtain new insights in a strategic context.
this proposed methodology is based on Natural Language Processing (NLP) techniques to obtain strategic insights in a technology intelligence analysis scenario. It consists on generating a vector space model (VSM) induced from a domain-specific Oil and Gas corpus, composed of thousands of scientific articles collected from the Elsevier online database. We explore an approach to represent different entities - such as articles, authors and keywords - in the same vector space, making it possible to correlate them and infer relations of similarity based on their cosine distance. An evaluation metric is also provided in order to assist the training process and hyperparameters optimization.
Oil and Gas highly technical vocabulary represents a challenge to NLP applications, in which some terms may assume a completely different meaning from the general - context domain. In this scenario, gathering an Oil and Gas corpus and training specialized vector space models for this specific domain allows increasing the quality in Technology Intelligence Analysis. The most significant finding is that we were able to explicit the semantic relationships between different entities of interest in the same VSM, also linking these relationships together with some additional metadata. An interesting application is to compare the publications of authors affiliated to two or more O&G companies at a given time. These non-trivial correlations are important to gain strategic insights considering a Technology Intelligence Analysis scenario.
the novelty of this proposed methodology is the possibility of exploring new insights when correlating different entities in a technology intelligence scenario for the Oil and Gas domain, using a simple yet efficient approach based on document embedding techniques. This method applies some advanced NLP techniques to quickly process more than a hundred thousand documents in a few seconds, without requiring complex hardware resources, which would be impractical using traditional techniques.