特定领域知识图生成的自动化方法──图测度与表征

IF 5.3 2区化学 Q1 CHEMISTRY, MEDICINAL

Journal of Chemical Information and Modeling Pub Date : 2025-01-28 DOI:10.1021/acs.jcim.4c0190410.1021/acs.jcim.4c01904

Connor O’Ryan, Kevin D. Hayes, Francis G. VanGessel, Ruth M. Doherty, William Wilson, John Fischer, Zois Boukouvalas and Peter W. Chung*,

{"title":"特定领域知识图生成的自动化方法──图测度与表征","authors":"Connor O’Ryan, Kevin D. Hayes, Francis G. VanGessel, Ruth M. Doherty, William Wilson, John Fischer, Zois Boukouvalas and Peter W. Chung*, ","doi":"10.1021/acs.jcim.4c0190410.1021/acs.jcim.4c01904","DOIUrl":null,"url":null,"abstract":"<p >In 2020, nearly 3 million scientific and engineering papers were published worldwide (White, K. Publications Output: U.S. Trends And International Comparisons). The vastness of the literature that already exists, the increasing rate of appearance of new publications, and the timely translation of artificial intelligence methods into scientific and engineering communities have ushered in the development of automated methods for mining and extracting information from technical documents. However, domain-specific approaches for extracting knowledge graph representations from semantic information remain limited. In this paper, we develop a natural language processing (NLP) approach to extract knowledge graphs resulting in a semantically structured network (SSN) that can be queried. After a detailed exposition of the modeling method, the approach is demonstrated specifically for the synthetic chemistry of organic molecules from the text of approximately 100,000 full-length patents. In this paper, we focus specifically on characterizing the knowledge graph to develop insights into the linguistic patterns and trends within the data and to establish objective graph characteristics that may enable comparisons among other text-based knowledge graphs across domains. Graph characterization is performed for network motif structures, assortativity, and eigenvector centrality. The structural information provided by the measures reveals language tendencies commonly employed by authors in the text discourse for chemical reactions. These include observations of the prevalence of descriptions of specific compound names, that common solvents and drying agents cut across large numbers of chemical synthesis approaches, and that power-law trends clearly emerge in the limit of larger corpora. The findings provide important quantitative characterizations of knowledge graphs for use in validation in large data settings.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"65 3","pages":"1243–1257 1243–1257"},"PeriodicalIF":5.3000,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Automated Approach for Domain-Specific Knowledge Graph Generation─Graph Measures and Characterization\",\"authors\":\"Connor O’Ryan, Kevin D. Hayes, Francis G. VanGessel, Ruth M. Doherty, William Wilson, John Fischer, Zois Boukouvalas and Peter W. Chung*, \",\"doi\":\"10.1021/acs.jcim.4c0190410.1021/acs.jcim.4c01904\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >In 2020, nearly 3 million scientific and engineering papers were published worldwide (White, K. Publications Output: U.S. Trends And International Comparisons). The vastness of the literature that already exists, the increasing rate of appearance of new publications, and the timely translation of artificial intelligence methods into scientific and engineering communities have ushered in the development of automated methods for mining and extracting information from technical documents. However, domain-specific approaches for extracting knowledge graph representations from semantic information remain limited. In this paper, we develop a natural language processing (NLP) approach to extract knowledge graphs resulting in a semantically structured network (SSN) that can be queried. After a detailed exposition of the modeling method, the approach is demonstrated specifically for the synthetic chemistry of organic molecules from the text of approximately 100,000 full-length patents. In this paper, we focus specifically on characterizing the knowledge graph to develop insights into the linguistic patterns and trends within the data and to establish objective graph characteristics that may enable comparisons among other text-based knowledge graphs across domains. Graph characterization is performed for network motif structures, assortativity, and eigenvector centrality. The structural information provided by the measures reveals language tendencies commonly employed by authors in the text discourse for chemical reactions. These include observations of the prevalence of descriptions of specific compound names, that common solvents and drying agents cut across large numbers of chemical synthesis approaches, and that power-law trends clearly emerge in the limit of larger corpora. The findings provide important quantitative characterizations of knowledge graphs for use in validation in large data settings.</p>\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\"65 3\",\"pages\":\"1243–1257 1243–1257\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-01-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acs.jcim.4c01904\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jcim.4c01904","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

摘要

2020年，全球发表了近300万篇科学和工程论文（White， K. Publications Output: U.S. Trends and International comparison）。已经存在的大量文献，新出版物出现的速度越来越快，以及人工智能方法及时翻译到科学和工程社区，已经迎来了从技术文档中挖掘和提取信息的自动化方法的发展。然而，从语义信息中提取知识图表示的特定领域方法仍然有限。在本文中，我们开发了一种自然语言处理（NLP）方法来提取知识图，从而产生可以查询的语义结构网络（SSN）。在详细阐述了建模方法之后，该方法专门用于有机分子的合成化学，来自大约100,000个全长专利的文本。在本文中，我们特别关注知识图的特征，以深入了解数据中的语言模式和趋势，并建立客观的图特征，以便与跨领域的其他基于文本的知识图进行比较。图表征执行网络母题结构，分类和特征向量中心性。测量提供的结构信息揭示了作者在化学反应语篇中常用的语言倾向。这些发现包括对特定化合物名称描述的普遍性观察，普通溶剂和干燥剂跨越了大量化学合成方法，幂律趋势在较大语料库的限制中明显出现。研究结果提供了重要的定量表征知识图用于验证在大数据设置。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

An Automated Approach for Domain-Specific Knowledge Graph Generation─Graph Measures and Characterization

查看原文本刊更多论文

An Automated Approach for Domain-Specific Knowledge Graph Generation─Graph Measures and Characterization

In 2020, nearly 3 million scientific and engineering papers were published worldwide (White, K. Publications Output: U.S. Trends And International Comparisons). The vastness of the literature that already exists, the increasing rate of appearance of new publications, and the timely translation of artificial intelligence methods into scientific and engineering communities have ushered in the development of automated methods for mining and extracting information from technical documents. However, domain-specific approaches for extracting knowledge graph representations from semantic information remain limited. In this paper, we develop a natural language processing (NLP) approach to extract knowledge graphs resulting in a semantically structured network (SSN) that can be queried. After a detailed exposition of the modeling method, the approach is demonstrated specifically for the synthetic chemistry of organic molecules from the text of approximately 100,000 full-length patents. In this paper, we focus specifically on characterizing the knowledge graph to develop insights into the linguistic patterns and trends within the data and to establish objective graph characteristics that may enable comparisons among other text-based knowledge graphs across domains. Graph characterization is performed for network motif structures, assortativity, and eigenvector centrality. The structural information provided by the measures reveals language tendencies commonly employed by authors in the text discourse for chemical reactions. These include observations of the prevalence of descriptions of specific compound names, that common solvents and drying agents cut across large numbers of chemical synthesis approaches, and that power-law trends clearly emerge in the limit of larger corpora. The findings provide important quantitative characterizations of knowledge graphs for use in validation in large data settings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.