通过基于自然语言处理的系统支持早期药物发现过程的二次研究

Proceedings of the International Conference on Applied Statistics Pub Date : 2020-12-01 DOI:10.2478/icas-2021-0023

Alin-Bogdan Popa

{"title":"通过基于自然语言处理的系统支持早期药物发现过程的二次研究","authors":"Alin-Bogdan Popa","doi":"10.2478/icas-2021-0023","DOIUrl":null,"url":null,"abstract":"Abstract Last decades were characterised by a constant decline in the productivity of research and development activities of pharmaceutical companies. This is due to the fact that the drug discovery process contains an intrinsic risk that should be managed efficiently. Within this process, the early phase projects could be streamlined by doing more secondary research. These activities would involve the integration of chemical and biological knowledge from scientific literature in order to extract an overview and the evolution of a certain research area. This would then help refine the research and development operations. Considering the vast amount of pharmaceutical studies publications, it is not easy to identify the important information. For this task, a series of projects leveraged the advantages of the open pharmacological space through state-of-the-art technologies. The most popular are Knowledge Graphs methods. Although extremely useful, this technology requires increased investments of time and human resources. An alternative would be to develop a system that uses Natural Language Processing blocks. Still, there is no defined framework and reusable code template for the use-case of compounds development. In this study, it is presented the design and development of a system that uses Dynamic Topic Modelling and Named Entity Recognition modules in order to extract meaningful information from a large volume of unstructured texts. Moreover, the dynamic character of the topic modelling technique allows to analyse the evolution of different subject areas over time. In order to validate the system, a collection of articles from the Pharmaceutical Research Journal was used. Our results show that the system is able to identify the main research areas in the last 20 years, namely crystalline and amorphous systems, insulin resistance, paracellular permeability. Additionally, the evolution of the subjects is a highly valuable resource and should be used to get an in-depth understanding about the shifts that happened in a specific domain. However, a limitation of this system is that it cannot detect association between two concepts or entities if they are not involved in the same document.","PeriodicalId":393626,"journal":{"name":"Proceedings of the International Conference on Applied Statistics","volume":"334 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Supporting secondary research in early drug discovery process through a Natural Language Processing based system\",\"authors\":\"Alin-Bogdan Popa\",\"doi\":\"10.2478/icas-2021-0023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Last decades were characterised by a constant decline in the productivity of research and development activities of pharmaceutical companies. This is due to the fact that the drug discovery process contains an intrinsic risk that should be managed efficiently. Within this process, the early phase projects could be streamlined by doing more secondary research. These activities would involve the integration of chemical and biological knowledge from scientific literature in order to extract an overview and the evolution of a certain research area. This would then help refine the research and development operations. Considering the vast amount of pharmaceutical studies publications, it is not easy to identify the important information. For this task, a series of projects leveraged the advantages of the open pharmacological space through state-of-the-art technologies. The most popular are Knowledge Graphs methods. Although extremely useful, this technology requires increased investments of time and human resources. An alternative would be to develop a system that uses Natural Language Processing blocks. Still, there is no defined framework and reusable code template for the use-case of compounds development. In this study, it is presented the design and development of a system that uses Dynamic Topic Modelling and Named Entity Recognition modules in order to extract meaningful information from a large volume of unstructured texts. Moreover, the dynamic character of the topic modelling technique allows to analyse the evolution of different subject areas over time. In order to validate the system, a collection of articles from the Pharmaceutical Research Journal was used. Our results show that the system is able to identify the main research areas in the last 20 years, namely crystalline and amorphous systems, insulin resistance, paracellular permeability. Additionally, the evolution of the subjects is a highly valuable resource and should be used to get an in-depth understanding about the shifts that happened in a specific domain. However, a limitation of this system is that it cannot detect association between two concepts or entities if they are not involved in the same document.\",\"PeriodicalId\":393626,\"journal\":{\"name\":\"Proceedings of the International Conference on Applied Statistics\",\"volume\":\"334 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on Applied Statistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/icas-2021-0023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Applied Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/icas-2021-0023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

摘要过去几十年的特点是制药公司的研究和开发活动的生产力不断下降。这是因为药物发现过程包含一个内在的风险，应该得到有效的管理。在这个过程中，早期阶段的项目可以通过做更多的二次研究来简化。这些活动将涉及综合科学文献中的化学和生物学知识，以便总结某一研究领域的概况和演变。这将有助于改进研发业务。考虑到大量的药物研究出版物，识别重要信息并不容易。为了完成这项任务，一系列项目通过最先进的技术利用了开放药理学空间的优势。最流行的是知识图方法。尽管这项技术非常有用，但它需要更多的时间和人力资源投资。另一种选择是开发一个使用自然语言处理模块的系统。但是，对于化合物开发的用例，还没有定义好的框架和可重用的代码模板。在这项研究中，提出了一个系统的设计和开发，该系统使用动态主题建模和命名实体识别模块，以便从大量非结构化文本中提取有意义的信息。此外，主题建模技术的动态特性允许分析不同主题领域随时间的演变。为了验证该系统，使用了药学研究杂志的文章集。我们的研究结果表明，该系统能够识别近20年来的主要研究领域，即晶体和非晶系统，胰岛素抵抗，细胞旁通透性。此外，主题的演变是一种非常有价值的资源，应该用于深入了解特定领域中发生的变化。然而，该系统的一个限制是，如果两个概念或实体不在同一文档中，则无法检测它们之间的关联。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Supporting secondary research in early drug discovery process through a Natural Language Processing based system

Abstract Last decades were characterised by a constant decline in the productivity of research and development activities of pharmaceutical companies. This is due to the fact that the drug discovery process contains an intrinsic risk that should be managed efficiently. Within this process, the early phase projects could be streamlined by doing more secondary research. These activities would involve the integration of chemical and biological knowledge from scientific literature in order to extract an overview and the evolution of a certain research area. This would then help refine the research and development operations. Considering the vast amount of pharmaceutical studies publications, it is not easy to identify the important information. For this task, a series of projects leveraged the advantages of the open pharmacological space through state-of-the-art technologies. The most popular are Knowledge Graphs methods. Although extremely useful, this technology requires increased investments of time and human resources. An alternative would be to develop a system that uses Natural Language Processing blocks. Still, there is no defined framework and reusable code template for the use-case of compounds development. In this study, it is presented the design and development of a system that uses Dynamic Topic Modelling and Named Entity Recognition modules in order to extract meaningful information from a large volume of unstructured texts. Moreover, the dynamic character of the topic modelling technique allows to analyse the evolution of different subject areas over time. In order to validate the system, a collection of articles from the Pharmaceutical Research Journal was used. Our results show that the system is able to identify the main research areas in the last 20 years, namely crystalline and amorphous systems, insulin resistance, paracellular permeability. Additionally, the evolution of the subjects is a highly valuable resource and should be used to get an in-depth understanding about the shifts that happened in a specific domain. However, a limitation of this system is that it cannot detect association between two concepts or entities if they are not involved in the same document.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the International Conference on Applied Statistics

自引率

0.00%

发文量