从事早期药物发现的开放科学组织的数据科学路线图。

IF 14.7 1区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Nature Communications Pub Date : 2024-07-05 DOI:10.1038/s41467-024-49777-x

Kristina Edfeldt, Aled M Edwards, Ola Engkvist, Judith Günther, Matthew Hartley, David G Hulcoop, Andrew R Leach, Brian D Marsden, Amelie Menge, Leonie Misquitta, Susanne Müller, Dafydd R Owen, Kristof T Schütt, Nicholas Skelton, Andreas Steffen, Alexander Tropsha, Erik Vernet, Yanli Wang, James Wellnitz, Timothy M Willson, Djork-Arné Clevert, Benjamin Haibe-Kains, Lovisa Holmberg Schiavone, Matthieu Schapira

{"title":"从事早期药物发现的开放科学组织的数据科学路线图。","authors":"Kristina Edfeldt, Aled M Edwards, Ola Engkvist, Judith Günther, Matthew Hartley, David G Hulcoop, Andrew R Leach, Brian D Marsden, Amelie Menge, Leonie Misquitta, Susanne Müller, Dafydd R Owen, Kristof T Schütt, Nicholas Skelton, Andreas Steffen, Alexander Tropsha, Erik Vernet, Yanli Wang, James Wellnitz, Timothy M Willson, Djork-Arné Clevert, Benjamin Haibe-Kains, Lovisa Holmberg Schiavone, Matthieu Schapira","doi":"10.1038/s41467-024-49777-x","DOIUrl":null,"url":null,"abstract":"The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.","PeriodicalId":19066,"journal":{"name":"Nature Communications","volume":null,"pages":null},"PeriodicalIF":14.7000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11224410/pdf/","citationCount":"0","resultStr":"{\"title\":\"A data science roadmap for open science organizations engaged in early-stage drug discovery.\",\"authors\":\"Kristina Edfeldt, Aled M Edwards, Ola Engkvist, Judith Günther, Matthew Hartley, David G Hulcoop, Andrew R Leach, Brian D Marsden, Amelie Menge, Leonie Misquitta, Susanne Müller, Dafydd R Owen, Kristof T Schütt, Nicholas Skelton, Andreas Steffen, Alexander Tropsha, Erik Vernet, Yanli Wang, James Wellnitz, Timothy M Willson, Djork-Arné Clevert, Benjamin Haibe-Kains, Lovisa Holmberg Schiavone, Matthieu Schapira\",\"doi\":\"10.1038/s41467-024-49777-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.\",\"PeriodicalId\":19066,\"journal\":{\"name\":\"Nature Communications\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":14.7000,\"publicationDate\":\"2024-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11224410/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Nature Communications\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1038/s41467-024-49777-x\",\"RegionNum\":1,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Communications","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41467-024-49777-x","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

结构基因组学联合会（Structural Genomics Consortium）是一个国际性的开放科学研究组织，其工作重点是加速早期药物发现，即发现和优化新药。我们和其他许多人一样，认为人工智能（AI）有望成为该领域的主要加速器。因此，问题在于如何从人工智能的最新进展中获得最大收益，以及如何生成、格式化和传播数据，以便在人工智能指导下的药物发现领域实现未来的突破。我们在此介绍由公共和私营部门专家组成的工作组提出的建议。强大的数据管理需要精确的本体论和标准化词汇，而跨实验室的集中式数据库架构则有利于将数据整合到高价值数据集中。实验室自动化和向数据挖掘开放电子实验笔记本推动了数据共享和数据建模的发展。建立稳健的机器学习模型的重要考虑因素包括透明、可重复的数据处理，选择最相关的数据表示，定义正确的训练集和测试集，以及估计预测的不确定性。除了数据共享，还可以利用云计算来构建和传播机器学习模型。加速命中和化学探针发现的重要载体将是：（1）在设计-制造-测试-分析（DMTA）周期内公开、大规模地实时整合实验数据生成和建模工作流程；（2）采用一种思维模式，让数据科学家和实验人员作为一个统一的团队工作，并将数据科学纳入实验设计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A data science roadmap for open science organizations engaged in early-stage drug discovery.

查看原文本刊更多论文

A data science roadmap for open science organizations engaged in early-stage drug discovery.

The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Nature Communications Biological Science Disciplines-

CiteScore

24.90

自引率

2.40%

发文量

6928

审稿时长

3.7 months

期刊介绍： Nature Communications, an open-access journal, publishes high-quality research spanning all areas of the natural sciences. Papers featured in the journal showcase significant advances relevant to specialists in each respective field. With a 2-year impact factor of 16.6 (2022) and a median time of 8 days from submission to the first editorial decision, Nature Communications is committed to rapid dissemination of research findings. As a multidisciplinary journal, it welcomes contributions from biological, health, physical, chemical, Earth, social, mathematical, applied, and engineering sciences, aiming to highlight important breakthroughs within each domain.