Finding the dark matter: Large language model-based enzyme kinetic data extractor and its validation.

IF 5.2 3区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Protein Science Pub Date : 2025-09-01 DOI:10.1002/pro.70251

Galen Wei, Xinchun Ran, Runeem Ai-Abssi, Zhongyue Yang

{"title":"Finding the dark matter: Large language model-based enzyme kinetic data extractor and its validation.","authors":"Galen Wei, Xinchun Ran, Runeem Ai-Abssi, Zhongyue Yang","doi":"10.1002/pro.70251","DOIUrl":null,"url":null,"abstract":"Despite the vast number of enzymatic kinetic measurements reported across decades of biochemical literature, the majority of relational enzyme kinetic data-linking amino acid sequence, substrate identity, kinetic parameters, and assay conditions-remains uncollected and inaccessible in structured form. This constitutes a significant portion of the \"dark matter\" of enzymology. Unlocking these hidden data through automated extraction offers an opportunity to expand enzyme dataset diversity and size, critical for building accurate, generalizable models that drive predictive enzyme engineering. To address this limitation, we built EnzyExtract, a large language model-powered pipeline that automates the extraction, verification, and structuring of enzyme kinetics data from scientific literature. By processing 137,892 full-text publications (PDF/XML), EnzyExtract collected more than 218,095 enzyme-substrate-kinetics entries, including 218,095 kcat and 167,794 Km values. These entries are mapped to enzymes spanning 3569 unique four-digit EC numbers, with a total of 84,464 entries assigned at least a first-digit EC number. EnzyExtract identified 89,544 unique kinetic entries (kcat and Km combined) absent from BRENDA, significantly expanding the known enzymology dataset. The newly curated dataset was compiled into a database named EnzyExtractDB. EnzyExtract demonstrates high accuracy when benchmarked against manually curated datasets and strong consistency with BRENDA-derived data. To create model-ready datasets, enzyme and substrate sequences were aligned to UniProt and PubChem, yielding 92,286 high-confidence, sequence-mapped kinetic entries. To assess the practical utility of our dataset, we retrained several state-of-the-art kcat predictors (including MESI, DLKcat, and TurNuP) using EnzyExtractDB. Across held-out test sets, all models demonstrate improved predictive performance in terms of RMSE, MAE, and R2, highlighting the value of high-quality, large-scale, literature-derived EnzyExtractDB for enhancing predictive modeling of enzyme kinetics. The EnzyExtract source code and the database are openly available at https://github.com/ChemBioHTP/EnzyExtract, and an interactive demo can be accessed via Google Colab at https://colab.research.google.com/drive/1MwKSEZzLPNOseksRshbzkkFoO_cgJhva.","PeriodicalId":20761,"journal":{"name":"Protein Science","volume":"34 9","pages":"e70251"},"PeriodicalIF":5.2000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12355964/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Protein Science","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/pro.70251","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Despite the vast number of enzymatic kinetic measurements reported across decades of biochemical literature, the majority of relational enzyme kinetic data-linking amino acid sequence, substrate identity, kinetic parameters, and assay conditions-remains uncollected and inaccessible in structured form. This constitutes a significant portion of the "dark matter" of enzymology. Unlocking these hidden data through automated extraction offers an opportunity to expand enzyme dataset diversity and size, critical for building accurate, generalizable models that drive predictive enzyme engineering. To address this limitation, we built EnzyExtract, a large language model-powered pipeline that automates the extraction, verification, and structuring of enzyme kinetics data from scientific literature. By processing 137,892 full-text publications (PDF/XML), EnzyExtract collected more than 218,095 enzyme-substrate-kinetics entries, including 218,095 k_cat and 167,794 K_m values. These entries are mapped to enzymes spanning 3569 unique four-digit EC numbers, with a total of 84,464 entries assigned at least a first-digit EC number. EnzyExtract identified 89,544 unique kinetic entries (k_cat and K_m combined) absent from BRENDA, significantly expanding the known enzymology dataset. The newly curated dataset was compiled into a database named EnzyExtractDB. EnzyExtract demonstrates high accuracy when benchmarked against manually curated datasets and strong consistency with BRENDA-derived data. To create model-ready datasets, enzyme and substrate sequences were aligned to UniProt and PubChem, yielding 92,286 high-confidence, sequence-mapped kinetic entries. To assess the practical utility of our dataset, we retrained several state-of-the-art k_cat predictors (including MESI, DLKcat, and TurNuP) using EnzyExtractDB. Across held-out test sets, all models demonstrate improved predictive performance in terms of RMSE, MAE, and R², highlighting the value of high-quality, large-scale, literature-derived EnzyExtractDB for enhancing predictive modeling of enzyme kinetics. The EnzyExtract source code and the database are openly available at https://github.com/ChemBioHTP/EnzyExtract, and an interactive demo can be accessed via Google Colab at https://colab.research.google.com/drive/1MwKSEZzLPNOseksRshbzkkFoO_cgJhva.

查看原文本刊更多论文

寻找暗物质：基于大型语言模型的酶动力学数据提取器及其验证。

尽管在几十年的生化文献中报道了大量的酶动力学测量，但大多数相关的酶动力学数据（包括氨基酸序列、底物特性、动力学参数和分析条件）仍然未被收集，并且无法以结构化的形式获得。这构成了酶学“暗物质”的重要部分。通过自动提取解锁这些隐藏的数据，为扩大酶数据集的多样性和规模提供了机会，这对于构建驱动预测酶工程的准确、可推广的模型至关重要。为了解决这一限制，我们构建了一个大型语言模型驱动的管道，可以自动从科学文献中提取、验证和构建酶动力学数据。通过处理137,892份全文出版物（PDF/XML），酶提取物收集了超过218,095个酶底物动力学条目，包括218,095 kcat和167,794 Km值。这些条目被映射到跨越3569个唯一的四位数EC号的酶，总共84,464个条目被分配了至少一个第一位数EC号。酶提取物确定了BRENDA中缺失的89,544个独特的动力学条目（kcat和Km加起来），极大地扩展了已知的酶学数据集。新整理的数据集被编译成一个名为酶萃取数据库。当对手动整理的数据集进行基准测试时，酶提取物显示出很高的准确性，并且与brenda衍生的数据具有很强的一致性。为了创建模型准备的数据集，酶和底物序列与UniProt和PubChem比对，产生92,286个高置信度，序列映射的动力学条目。为了评估我们数据集的实际效用，我们使用酶提取数据库重新训练了几个最先进的kcat预测器（包括MESI， DLKcat和TurNuP）。在整个测试集中，所有模型在RMSE、MAE和R2方面都表现出改进的预测性能，突出了高质量、大规模、文献衍生的酶萃取数据库在增强酶动力学预测建模方面的价值。在https://github.com/ChemBioHTP/EnzyExtract上可以公开获得酶提取物的源代码和数据库，在https://colab.research.google.com/drive/1MwKSEZzLPNOseksRshbzkkFoO_cgJhva上可以通过谷歌Colab访问交互式演示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Protein Science 生物-生化与分子生物学

CiteScore

12.40

自引率

1.20%

发文量

246

审稿时长

1 months

期刊介绍： Protein Science, the flagship journal of The Protein Society, is a publication that focuses on advancing fundamental knowledge in the field of protein molecules. The journal welcomes original reports and review articles that contribute to our understanding of protein function, structure, folding, design, and evolution. Additionally, Protein Science encourages papers that explore the applications of protein science in various areas such as therapeutics, protein-based biomaterials, bionanotechnology, synthetic biology, and bioelectronics. The journal accepts manuscript submissions in any suitable format for review, with the requirement of converting the manuscript to journal-style format only upon acceptance for publication. Protein Science is indexed and abstracted in numerous databases, including the Agricultural & Environmental Science Database (ProQuest), Biological Science Database (ProQuest), CAS: Chemical Abstracts Service (ACS), Embase (Elsevier), Health & Medical Collection (ProQuest), Health Research Premium Collection (ProQuest), Materials Science & Engineering Database (ProQuest), MEDLINE/PubMed (NLM), Natural Science Collection (ProQuest), and SciTech Premium Collection (ProQuest).