Galen Wei, Xinchun Ran, Runeem Ai-Abssi, Zhongyue Yang
{"title":"Finding the dark matter: Large language model-based enzyme kinetic data extractor and its validation.","authors":"Galen Wei, Xinchun Ran, Runeem Ai-Abssi, Zhongyue Yang","doi":"10.1002/pro.70251","DOIUrl":null,"url":null,"abstract":"<p><p>Despite the vast number of enzymatic kinetic measurements reported across decades of biochemical literature, the majority of relational enzyme kinetic data-linking amino acid sequence, substrate identity, kinetic parameters, and assay conditions-remains uncollected and inaccessible in structured form. This constitutes a significant portion of the \"dark matter\" of enzymology. Unlocking these hidden data through automated extraction offers an opportunity to expand enzyme dataset diversity and size, critical for building accurate, generalizable models that drive predictive enzyme engineering. To address this limitation, we built EnzyExtract, a large language model-powered pipeline that automates the extraction, verification, and structuring of enzyme kinetics data from scientific literature. By processing 137,892 full-text publications (PDF/XML), EnzyExtract collected more than 218,095 enzyme-substrate-kinetics entries, including 218,095 k<sub>cat</sub> and 167,794 K<sub>m</sub> values. These entries are mapped to enzymes spanning 3569 unique four-digit EC numbers, with a total of 84,464 entries assigned at least a first-digit EC number. EnzyExtract identified 89,544 unique kinetic entries (k<sub>cat</sub> and K<sub>m</sub> combined) absent from BRENDA, significantly expanding the known enzymology dataset. The newly curated dataset was compiled into a database named EnzyExtractDB. EnzyExtract demonstrates high accuracy when benchmarked against manually curated datasets and strong consistency with BRENDA-derived data. To create model-ready datasets, enzyme and substrate sequences were aligned to UniProt and PubChem, yielding 92,286 high-confidence, sequence-mapped kinetic entries. To assess the practical utility of our dataset, we retrained several state-of-the-art k<sub>cat</sub> predictors (including MESI, DLKcat, and TurNuP) using EnzyExtractDB. Across held-out test sets, all models demonstrate improved predictive performance in terms of RMSE, MAE, and R<sup>2</sup>, highlighting the value of high-quality, large-scale, literature-derived EnzyExtractDB for enhancing predictive modeling of enzyme kinetics. The EnzyExtract source code and the database are openly available at https://github.com/ChemBioHTP/EnzyExtract, and an interactive demo can be accessed via Google Colab at https://colab.research.google.com/drive/1MwKSEZzLPNOseksRshbzkkFoO_cgJhva.</p>","PeriodicalId":20761,"journal":{"name":"Protein Science","volume":"34 9","pages":"e70251"},"PeriodicalIF":5.2000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12355964/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Protein Science","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/pro.70251","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Despite the vast number of enzymatic kinetic measurements reported across decades of biochemical literature, the majority of relational enzyme kinetic data-linking amino acid sequence, substrate identity, kinetic parameters, and assay conditions-remains uncollected and inaccessible in structured form. This constitutes a significant portion of the "dark matter" of enzymology. Unlocking these hidden data through automated extraction offers an opportunity to expand enzyme dataset diversity and size, critical for building accurate, generalizable models that drive predictive enzyme engineering. To address this limitation, we built EnzyExtract, a large language model-powered pipeline that automates the extraction, verification, and structuring of enzyme kinetics data from scientific literature. By processing 137,892 full-text publications (PDF/XML), EnzyExtract collected more than 218,095 enzyme-substrate-kinetics entries, including 218,095 kcat and 167,794 Km values. These entries are mapped to enzymes spanning 3569 unique four-digit EC numbers, with a total of 84,464 entries assigned at least a first-digit EC number. EnzyExtract identified 89,544 unique kinetic entries (kcat and Km combined) absent from BRENDA, significantly expanding the known enzymology dataset. The newly curated dataset was compiled into a database named EnzyExtractDB. EnzyExtract demonstrates high accuracy when benchmarked against manually curated datasets and strong consistency with BRENDA-derived data. To create model-ready datasets, enzyme and substrate sequences were aligned to UniProt and PubChem, yielding 92,286 high-confidence, sequence-mapped kinetic entries. To assess the practical utility of our dataset, we retrained several state-of-the-art kcat predictors (including MESI, DLKcat, and TurNuP) using EnzyExtractDB. Across held-out test sets, all models demonstrate improved predictive performance in terms of RMSE, MAE, and R2, highlighting the value of high-quality, large-scale, literature-derived EnzyExtractDB for enhancing predictive modeling of enzyme kinetics. The EnzyExtract source code and the database are openly available at https://github.com/ChemBioHTP/EnzyExtract, and an interactive demo can be accessed via Google Colab at https://colab.research.google.com/drive/1MwKSEZzLPNOseksRshbzkkFoO_cgJhva.
期刊介绍:
Protein Science, the flagship journal of The Protein Society, is a publication that focuses on advancing fundamental knowledge in the field of protein molecules. The journal welcomes original reports and review articles that contribute to our understanding of protein function, structure, folding, design, and evolution.
Additionally, Protein Science encourages papers that explore the applications of protein science in various areas such as therapeutics, protein-based biomaterials, bionanotechnology, synthetic biology, and bioelectronics.
The journal accepts manuscript submissions in any suitable format for review, with the requirement of converting the manuscript to journal-style format only upon acceptance for publication.
Protein Science is indexed and abstracted in numerous databases, including the Agricultural & Environmental Science Database (ProQuest), Biological Science Database (ProQuest), CAS: Chemical Abstracts Service (ACS), Embase (Elsevier), Health & Medical Collection (ProQuest), Health Research Premium Collection (ProQuest), Materials Science & Engineering Database (ProQuest), MEDLINE/PubMed (NLM), Natural Science Collection (ProQuest), and SciTech Premium Collection (ProQuest).