Toward the Reconciliation of Inconsistent Molecular Structures from Biochemical Databases.

IF 1.4 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology Pub Date : 2024-06-01 Epub Date: 2024-05-17 DOI:10.1089/cmb.2024.0520

Casper Asbjørn Eriksen, Jakob Lykke Andersen, Rolf Fagerberg, Daniel Merkle

{"title":"Toward the Reconciliation of Inconsistent Molecular Structures from Biochemical Databases.","authors":"Casper Asbjørn Eriksen, Jakob Lykke Andersen, Rolf Fagerberg, Daniel Merkle","doi":"10.1089/cmb.2024.0520","DOIUrl":null,"url":null,"abstract":"Information on the structure of molecules, retrieved via biochemical databases, plays a pivotal role in various disciplines, including metabolomics, systems biology, and drug discovery. No such database can be complete and it is often necessary to incorporate data from several sources. However, the molecular structure for a given compound is not necessarily consistent between databases. This article presents StructRecon, a novel tool for resolving unique molecular structures from database identifiers. Currently, identifiers from BiGG, ChEBI, Escherichia coli Metabolome Database (ECMDB), MetaNetX, and PubChem are supported. StructRecon traverses the cross-links between entries in different databases to construct what we call identifier graphs. The goal of these graphs is to offer a more complete view of the total information available on a given compound across all the supported databases. To reconcile discrepancies met during the traversal of the databases, we develop an extensible model for molecular structure supporting multiple independent levels of detail, which allows standardization of the structure to be applied iteratively. In some cases, our standardization approach results in multiple candidate structures for a given compound, in which case a random walk-based algorithm is used to select the most likely structure among incompatible alternatives. As a case study, we applied StructRecon to the EColiCore2 model. We found at least one structure for 98.66% of its compounds, which is more than twice as many as possible when using the databases in more standard ways not considering the complex network of cross-database references captured by our identifier graphs. StructRecon is open-source and modular, which enables support for more databases in the future.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"498-512"},"PeriodicalIF":1.4000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2024.0520","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/5/17 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Information on the structure of molecules, retrieved via biochemical databases, plays a pivotal role in various disciplines, including metabolomics, systems biology, and drug discovery. No such database can be complete and it is often necessary to incorporate data from several sources. However, the molecular structure for a given compound is not necessarily consistent between databases. This article presents StructRecon, a novel tool for resolving unique molecular structures from database identifiers. Currently, identifiers from BiGG, ChEBI, Escherichia coli Metabolome Database (ECMDB), MetaNetX, and PubChem are supported. StructRecon traverses the cross-links between entries in different databases to construct what we call identifier graphs. The goal of these graphs is to offer a more complete view of the total information available on a given compound across all the supported databases. To reconcile discrepancies met during the traversal of the databases, we develop an extensible model for molecular structure supporting multiple independent levels of detail, which allows standardization of the structure to be applied iteratively. In some cases, our standardization approach results in multiple candidate structures for a given compound, in which case a random walk-based algorithm is used to select the most likely structure among incompatible alternatives. As a case study, we applied StructRecon to the EColiCore2 model. We found at least one structure for 98.66% of its compounds, which is more than twice as many as possible when using the databases in more standard ways not considering the complex network of cross-database references captured by our identifier graphs. StructRecon is open-source and modular, which enables support for more databases in the future.

查看原文本刊更多论文

努力调和生化数据库中不一致的分子结构。

通过生化数据库获取的分子结构信息在代谢组学、系统生物学和药物发现等多个学科中发挥着举足轻重的作用。任何此类数据库都不可能是完整的，通常需要结合多个来源的数据。然而，不同数据库中给定化合物的分子结构并不一定一致。本文介绍的 StructRecon 是一种从数据库标识符解析独特分子结构的新型工具。目前，该工具支持来自 BiGG、ChEBI、大肠杆菌代谢组数据库（ECMDB）、MetaNetX 和 PubChem 的标识符。StructRecon 会遍历不同数据库中条目之间的交叉链接，以构建我们所说的标识符图。这些图谱的目的是提供一个更完整的视图，显示特定化合物在所有支持数据库中可用的全部信息。为了调和在遍历数据库过程中遇到的差异，我们开发了一个可扩展的分子结构模型，支持多个独立的细节级别，从而可以反复应用结构标准化。在某些情况下，我们的标准化方法会为给定化合物生成多个候选结构，在这种情况下，我们会使用一种基于随机漫步的算法，从不相容的备选结构中选择最有可能的结构。作为案例研究，我们将 StructRecon 应用于 EColiCore2 模型。我们为其中 98.66% 的化合物找到了至少一种结构，这比以更标准的方式使用数据库而不考虑我们的标识符图捕捉到的复杂的跨数据库引用网络所能找到的结构数量高出一倍多。StructRecon 是开源和模块化的，因此未来可以支持更多数据库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computational Biology 生物-计算机：跨学科应用

CiteScore

3.60

自引率

5.90%

发文量

113

审稿时长

6-12 weeks

期刊介绍： Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics. Journal of Computational Biology coverage includes: -Genomics -Mathematical modeling and simulation -Distributed and parallel biological computing -Designing biological databases -Pattern matching and pattern detection -Linking disparate databases and data -New tools for computational biology -Relational and object-oriented database technology for bioinformatics -Biological expert system design and use -Reasoning by analogy, hypothesis formation, and testing by machine -Management of biological databases