CSU-MS2: A Contrastive Learning Framework for Cross-Modal Compound Identification from MS/MS Spectra to Molecular Structures.

IF 6.7 1区 化学 Q1 CHEMISTRY, ANALYTICAL
Ting Xie,Hailiang Zhang,Qiong Yang,Jinyu Sun,Yue Wang,Jia Long,Zhimin Zhang,Hongmei Lu
{"title":"CSU-MS2: A Contrastive Learning Framework for Cross-Modal Compound Identification from MS/MS Spectra to Molecular Structures.","authors":"Ting Xie,Hailiang Zhang,Qiong Yang,Jinyu Sun,Yue Wang,Jia Long,Zhimin Zhang,Hongmei Lu","doi":"10.1021/acs.analchem.5c01594","DOIUrl":null,"url":null,"abstract":"Tandem mass spectrometry (MS/MS) is a cornerstone for compound identification in complex mixtures, but conventional spectral matching approaches face critical limitations due to limited library coverage and matching algorithms. To address this, we propose CSU-MS2 (contrastively spectral-structural Unification framework for MS/MS Spectra and Molecular Structures), a novel framework that bridges MS/MS spectra and molecular structures through cross-modal contrastive learning. CSU-MS2 uniquely integrates an External Space Attention Aggregation (ESA) module to dynamically align spectral and structural features, enabling direct retrieval of molecular candidates from a unified embedding space. The framework is pretrained on large-scale in-silico MS/MS data sets generated by CFM-ID and ICEBERG, followed by fine-tuning on high-quality experimental data. Results show that CSU-MS2 achieves a Recall@1 of 75.45% when matching 1047 spectra against a reference library containing 1,001,047 compounds, significantly surpassing existing methods such as CFM-ID (68.38%), SIRIUS (64.85%), MetFrag (48.59%), and CMSSP (30.47%). Furthermore, rigorous validation on three external data sets spanning human metabolomics (MTBLS265), plant metabolites (PMhub), and the CASMI 2022 challenge demonstrates robust generalizability, with domain-specific retrieval achieving a Recall@10 of 91.67% for blood metabolites. To facilitate compound identification across various domains, we have assembled a Spectrum-searchable Structural Feature Database (SSFDB) from 23 structural databases and deployed an open-source web server supporting customizable cross-modal retrieval. All code, models, and SSFDB are publicly accessible, offering a transformative solution for high-throughput compound identification in metabolomics and beyond.","PeriodicalId":27,"journal":{"name":"Analytical Chemistry","volume":"22 1","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytical Chemistry","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.analchem.5c01594","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Tandem mass spectrometry (MS/MS) is a cornerstone for compound identification in complex mixtures, but conventional spectral matching approaches face critical limitations due to limited library coverage and matching algorithms. To address this, we propose CSU-MS2 (contrastively spectral-structural Unification framework for MS/MS Spectra and Molecular Structures), a novel framework that bridges MS/MS spectra and molecular structures through cross-modal contrastive learning. CSU-MS2 uniquely integrates an External Space Attention Aggregation (ESA) module to dynamically align spectral and structural features, enabling direct retrieval of molecular candidates from a unified embedding space. The framework is pretrained on large-scale in-silico MS/MS data sets generated by CFM-ID and ICEBERG, followed by fine-tuning on high-quality experimental data. Results show that CSU-MS2 achieves a Recall@1 of 75.45% when matching 1047 spectra against a reference library containing 1,001,047 compounds, significantly surpassing existing methods such as CFM-ID (68.38%), SIRIUS (64.85%), MetFrag (48.59%), and CMSSP (30.47%). Furthermore, rigorous validation on three external data sets spanning human metabolomics (MTBLS265), plant metabolites (PMhub), and the CASMI 2022 challenge demonstrates robust generalizability, with domain-specific retrieval achieving a Recall@10 of 91.67% for blood metabolites. To facilitate compound identification across various domains, we have assembled a Spectrum-searchable Structural Feature Database (SSFDB) from 23 structural databases and deployed an open-source web server supporting customizable cross-modal retrieval. All code, models, and SSFDB are publicly accessible, offering a transformative solution for high-throughput compound identification in metabolomics and beyond.
CSU-MS2:从质谱/质谱到分子结构的跨模态化合物鉴定的对比学习框架。
串联质谱(MS/MS)是复杂混合物中化合物鉴定的基础,但由于文库覆盖范围和匹配算法有限,传统的光谱匹配方法面临着严重的局限性。为了解决这个问题,我们提出了CSU-MS2 (MS/MS光谱和分子结构的对比光谱-结构统一框架),这是一个通过跨模态对比学习连接MS/MS光谱和分子结构的新框架。CSU-MS2独特地集成了外部空间注意聚合(ESA)模块,以动态对齐光谱和结构特征,从而能够从统一的嵌入空间中直接检索候选分子。该框架在CFM-ID和ICEBERG生成的大规模MS/MS数据集上进行预训练,然后对高质量的实验数据进行微调。结果表明,CSU-MS2在1047个光谱与包含1001047个化合物的参考文库匹配时,准确率达到Recall@1 75.45%,显著优于CFM-ID(68.38%)、SIRIUS(64.85%)、MetFrag(48.59%)和CMSSP(30.47%)等现有方法。此外,对跨越人类代谢组学(MTBLS265)、植物代谢物(PMhub)和CASMI 2022挑战的三个外部数据集进行了严格验证,证明了强大的通用性,对血液代谢物的特定域检索达到了Recall@10的91.67%。为了方便跨不同领域的化合物识别,我们从23个结构数据库中组装了一个可光谱搜索的结构特征数据库(SSFDB),并部署了一个支持可定制的跨模式检索的开源web服务器。所有代码、模型和SSFDB都是公开访问的,为代谢组学及其他领域的高通量化合物鉴定提供了变革性的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Analytical Chemistry
Analytical Chemistry 化学-分析化学
CiteScore
12.10
自引率
12.20%
发文量
1949
审稿时长
1.4 months
期刊介绍: Analytical Chemistry, a peer-reviewed research journal, focuses on disseminating new and original knowledge across all branches of analytical chemistry. Fundamental articles may explore general principles of chemical measurement science and need not directly address existing or potential analytical methodology. They can be entirely theoretical or report experimental results. Contributions may cover various phases of analytical operations, including sampling, bioanalysis, electrochemistry, mass spectrometry, microscale and nanoscale systems, environmental analysis, separations, spectroscopy, chemical reactions and selectivity, instrumentation, imaging, surface analysis, and data processing. Papers discussing known analytical methods should present a significant, original application of the method, a notable improvement, or results on an important analyte.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信