Open Raman spectral library for biomolecule identification

IF 3.8 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems Pub Date : 2025-06-27 DOI:10.1016/j.chemolab.2025.105476

Marcelo Terán , José Javier Ruiz , Pablo Loza-Alvarez , David Masip , David Merino

{"title":"Open Raman spectral library for biomolecule identification","authors":"Marcelo Terán , José Javier Ruiz , Pablo Loza-Alvarez , David Masip , David Merino","doi":"10.1016/j.chemolab.2025.105476","DOIUrl":null,"url":null,"abstract":"<div><div>Raman spectroscopy combined with Multivariate Curve Resolution (MCR) analysis is widely used in biomedical applications. However, assignation of biomolecules to the components extracted by MCR can be challenging due to the absence of an open Raman spectral library for biomolecules. Raman experts typically identify unmixed component spectra as biomolecules by comparing them with reference spectra from the literature. This process can be time-consuming and subject to human bias. In this work, we created an open Raman spectral database with 140 biomolecules by implementing an algorithm to digitalize the spectra plots and most relevant peaks from articles available in the literature. Additionally, we implemented two search algorithms. The first one uses the spectral linear kernel or cosine similarity on the full spectra. The second algorithm is based on peak matching, and relies on the intersection over the union of the matched peaks with a defined tolerance for peak matching. Our experimental validation showed 100 % top 10 accuracy in molecule identification (e.g. collagen) and 100 % accuracy in molecule type identification (e.g. protein) in both pure biomolecule measurements and also when replicating results from prior studies. Objectively narrowing the identification to the top 10 ranked candidates and providing type identification can significantly reduce both the time required for visual identification and the need to purchase reference component samples. We publish our spectral library as an open-source tool so it can be expanded collaboratively by the research community. It is available at: <span><span>https://github.com/mteranm/ramanbiolib</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"264 ","pages":"Article 105476"},"PeriodicalIF":3.8000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925001613","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Raman spectroscopy combined with Multivariate Curve Resolution (MCR) analysis is widely used in biomedical applications. However, assignation of biomolecules to the components extracted by MCR can be challenging due to the absence of an open Raman spectral library for biomolecules. Raman experts typically identify unmixed component spectra as biomolecules by comparing them with reference spectra from the literature. This process can be time-consuming and subject to human bias. In this work, we created an open Raman spectral database with 140 biomolecules by implementing an algorithm to digitalize the spectra plots and most relevant peaks from articles available in the literature. Additionally, we implemented two search algorithms. The first one uses the spectral linear kernel or cosine similarity on the full spectra. The second algorithm is based on peak matching, and relies on the intersection over the union of the matched peaks with a defined tolerance for peak matching. Our experimental validation showed 100 % top 10 accuracy in molecule identification (e.g. collagen) and 100 % accuracy in molecule type identification (e.g. protein) in both pure biomolecule measurements and also when replicating results from prior studies. Objectively narrowing the identification to the top 10 ranked candidates and providing type identification can significantly reduce both the time required for visual identification and the need to purchase reference component samples. We publish our spectral library as an open-source tool so it can be expanded collaboratively by the research community. It is available at: https://github.com/mteranm/ramanbiolib.

查看原文本刊更多论文

开放拉曼光谱库用于生物分子鉴定

拉曼光谱与多元曲线分辨率（MCR）分析相结合，在生物医学领域得到了广泛的应用。然而，由于缺乏开放的生物分子拉曼光谱库，将生物分子分配到MCR提取的组分可能具有挑战性。拉曼专家通常通过将未混合组分光谱与文献中的参考光谱进行比较来识别生物分子。这个过程可能很耗时，而且容易受到人为偏见的影响。在这项工作中，我们通过实现一种算法，将文献中可用的光谱图和最相关的峰数字化，创建了一个包含140个生物分子的开放拉曼光谱数据库。此外，我们实现了两种搜索算法。第一种方法是在全谱上使用谱线性核或余弦相似度。第二种算法基于峰值匹配，依赖于匹配峰值的并集上的交集，并定义峰值匹配的容差。我们的实验验证表明，在纯生物分子测量和复制先前研究结果时，分子鉴定（例如胶原蛋白）的前10名准确率为100%，分子类型鉴定（例如蛋白质）的准确率为100%。客观地将识别范围缩小到排名前10位的候选物，并提供类型识别，可以显著减少视觉识别所需的时间和购买参考成分样本的需要。我们将我们的光谱库作为一个开源工具发布，这样它就可以被研究社区共同扩展。它可以在https://github.com/mteranm/ramanbiolib上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Chemometrics and Intelligent Laboratory Systems 工程技术-分析化学

CiteScore

7.50

自引率

7.70%

发文量

169

审稿时长

3.4 months

期刊介绍： Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines. Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data. The journal deals with the following topics: 1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.) 2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered. 3) Development of new software that provides novel tools or truly advances the use of chemometrical methods. 4) Well characterized data sets to test performance for the new methods and software. The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.