Machine Learning-Based Identification of Petroleum Distillates and Gasoline Traces Using Measured and Synthetic GC Spectra from Collected Samples.

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics Pub Date : 2025-08-01 DOI:10.1002/minf.70008

Omer Kaspi, Yaniv Y Avissar, Arnon Grafit, Ron Chibel, Olga Girshevitz, Hanoch Senderowitz

{"title":"Machine Learning-Based Identification of Petroleum Distillates and Gasoline Traces Using Measured and Synthetic GC Spectra from Collected Samples.","authors":"Omer Kaspi, Yaniv Y Avissar, Arnon Grafit, Ron Chibel, Olga Girshevitz, Hanoch Senderowitz","doi":"10.1002/minf.70008","DOIUrl":null,"url":null,"abstract":"<p><p>Ignition cases involving arsons are typically handled by forensic experts who examine spectra of samples collected from scenes of fire to test for the existence or absence of ignitable liquids. This is tedious work, since many cases do not involve such liquids. To facilitate this process, we have developed in this work a Machine Learning (ML)-based workflow for samples' classification based on their gas chromatography (GC) chromatograms (i.e., spectra). To this end, annotated spectra of 181 samples containing three groups of liquids (petroleum distillates, gasoline, and an assortment of other substances) collected from fire scenes as well as two reference databases were obtained from the Israeli Department of Identification and Forensic Sciences (DIFS). These spectra were used for the derivation of ML-based classification models using three algorithms, namely, kNN, representative spectrum, and random forest (RF) giving rise to reliable predictions. To increase the size of the dataset to a level that would enable the usage of more advanced ML algorithms, we have used the experimental spectra to develop a new spectra synthesis algorithm and utilized it to generate a large dataset of synthetic spectra. This dataset was used for the derivation of new kNN, RF, and representative spectrum models as well as deep learning (DL) models producing F1-scores over an independent test set composed entirely of \"real\" spectra ranging from 0.74-0.95, 0.86-0.95, 0.30-0.75, and 0.85-0.96 for kNN, RF, representative spectrum, and DL, respectively. Following the completion of the work, a second set of real spectra was provided to us by DIFS, and modeling it with the second set of models yielded F1-scores ranging from 0.92-0.96, 0.96-1.00, 0.71-0.82, and 0.95-0.98 for kNN, RF, representative spectrum, and DL, respectively. These results therefore suggest that for this dataset, performances depend more on the size of the dataset used for model training than on the ML algorithm. We propose that the workflow and spectra synthesis algorithm developed in this work could be readily applied to other forensic domains where samples are characterized by spectra, either solely or in combination with other parameters.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 8","pages":"e202400371"},"PeriodicalIF":3.1000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12371388/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/minf.70008","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

Abstract

Ignition cases involving arsons are typically handled by forensic experts who examine spectra of samples collected from scenes of fire to test for the existence or absence of ignitable liquids. This is tedious work, since many cases do not involve such liquids. To facilitate this process, we have developed in this work a Machine Learning (ML)-based workflow for samples' classification based on their gas chromatography (GC) chromatograms (i.e., spectra). To this end, annotated spectra of 181 samples containing three groups of liquids (petroleum distillates, gasoline, and an assortment of other substances) collected from fire scenes as well as two reference databases were obtained from the Israeli Department of Identification and Forensic Sciences (DIFS). These spectra were used for the derivation of ML-based classification models using three algorithms, namely, kNN, representative spectrum, and random forest (RF) giving rise to reliable predictions. To increase the size of the dataset to a level that would enable the usage of more advanced ML algorithms, we have used the experimental spectra to develop a new spectra synthesis algorithm and utilized it to generate a large dataset of synthetic spectra. This dataset was used for the derivation of new kNN, RF, and representative spectrum models as well as deep learning (DL) models producing F1-scores over an independent test set composed entirely of "real" spectra ranging from 0.74-0.95, 0.86-0.95, 0.30-0.75, and 0.85-0.96 for kNN, RF, representative spectrum, and DL, respectively. Following the completion of the work, a second set of real spectra was provided to us by DIFS, and modeling it with the second set of models yielded F1-scores ranging from 0.92-0.96, 0.96-1.00, 0.71-0.82, and 0.95-0.98 for kNN, RF, representative spectrum, and DL, respectively. These results therefore suggest that for this dataset, performances depend more on the size of the dataset used for model training than on the ML algorithm. We propose that the workflow and spectra synthesis algorithm developed in this work could be readily applied to other forensic domains where samples are characterized by spectra, either solely or in combination with other parameters.

Abstract Image

查看原文本刊更多论文

基于机器学习的石油馏分和汽油痕迹的识别，使用从收集的样品中测量和合成GC光谱。

涉及纵火的点火案件通常由法医专家处理，他们检查从火灾现场收集的样品的光谱，以测试是否存在可燃液体。这是一项繁琐的工作，因为许多情况下不涉及这种液体。为了促进这一过程，我们在这项工作中开发了一个基于机器学习（ML）的工作流程，用于根据其气相色谱（GC）色谱图（即光谱）对样品进行分类。为此，从火灾现场收集的含有三组液体（石油馏分油、汽油和各种其他物质）的181个样品的注释光谱以及从以色列鉴定和法医学部（DIFS）获得的两个参考数据库。这些光谱被用于推导基于ml的分类模型，使用三种算法，即kNN，代表性光谱和随机森林（RF），从而产生可靠的预测。为了将数据集的大小增加到能够使用更先进的ML算法的水平，我们使用实验光谱开发了一种新的光谱合成算法，并利用它来生成大型合成光谱数据集。该数据集用于推导新的kNN、RF和代表性光谱模型，以及深度学习（DL）模型，这些模型在完全由kNN、RF、代表性光谱和DL组成的独立测试集上产生f1分数，测试集分别为0.74-0.95、0.86-0.95、0.30-0.75和0.85-0.96。工作完成后，DIFS为我们提供了第二组真实光谱，用第二组模型对kNN、RF、代表性光谱和DL分别进行了0.92-0.96、0.96-1.00、0.71-0.82和0.95-0.98的f1评分。因此，这些结果表明，对于这个数据集，性能更多地取决于用于模型训练的数据集的大小，而不是ML算法。我们提出，在这项工作中开发的工作流和光谱合成算法可以很容易地应用于其他法医领域，在这些领域中，样品可以单独或与其他参数结合使用光谱来表征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Molecular Informatics CHEMISTRY, MEDICINAL-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

7.30

自引率

2.80%

发文量

审稿时长

3 months

期刊介绍： Molecular Informatics is a peer-reviewed, international forum for publication of high-quality, interdisciplinary research on all molecular aspects of bio/cheminformatics and computer-assisted molecular design. Molecular Informatics succeeded QSAR & Combinatorial Science in 2010. Molecular Informatics presents methodological innovations that will lead to a deeper understanding of ligand-receptor interactions, macromolecular complexes, molecular networks, design concepts and processes that demonstrate how ideas and design concepts lead to molecules with a desired structure or function, preferably including experimental validation. The journal''s scope includes but is not limited to the fields of drug discovery and chemical biology, protein and nucleic acid engineering and design, the design of nanomolecular structures, strategies for modeling of macromolecular assemblies, molecular networks and systems, pharmaco- and chemogenomics, computer-assisted screening strategies, as well as novel technologies for the de novo design of biologically active molecules. As a unique feature Molecular Informatics publishes so-called "Methods Corner" review-type articles which feature important technological concepts and advances within the scope of the journal.