Machine learning in forensic toxicology: Applications, experiences, and future directions

IF 1.8 Q4 TOXICOLOGY

Toxicologie Analytique et Clinique Pub Date : 2025-03-01 DOI:10.1016/j.toxac.2025.01.014

Michael Scholz

{"title":"Machine learning in forensic toxicology: Applications, experiences, and future directions","authors":"Michael Scholz","doi":"10.1016/j.toxac.2025.01.014","DOIUrl":null,"url":null,"abstract":"<div><div>Giving a basic overview of principles of machine learning and its pitfalls together with real world successful examples. This should help improve technological literacy of machine learning within the forensic toxicologist community.</div><div>The demands on a forensic toxicologist are changing rapidly. In the past, it was sufficient to operate a GC-MS or LC-MS device with often extremely user-unfriendly software to obtain a result. Then the evaluation of a case could begin. However, as analytical instruments have become faster, more sensitive, versatile and powerful, forensic toxicology has evolved in parallel. This development has been accompanied by a rapid increase in the volume of data. This trend is particularly evident in high-resolution mass spectrometry and non-targeted search analysis, in which a large number of substances can be detected in complex biological samples. Forensic toxicologists are no longer interested only in prescription or illegal drugs, but in the totality of all small molecules in the human body (the so-called metabolome). Under certain circumstances, changes in the metabolome can provide clues to drug use, cause of death, drunk or even drowsy driving. It is obvious that these huge amounts of data can no longer be analyzed manually.</div><div>Machine learning (ML), a subfield of artificial intelligence, has proven to be extremely powerful and promising in tackling large, complex, and high-dimensional data sets. ML can make predictions, find patterns, or classify data. The three-machine learning types are supervised, unsupervised, and reinforcement learning. It has emerged over the last decade, and consists of many different learning algorithms (e.g. Linear Regression, Logistic Regression, Decision Trees, Random Forest, Support Vector Machines, Naive Bayes and others). Currently, these algorithms are finding their way into forensic toxicology. However, this transformative technology is not without its challenges. While the underlying principles of ML are easy to understand, there are a lot of pitfalls to avoid ensuring that ML can actually improve results in forensic toxicology. There are so many easy-to-make mistakes that can cause an ML model to appear to perform well, when in reality it does not.</div><div>The most common pitfalls are: inadequate or non-representative training data, poor quality of data or overfitting and underfitting. It is of the utmost importance to correctly split datasets, train algorithms, and validate results. Another problem that severely impacts machine-learning algorithms is the curse of dimensionality, a phenomenon where the efficiency and effectiveness of algorithms deteriorate as the dimensionality of the data increases exponentially. Consequently, the skilled forensic toxicologist must employ dimensionality reduction techniques such as selection of the most relevant features from the original dataset while discarding irrelevant or redundant ones (feature selection). This reduces the dimensionality of the data, simplifying the model and improving its efficiency. One can also transform the original high-dimensional data into a lower-dimensional space by creating new features that capture the essential information (feature extraction). It also helps to scale the features to a similar range to prevent certain features from dominating others, especially in distance-based algorithms. To further ensure robustness in the model training process, missing data should be addressed appropriately through imputation or deletion.</div><div>Examples of successful implementation of ML in forensic toxicology: the combination of machine learning and (high-resolution) mass spectrometry offers incredible synergy that can be harnessed to optimize workflows by detection of sample adulteration, improve detection of difficult analyte groups (e.g. synthetic cannabinoid receptor agonists, SCRAs), and optimize processing of high-dimensional data sets. This approach can help with even the most complex problems in our field, such as detecting the effects of sleepiness on the metabolome and establishing biomarkers of sleepiness.</div></div>","PeriodicalId":23170,"journal":{"name":"Toxicologie Analytique et Clinique","volume":"37 1","pages":"Pages S14-S15"},"PeriodicalIF":1.8000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Toxicologie Analytique et Clinique","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352007825000149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"TOXICOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Giving a basic overview of principles of machine learning and its pitfalls together with real world successful examples. This should help improve technological literacy of machine learning within the forensic toxicologist community.

The demands on a forensic toxicologist are changing rapidly. In the past, it was sufficient to operate a GC-MS or LC-MS device with often extremely user-unfriendly software to obtain a result. Then the evaluation of a case could begin. However, as analytical instruments have become faster, more sensitive, versatile and powerful, forensic toxicology has evolved in parallel. This development has been accompanied by a rapid increase in the volume of data. This trend is particularly evident in high-resolution mass spectrometry and non-targeted search analysis, in which a large number of substances can be detected in complex biological samples. Forensic toxicologists are no longer interested only in prescription or illegal drugs, but in the totality of all small molecules in the human body (the so-called metabolome). Under certain circumstances, changes in the metabolome can provide clues to drug use, cause of death, drunk or even drowsy driving. It is obvious that these huge amounts of data can no longer be analyzed manually.

Machine learning (ML), a subfield of artificial intelligence, has proven to be extremely powerful and promising in tackling large, complex, and high-dimensional data sets. ML can make predictions, find patterns, or classify data. The three-machine learning types are supervised, unsupervised, and reinforcement learning. It has emerged over the last decade, and consists of many different learning algorithms (e.g. Linear Regression, Logistic Regression, Decision Trees, Random Forest, Support Vector Machines, Naive Bayes and others). Currently, these algorithms are finding their way into forensic toxicology. However, this transformative technology is not without its challenges. While the underlying principles of ML are easy to understand, there are a lot of pitfalls to avoid ensuring that ML can actually improve results in forensic toxicology. There are so many easy-to-make mistakes that can cause an ML model to appear to perform well, when in reality it does not.

The most common pitfalls are: inadequate or non-representative training data, poor quality of data or overfitting and underfitting. It is of the utmost importance to correctly split datasets, train algorithms, and validate results. Another problem that severely impacts machine-learning algorithms is the curse of dimensionality, a phenomenon where the efficiency and effectiveness of algorithms deteriorate as the dimensionality of the data increases exponentially. Consequently, the skilled forensic toxicologist must employ dimensionality reduction techniques such as selection of the most relevant features from the original dataset while discarding irrelevant or redundant ones (feature selection). This reduces the dimensionality of the data, simplifying the model and improving its efficiency. One can also transform the original high-dimensional data into a lower-dimensional space by creating new features that capture the essential information (feature extraction). It also helps to scale the features to a similar range to prevent certain features from dominating others, especially in distance-based algorithms. To further ensure robustness in the model training process, missing data should be addressed appropriately through imputation or deletion.

Examples of successful implementation of ML in forensic toxicology: the combination of machine learning and (high-resolution) mass spectrometry offers incredible synergy that can be harnessed to optimize workflows by detection of sample adulteration, improve detection of difficult analyte groups (e.g. synthetic cannabinoid receptor agonists, SCRAs), and optimize processing of high-dimensional data sets. This approach can help with even the most complex problems in our field, such as detecting the effects of sleepiness on the metabolome and establishing biomarkers of sleepiness.

查看原文本刊更多论文

机器学习在法医毒理学：应用、经验和未来方向

给出机器学习原理的基本概述及其陷阱，并结合现实世界的成功例子。这将有助于提高法医毒理学家对机器学习的技术素养。对法医毒理学家的需求正在迅速变化。在过去，操作气相色谱-质谱或LC-MS设备，通常使用极其不友好的软件就可以获得结果。然后就可以开始对案件进行评估了。然而，随着分析仪器变得更快、更敏感、更多功能和更强大，法医毒理学也在同步发展。伴随这一发展的是数据量的迅速增加。这一趋势在高分辨率质谱分析和非靶向搜索分析中尤为明显，在这些分析中，可以在复杂的生物样品中检测到大量物质。法医毒理学家不再只对处方药或非法药物感兴趣，而是对人体内所有小分子（所谓的代谢组）的总和感兴趣。在某些情况下，代谢组的变化可以为药物使用、死亡原因、醉酒甚至疲劳驾驶提供线索。很明显，这些庞大的数据不能再手工分析了。机器学习（ML）是人工智能的一个子领域，在处理大型、复杂和高维数据集方面已经被证明是非常强大和有前途的。机器学习可以进行预测、发现模式或对数据进行分类。这三种机器学习类型分别是监督学习、无监督学习和强化学习。它在过去十年中出现，由许多不同的学习算法组成（例如线性回归，逻辑回归，决策树，随机森林，支持向量机，朴素贝叶斯等）。目前，这些算法正在进入法医毒理学领域。然而，这种变革性技术并非没有挑战。虽然机器学习的基本原理很容易理解，但要确保机器学习能够真正改善法医毒理学的结果，还有很多陷阱要避免。有很多容易犯的错误可能会导致ML模型看起来表现良好，而实际上却并非如此。最常见的陷阱是：训练数据不充分或无代表性，数据质量差或过拟合和欠拟合。正确分割数据集、训练算法和验证结果是至关重要的。另一个严重影响机器学习算法的问题是维数诅咒，即随着数据维数呈指数级增长，算法的效率和有效性会下降。因此，熟练的法医毒理学家必须采用降维技术，例如从原始数据集中选择最相关的特征，同时丢弃不相关或冗余的特征（特征选择）。这降低了数据的维数，简化了模型，提高了模型的效率。还可以通过创建捕获基本信息的新特征（特征提取）将原始高维数据转换为低维空间。它还有助于将特征缩放到相似的范围，以防止某些特征支配其他特征，特别是在基于距离的算法中。为了进一步保证模型训练过程中的鲁棒性，缺失的数据应该通过插入或删除来适当处理。在法医毒理学中成功实施ML的例子：机器学习和（高分辨率）质谱的结合提供了令人难以置信的协同作用，可以通过检测样品掺假来优化工作流程，改进对难分析物群（例如合成大麻素受体激动剂，scra）的检测，并优化高维数据集的处理。这种方法甚至可以帮助解决我们这个领域中最复杂的问题，比如检测困倦对代谢组的影响，建立困倦的生物标志物。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊