Multi-class machine learning classification of PFAS in environmental water samples: a blinded test of performance on unknowns†

IF 3.5 Q3 ENGINEERING, ENVIRONMENTAL

Environmental science. Advances Pub Date : 2024-01-17 DOI:10.1039/D3VA00266G

Tohren C. G. Kibbey, Denis M. O'Carroll, Andrew Safulko and Greg Coyle

{"title":"Multi-class machine learning classification of PFAS in environmental water samples: a blinded test of performance on unknowns†","authors":"Tohren C. G. Kibbey, Denis M. O'Carroll, Andrew Safulko and Greg Coyle","doi":"10.1039/D3VA00266G","DOIUrl":null,"url":null,"abstract":"<p >The ability to identify the origin of detected PFAS in environmental samples is of great interest. This work used a blinded test to explore the ability of a recently-developed multiclass classification approach to classify unknown PFAS water samples based on composition. The approach was adapted from previous work to identify similarities between the patterns of unknown samples and classes defined by the compositions of samples from more than one hundred different PFAS data sources, in addition to making an overall assessment of whether PFAS is likely of AFFF or non-AFFF origin. Methods permitting the use of data with different subsets of analyzed PFAS components allowed for the use of a training dataset of more than 13 000 samples from a highly diverse range of sites. For this work, researchers at Brown and Caldwell (BC) provided a set of 252 unknown samples to researchers at The University of Oklahoma (OU) and The University of New South Wales (UNSW) for classification. Unknown samples were provided by clients of BC, and also included a number of artificial sample compositions created to test the ability of a rejection method to identify samples too unlike the training dataset for accurate classification. Unknown samples were de-identified and placed in random order prior to being sent to OU and UNSW researchers. Only after classification results had been sent by OU and UNSW researchers to BC researchers did BC provide the actual sample descriptions to OU and UNSW. Results showed extremely strong performance of the method, both in terms of its ability to identify similarities between unknown samples and samples of known origin, and its ability to make more subtle distinctions between sample origin, such as, for example, recognizing unknown samples from an airport wastewater collection system as being compositionally similar to known samples in another airport wastewater collection system. A rejection algorithm was tested and found to be able to identify artificial sample compositions as different from those in the training dataset, a critical feature of a practical supervised machine learning application, necessary to avoid misclassification of unknown samples that are unlike those in the training dataset.</p>","PeriodicalId":72941,"journal":{"name":"Environmental science. Advances","volume":" 3","pages":" 366-382"},"PeriodicalIF":3.5000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/va/d3va00266g?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental science. Advances","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2024/va/d3va00266g","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

The ability to identify the origin of detected PFAS in environmental samples is of great interest. This work used a blinded test to explore the ability of a recently-developed multiclass classification approach to classify unknown PFAS water samples based on composition. The approach was adapted from previous work to identify similarities between the patterns of unknown samples and classes defined by the compositions of samples from more than one hundred different PFAS data sources, in addition to making an overall assessment of whether PFAS is likely of AFFF or non-AFFF origin. Methods permitting the use of data with different subsets of analyzed PFAS components allowed for the use of a training dataset of more than 13 000 samples from a highly diverse range of sites. For this work, researchers at Brown and Caldwell (BC) provided a set of 252 unknown samples to researchers at The University of Oklahoma (OU) and The University of New South Wales (UNSW) for classification. Unknown samples were provided by clients of BC, and also included a number of artificial sample compositions created to test the ability of a rejection method to identify samples too unlike the training dataset for accurate classification. Unknown samples were de-identified and placed in random order prior to being sent to OU and UNSW researchers. Only after classification results had been sent by OU and UNSW researchers to BC researchers did BC provide the actual sample descriptions to OU and UNSW. Results showed extremely strong performance of the method, both in terms of its ability to identify similarities between unknown samples and samples of known origin, and its ability to make more subtle distinctions between sample origin, such as, for example, recognizing unknown samples from an airport wastewater collection system as being compositionally similar to known samples in another airport wastewater collection system. A rejection algorithm was tested and found to be able to identify artificial sample compositions as different from those in the training dataset, a critical feature of a practical supervised machine learning application, necessary to avoid misclassification of unknown samples that are unlike those in the training dataset.

Abstract Image

查看原文本刊更多论文

环境水样中 PFAS 的多类机器学习分类：对未知数据的盲法性能测试

识别环境样本中检测到的全氟辛烷磺酸来源的能力备受关注。这项工作采用了盲法测试，以探索最近开发的多类分类方法根据成分对未知 PFAS 水样进行分类的能力。除了对 PFAS 可能来源于 AFFF 还是非 AFFF 进行整体评估外，该方法还根据以前的工作进行了调整，以确定未知样本的模式与来自一百多个不同 PAS 数据源的样本成分所定义的类别之间的相似性。由于采用了允许使用已分析过的 PFAS 成分的不同子集数据的方法，因此可以使用来自高度多样化地点的 13,000 多个样本组成的训练数据集。在这项工作中，布朗与考德威尔公司 (BC) 的研究人员向俄克拉荷马大学 (OU) 和新南威尔士大学 (UNSW) 的研究人员提供了一组 252 个未知样本，供其进行分类。未知样本是由 BC 的客户提供的，其中还包括一些人造样本组成，用于测试剔除方法识别与训练数据集太不相似的样本以进行准确分类的能力。未知样本在发送给 OU 和新南威尔士大学的研究人员之前，已被去除身份标识并按随机顺序排列。只有在 OU 和新南威尔士大学研究人员将分类结果发送给 BC 研究人员后，BC 才向 OU 和新南威尔士大学提供实际样本描述。结果表明，该方法在识别未知样本与已知来源样本之间的相似性方面，以及在对样本来源进行更细微的区分（例如，识别机场废水收集系统中的未知样本与另一个机场废水收集系统中的已知样本在成分上的相似性）方面，都有非常出色的表现。对剔除算法进行了测试，发现该算法能够识别人工样本成分与训练数据集中的样本成分不同，这是实际监督机器学习应用的一个关键特征，对于避免误分类与训练数据集中样本不同的未知样本十分必要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Environmental science. Advances

CiteScore

1.90

自引率

0.00%

发文量