classLog: Logistic regression for the classification of genetic sequences

IF 2 Q4 VIROLOGY

Frontiers in virology Pub Date : 2023-11-06 DOI:10.3389/fviro.2023.1215012

Michael A. Zeller, Zebulun W. Arendsee, Gavin J.D. Smith, Tavis K. Anderson

{"title":"classLog: Logistic regression for the classification of genetic sequences","authors":"Michael A. Zeller, Zebulun W. Arendsee, Gavin J.D. Smith, Tavis K. Anderson","doi":"10.3389/fviro.2023.1215012","DOIUrl":null,"url":null,"abstract":"<sec><title>Introduction</title>Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference. </sec><sec><title>Methods</title>We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%. </sec><sec><title>Results</title>When applied to a poor-quality sequence data, the classifier achieved between >85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations. </sec><sec><title>Discussion</title>Our approach is implemented as a python package with code available at <uri xlink:href=\"https://github.com/flu-crew/classLog\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">https://github.com/flu-crew/classLog</uri>.</sec>","PeriodicalId":73114,"journal":{"name":"Frontiers in virology","volume":"28 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in virology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fviro.2023.1215012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"VIROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference.

Methods

We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%.

Results

When applied to a poor-quality sequence data, the classifier achieved between >85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations.

Discussion

Our approach is implemented as a python package with code available at https://github.com/flu-crew/classLog.

查看原文本刊更多论文

用于基因序列分类的逻辑回归

测序和系统发育分类已经成为人类和动物诊断实验室的共同任务。常规做法是对病原体进行测序，以确定具有诊断意义的遗传变异，并将这些数据用于实时基因组接触者追踪和监测。在这种模式下，产生了前所未有的数据量，需要快速分析以提供有意义的推理。方法提出了一种可以对基因序列数据进行分类的机器学习逻辑回归管道。该管道实现了一种直观和可定制的方法来开发一个训练有素的预测模型，该模型在线性时间复杂度下运行，即使数据不完整也能快速生成准确的输出。我们的方法以猪呼吸与生殖综合征病毒(PRRSv)和猪H1甲型流感病毒(IAV)数据集为基准。训练好的分类器针对序列和模拟数据集进行了测试，这些数据集人为地将序列质量降低了0、10、20、30和40%。结果当应用于低质量的序列数据时，分类器对PRRSv和猪H1 IAV HA数据集的准确率在85%到95%之间，当使用完整数据集时，这一准确率增加到接近完美。该模型还通过模型内的特征选择排序来确定用于确定遗传进化支身份的氨基酸位置。这些位置可以映射到最大似然系统发育树上，允许对进化枝定义突变的推断。我们的方法是作为python包实现的，其代码可在https://github.com/flu-crew/classLog上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in virology

自引率

0.00%

发文量