classLog: Logistic regression for the classification of genetic sequences

IF 2 Q4 VIROLOGY
Michael A. Zeller, Zebulun W. Arendsee, Gavin J.D. Smith, Tavis K. Anderson
{"title":"classLog: Logistic regression for the classification of genetic sequences","authors":"Michael A. Zeller, Zebulun W. Arendsee, Gavin J.D. Smith, Tavis K. Anderson","doi":"10.3389/fviro.2023.1215012","DOIUrl":null,"url":null,"abstract":"<sec><title>Introduction</title><p>Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference. </p></sec><sec><title>Methods</title><p>We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%. </p></sec><sec><title>Results</title><p>When applied to a poor-quality sequence data, the classifier achieved between &gt;85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations. </p></sec><sec><title>Discussion</title><p>Our approach is implemented as a python package with code available at <uri xlink:href=\"https://github.com/flu-crew/classLog\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">https://github.com/flu-crew/classLog</uri>.</p></sec>","PeriodicalId":73114,"journal":{"name":"Frontiers in virology","volume":"28 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in virology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fviro.2023.1215012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"VIROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction

Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference.

Methods

We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%.

Results

When applied to a poor-quality sequence data, the classifier achieved between >85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations.

Discussion

Our approach is implemented as a python package with code available at https://github.com/flu-crew/classLog.

用于基因序列分类的逻辑回归
测序和系统发育分类已经成为人类和动物诊断实验室的共同任务。常规做法是对病原体进行测序,以确定具有诊断意义的遗传变异,并将这些数据用于实时基因组接触者追踪和监测。在这种模式下,产生了前所未有的数据量,需要快速分析以提供有意义的推理。方法提出了一种可以对基因序列数据进行分类的机器学习逻辑回归管道。该管道实现了一种直观和可定制的方法来开发一个训练有素的预测模型,该模型在线性时间复杂度下运行,即使数据不完整也能快速生成准确的输出。我们的方法以猪呼吸与生殖综合征病毒(PRRSv)和猪H1甲型流感病毒(IAV)数据集为基准。训练好的分类器针对序列和模拟数据集进行了测试,这些数据集人为地将序列质量降低了0、10、20、30和40%。结果当应用于低质量的序列数据时,分类器对PRRSv和猪H1 IAV HA数据集的准确率在85%到95%之间,当使用完整数据集时,这一准确率增加到接近完美。该模型还通过模型内的特征选择排序来确定用于确定遗传进化支身份的氨基酸位置。这些位置可以映射到最大似然系统发育树上,允许对进化枝定义突变的推断。我们的方法是作为python包实现的,其代码可在https://github.com/flu-crew/classLog上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信