Classification of HCV infections through sequence image normalization

S. Basodi, P. Icer, P. Skums, Y. Khudyakov, A. Zelikovsky, Yi Pan
{"title":"Classification of HCV infections through sequence image normalization","authors":"S. Basodi, P. Icer, P. Skums, Y. Khudyakov, A. Zelikovsky, Yi Pan","doi":"10.1109/ICCABS.2017.8114313","DOIUrl":null,"url":null,"abstract":"Identification of Hepatitis C virus (HCV) infections is crucial in determining viral outbreaks. HCV has an affinity to lead towards chronic infection with time due to its highly mutable nature. This leads to increase in heterogeneous population of genetically related HCV variants in the affected individuals. To our knowledge, there are no reliable diagnostic assays for distinguishing acute and chronic HCV infections. Providing a robust classification scheme for the staging of viral infection requires identification of prominent features which in this case can be done using domain knowledge. Simple genetic heterogeneity metrics are not sufficient to represent HCV infections accurately as features for the classification algorithms. This is due to complexity of structural development of intra-host populations, which are affected by bouts of selective sweeps and negative selection during chronic infection [1], [2]. Although some machine learning models are known to work well for sequence data for classification problems, their straightforward application to viral genomic data is problematic, since the number of viral sequences and the structures of intra-host viral populations are not consistent across various samples. We propose a novel preprocessing approach to transform irregular viral genomic data into a normalized image data. Such representation allows to apply powerful machine learning algorithms to the problem of classification of recent and chronic HCV infections. Our dataset consists of intra-host HCV populations of a highly heterogeneous genomic region HVR1, collected from 108 recently and 257 chronically infected individuals sampled by next-generation sequencing. We train several classification models using stratified 10-fold cross validation on the transformed image data. SVM classification model achieves the highest accuracy of 98% and also has more than 95% of precision, recall and F1_Score metrics, for both acute and chronically HCV infected individuals.","PeriodicalId":89933,"journal":{"name":"IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences","volume":"85 1","pages":"1"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCABS.2017.8114313","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Identification of Hepatitis C virus (HCV) infections is crucial in determining viral outbreaks. HCV has an affinity to lead towards chronic infection with time due to its highly mutable nature. This leads to increase in heterogeneous population of genetically related HCV variants in the affected individuals. To our knowledge, there are no reliable diagnostic assays for distinguishing acute and chronic HCV infections. Providing a robust classification scheme for the staging of viral infection requires identification of prominent features which in this case can be done using domain knowledge. Simple genetic heterogeneity metrics are not sufficient to represent HCV infections accurately as features for the classification algorithms. This is due to complexity of structural development of intra-host populations, which are affected by bouts of selective sweeps and negative selection during chronic infection [1], [2]. Although some machine learning models are known to work well for sequence data for classification problems, their straightforward application to viral genomic data is problematic, since the number of viral sequences and the structures of intra-host viral populations are not consistent across various samples. We propose a novel preprocessing approach to transform irregular viral genomic data into a normalized image data. Such representation allows to apply powerful machine learning algorithms to the problem of classification of recent and chronic HCV infections. Our dataset consists of intra-host HCV populations of a highly heterogeneous genomic region HVR1, collected from 108 recently and 257 chronically infected individuals sampled by next-generation sequencing. We train several classification models using stratified 10-fold cross validation on the transformed image data. SVM classification model achieves the highest accuracy of 98% and also has more than 95% of precision, recall and F1_Score metrics, for both acute and chronically HCV infected individuals.
序列图像归一化对HCV感染的分类
丙型肝炎病毒(HCV)感染的鉴定对于确定病毒暴发至关重要。由于HCV的高度易变性,随着时间的推移,它具有导致慢性感染的亲和力。这导致受影响个体中遗传相关的HCV变异异质人群的增加。据我们所知,目前还没有可靠的诊断方法来区分急性和慢性丙肝病毒感染。为病毒感染的分期提供一个强大的分类方案需要识别突出的特征,在这种情况下可以使用领域知识来完成。简单的遗传异质性指标不足以准确地代表HCV感染作为分类算法的特征。这是由于宿主内种群结构发育的复杂性,在慢性感染[1],b[2]期间,宿主内种群受到选择性扫描和负选择的影响。尽管已知一些机器学习模型可以很好地用于序列数据的分类问题,但它们直接应用于病毒基因组数据是有问题的,因为病毒序列的数量和宿主内病毒种群的结构在不同样本中并不一致。我们提出了一种新的预处理方法,将不规则的病毒基因组数据转换为规范化的图像数据。这种表示允许将强大的机器学习算法应用于近期和慢性HCV感染的分类问题。我们的数据集包括HVR1高度异质基因组区域的宿主内HCV群体,收集自108名新近和257名慢性感染个体的新一代测序样本。我们在转换后的图像数据上使用分层的10倍交叉验证来训练几个分类模型。SVM分类模型在急性和慢性HCV感染者中准确率最高,达到98%,准确率、召回率和F1_Score指标均超过95%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信