BioSequence2Vec: Efficient Embedding Generation For Biological Sequences

Sarwan Ali, Usama Sardar, Murray Patterson, Imdadullah Khan
{"title":"BioSequence2Vec: Efficient Embedding Generation For Biological Sequences","authors":"Sarwan Ali, Usama Sardar, Murray Patterson, Imdadullah Khan","doi":"10.48550/arXiv.2304.00291","DOIUrl":null,"url":null,"abstract":"Representation learning is an important step in the machine learning pipeline. Given the current biological sequencing data volume, learning an explicit representation is prohibitive due to the dimensionality of the resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification. Three challenges with kernel methods are (i) the computation time, (ii) the memory usage (storing an $n\\times n$ matrix), and (iii) the usage of kernel matrices limited to kernel-based ML methods (difficult to generalize on non-kernel classifiers). While (i) can be solved using approximate methods, challenge (ii) remains for typical kernel methods. Similarly, although non-kernel-based ML methods can be applied to kernel matrices by extracting principal components (kernel PCA), it may result in information loss, while being computationally expensive. In this paper, we propose a general-purpose representation learning approach that embodies kernel methods' qualities while avoiding computation, memory, and generalizability challenges. This involves computing a low-dimensional embedding of each sequence, using random projections of its $k$-mer frequency vectors, significantly reducing the computation needed to compute the dot product and the memory needed to store the resulting representation. Our proposed fast and alignment-free embedding method can be used as input to any distance (e.g., $k$ nearest neighbors) and non-distance (e.g., decision tree) based ML method for classification and clustering tasks. Using different forms of biological sequences as input, we perform a variety of real-world classification tasks, such as SARS-CoV-2 lineage and gene family classification, outperforming several state-of-the-art embedding and kernel methods in predictive performance.","PeriodicalId":91995,"journal":{"name":"Advances in Knowledge Discovery and Data Mining : 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings. Part I. Pacific-Asia Conference on Knowledge Discovery and Data Mining (21st : 2017 : Cheju Isl...","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Knowledge Discovery and Data Mining : 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings. Part I. Pacific-Asia Conference on Knowledge Discovery and Data Mining (21st : 2017 : Cheju Isl...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2304.00291","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Representation learning is an important step in the machine learning pipeline. Given the current biological sequencing data volume, learning an explicit representation is prohibitive due to the dimensionality of the resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification. Three challenges with kernel methods are (i) the computation time, (ii) the memory usage (storing an $n\times n$ matrix), and (iii) the usage of kernel matrices limited to kernel-based ML methods (difficult to generalize on non-kernel classifiers). While (i) can be solved using approximate methods, challenge (ii) remains for typical kernel methods. Similarly, although non-kernel-based ML methods can be applied to kernel matrices by extracting principal components (kernel PCA), it may result in information loss, while being computationally expensive. In this paper, we propose a general-purpose representation learning approach that embodies kernel methods' qualities while avoiding computation, memory, and generalizability challenges. This involves computing a low-dimensional embedding of each sequence, using random projections of its $k$-mer frequency vectors, significantly reducing the computation needed to compute the dot product and the memory needed to store the resulting representation. Our proposed fast and alignment-free embedding method can be used as input to any distance (e.g., $k$ nearest neighbors) and non-distance (e.g., decision tree) based ML method for classification and clustering tasks. Using different forms of biological sequences as input, we perform a variety of real-world classification tasks, such as SARS-CoV-2 lineage and gene family classification, outperforming several state-of-the-art embedding and kernel methods in predictive performance.
BioSequence2Vec:高效嵌入生成生物序列
表示学习是机器学习管道中的重要一步。鉴于目前的生物测序数据量,由于所得到的特征向量的维度,学习显式表示是令人望而却步的。基于核的方法,例如SVM,对于一些机器学习(ML)任务(如序列分类)是一种被证明有效和有用的替代方法。核方法的三个挑战是:(i)计算时间,(ii)内存使用(存储一个$n × n$矩阵),以及(iii)核矩阵的使用仅限于基于核的ML方法(难以在非核分类器上推广)。虽然(i)可以用近似方法解决,但(ii)仍然是典型核方法的挑战。类似地,尽管非基于核的机器学习方法可以通过提取主成分(核PCA)应用于核矩阵,但它可能导致信息丢失,同时计算成本很高。在本文中,我们提出了一种通用的表示学习方法,它体现了核方法的品质,同时避免了计算、内存和泛化的挑战。这涉及到计算每个序列的低维嵌入,使用其k -mer频率向量的随机投影,大大减少了计算点积所需的计算和存储结果表示所需的内存。我们提出的快速且无对齐的嵌入方法可以作为任何距离(例如,$k$近邻)和非距离(例如,决策树)的ML方法的输入,用于分类和聚类任务。使用不同形式的生物序列作为输入,我们执行了各种现实世界的分类任务,例如SARS-CoV-2谱系和基因家族分类,在预测性能方面优于几种最先进的嵌入和核方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信