BioSequence2Vec: Efficient Embedding Generation For Biological Sequences

Advances in Knowledge Discovery and Data Mining : 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings. Part I. Pacific-Asia Conference on Knowledge Discovery and Data Mining (21st : 2017 : Cheju Isl... Pub Date : 2023-04-01 DOI:10.48550/arXiv.2304.00291

Sarwan Ali, Usama Sardar, Murray Patterson, Imdadullah Khan

{"title":"BioSequence2Vec: Efficient Embedding Generation For Biological Sequences","authors":"Sarwan Ali, Usama Sardar, Murray Patterson, Imdadullah Khan","doi":"10.48550/arXiv.2304.00291","DOIUrl":null,"url":null,"abstract":"Representation learning is an important step in the machine learning pipeline. Given the current biological sequencing data volume, learning an explicit representation is prohibitive due to the dimensionality of the resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification. Three challenges with kernel methods are (i) the computation time, (ii) the memory usage (storing an $n\\times n$ matrix), and (iii) the usage of kernel matrices limited to kernel-based ML methods (difficult to generalize on non-kernel classifiers). While (i) can be solved using approximate methods, challenge (ii) remains for typical kernel methods. Similarly, although non-kernel-based ML methods can be applied to kernel matrices by extracting principal components (kernel PCA), it may result in information loss, while being computationally expensive. In this paper, we propose a general-purpose representation learning approach that embodies kernel methods' qualities while avoiding computation, memory, and generalizability challenges. This involves computing a low-dimensional embedding of each sequence, using random projections of its $k$-mer frequency vectors, significantly reducing the computation needed to compute the dot product and the memory needed to store the resulting representation. Our proposed fast and alignment-free embedding method can be used as input to any distance (e.g., $k$ nearest neighbors) and non-distance (e.g., decision tree) based ML method for classification and clustering tasks. Using different forms of biological sequences as input, we perform a variety of real-world classification tasks, such as SARS-CoV-2 lineage and gene family classification, outperforming several state-of-the-art embedding and kernel methods in predictive performance.","PeriodicalId":91995,"journal":{"name":"Advances in Knowledge Discovery and Data Mining : 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings. Part I. Pacific-Asia Conference on Knowledge Discovery and Data Mining (21st : 2017 : Cheju Isl...","volume":"6 1","pages":"173-185"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Knowledge Discovery and Data Mining : 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings. Part I. Pacific-Asia Conference on Knowledge Discovery and Data Mining (21st : 2017 : Cheju Isl...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2304.00291","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Representation learning is an important step in the machine learning pipeline. Given the current biological sequencing data volume, learning an explicit representation is prohibitive due to the dimensionality of the resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification. Three challenges with kernel methods are (i) the computation time, (ii) the memory usage (storing an $n\times n$ matrix), and (iii) the usage of kernel matrices limited to kernel-based ML methods (difficult to generalize on non-kernel classifiers). While (i) can be solved using approximate methods, challenge (ii) remains for typical kernel methods. Similarly, although non-kernel-based ML methods can be applied to kernel matrices by extracting principal components (kernel PCA), it may result in information loss, while being computationally expensive. In this paper, we propose a general-purpose representation learning approach that embodies kernel methods' qualities while avoiding computation, memory, and generalizability challenges. This involves computing a low-dimensional embedding of each sequence, using random projections of its $k$-mer frequency vectors, significantly reducing the computation needed to compute the dot product and the memory needed to store the resulting representation. Our proposed fast and alignment-free embedding method can be used as input to any distance (e.g., $k$ nearest neighbors) and non-distance (e.g., decision tree) based ML method for classification and clustering tasks. Using different forms of biological sequences as input, we perform a variety of real-world classification tasks, such as SARS-CoV-2 lineage and gene family classification, outperforming several state-of-the-art embedding and kernel methods in predictive performance.

查看原文本刊更多论文

BioSequence2Vec:高效嵌入生成生物序列

表示学习是机器学习管道中的重要一步。鉴于目前的生物测序数据量，由于所得到的特征向量的维度，学习显式表示是令人望而却步的。基于核的方法，例如SVM，对于一些机器学习(ML)任务(如序列分类)是一种被证明有效和有用的替代方法。核方法的三个挑战是:(i)计算时间，(ii)内存使用(存储一个$n × n$矩阵)，以及(iii)核矩阵的使用仅限于基于核的ML方法(难以在非核分类器上推广)。虽然(i)可以用近似方法解决，但(ii)仍然是典型核方法的挑战。类似地，尽管非基于核的机器学习方法可以通过提取主成分(核PCA)应用于核矩阵，但它可能导致信息丢失，同时计算成本很高。在本文中，我们提出了一种通用的表示学习方法，它体现了核方法的品质，同时避免了计算、内存和泛化的挑战。这涉及到计算每个序列的低维嵌入，使用其k -mer频率向量的随机投影，大大减少了计算点积所需的计算和存储结果表示所需的内存。我们提出的快速且无对齐的嵌入方法可以作为任何距离(例如，$k$近邻)和非距离(例如，决策树)的ML方法的输入，用于分类和聚类任务。使用不同形式的生物序列作为输入，我们执行了各种现实世界的分类任务，例如SARS-CoV-2谱系和基因家族分类，在预测性能方面优于几种最先进的嵌入和核方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Advances in Knowledge Discovery and Data Mining : 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings. Part I. Pacific-Asia Conference on Knowledge Discovery and Data Mining (21st : 2017 : Cheju Isl...

自引率

0.00%

发文量