Fast training on large genomics data using distributed Support Vector Machines

2016 8th International Conference on Communication Systems and Networks (COMSNETS) Pub Date : 2016-01-01 DOI:10.1109/COMSNETS.2016.7439943

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, S. Bagchi, A. Grama, S. Chaterji

{"title":"Fast training on large genomics data using distributed Support Vector Machines","authors":"Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, S. Bagchi, A. Grama, S. Chaterji","doi":"10.1109/COMSNETS.2016.7439943","DOIUrl":null,"url":null,"abstract":"The field of genomics has seen a glorious explosion of high-quality data, with tremendous strides having been made in genomic sequencing instruments and computational genomics applications meant to make sense of the data. A common use case for genomics data is to answer the question if a specific genetic signature is correlated with some disease manifestations. Support Vector Machine (SVM) is a widely used classifier in computational literature. Previous studies have shown success in using these SVMs for the above use case of genomics data. However, SVMs suffer from a widely-recognized scalability problem in both memory use and computational time. It is as yet an unanswered question if training such classifiers can scale to the massive sizes that characterize many of the genomics data sets. We answer that question here for a specific dataset, in order to decipher whether some regulatory module of a particular combinatorial epigenetic “pattern” will regulate the expression of a gene. However, the specifics of the dataset is likely of less relevance to the claims of our work. We take a proposed theoretical technique for efficient training of SVM, namely Cascade SVM, create our classifier called EP-SVM, and empirically evaluate how it scales to the large genomics dataset. We implement Cascade SVM on the Apache Spark platform and open source this implementation1. Through our evaluation, we bring out the computational cost on each application process, the way of distributing the overall workload among multiple processes, which can potentially execute on different cores or different machines, and the cost of data transfer to different cores or different machines. We believe we are the first to shed light on the computational and network costs of training an SVM on a multi-dimensional genomics dataset. We also evaluate the accuracy of the classifier result as a function of the parameters of the SVM model.","PeriodicalId":185861,"journal":{"name":"2016 8th International Conference on Communication Systems and Networks (COMSNETS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 8th International Conference on Communication Systems and Networks (COMSNETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMSNETS.2016.7439943","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The field of genomics has seen a glorious explosion of high-quality data, with tremendous strides having been made in genomic sequencing instruments and computational genomics applications meant to make sense of the data. A common use case for genomics data is to answer the question if a specific genetic signature is correlated with some disease manifestations. Support Vector Machine (SVM) is a widely used classifier in computational literature. Previous studies have shown success in using these SVMs for the above use case of genomics data. However, SVMs suffer from a widely-recognized scalability problem in both memory use and computational time. It is as yet an unanswered question if training such classifiers can scale to the massive sizes that characterize many of the genomics data sets. We answer that question here for a specific dataset, in order to decipher whether some regulatory module of a particular combinatorial epigenetic “pattern” will regulate the expression of a gene. However, the specifics of the dataset is likely of less relevance to the claims of our work. We take a proposed theoretical technique for efficient training of SVM, namely Cascade SVM, create our classifier called EP-SVM, and empirically evaluate how it scales to the large genomics dataset. We implement Cascade SVM on the Apache Spark platform and open source this implementation1. Through our evaluation, we bring out the computational cost on each application process, the way of distributing the overall workload among multiple processes, which can potentially execute on different cores or different machines, and the cost of data transfer to different cores or different machines. We believe we are the first to shed light on the computational and network costs of training an SVM on a multi-dimensional genomics dataset. We also evaluate the accuracy of the classifier result as a function of the parameters of the SVM model.

查看原文本刊更多论文

使用分布式支持向量机对大型基因组数据进行快速训练

基因组学领域见证了高质量数据的辉煌爆发，在基因组测序仪器和旨在理解数据的计算基因组学应用方面取得了巨大进步。基因组学数据的一个常见用例是回答特定遗传特征是否与某些疾病表现相关的问题。支持向量机(SVM)是计算文献中应用广泛的分类器。先前的研究表明，在上述基因组学数据用例中使用这些支持向量机是成功的。然而，支持向量机在内存使用和计算时间方面存在公认的可伸缩性问题。如果训练这样的分类器可以扩展到具有许多基因组学数据集特征的大规模，这是一个尚未回答的问题。我们在这里针对一个特定的数据集回答了这个问题，以便破译特定组合表观遗传“模式”的某些调节模块是否会调节基因的表达。然而，数据集的细节可能与我们工作的主张不太相关。我们提出了一种有效训练支持向量机的理论技术，即级联支持向量机，创建了我们的分类器EP-SVM，并对其如何扩展到大型基因组数据集进行了实证评估。我们在Apache Spark平台上实现了Cascade SVM，并将其开源。通过我们的评估，我们得出了每个应用程序进程的计算成本，在多个进程之间分配总体工作负载的方式，这些进程可能在不同的核心或不同的机器上执行，以及数据传输到不同的核心或不同的机器上的成本。我们相信我们是第一个阐明在多维基因组数据集上训练支持向量机的计算和网络成本的人。我们还评估了分类器结果的准确性作为支持向量机模型参数的函数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 8th International Conference on Communication Systems and Networks (COMSNETS)

自引率

0.00%

发文量