Benchmarking technology infrastructures for embarrassingly and non-embarrassingly parallel problems in biomedical domain

2013 Biomedical Sciences and Engineering Conference (BSEC) Pub Date : 2013-05-21 DOI:10.1109/BSEC.2013.6618496

S. Kazmi, M. Kane, M. Krauthammer

{"title":"Benchmarking technology infrastructures for embarrassingly and non-embarrassingly parallel problems in biomedical domain","authors":"S. Kazmi, M. Kane, M. Krauthammer","doi":"10.1109/BSEC.2013.6618496","DOIUrl":null,"url":null,"abstract":"Having the advantage of large scale open source data available to us in multiple forms, the ultimate goal is to integrate these resources with gene sequence data to enhance our understanding and make viable inferences about the true nature of the processes that generate this data. We are investigating the use of open source subset of the National Institute of Health's National Library of Medicine (NIH/NLM) data for our analysis including text as well as image features to semantically link similar publications. Due to the sheer volume of data as well as the complexity of inference tasks, the initial problem is not in the analysis but lies in making a decision about the computational infrastructure to deploy and in data representation that will help accomplish our goals. Just like any other business process, reducing processing cost and time is of essence. This work benchmarks two open source platforms (A) Apache Hadoop with Apache Mahout, and (B) open source R using bigmemory package for performing non-embarrassingly parallel and embarrassingly parallel machine learning tasks. Singular Value Decomposition (SVD) and k-means are used to represent these two problem classes respectively and average task time is evaluated for the two architectures for a range of input data sizes. In addition, performance of these algorithms using sparse and dense matrix representation is also evaluated for clustering and feature extraction tasks. Our analysis shows that R is not able to process data larger than 2 giga-bytes, with an exponential performance degradation for data larger than 226 mega-bytes. Bigmemory package in R allowed processing of larger data but with similar degradation beyond 226 mega-bytes. As expected, Hadoop/Mahout did not perform well for SVD as compared to k-means due to the tightly coupled nature of data needed at each step and is only justified for processing of very large data sets.","PeriodicalId":431045,"journal":{"name":"2013 Biomedical Sciences and Engineering Conference (BSEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Biomedical Sciences and Engineering Conference (BSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BSEC.2013.6618496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Having the advantage of large scale open source data available to us in multiple forms, the ultimate goal is to integrate these resources with gene sequence data to enhance our understanding and make viable inferences about the true nature of the processes that generate this data. We are investigating the use of open source subset of the National Institute of Health's National Library of Medicine (NIH/NLM) data for our analysis including text as well as image features to semantically link similar publications. Due to the sheer volume of data as well as the complexity of inference tasks, the initial problem is not in the analysis but lies in making a decision about the computational infrastructure to deploy and in data representation that will help accomplish our goals. Just like any other business process, reducing processing cost and time is of essence. This work benchmarks two open source platforms (A) Apache Hadoop with Apache Mahout, and (B) open source R using bigmemory package for performing non-embarrassingly parallel and embarrassingly parallel machine learning tasks. Singular Value Decomposition (SVD) and k-means are used to represent these two problem classes respectively and average task time is evaluated for the two architectures for a range of input data sizes. In addition, performance of these algorithms using sparse and dense matrix representation is also evaluated for clustering and feature extraction tasks. Our analysis shows that R is not able to process data larger than 2 giga-bytes, with an exponential performance degradation for data larger than 226 mega-bytes. Bigmemory package in R allowed processing of larger data but with similar degradation beyond 226 mega-bytes. As expected, Hadoop/Mahout did not perform well for SVD as compared to k-means due to the tightly coupled nature of data needed at each step and is only justified for processing of very large data sets.

查看原文本刊更多论文

生物医学领域尴尬和非尴尬并行问题的基准技术基础设施

有了以多种形式向我们提供的大规模开源数据的优势，最终目标是将这些资源与基因序列数据集成在一起，以增强我们对生成这些数据的过程的真实性质的理解并做出可行的推断。我们正在研究使用美国国立卫生研究院国家医学图书馆(NIH/NLM)数据的开源子集进行分析，包括文本和图像特征，以便在语义上链接类似的出版物。由于庞大的数据量以及推理任务的复杂性，最初的问题不在于分析，而在于决定部署的计算基础设施和有助于实现我们目标的数据表示。就像任何其他业务流程一样，减少处理成本和时间至关重要。这项工作对两个开源平台进行了基准测试(A)使用Apache Mahout的Apache Hadoop，以及(B)使用bigmemory包执行非并行和并行机器学习任务的开源R。分别使用奇异值分解(SVD)和k-means来表示这两个问题类别，并对输入数据大小范围内的两种体系结构的平均任务时间进行了评估。此外，还评估了这些使用稀疏和密集矩阵表示的算法在聚类和特征提取任务中的性能。我们的分析表明，R无法处理大于2千兆字节的数据，对于大于226兆字节的数据，性能会呈指数级下降。R中的Bigmemory包允许处理更大的数据，但超过226兆字节时，性能也会下降。正如预期的那样，由于每一步所需数据的紧密耦合性质，Hadoop/Mahout在SVD方面的表现不如k-means好，并且只适用于处理非常大的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 Biomedical Sciences and Engineering Conference (BSEC)

自引率

0.00%

发文量