{"title":"Benchmarking technology infrastructures for embarrassingly and non-embarrassingly parallel problems in biomedical domain","authors":"S. Kazmi, M. Kane, M. Krauthammer","doi":"10.1109/BSEC.2013.6618496","DOIUrl":null,"url":null,"abstract":"Having the advantage of large scale open source data available to us in multiple forms, the ultimate goal is to integrate these resources with gene sequence data to enhance our understanding and make viable inferences about the true nature of the processes that generate this data. We are investigating the use of open source subset of the National Institute of Health's National Library of Medicine (NIH/NLM) data for our analysis including text as well as image features to semantically link similar publications. Due to the sheer volume of data as well as the complexity of inference tasks, the initial problem is not in the analysis but lies in making a decision about the computational infrastructure to deploy and in data representation that will help accomplish our goals. Just like any other business process, reducing processing cost and time is of essence. This work benchmarks two open source platforms (A) Apache Hadoop with Apache Mahout, and (B) open source R using bigmemory package for performing non-embarrassingly parallel and embarrassingly parallel machine learning tasks. Singular Value Decomposition (SVD) and k-means are used to represent these two problem classes respectively and average task time is evaluated for the two architectures for a range of input data sizes. In addition, performance of these algorithms using sparse and dense matrix representation is also evaluated for clustering and feature extraction tasks. Our analysis shows that R is not able to process data larger than 2 giga-bytes, with an exponential performance degradation for data larger than 226 mega-bytes. Bigmemory package in R allowed processing of larger data but with similar degradation beyond 226 mega-bytes. As expected, Hadoop/Mahout did not perform well for SVD as compared to k-means due to the tightly coupled nature of data needed at each step and is only justified for processing of very large data sets.","PeriodicalId":431045,"journal":{"name":"2013 Biomedical Sciences and Engineering Conference (BSEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Biomedical Sciences and Engineering Conference (BSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BSEC.2013.6618496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Having the advantage of large scale open source data available to us in multiple forms, the ultimate goal is to integrate these resources with gene sequence data to enhance our understanding and make viable inferences about the true nature of the processes that generate this data. We are investigating the use of open source subset of the National Institute of Health's National Library of Medicine (NIH/NLM) data for our analysis including text as well as image features to semantically link similar publications. Due to the sheer volume of data as well as the complexity of inference tasks, the initial problem is not in the analysis but lies in making a decision about the computational infrastructure to deploy and in data representation that will help accomplish our goals. Just like any other business process, reducing processing cost and time is of essence. This work benchmarks two open source platforms (A) Apache Hadoop with Apache Mahout, and (B) open source R using bigmemory package for performing non-embarrassingly parallel and embarrassingly parallel machine learning tasks. Singular Value Decomposition (SVD) and k-means are used to represent these two problem classes respectively and average task time is evaluated for the two architectures for a range of input data sizes. In addition, performance of these algorithms using sparse and dense matrix representation is also evaluated for clustering and feature extraction tasks. Our analysis shows that R is not able to process data larger than 2 giga-bytes, with an exponential performance degradation for data larger than 226 mega-bytes. Bigmemory package in R allowed processing of larger data but with similar degradation beyond 226 mega-bytes. As expected, Hadoop/Mahout did not perform well for SVD as compared to k-means due to the tightly coupled nature of data needed at each step and is only justified for processing of very large data sets.