Jacek Kusnierz, M. Malawski, V. Padulano, E. T. Saavedra, P. Alonso-Jordá
{"title":"Distributed Parallel Analysis Engine for High Energy Physics Using AWS Lambda","authors":"Jacek Kusnierz, M. Malawski, V. Padulano, E. T. Saavedra, P. Alonso-Jordá","doi":"10.1145/3452413.3464788","DOIUrl":null,"url":null,"abstract":"The High-Energy Physics experiments at CERN produce a high volume of data. It is not possible to analyze big chunks of it within a reasonable time by any single machine. The ROOT framework was recently extended with the distributed computing capabilities for massively parallelized RDataFrame applications. This approach, using the MapReduce pattern underneath, made the heavy computations much more approachable even for the newcomers. This paper explores the possibility of running such analyses on serverless services in public cloud using a purely stateless environment. So far, the distributed approaches used by RDataFrame relied on stateful, fully managed computing frameworks like Apache Spark. Here we show that our newly developed tool is able to use perfectly stateless cloud functions, demonstrating the excellent speedup in parallel stage of processing in our benchmarks.","PeriodicalId":339058,"journal":{"name":"Proceedings of the 1st Workshop on High Performance Serverless Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st Workshop on High Performance Serverless Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3452413.3464788","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The High-Energy Physics experiments at CERN produce a high volume of data. It is not possible to analyze big chunks of it within a reasonable time by any single machine. The ROOT framework was recently extended with the distributed computing capabilities for massively parallelized RDataFrame applications. This approach, using the MapReduce pattern underneath, made the heavy computations much more approachable even for the newcomers. This paper explores the possibility of running such analyses on serverless services in public cloud using a purely stateless environment. So far, the distributed approaches used by RDataFrame relied on stateful, fully managed computing frameworks like Apache Spark. Here we show that our newly developed tool is able to use perfectly stateless cloud functions, demonstrating the excellent speedup in parallel stage of processing in our benchmarks.