{"title":"在HPC平台上执行MapReduce科学数据分析的经验","authors":"Diana Moise","doi":"10.1145/2912152.2912154","DOIUrl":null,"url":null,"abstract":"The growing interest in being able to apply Big Data techniques to scientific data generated using HPC simulations led to the question of whether this is achievable on the same HPC platform, and if so, what is the performance that can be obtained on these systems. The motivation behind this approach is twofold: scientific datasets are often very large, and would take a long time to transfer to external Big Data clusters; furthermore, the ability to perform live analysis on the data as it is being generated on the HPC platform can be crucial to many scientific applications. Using as case-study a Hadoop-based application that analyzes Molecular Dynamics simulations data on the same HPC platform on which it was produced, we present our experiences with performing Big Data analysis on an HPC system. This work also describes the challenges that one has to deal with when performing Hadoop-based computations on scientific data on HPC platforms: data storage, data formats, ingesting data in Hadoop, optimizing the deployment to overcome the limitations of the HPC environment. Our work shows in a first phase that such an instantiation of Big Data analysis on an HPC system is both relevant and feasible; in a second phase, we greatly improve the performance by efficient configuration of HPC resources and tuning of the application. Our findings can be shared as best practices and recommendations in the context of the convergence of the HPC and Big Data environments.","PeriodicalId":443897,"journal":{"name":"Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Experiences with Performing MapReduce Analysis of Scientific Data on HPC Platforms\",\"authors\":\"Diana Moise\",\"doi\":\"10.1145/2912152.2912154\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The growing interest in being able to apply Big Data techniques to scientific data generated using HPC simulations led to the question of whether this is achievable on the same HPC platform, and if so, what is the performance that can be obtained on these systems. The motivation behind this approach is twofold: scientific datasets are often very large, and would take a long time to transfer to external Big Data clusters; furthermore, the ability to perform live analysis on the data as it is being generated on the HPC platform can be crucial to many scientific applications. Using as case-study a Hadoop-based application that analyzes Molecular Dynamics simulations data on the same HPC platform on which it was produced, we present our experiences with performing Big Data analysis on an HPC system. This work also describes the challenges that one has to deal with when performing Hadoop-based computations on scientific data on HPC platforms: data storage, data formats, ingesting data in Hadoop, optimizing the deployment to overcome the limitations of the HPC environment. Our work shows in a first phase that such an instantiation of Big Data analysis on an HPC system is both relevant and feasible; in a second phase, we greatly improve the performance by efficient configuration of HPC resources and tuning of the application. Our findings can be shared as best practices and recommendations in the context of the convergence of the HPC and Big Data environments.\",\"PeriodicalId\":443897,\"journal\":{\"name\":\"Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2912152.2912154\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2912152.2912154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Experiences with Performing MapReduce Analysis of Scientific Data on HPC Platforms
The growing interest in being able to apply Big Data techniques to scientific data generated using HPC simulations led to the question of whether this is achievable on the same HPC platform, and if so, what is the performance that can be obtained on these systems. The motivation behind this approach is twofold: scientific datasets are often very large, and would take a long time to transfer to external Big Data clusters; furthermore, the ability to perform live analysis on the data as it is being generated on the HPC platform can be crucial to many scientific applications. Using as case-study a Hadoop-based application that analyzes Molecular Dynamics simulations data on the same HPC platform on which it was produced, we present our experiences with performing Big Data analysis on an HPC system. This work also describes the challenges that one has to deal with when performing Hadoop-based computations on scientific data on HPC platforms: data storage, data formats, ingesting data in Hadoop, optimizing the deployment to overcome the limitations of the HPC environment. Our work shows in a first phase that such an instantiation of Big Data analysis on an HPC system is both relevant and feasible; in a second phase, we greatly improve the performance by efficient configuration of HPC resources and tuning of the application. Our findings can be shared as best practices and recommendations in the context of the convergence of the HPC and Big Data environments.