{"title":"验证Apache Hadoop的动态检查点机制与失败场景","authors":"Paulo Vinicius Cardoso, P. Barcelos","doi":"10.1109/LATW.2018.8347240","DOIUrl":null,"url":null,"abstract":"New computational paradigms have created data intensive applications which have a demand for efficient and reliable processing platforms. High performance systems, used to answer this demand, have a increasing number of components such as nodes and cores. A multi component system may suffer with reliability and availability issues once the mean time between failures become smaller. Checkpoint and Recovery (CR) is a fault tolerance technique based on backward error recovery that focus on retrieving system safety state from backup saves. This paper shows the Checkpoint and Recovery technique implemented by Apache Hadoop, a framework that allows distributed processing of large datasets across clusters of computers. Hadoop uses the checkpoint technique to provides fault tolerance on Hadoop Distributed File System (HDFS). However, choosing an appropriate checkpoint interval is a major challenge once Hadoop defines the CR attributes statically. Then we propose a dynamic solution for checkpoint attributes configuration on HDFS, whose goal is to make it adaptable to system usage context. We expose a validation of both static and dynamic mechanisms on failure induced scenarios with DataNode crashes in order to determine the overhead of checkpoint and recovery steps.","PeriodicalId":236190,"journal":{"name":"2018 IEEE 19th Latin-American Test Symposium (LATS)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Validation of a dynamic checkpoint mechanism for Apache Hadoop with failure scenarios\",\"authors\":\"Paulo Vinicius Cardoso, P. Barcelos\",\"doi\":\"10.1109/LATW.2018.8347240\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"New computational paradigms have created data intensive applications which have a demand for efficient and reliable processing platforms. High performance systems, used to answer this demand, have a increasing number of components such as nodes and cores. A multi component system may suffer with reliability and availability issues once the mean time between failures become smaller. Checkpoint and Recovery (CR) is a fault tolerance technique based on backward error recovery that focus on retrieving system safety state from backup saves. This paper shows the Checkpoint and Recovery technique implemented by Apache Hadoop, a framework that allows distributed processing of large datasets across clusters of computers. Hadoop uses the checkpoint technique to provides fault tolerance on Hadoop Distributed File System (HDFS). However, choosing an appropriate checkpoint interval is a major challenge once Hadoop defines the CR attributes statically. Then we propose a dynamic solution for checkpoint attributes configuration on HDFS, whose goal is to make it adaptable to system usage context. We expose a validation of both static and dynamic mechanisms on failure induced scenarios with DataNode crashes in order to determine the overhead of checkpoint and recovery steps.\",\"PeriodicalId\":236190,\"journal\":{\"name\":\"2018 IEEE 19th Latin-American Test Symposium (LATS)\",\"volume\":\"116 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 19th Latin-American Test Symposium (LATS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/LATW.2018.8347240\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 19th Latin-American Test Symposium (LATS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/LATW.2018.8347240","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Validation of a dynamic checkpoint mechanism for Apache Hadoop with failure scenarios
New computational paradigms have created data intensive applications which have a demand for efficient and reliable processing platforms. High performance systems, used to answer this demand, have a increasing number of components such as nodes and cores. A multi component system may suffer with reliability and availability issues once the mean time between failures become smaller. Checkpoint and Recovery (CR) is a fault tolerance technique based on backward error recovery that focus on retrieving system safety state from backup saves. This paper shows the Checkpoint and Recovery technique implemented by Apache Hadoop, a framework that allows distributed processing of large datasets across clusters of computers. Hadoop uses the checkpoint technique to provides fault tolerance on Hadoop Distributed File System (HDFS). However, choosing an appropriate checkpoint interval is a major challenge once Hadoop defines the CR attributes statically. Then we propose a dynamic solution for checkpoint attributes configuration on HDFS, whose goal is to make it adaptable to system usage context. We expose a validation of both static and dynamic mechanisms on failure induced scenarios with DataNode crashes in order to determine the overhead of checkpoint and recovery steps.