{"title":"MAMS: A Highly Reliable Policy for Metadata Service","authors":"Jiang Zhou, Yong Chen, Weiping Wang, Dan Meng","doi":"10.1109/ICPP.2015.82","DOIUrl":null,"url":null,"abstract":"Most mass data processing applications nowadays often need long, continuous, and uninterrupted data access. Parallel/distributed file systems often use multiple metadata servers to manage the global namespace and provide a reliability guarantee. With the rapid increase of data amount and system scale, the probability of hardware or software failures keeps increasing, which easily leads to multiple points of failures. Metadata service reliability has become a crucial issue as it affects file and directory operations in the event of failures. Existing reliable metadata management mechanisms can provide fault tolerance but have disadvantages in system availability, state consistence, and performance overhead. This paper introduces a new highly reliable policy called MAMS (multiple actives multiple standbys) to ensure multiple metadata service reliability in file systems. Different from traditional strategies, the MAMS divides metadata servers into different replica groups and maintains more than one standby node for failover in each group. Combining the global view with distributed protocols, the MAMS achieves an automatic state transition and service takeover. We have implemented the MAMS policy in a prototyping file system and conducted extensive tests to validate and evaluate it. The experimental results confirm that the MAMS policy can achieve a faster transparent fault tolerance in different error scenarios with less influence on metadata operations. Compared with typical designs in Hadoop Avatar, Hadoop HA, and Boom-FS file systems, the mean time to recovery (MTTR) with the MAMS was reduced by 80.23%, 65.46% and 28.13%, respectively.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"170 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 44th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2015.82","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Most mass data processing applications nowadays often need long, continuous, and uninterrupted data access. Parallel/distributed file systems often use multiple metadata servers to manage the global namespace and provide a reliability guarantee. With the rapid increase of data amount and system scale, the probability of hardware or software failures keeps increasing, which easily leads to multiple points of failures. Metadata service reliability has become a crucial issue as it affects file and directory operations in the event of failures. Existing reliable metadata management mechanisms can provide fault tolerance but have disadvantages in system availability, state consistence, and performance overhead. This paper introduces a new highly reliable policy called MAMS (multiple actives multiple standbys) to ensure multiple metadata service reliability in file systems. Different from traditional strategies, the MAMS divides metadata servers into different replica groups and maintains more than one standby node for failover in each group. Combining the global view with distributed protocols, the MAMS achieves an automatic state transition and service takeover. We have implemented the MAMS policy in a prototyping file system and conducted extensive tests to validate and evaluate it. The experimental results confirm that the MAMS policy can achieve a faster transparent fault tolerance in different error scenarios with less influence on metadata operations. Compared with typical designs in Hadoop Avatar, Hadoop HA, and Boom-FS file systems, the mean time to recovery (MTTR) with the MAMS was reduced by 80.23%, 65.46% and 28.13%, respectively.