{"title":"极端尺度下流行病失效检测的高效快速近似一致性","authors":"Amogh Katti, D. Lilja","doi":"10.1109/PDP2018.2018.00045","DOIUrl":null,"url":null,"abstract":"This paper proposes a memory efficient failure detection and consensus algorithm, for fail-stop type process failures, based on epidemic protocols. It is suitable for extreme scale systems with reliable networks (no message loss) and high failure frequency. Communication time dominates the execution time at scale. The redundant failure detections and non-uniform information dissemination speed of epidemic algorithms make approximate epidemic-based consensus detection a useful way to trade communication overhead for accuracy. An approximate technique to the consensus detection is also proposed in this paper for faster consensus detection. Results show that the algorithm detects consensus correctly on failed processes with logarithmic scalability. The algorithm is tolerant to process failures both before and during the execution and the number of failures (occurring both before and during execution) have virtually no effect on the consensus detection time at scale. Comparison with similar deterministic consensus detection technique shows that the algorithm detects consensus at the same time with high probability. Further, benefits of the proposed approximate technique increase as system size increases. Compared to the non-approximate version, for a system size of 218 processes, the communication saved is 34% with accuracy loss of the order of 10^-4 in consensus detection.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Efficient and Fast Approximate Consensus with Epidemic Failure Detection at Extreme Scale\",\"authors\":\"Amogh Katti, D. Lilja\",\"doi\":\"10.1109/PDP2018.2018.00045\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a memory efficient failure detection and consensus algorithm, for fail-stop type process failures, based on epidemic protocols. It is suitable for extreme scale systems with reliable networks (no message loss) and high failure frequency. Communication time dominates the execution time at scale. The redundant failure detections and non-uniform information dissemination speed of epidemic algorithms make approximate epidemic-based consensus detection a useful way to trade communication overhead for accuracy. An approximate technique to the consensus detection is also proposed in this paper for faster consensus detection. Results show that the algorithm detects consensus correctly on failed processes with logarithmic scalability. The algorithm is tolerant to process failures both before and during the execution and the number of failures (occurring both before and during execution) have virtually no effect on the consensus detection time at scale. Comparison with similar deterministic consensus detection technique shows that the algorithm detects consensus at the same time with high probability. Further, benefits of the proposed approximate technique increase as system size increases. Compared to the non-approximate version, for a system size of 218 processes, the communication saved is 34% with accuracy loss of the order of 10^-4 in consensus detection.\",\"PeriodicalId\":333367,\"journal\":{\"name\":\"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDP2018.2018.00045\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP2018.2018.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Efficient and Fast Approximate Consensus with Epidemic Failure Detection at Extreme Scale
This paper proposes a memory efficient failure detection and consensus algorithm, for fail-stop type process failures, based on epidemic protocols. It is suitable for extreme scale systems with reliable networks (no message loss) and high failure frequency. Communication time dominates the execution time at scale. The redundant failure detections and non-uniform information dissemination speed of epidemic algorithms make approximate epidemic-based consensus detection a useful way to trade communication overhead for accuracy. An approximate technique to the consensus detection is also proposed in this paper for faster consensus detection. Results show that the algorithm detects consensus correctly on failed processes with logarithmic scalability. The algorithm is tolerant to process failures both before and during the execution and the number of failures (occurring both before and during execution) have virtually no effect on the consensus detection time at scale. Comparison with similar deterministic consensus detection technique shows that the algorithm detects consensus at the same time with high probability. Further, benefits of the proposed approximate technique increase as system size increases. Compared to the non-approximate version, for a system size of 218 processes, the communication saved is 34% with accuracy loss of the order of 10^-4 in consensus detection.