Extensive experimentation with eACID

S. Hussain, M. Qadir
{"title":"Extensive experimentation with eACID","authors":"S. Hussain, M. Qadir","doi":"10.1109/ICET.2009.5353142","DOIUrl":null,"url":null,"abstract":"Fault monitoring is one of the main activities of fault tolerant distributed systems. It is required to determine the suspected /crashed component and proactively take the recovery steps to keep the system alive. The main objective of the fault monitoring activity is to quickly and correctly identify the faults. A fault monitoring system which is quick to declare faults increases the chances of false alarms, i.e., declaration of a fault which is actually not a fault. Therefore, an ideal fault monitoring system needs to be as quick as possible in identification of faults without increasing the false alarms. Fault monitor typically detects faults by sending and receiving messages to remote objects and observing the time intervals between a message and its response. One of the major responsibilities of the monitor is to adapt these intervals according to the dynamic network and system conditions, and set them very close to the actual delays in the system. The adaptation of the delays, timeout and monitoring intervals, must not fluctuate with large amplitudes around the actual delays. Otherwise, the number of false alarms would increase or the identification of faults will be delayed. The adaptation should converge to the actual delays very fast. Adaptation of the monitoring interval in the same way as time outs adapt can not be defended. Sometimes, a distributed system (network or other components) may have abrupt change in their state for a very short duration (the transient behavior), the fault monitoring system should bypass these transients behavior, and otherwise the decisions taken on transients will have to be changed to other state very quickly which will add extra overheads both in taking the decision and then reverting it back. Our algorithm with the name of eACID (enhanced Adaptive Convergent Intelligent fault monitoring in Distributed systems), when compared with the best known algorithm, ADAPTATION [Sotama et al.], yielded 16% less false timeouts and 9% more utilization of responses. eACID adapts the timeout on the previous history which gives us a fair idea about the work load and we use it to our advantage. Our scheme does not take decisions on transient behaviors of the system.","PeriodicalId":307661,"journal":{"name":"2009 International Conference on Emerging Technologies","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICET.2009.5353142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Fault monitoring is one of the main activities of fault tolerant distributed systems. It is required to determine the suspected /crashed component and proactively take the recovery steps to keep the system alive. The main objective of the fault monitoring activity is to quickly and correctly identify the faults. A fault monitoring system which is quick to declare faults increases the chances of false alarms, i.e., declaration of a fault which is actually not a fault. Therefore, an ideal fault monitoring system needs to be as quick as possible in identification of faults without increasing the false alarms. Fault monitor typically detects faults by sending and receiving messages to remote objects and observing the time intervals between a message and its response. One of the major responsibilities of the monitor is to adapt these intervals according to the dynamic network and system conditions, and set them very close to the actual delays in the system. The adaptation of the delays, timeout and monitoring intervals, must not fluctuate with large amplitudes around the actual delays. Otherwise, the number of false alarms would increase or the identification of faults will be delayed. The adaptation should converge to the actual delays very fast. Adaptation of the monitoring interval in the same way as time outs adapt can not be defended. Sometimes, a distributed system (network or other components) may have abrupt change in their state for a very short duration (the transient behavior), the fault monitoring system should bypass these transients behavior, and otherwise the decisions taken on transients will have to be changed to other state very quickly which will add extra overheads both in taking the decision and then reverting it back. Our algorithm with the name of eACID (enhanced Adaptive Convergent Intelligent fault monitoring in Distributed systems), when compared with the best known algorithm, ADAPTATION [Sotama et al.], yielded 16% less false timeouts and 9% more utilization of responses. eACID adapts the timeout on the previous history which gives us a fair idea about the work load and we use it to our advantage. Our scheme does not take decisions on transient behaviors of the system.
广泛的eACID实验
故障监测是容错分布式系统的主要活动之一。需要确定可疑/崩溃的组件,并主动采取恢复步骤以保持系统正常运行。故障监测活动的主要目的是快速、正确地识别故障。一个故障监测系统如果能快速地宣布故障,就会增加假警报的机会,即,宣布的故障实际上不是故障。因此,理想的故障监控系统需要在不增加虚警的情况下,尽可能快速地识别故障。故障监视器通常通过向远程对象发送和接收消息并观察消息与其响应之间的时间间隔来检测故障。监视器的主要职责之一是根据动态网络和系统的情况调整这些间隔,并使其非常接近系统的实际延迟。延迟的适应,超时和监测间隔,不能在实际延迟周围波动很大。否则会导致误报数量增加或延误故障的识别。自适应应该很快收敛于实际延迟。监视间隔的自适应与超时自适应的方式相同,这是不可辩驳的。有时,分布式系统(网络或其他组件)可能会在很短的时间内突然改变其状态(瞬态行为),故障监测系统应该绕过这些瞬态行为,否则在瞬态上所做的决定将不得不非常迅速地改变为其他状态,这将增加额外的开销,既要做出决定,又要恢复它。我们的算法名为eACID (enhanced Adaptive Convergent Intelligent fault monitoring in Distributed systems),与最著名的Adaptive [Sotama等人]算法相比,该算法的错误超时减少了16%,响应利用率提高了9%。eACID根据以前的历史调整超时,这使我们对工作负载有了一个很好的了解,我们利用它来发挥我们的优势。我们的方案不决定系统的暂态行为。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信