{"title":"Nail-it-down: nailing and fixing configuration faults in cloud environments","authors":"Kalapriya Kannan, A. Bhamidipaty","doi":"10.1145/2482767.2482796","DOIUrl":null,"url":null,"abstract":"Faults due to configuration of resources account for majority of errors in distributed software systems. Yet, the problem of identifying faulty configuration remains at large. Current approaches for fault identification are focused on event correlation techniques which suffer from limited granular data generated by software components. As complexity of cloud environments increase, resource sharing increases many-fold thereby making it even harder to isolate configuration faults through analysis of events. In this paper, we propose a scalable approach that not only identifies the presence of a configuration fault but also attempts to nail down the parameter that is the source of the observed fault. We leverage the knowledge of shared resources in the environment and use a simple matrix representation for providing near real-time analysis of the faults. This enables the solution to be used for both reactive management and for automated proactive problem determination. Experiments through simulations demonstrate that our approach is effective in identifying configuration faults with reduced time and increased accuracy. Our algorithm gracefully handles the complexity of the problem as the system size grows.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2482767.2482796","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Faults due to configuration of resources account for majority of errors in distributed software systems. Yet, the problem of identifying faulty configuration remains at large. Current approaches for fault identification are focused on event correlation techniques which suffer from limited granular data generated by software components. As complexity of cloud environments increase, resource sharing increases many-fold thereby making it even harder to isolate configuration faults through analysis of events. In this paper, we propose a scalable approach that not only identifies the presence of a configuration fault but also attempts to nail down the parameter that is the source of the observed fault. We leverage the knowledge of shared resources in the environment and use a simple matrix representation for providing near real-time analysis of the faults. This enables the solution to be used for both reactive management and for automated proactive problem determination. Experiments through simulations demonstrate that our approach is effective in identifying configuration faults with reduced time and increased accuracy. Our algorithm gracefully handles the complexity of the problem as the system size grows.