Soila Kavulya, S. Daniels, Kaustubh R. Joshi, M. Hiltunen, R. Gandhi, P. Narasimhan
{"title":"Draco: Statistical diagnosis of chronic problems in large distributed systems","authors":"Soila Kavulya, S. Daniels, Kaustubh R. Joshi, M. Hiltunen, R. Gandhi, P. Narasimhan","doi":"10.1109/DSN.2012.6263927","DOIUrl":"https://doi.org/10.1109/DSN.2012.6263927","url":null,"abstract":"Chronics are recurrent problems that often fly under the radar of operations teams because they do not affect enough users or service invocations to set off alarm thresholds. In contrast with major outages that are rare, often have a single cause, and as a result are relatively easy to detect and diagnose quickly, chronic problems are elusive because they are often triggered by complex conditions, persist in a system for days or weeks, and coexist with other problems active at the same time. In this paper, we present Draco, a scalable engine to diagnose chronics that addresses these issues by using a “top-down” approach that starts by heuristically identifying user interactions that are likely to have failed, e.g., dropped calls, and drills down to identify groups of properties that best explain the difference between failed and successful interactions by using a scalable Bayesian learner. We have deployed Draco in production for the VoIP operations of a major ISP. In addition to providing examples of chronics that Draco has helped identify, we show via a comprehensive evaluation on production data that Draco provided 97% coverage, had fewer than 4% false positives, and outperformed state-of-the-art diagnostic techniques by up to 56% for complex chronics.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125227640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An empirical study of the robustness of Inter-component Communication in Android","authors":"A. Maji, F. Arshad, S. Bagchi, Jan S. Rellermeyer","doi":"10.1109/DSN.2012.6263963","DOIUrl":"https://doi.org/10.1109/DSN.2012.6263963","url":null,"abstract":"Over the last three years, Android has established itself as the largest-selling operating system for smartphones. It boasts of a Linux-based robust kernel, a modular framework with multiple components in each application, and a security-conscious design where each application is isolated in its own virtual machine. However, all of these desirable properties would be rendered ineffectual if an application were to deliver erroneous messages to targeted applications and thus cause the target to behave incorrectly. In this paper, we present an empirical evaluation of the robustness of Inter-component Communication (ICC) in Android through fuzz testing methodology, whereby, parameters of the inter-component communication are changed to various incorrect values. We show that not only exception handling is a rarity in Android applications, but also it is possible to crash the Android runtime from unprivileged user processes. Based on our observations, we highlight some of the critical design issues in Android ICC and suggest solutions to alleviate these problems.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128706540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EliMet: Security metric elicitation in power grid critical infrastructures by observing system administrators' responsive behavior","authors":"S. Zonouz, A. Houmansadr, P. Haghani","doi":"10.1109/DSN.2012.6263941","DOIUrl":"https://doi.org/10.1109/DSN.2012.6263941","url":null,"abstract":"To protect complex power-grid control networks, efficient security assessment techniques are required. However, efficiently making sure that calculated security measures match the expert knowledge is a challenging endeavor. In this paper, we present EliMet, a framework that combines information from different sources and estimates the extent to which a control network meets its security objective. Initially, during an offline phase, a state-based model of the network is generated, and security-level of each state is measured using a generic and easy-to-compute metric. EliMet then passively observes system operators' online reactive behavior against security incidents, and accordingly refines the calculated security measure values. Finally, to make the values comply with the expert knowledge, EliMet actively queries operators regarding those states for which sufficient information was not gained during the passive observation. Our experimental results show that EliMet can optimally make use of prior knowledge as well as automated inference techniques to minimize human involvement and efficiently deduce the expert knowledge regarding individual states of that particular system.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"21 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124667097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moshe Gabel, A. Schuster, Ran Gilad-Bachrach, N. Bjørner
{"title":"Latent fault detection in large scale services","authors":"Moshe Gabel, A. Schuster, Ran Gilad-Bachrach, N. Bjørner","doi":"10.1109/DSN.2012.6263932","DOIUrl":"https://doi.org/10.1109/DSN.2012.6263932","url":null,"abstract":"Unexpected machine failures, with their resulting service outages and data loss, pose challenges to datacenter management. Existing failure detection techniques rely on domain knowledge, precious (often unavailable) training data, textual console logs, or intrusive service modifications. We hypothesize that many machine failures are not a result of abrupt changes but rather a result of a long period of degraded performance. This is confirmed in our experiments, in which over 20% of machine failures were preceded by such latent faults. We propose a proactive approach for failure prevention. We present a novel framework for statistical latent fault detection using only ordinary machine counters collected as standard practice. We demonstrate three detection methods within this framework. Derived tests are domain-independent and unsupervised, require neither background information nor tuning, and scale to very large services. We prove strong guarantees on the false positive rates of our tests.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"362 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115934154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Beccuti, A. Bobbio, G. Franceschinis, R. Terruggia
{"title":"A new symbolic approach for network reliability analysis","authors":"M. Beccuti, A. Bobbio, G. Franceschinis, R. Terruggia","doi":"10.1109/DSN.2012.6263935","DOIUrl":"https://doi.org/10.1109/DSN.2012.6263935","url":null,"abstract":"In this paper we propose an improved BDD approach to the network reliability analysis, that allows the user to compute an exact solution or an approximation based on reliability bounds when network complexity makes the former solution practically impossible. To this purpose, a new algorithm for encoding the connectivity graph on a Binary Decision Diagram (BDD) has been developed; it reduces the computation memory peak with respect to previous approaches based on the same type of data structure without increasing the execution time, and allows us also to derive from a subset of the minpaths/mincuts a lower/upper bound of the network reliability, so that the quality of the obtained approximation can be estimated. Finally, a fair and detailed comparison between our approach and another state of the art approach presented in the literature is documented through a set of benchmarks.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117222859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Error injection-based study of soft error propagation in AMD Bulldozer microprocessor module","authors":"C. Constantinescu, Mike Butler, Chris Weller","doi":"10.1109/DSN.2012.6263922","DOIUrl":"https://doi.org/10.1109/DSN.2012.6263922","url":null,"abstract":"Single-event upsets (SEU) and single-event transients (SET) may lead to crashes or even silent data corruption (SDC) in microprocessors. Error detection and recovery features are employed to mitigate the impact of SEU and SET. However, these features add performance, area, power, and cost overheads. As a result, designers must concentrate their efforts on protecting the most sensitive areas of the processor. Simulated error injection was used to study the propagation of the SEU-induced soft errors in the latest AMD microprocessor module, Bulldozer. This paper presents the Bulldozer architecture, error injection methodology, and experimental results. Propagation of soft errors is quantified by derating factors. Error injection is performed both at the module and unit level, derating factors and simulation times being compared. Accuracy is assessed by deriving confidence intervals of the derating factors. The experiments point out the most sensitive units of the Bulldozer module, and allow efficient implementation of the error-handling features.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126159296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Taming Mr Hayes: Mitigating signaling based attacks on smartphones","authors":"Collin Mulliner, Steffen Liebergeld, Matthias Lange, Jean-Pierre Seifert","doi":"10.1109/DSN.2012.6263943","DOIUrl":"https://doi.org/10.1109/DSN.2012.6263943","url":null,"abstract":"Malicious injection of cellular signaling traffic from mobile phones is an emerging security issue. The respective attacks can be performed by hijacked smartphones and by malware resident on mobile phones. Until today there are no protection mechanisms in place to prevent signaling based attacks other than implementing expensive additions to the cellular core network. In this work we present a protection system that resides on the mobile phone. Our solution works by partitioning the phone software stack into the application operating system and the communication partition. The application system is a standard fully featured Android system. On the other side, communication to the cellular network is mediated by a flexible monitoring and enforcement system running on the communication partition. We implemented and evaluated our protection system on a real smartphone. Our evaluation shows that it can mitigate all currently known signaling based attacks and in addition can protect users from cellular Trojans.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132174593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable optimal countermeasure selection using implicit enumeration on attack countermeasure trees","authors":"A. Roy, Dong Seong Kim, Kishor S. Trivedi","doi":"10.1109/DSN.2012.6263940","DOIUrl":"https://doi.org/10.1109/DSN.2012.6263940","url":null,"abstract":"Constraints such as limited security investment cost precludes a security decision maker from implementing all possible countermeasures in a system. Existing analytical model-based security optimization strategies do not prevail for the following reasons: (i) none of these model-based methods offer a way to find optimal security solution in the absence of probability assignments to the model, (ii) methods scale badly as size of the system to model increases and (iii) some methods suffer as they use attack trees (AT) whose structure does not allow for the inclusion of countermeasures while others translate the non-state-space model (e.g., attack response tree) into a state-space model hence causing state-space explosion. In this paper, we use a novel AT paradigm called attack countermeasure tree (ACT) whose structure takes into account attacks as well as countermeasures (in the form of detection and mitigation events). We use greedy and branch and bound techniques to study several objective functions with goals such as minimizing the number of countermeasures, security investment cost in the ACT and maximizing the benefit from implementing a certain countermeasure set in the ACT under different constraints. We cast each optimization problem into an integer programming problem which also allows us to find optimal solution even in the absence of probability assignments to the model. Our method scales well for large ACTs and we compare its efficiency with other approaches.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121582083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A study of soft error consequences in hard disk drives","authors":"T. Tsai, Nawanol Theera-Ampornpunt, S. Bagchi","doi":"10.1109/DSN.2012.6263936","DOIUrl":"https://doi.org/10.1109/DSN.2012.6263936","url":null,"abstract":"Hard disk drives have multiple layers of fault tolerance mechanisms that protect against data loss. However, a few failures occasionally breach the entire set of mechanisms. To prevent such scenarios, we rely on failure prediction mechanisms to raise alarms with sufficient warning to allow the at-risk data to be copied to a safe location. A common failure prediction technique monitors the occurrence of soft errors and triggers an alarm when the soft error rate exceeds a specified threshold. This study uses data collected from a population of over 50,000 customer deployed disk drives to examine the relationship between soft errors and failures, in particular failures manifested as hard errors. The data analysis shows that soft errors alone cannot be used as a reliable predictor of hard errors. However, in those cases where soft errors do accurately predict hard errors, sufficient warning time exists for preventive actions.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132755208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danilo Ansaloni, L. Chen, E. Smirni, Walter Binder
{"title":"Model-driven consolidation of Java workloads on multicores","authors":"Danilo Ansaloni, L. Chen, E. Smirni, Walter Binder","doi":"10.1109/DSN.2012.6263928","DOIUrl":"https://doi.org/10.1109/DSN.2012.6263928","url":null,"abstract":"Optimal resource allocation and application consolidation on modern multicore systems that host multiple applications is not easy. Striking a balance among conflicting targets such as maximizing system throughput and system utilization while minimizing application response times is a quandary for system administrators. The purpose of this work is to offer a methodology that can automate the difficult process of identifying how to best consolidate workloads in a multicore environment. We develop a simple approach that treats the hardware and the operating system as a black box and uses measurements to profile the application resource demands. The demands become input to a queueing network model that successfully predicts application scalability and that captures the performance impact of consolidated applications on shared on-chip and off-chip resources. Extensive analysis with the widely used DaCapo Java benchmarks on an IBM Power 7 system illustrates the model's ability to accurately predict the system's optimal application mix.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133241229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}