T. Angskun, G. Bosilca, G. Fagg, Jelena Pjesivac-Grbovic, J. Dongarra
{"title":"基于离散事件仿真的自愈网络可靠性分析","authors":"T. Angskun, G. Bosilca, G. Fagg, Jelena Pjesivac-Grbovic, J. Dongarra","doi":"10.1109/CCGRID.2007.95","DOIUrl":null,"url":null,"abstract":"The number of processors embedded on high performance computing platforms is continuously increasing to accommodate user desire to solve larger and more complex problems. However, as the number of components increases, so does the probability of failure. Thus, both scalable and fault-tolerance of software are important issues in this field. To ensure reliability of the software especially under the failure circumstance, the reliability analysis is needed. The discrete-event simulation technique offers an attractive a ternative to traditional Markovian-based analytical models, which often have an intractably large state space. In this paper, we analyze reliability of a self-healing network developed for parallel runtime environments using discrete-event simulation. The network is designed to support transmission of messages across multiple nodes and at the same time, to protect against node and process failures. Results demonstrate the flexibility of a discrete-event simulation approach for studying the network behavior under failure conditions and various protocol parameters, message types, and routing algorithms.","PeriodicalId":278535,"journal":{"name":"Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)","volume":"274 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Reliability Analysis of Self-Healing Network using Discrete-Event Simulation\",\"authors\":\"T. Angskun, G. Bosilca, G. Fagg, Jelena Pjesivac-Grbovic, J. Dongarra\",\"doi\":\"10.1109/CCGRID.2007.95\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The number of processors embedded on high performance computing platforms is continuously increasing to accommodate user desire to solve larger and more complex problems. However, as the number of components increases, so does the probability of failure. Thus, both scalable and fault-tolerance of software are important issues in this field. To ensure reliability of the software especially under the failure circumstance, the reliability analysis is needed. The discrete-event simulation technique offers an attractive a ternative to traditional Markovian-based analytical models, which often have an intractably large state space. In this paper, we analyze reliability of a self-healing network developed for parallel runtime environments using discrete-event simulation. The network is designed to support transmission of messages across multiple nodes and at the same time, to protect against node and process failures. Results demonstrate the flexibility of a discrete-event simulation approach for studying the network behavior under failure conditions and various protocol parameters, message types, and routing algorithms.\",\"PeriodicalId\":278535,\"journal\":{\"name\":\"Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)\",\"volume\":\"274 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2007.95\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2007.95","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Reliability Analysis of Self-Healing Network using Discrete-Event Simulation
The number of processors embedded on high performance computing platforms is continuously increasing to accommodate user desire to solve larger and more complex problems. However, as the number of components increases, so does the probability of failure. Thus, both scalable and fault-tolerance of software are important issues in this field. To ensure reliability of the software especially under the failure circumstance, the reliability analysis is needed. The discrete-event simulation technique offers an attractive a ternative to traditional Markovian-based analytical models, which often have an intractably large state space. In this paper, we analyze reliability of a self-healing network developed for parallel runtime environments using discrete-event simulation. The network is designed to support transmission of messages across multiple nodes and at the same time, to protect against node and process failures. Results demonstrate the flexibility of a discrete-event simulation approach for studying the network behavior under failure conditions and various protocol parameters, message types, and routing algorithms.