{"title":"监控大规模集群中的高性能网络","authors":"F. Gadaud","doi":"10.1109/CCGRID.2006.155","DOIUrl":null,"url":null,"abstract":"The number of large-scale clusters is rising. They are included into grids or become key components of large structures. As more users and projects rely on RFC clusters, high availability and security are requirements for a fast growing adoption and use. In this paper, we, focus on high performance networks. All HPC clusters are built on top of them. We demonstrate that classical instrumentations are inefficient in HPC environment, they do not scale or cause a significant loss of performance. Based on this fact, we highlight clusters properties; nodes have assigned roles and are coupled at various levels. Moreover, we study the main characteristics of resource usage for each type of node and propose an instrumentation that can be effectively deployed. It results in fine-grained mechanisms adapted to system architecture, and performance constraints. Relevant information is collected over time. Two properties are verified online and dynamically: coherency and containment. Each induces a type of verification and both aim at reducing recovery time from failure and security risk of a whole cluster. We illustrate our methodology on QsNet by K. Magontis et al. (2001) network and provide a way to increase safety of high performance networks and clusters","PeriodicalId":419226,"journal":{"name":"Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Monitoring High Performance Networks in Large-scale Clusters\",\"authors\":\"F. Gadaud\",\"doi\":\"10.1109/CCGRID.2006.155\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The number of large-scale clusters is rising. They are included into grids or become key components of large structures. As more users and projects rely on RFC clusters, high availability and security are requirements for a fast growing adoption and use. In this paper, we, focus on high performance networks. All HPC clusters are built on top of them. We demonstrate that classical instrumentations are inefficient in HPC environment, they do not scale or cause a significant loss of performance. Based on this fact, we highlight clusters properties; nodes have assigned roles and are coupled at various levels. Moreover, we study the main characteristics of resource usage for each type of node and propose an instrumentation that can be effectively deployed. It results in fine-grained mechanisms adapted to system architecture, and performance constraints. Relevant information is collected over time. Two properties are verified online and dynamically: coherency and containment. Each induces a type of verification and both aim at reducing recovery time from failure and security risk of a whole cluster. We illustrate our methodology on QsNet by K. Magontis et al. (2001) network and provide a way to increase safety of high performance networks and clusters\",\"PeriodicalId\":419226,\"journal\":{\"name\":\"Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2006.155\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2006.155","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Monitoring High Performance Networks in Large-scale Clusters
The number of large-scale clusters is rising. They are included into grids or become key components of large structures. As more users and projects rely on RFC clusters, high availability and security are requirements for a fast growing adoption and use. In this paper, we, focus on high performance networks. All HPC clusters are built on top of them. We demonstrate that classical instrumentations are inefficient in HPC environment, they do not scale or cause a significant loss of performance. Based on this fact, we highlight clusters properties; nodes have assigned roles and are coupled at various levels. Moreover, we study the main characteristics of resource usage for each type of node and propose an instrumentation that can be effectively deployed. It results in fine-grained mechanisms adapted to system architecture, and performance constraints. Relevant information is collected over time. Two properties are verified online and dynamically: coherency and containment. Each induces a type of verification and both aim at reducing recovery time from failure and security risk of a whole cluster. We illustrate our methodology on QsNet by K. Magontis et al. (2001) network and provide a way to increase safety of high performance networks and clusters