基于温度的计算机集群故障预测

2012 15th International Multitopic Conference (INMIC) Pub Date : 2012-12-01 DOI:10.1109/INMIC.2012.6511446

S. Haider, Naveed Riaz Ansari

{"title":"基于温度的计算机集群故障预测","authors":"S. Haider, Naveed Riaz Ansari","doi":"10.1109/INMIC.2012.6511446","DOIUrl":null,"url":null,"abstract":"Clusters and Grids have one thing common and that is they both are used to achieve High Performance in Computing. The scope of Cluster is relatively narrow compared to Grid, as Clusters are homogeneous while Grids are heterogeneous. Another emerging area in High Performance Computing (HPC) is Cloud computing that can be considered as a further extension of Grid computing. Apart from other issues that exist in Clusters, Grids and Clouds, there is one common problem or issue that is available in all of them and that is Fault Tolerance and Handling. Fault Tolerance is the technique or the set of techniques that are used when different types of Hardware, Software, Network and other types of problems come during the handling and execution of Clusters, Grids and Clouds. In this research we have focused on fault identification and forecasting from Clusters point of view and have tried to establish a technique that forecasts the faults in Clusters based environments on the basis of temperature. Nodes keep on receiving and monitoring the temperature of the attached devices from temperature sensor and check the temperature threshold values of those devices. If the temperature threshold value of devices is within the range than we place/rate the machine in Green zone. Similarly if temperatures are approaching threshold values then we place the machines in Orange zone that represents that machine may or may not crash on the basis of temperature. Similarly when the devices have crossed the threshold values of the temperature then we place the machine in Red zone that represents that machine is likely to fail due to the failure of one or more hardware devices any time.","PeriodicalId":396084,"journal":{"name":"2012 15th International Multitopic Conference (INMIC)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Temperature based fault forecasting in computer clusters\",\"authors\":\"S. Haider, Naveed Riaz Ansari\",\"doi\":\"10.1109/INMIC.2012.6511446\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Clusters and Grids have one thing common and that is they both are used to achieve High Performance in Computing. The scope of Cluster is relatively narrow compared to Grid, as Clusters are homogeneous while Grids are heterogeneous. Another emerging area in High Performance Computing (HPC) is Cloud computing that can be considered as a further extension of Grid computing. Apart from other issues that exist in Clusters, Grids and Clouds, there is one common problem or issue that is available in all of them and that is Fault Tolerance and Handling. Fault Tolerance is the technique or the set of techniques that are used when different types of Hardware, Software, Network and other types of problems come during the handling and execution of Clusters, Grids and Clouds. In this research we have focused on fault identification and forecasting from Clusters point of view and have tried to establish a technique that forecasts the faults in Clusters based environments on the basis of temperature. Nodes keep on receiving and monitoring the temperature of the attached devices from temperature sensor and check the temperature threshold values of those devices. If the temperature threshold value of devices is within the range than we place/rate the machine in Green zone. Similarly if temperatures are approaching threshold values then we place the machines in Orange zone that represents that machine may or may not crash on the basis of temperature. Similarly when the devices have crossed the threshold values of the temperature then we place the machine in Red zone that represents that machine is likely to fail due to the failure of one or more hardware devices any time.\",\"PeriodicalId\":396084,\"journal\":{\"name\":\"2012 15th International Multitopic Conference (INMIC)\",\"volume\":\"159 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 15th International Multitopic Conference (INMIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INMIC.2012.6511446\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 15th International Multitopic Conference (INMIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INMIC.2012.6511446","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

集群和网格有一个共同点，那就是它们都用于实现高性能计算。与网格相比，集群的范围相对狭窄，因为集群是同构的，而网格是异构的。高性能计算(HPC)的另一个新兴领域是云计算，可以将其视为网格计算的进一步扩展。除了集群、网格和云中存在的其他问题之外，还有一个共同的问题，即容错和处理。容错是在集群、网格和云的处理和执行过程中出现不同类型的硬件、软件、网络和其他类型的问题时使用的一种或一组技术。在本研究中，我们主要从聚类的角度对故障进行识别和预测，并尝试建立一种基于温度的聚类环境故障预测技术。节点持续接收和监控温度传感器所连接设备的温度，并检查这些设备的温度阈值。如果设备的温度阈值在该范围内，则我们将机器放置/评级在绿区。同样，如果温度接近阈值，那么我们将机器放置在橙色区域，表示该机器可能会或可能不会根据温度崩溃。同样，当设备超过温度阈值时，我们将机器置于红色区域，表示该机器可能由于一个或多个硬件设备的故障而随时失效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Temperature based fault forecasting in computer clusters

Clusters and Grids have one thing common and that is they both are used to achieve High Performance in Computing. The scope of Cluster is relatively narrow compared to Grid, as Clusters are homogeneous while Grids are heterogeneous. Another emerging area in High Performance Computing (HPC) is Cloud computing that can be considered as a further extension of Grid computing. Apart from other issues that exist in Clusters, Grids and Clouds, there is one common problem or issue that is available in all of them and that is Fault Tolerance and Handling. Fault Tolerance is the technique or the set of techniques that are used when different types of Hardware, Software, Network and other types of problems come during the handling and execution of Clusters, Grids and Clouds. In this research we have focused on fault identification and forecasting from Clusters point of view and have tried to establish a technique that forecasts the faults in Clusters based environments on the basis of temperature. Nodes keep on receiving and monitoring the temperature of the attached devices from temperature sensor and check the temperature threshold values of those devices. If the temperature threshold value of devices is within the range than we place/rate the machine in Green zone. Similarly if temperatures are approaching threshold values then we place the machines in Orange zone that represents that machine may or may not crash on the basis of temperature. Similarly when the devices have crossed the threshold values of the temperature then we place the machine in Red zone that represents that machine is likely to fail due to the failure of one or more hardware devices any time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 15th International Multitopic Conference (INMIC)

自引率

0.00%

发文量