{"title":"Detecting matrix multiplication faults in many-core systems","authors":"F. Sibai","doi":"10.1109/INNOVATIONS.2011.5893843","DOIUrl":null,"url":null,"abstract":"Many-core systems are characterized by a large number of components based on ever-shrinking circuit geometries. System reliability becomes an issue because of the system complexity, the large number of components and nanoscale issues due to soft errors. While information redundancy techniques can be used for fault tolerance, they occupy too much memory space and increase the memory and network bandwidth. Moreover, in many-cores, resources are plentiful encouraging the design of simple cores without hardware fault tolerance. Thus in the absence of information redundancy, software fault detection techniques become necessary to detect errors. Herein, we present fault detection techniques for 2×2 matrix multiplication which we extend to nxn matrix multiplication. These tests can detect transient and some intermittent and permanent hardware faults. These tests are also suitable to computing grids and distributed heterogeneous systems where the result-forming node may run tests in software to validate the sub-results submitted by the grid nodes.","PeriodicalId":173102,"journal":{"name":"2011 International Conference on Innovations in Information Technology","volume":"284 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Innovations in Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INNOVATIONS.2011.5893843","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Many-core systems are characterized by a large number of components based on ever-shrinking circuit geometries. System reliability becomes an issue because of the system complexity, the large number of components and nanoscale issues due to soft errors. While information redundancy techniques can be used for fault tolerance, they occupy too much memory space and increase the memory and network bandwidth. Moreover, in many-cores, resources are plentiful encouraging the design of simple cores without hardware fault tolerance. Thus in the absence of information redundancy, software fault detection techniques become necessary to detect errors. Herein, we present fault detection techniques for 2×2 matrix multiplication which we extend to nxn matrix multiplication. These tests can detect transient and some intermittent and permanent hardware faults. These tests are also suitable to computing grids and distributed heterogeneous systems where the result-forming node may run tests in software to validate the sub-results submitted by the grid nodes.