Jian Fu, Qiang Yang, R. Poss, C. Jesshope, Chunyuan Zhang
{"title":"On-demand thread-level fault detection in a concurrent programming environment","authors":"Jian Fu, Qiang Yang, R. Poss, C. Jesshope, Chunyuan Zhang","doi":"10.1109/SAMOS.2013.6621132","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621132","url":null,"abstract":"The vulnerability of multi-core processors is increasing due to tighter design margins and greater susceptibility to interference. Moreover, concurrent programming environments are the norm in the exploitation of multi-core systems. In this paper, we present an on-demand thread-level fault detection mechanism for multi-cores. The main contribution is on-demand redundancy, which allows users to set the redundancy scope in the concurrent code. To achieve this we introduce intelligent redundant thread creation and synchronization, which manages concurrency and synchronization between the redundant threads via the master. This framework was implemented in an emulation of a multi-threaded, many-core processor with single, in-order issue cores. It was evaluated by a range of programs in image and signal processing, and encryption. The performance overhead of redundancy is less than 11% for single core execution and is always less than 100% for all scenarios. This efficiency derives from the platform's hardware concurrency management and latency tolerance.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123976400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Holger Flatt, J. Jasperneite, Daniel Dennstedt, Tran Dinh Hung
{"title":"Mapping of PRP/HSR redundancy protocols onto a configurable FPGA/CPU based architecture","authors":"Holger Flatt, J. Jasperneite, Daniel Dennstedt, Tran Dinh Hung","doi":"10.1109/SAMOS.2013.6621114","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621114","url":null,"abstract":"This paper presents the mapping of the seamless redundancy protocols PRP and HSR in combination with IEEE 1588 based clock synchronization onto a configurable CPU/FPGA based Redundancy Box architecture. Whereas core functions of PRP, HSR, and IEEE 1588 are mapped onto the FPGA, a CPU executes the control parts of these protocols. An optional attached standard switch ASIC provides direct connection to several network devices. For validation purpose, a special embedded platform is proposed that is composed of an FPGA and a commercial off-the-shelf switch ASIC. The results show that even a low-cost Altera Cyclone IV FPGA comprising 74,000 logic elements fulfills the requirements for protocol processing at 100 Mbps per port. Minimum size frames are forwarded by the FPGA up to two times faster than competitive implementations. Three connected PRP/HSR RedBoxes and an IEEE 1588 clock master are synchronizing in laboratory within an accuracy of 30 ns. Using several RedBoxes in PRP and HSR mode, a seamless redundancy is demonstrated for a PROFINET RT test network and supplemental network components. Overall, the presented RedBox can be flexibly integrated into time-synchronized industrial networks in order to significantly increase the communication reliability.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124719582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concurrent multi-level arrays: Wait-free extensible hash maps","authors":"S. Feldman, P. Laborde, D. Dechev","doi":"10.1109/SAMOS.2013.6621118","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621118","url":null,"abstract":"In this work we present the first design and implementation of a wait-free hash map. Our multiprocessor data structure allows a large number of threads to concurrently put, get, and remove information. Wait-freedom means that all threads make progress in a finite amount of time - an attribute that can be critical in real-time environments. This is opposed to the traditional blocking implementations of shared data structures which suffer from the negative impact of deadlock and related correctness and performance issues. Our design is portable because we only use atomic operations that are provided by the hardware; therefore, our hash map can be utilized by a variety of data-intensive applications including those within the domains of embedded systems and supercomputers. The challenges of providing this guarantee make the design and implementation of wait-free objects difficult. As such, there are few wait-free data structures described in the literature; in particular, there are no wait-free hash maps. It often becomes necessary to sacrifice performance in order to achieve wait-freedom. However, our experimental evaluation shows that our hash map design is, on average, 5 times faster than a traditional blocking design. Our solution outperforms the best available alternative non-blocking designs in a large majority of cases, typically by a factor of 8 or higher.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128905152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cobra: A comprehensive bundle-based reliable architecture","authors":"Andrea Pellegrini, V. Bertacco","doi":"10.1109/samos.2013.6621131","DOIUrl":"https://doi.org/10.1109/samos.2013.6621131","url":null,"abstract":"The declining robustness of transistors and their ever-denser integration threatens the dependability of future microprocessors. Classic multicores offer a simple solution to overcome hardware defects: faulty processors can be disabled without affecting the rest of the system. However, this approach becomes quickly an impractical solution at high fault rates. Recently, distributed computer architectures have been proposed to mitigate the effects of faulty transistors by utilizing finegrained hardware reconfiguration, managed by fully decoupled control logic. Unfortunately, such solutions trade flexibility for execution locality, and thus they do not scale to large systems. To overcome this issue we propose Cobra, a distributed, scalable, highly parallel reliable architecture. Cobra is a service-based architecture where groups of dynamic instructions flow independently through the system, making use of the available hardware resources. Cobra organizes the system's units dynamically using a novel resource assignment that preserves locality and limits communication overhead. Our experiments show that Cobra is extremely dependable, and outperforms classic multicores when subjected to 5 or more defects per 100 million transistors. We also show that Cobra operates 80% faster than previous state-of-the-art solutions on multi-programmed SPEC CPU2006 workloads and it improves cache hit rate by up to 62%. Our runtime fault detection techniques have a performance impact of only 3%.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128937557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An accurate energy model for streaming applications mapped on MPSoC platforms","authors":"J. Spasić, T. Stefanov","doi":"10.1109/SAMOS.2013.6621124","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621124","url":null,"abstract":"In this paper, we propose a very accurate energy model for streaming applications modeled as Polyhedral Process Networks (PPN) and mapped onto tile-based MPSoC platforms with distributed memory. The energy model is based on the well-defined properties of the PPN application model. To guarantee the accuracy of the energy model, values of important model parameters are obtained by real measurements. The proposed energy model is applicable to different types of processors and communication infrastructures within an MPSoC platform. The energy model is evaluated on FPGA-based MPSoC platforms against real measurements of energy consumption from the FPGA. The obtained energy estimates are highly accurate with an average error of 4% and a standard deviation of 3%. The average model evaluation time per design point takes 2.5 minutes for considered cases, which is very good given the high accuracy of the model.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130326023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomasz Patyk, D. Guevorkian, Teemu Pitkänen, P. Jääskeläinen, J. Takala
{"title":"Low-power application-specific FFT processor for LTE applications","authors":"Tomasz Patyk, D. Guevorkian, Teemu Pitkänen, P. Jääskeläinen, J. Takala","doi":"10.1109/SAMOS.2013.6621102","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621102","url":null,"abstract":"In this paper, we describe a processor architecture tailored to mixed-radix4/2/3 FFT algorithm. The proposed design supports all FFT sizes, namely 128-2048/1536, required by the LTE applications. The processor is based on the Transport Triggered Architecture processor architecture, which was customized with a set of function units, designed especially for the application at hand. The processor has been synthesized on an ASIC technology and both energy-efficiency and performance have been evaluated. The developed processor is programmable but shows energy-efficiency comparable to fixed-function ASIC implementations.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128849705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efstathios Sotiriou-Xanthopoulos, K. Siozios, G. Economakos, D. Soudris
{"title":"A Process-based Reconfigurable SystemC Module for simulation speedup","authors":"Efstathios Sotiriou-Xanthopoulos, K. Siozios, G. Economakos, D. Soudris","doi":"10.1109/SAMOS.2013.6621108","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621108","url":null,"abstract":"As Multi-Processor Systems-on-Chip (MPSoC) architectures become more and more complex, Design Space Exploration (DSE) becomes the only viable solution for finding the pareto-optimal designs. To evaluate each solution with real dataset, DSE has to simulate the design under test, which is modeled as a Virtual Platform usually written in SystemC. However, the simulation is a very slow task which includes non-productive time periods like system initialization, while the platform re-compilation also imposes a significant overhead. In this paper, a Process-based Reconfigurable Module is used in order to bypass the non-productive simulation parts, thus accelerating the simulation. The effectiveness of the proposed methodology is proved with a series of computationally intensive multimedia applications, where the simulation time improvements reach 34% on average.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"438 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132256767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPUburn: A system to test and mitigate GPU hardware failures","authors":"D. Defour, E. Petit","doi":"10.1109/SAMOS.2013.6621133","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621133","url":null,"abstract":"Due to many factors such as, high transistor density, high frequency, and low voltage, today's processors are more than ever subject to hardware failures. These errors have various impacts depending on the location of the error and the type of processor. Because of the hierarchical structure of the compute units and work scheduling, the hardware failure on GPUs affect only part of the application. In this paper we present a new methodology to characterize the hardware failures of Nvidia GPUs based on a software micro-benchmarking platform implemented in OpenCL. We also present which hardware part of TESLA architecture is more sensitive to intermittent errors, which usually appears when the processor is aging. We obtained these results by accelerating the aging process by running the processors at high temperature. We show that on GPUs, intermittent errors impact is limited to a localized architecture tile. Finally, we propose a methodology to detect, record location of defective units in order to avoid them to ensure the program correctness on such architectures, improving the GPU fault-tolerance capability and lifespan.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127867371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High speed cycle approximate simulation for cache-incoherent MPSoCs","authors":"Christopher Thompson, Miles Gould, N. Topham","doi":"10.1109/SAMOS.2013.6621110","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621110","url":null,"abstract":"We present a new high speed cycle-approximate simulator, addressing an important, neglected category of multi-core systems: deeply-embedded cache-incoherent MPSoCs. We take advantage of the unique properties of these systems to increase the parallelism of the simulation. In doing so we achieve performance not possible using previous simulation techniques, without compromising the accuracy of the results. We present quantitative performance results across a large range of simulated NoC designs, comprising 1 to 64 cores. On average we simulate at 5.9 MIPS, with simulation speeds reaching 373 MIPS in the best case. Comparing against FPGA implementations we demonstrate that the simulator manages this with an average timing error of only 2.1%.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129993152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An effective model extraction method with state space compression for model checking SystemC TLM designs","authors":"Yanyan Gao, Xi Li","doi":"10.1109/SAMOS.2013.6621107","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621107","url":null,"abstract":"SystemC has become a de-facto standard language for SoC and ASIP designs. The verification of implementation with SystemC is the key to guarantee the correctness of designs and prevent the errors from propagating to the lower levels. The gap between SystemC TLM model and its corresponding formal model makes it hard to perform automated translation between them. SystemC describes process behavior in sequential statements and usually employs intermediate variables, while most model checking languages for hardware only describe parallel behaviors, in which the usage of intermediate variables not only increases state space and may prolong execution time, but also introduce potential errors. For a model checking language which supports parallel description, the elimination of redundant intermediate variables is requisite and also an efficient way to reduce the state space. This paper intends to solve these issues: (1) proposing an extraction method that can implement the translation from a description which supports sequential execution to a description supports parallel execution; (2) identifying and removing redundant intermediate variables. In this paper, a novel mechanism is presented to automatically extract behavior description from SystemC to a widespreadly used model checking language SMV. We have implemented a tool SC2SMV and performed actual extraction process on it to demonstrate the effectiveness of the method presented in this paper.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115606275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}