Robert A. Thacker, C. Myers, K. R. Jones, S. Little
{"title":"A new verification method for embedded systems","authors":"Robert A. Thacker, C. Myers, K. R. Jones, S. Little","doi":"10.1109/ICCD.2009.5413154","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413154","url":null,"abstract":"Verification of embedded systems is complicated by the fact that they are composed of digital hardware, analog sensors and actuators, and low level software. In order to verify the interaction of these heterogeneous components, it would be beneficial to have a single modeling formalism that is capable of representing all of these components. To address this need, this paper describes an extended labeled hybrid Petri net (LHPN) model that includes constructs for Boolean, discrete, and continuous variables as well as constructs to model timing. This paper also presents a method to verify these extended LHPNs. Finally, this paper presents a case study to illustrate the application of this model to the verification of a fault-tolerant temperature sensor.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126255141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient calibration of thermal models based on application behavior","authors":"Youngwoo Ahn, Inchoon Yeo, R. Bettati","doi":"10.1109/ICCD.2009.5413179","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413179","url":null,"abstract":"With increasing power densities, raising operating temperatures in chips threaten system reliability. Thermal control therefore has emerged as an important issue in system design and management. For dynamic thermal control to be effective, predictive thermal models of the system are needed. Such models typically use power as input, which renders them difficult to use in practical systems, where power monitoring is not available at processor or chip level. In this paper, we describe a methodology to infer the thermal model based on the monitoring of existing temperature sensors and of instruction counter registers. This allows the thermal model to be easily established, calibrated, and recalibrated at runtime to account for different thermal behavior due to either variations in fabrication or to varying environmental parameters. We validate the proposed methodology through a series of experiments. We also propose and validate an extension of the model and associated methodology for multicore processors.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126258857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A power-aware hybrid RAM-CAM renaming mechanism for fast recovery","authors":"S. Petit, R. Ubal, J. Sahuquillo, P. López","doi":"10.1109/ICCD.2009.5413160","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413160","url":null,"abstract":"Modern superscalar processors implement register renaming by using either RAM or CAM tables. The design of these structures should address their access time and misprediction recovery penalty. While direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. Although they are more complex and slower, CAMs usually match the processor cycle in current designs. However, they do not scale with the number of physical registers and the pipeline width. In this paper we present a new hybrid RAM-CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides the current mappings quickly; on mispeculation, a low-complexity CAM enables immediate recovery and further register renaming. Compared to an ideal CAM in a 4-way state-of-the-art superscalar microprocessor, and for almost the same performance (1% slowdown) and area (95% of the ideal CAM size), the proposed scheme consumes about 90% less dynamic energy.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134261502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiplier-less and table-less linear approximation for square and square-root","authors":"I. Park, Tae-Hwan Kim","doi":"10.1109/ICCD.2009.5413129","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413129","url":null,"abstract":"Square and square-root are widely used in digital signal processing and digital communication algorithms, and their efficient realizations are commonly required to reduce the hardware complexity. In the implementation point of view, approximate realizations are often desired if they do not degrade performance significantly. In this paper, we propose new linear approximations for the square and square-root functions. The traditional linear approximations need multipliers to calculate slope offsets and tables to store initial offset values and slope values, whereas the proposed approximations exploit the inherent properties of square-related functions to linearly interpolate with only simple operations, such as shift, concatenation and addition, which are usually supported in modern VLSI systems. Regardless of the bit-width of the number system, more importantly, the maximum relative errors of the proposed approximations are bounded to 6.25% and 3.13% for square and square-root functions, respectively.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131729127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Code density concerns for new architectures","authors":"Vincent M. Weaver, S. Mckee","doi":"10.1109/ICCD.2009.5413117","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413117","url":null,"abstract":"Reducing a program's instruction count can improve cache behavior and bandwidth utilization, lower power consumption, and increase overall performance. Nonetheless, code density is an often overlooked feature in studying processor architectures. We hand-optimize an assembly language embedded benchmark for size on 21 different instruction set architectures, finding up to a factor of three difference in code sizes from ISA alone. We find that the architectural features that contribute most heavily to code density are instruction length, number of registers, availability of a zero register, bit-width, hardware divide units, number of instruction operands, and the availability of unaligned loads and stores. We extend our results to investigate operating system, compiler, and system library effects on code density. We find that the executable starting address, executable format, and system call interface all affect program size. While ISA effects are important, the efficiency of the entire system stack must be taken into account when developing a new dense instruction set architecture.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115751564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel SoC architecture on FPGA for ultra fast face detection","authors":"Chunhui He, Alexandros Papakonstantinou, Deming Chen","doi":"10.1109/ICCD.2009.5413122","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413122","url":null,"abstract":"Face detection is the cornerstone of a wide range of applications such as video surveillance, robotic vision and biometric authentication. One of the biggest challenges in face detection based applications is the speed at which faces can be accurately detected. In this paper, we present a novel SoC (System on Chip) architecture for ultra fast face detection in video or other image rich content. Our implementation is based on an efficient and robust algorithm that uses a cascade of Artificial Neural Network (ANN) classifiers on AdaBoost trained Haar features. The face detector architecture extracts the coarse grained parallelism by efficiently overlapping different computation phases while taking advantage of the finegrained parallelism at the module level. We provide details on the parallelism extraction achieved by our architecture and show experimental results that portray the efficiency of our face detection implementation. For the implementation and evaluation of our architecture we used the Xilinx FX130T Virtex5 FPGA device on the ML510 development board. Our performance evaluations indicate that a speedup of around 100X can be achieved over a SSE-optimized software implementation running on a 2.4GHz Core-2 Quad CPU. The detection speed reaches 625 frames per sec (fps).","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114178917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The impact of liquid cooling on 3D multi-core processors","authors":"H. Jang, I. Yoon, C. Kim, Seungwon Shin, S. Chung","doi":"10.1109/ICCD.2009.5413115","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413115","url":null,"abstract":"Recently, 3D integration has been regarded as one of the most promising techniques due to its abilities of reducing global wire lengths and lowering power consumption. However, 3D integrated processors inevitably cause higher power density and lower thermal conductivity, since the closer proximity of heat generating dies makes existing thermal hotspots more severe. Without an efficient cooling method inside the package, 3D integrated processors should suffer severe performance degradation by dynamic thermal management as well as reliability problems. In this paper, we analyze the impact of the liquid cooling on a 3D multi-core processor compared to the conventional air cooling. We also evaluate the leakage power consumption and the lifetime reliability depending on the temperature of each functional unit in the 3D multi-core processor. The simulation results show that the liquid cooling reduces the temperature of the L1 instruction cache (the hottest block in this evaluation) by as much as 45 degrees, resulting in 12.8% leakage reduction, on average, compared to the conventional air cooling. Moreover, the reduced temperature of the L1 instruction cache also improves the reliability of electromigration, stress migration, time-dependent dielectric breakdown, thermal cycling, and negative bias temperature instability significantly.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121208006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed Al-Maashri, Guangyu Sun, Xiangyu Dong, V. Narayanan, Yuan Xie
{"title":"3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis","authors":"Ahmed Al-Maashri, Guangyu Sun, Xiangyu Dong, V. Narayanan, Yuan Xie","doi":"10.1109/ICCD.2009.5413147","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413147","url":null,"abstract":"Graphics Processing Units (GPUs) offer tremendous computational and processing power. The architecture requires high communication bandwidth and lower latency between computation units and caches. 3D die-stacking technology is a promising approach to meet such requirements. To the best of our knowledge no other study has investigated the implementation of 3D technology in GPUs. In this paper, we study the impact of stacking caches using the 3D technology on GPU performance. We also investigate the benefits of using 3D stacked MRAM on GPUs. Our work includes cost, power, and thermal analysis of the proposed architectural designs. Our results show a 53% geometric mean performance speedup for iso-cycle time architectures and about 19% for iso-cost architectures.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114943380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VariPipe: Low-overhead variable-clock synchronous pipelines","authors":"Navid Toosizadeh, S. Zaky, Jianwen Zhu","doi":"10.1109/ICCD.2009.5413167","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413167","url":null,"abstract":"Synchronous pipelines usually have a fixed clock frequency determined by the worst-case process-voltage-temperature (PVT) analysis of the most critical path. Higher operating frequencies are possible under typical PVT conditions, especially when the most critical path is not triggered. This paper introduces a design methodology that uses asynchronous design to generate the clock of a synchronous pipeline. The result is a variable clock period that changes cycle-by-cycle according to the current operations in the pipeline and the current PVT conditions. The paper also presents a simple design flow to implement variable-clock systems with standard cells using conventional synchronous design tools. The variable-clock pipeline technique has been tested on a 32-bit microprocessor in 90nm technology. Post-layout simulations with three sets of benchmarks demonstrate that the variable-clock processor has a two-fold performance advantage over its fixed-clock counterpart. The overhead of the added clock generation circuit is merely 2.6% in area and 3% in energy consumption, compared to an earlier proposal that costs 100% overhead.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129121177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kalyana C. Bollapalli, Rajesh Garg, Kanupriya Gulati, S. Khatri
{"title":"On-chip bidirectional wiring for heavily pipelined systems using network coding","authors":"Kalyana C. Bollapalli, Rajesh Garg, Kanupriya Gulati, S. Khatri","doi":"10.1109/ICCD.2009.5413165","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413165","url":null,"abstract":"In this paper, we describe a low-area, reduced-power on-chip point-to-point bidirectional communication scheme for heavily pipelined systems. When data needs to be transmitted bidirectionally between two on-chip locations, the traditional approach resorts to either using two unidirectional wires, or to using a single wire (with a unidirectional transfer at any given time instant). In contrast, our bidirectional communication scheme allows data to be transmitted simultaneously between two on-chip locations, with a single wire performing the bidirectional data transfer. Our approach borrows ideas from the emerging area of network coding (in the field of communication). By utilizing coding units (which also serve the purpose of buffering the signals) along the wire between the two endpoints, we are able to achieve the same throughput as a traditional approach, while reducing the total area utilization by about 49.8% (thereby reducing routing congestion), and the total power consumption by about 11.5%. The area and power results include the contribution of routing wires, coding units, drivers, the clock distribution network and the required reset wire. Our bidirectional communication approach is ideally suited for heavily pipelined data intensive systems.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124734797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}