{"title":"TAP prediction: Reusing conditional branch predictor for indirect branches with Target Address Pointers","authors":"Zichao Xie, Dong Tong, Mingkai Huang, Xiaoyin Wang, Qinqing Shi, Xu Cheng","doi":"10.1109/ICCD.2011.6081386","DOIUrl":"https://doi.org/10.1109/ICCD.2011.6081386","url":null,"abstract":"Indirect-branch prediction is becoming more important for modern processors as more programs are written in object-oriented languages. Previous hardware-based indirect-branch predictors generally require significant hardware storage or use aggressive algorithms which make the processor front-end more complex. In this paper, we propose a fast and cost-efficient indirect-branch prediction strategy, called Target Address Pointer (TAP) Prediction. TAP Prediction reuses the history-based branch direction predictor to detect occurrences of indirect branches, and then stores indirect-branch targets in the Branch Target Buffer (BTB). The key idea of TAP Prediction is to predict the Target Address Pointers, which generate virtual addresses to index the targets stored in the BTB, rather than to predict the indirect-branch targets directly. TAP Prediction also reuses the branch direction predictor to construct several small predictors. When fetching an indirect branch, these small predictors work in parallel to generate the target address pointer. Then TAP prediction accesses the BTB to fetch the predicted indirect-branch target using the generated virtual address. This mechanism could achieve time cost comparable to that of dedicated-storage-predictors, without requiring additional large amounts of storage. Our evaluation shows that for three representative direction predictors-Hybrid, Perceptrons, and O-GEHL-TAP schemes improve performance by 18.19%, 21.52%, and 20.59%, respectively, over the baseline processor with the most commonly-used BTB prediction. Compared with previous hardware-based indirect-branch predictors, the TAP-Perceptrons scheme achieves performance improvement equivalent to that provided by a 48KB TTC predictor, and it also outperforms the VPC predictor by 14.02%.","PeriodicalId":354015,"journal":{"name":"2011 IEEE 29th International Conference on Computer Design (ICCD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133017877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving GPU Robustness by making use of faulty parts","authors":"Artem Durytskyy, M. Zahran, R. Karri","doi":"10.1109/ICCD.2011.6081422","DOIUrl":"https://doi.org/10.1109/ICCD.2011.6081422","url":null,"abstract":"With hundreds of processing units in current state-of-the-art graphics processing units (GPUs), the probability that one or more processing units fail due to permanent faults, during fabrication or post deployment, increases drastically. In our experiments we found that the loss of a single streaming multiprocessor (SM) in an 8-SM GPU resulted in as much as 16%performance loss. The default method for dealing with faulty SMs is to turn them off. Although faulty SMs cannot be trusted to completely execute a single kernel (program assigned to an SM) correctly, we show that we can still make use of these SMs to improve system throughput by generating and supplying high-level hints to other functional SMs. By making the faulty SMs supply hints to functional SMs, we have been able to achieve an average speed-up of about 16 % over the baseline case (wherein the faulty SMs are turned off). The proposed technique requires minimal hardware overhead and is highly scalable.","PeriodicalId":354015,"journal":{"name":"2011 IEEE 29th International Conference on Computer Design (ICCD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115901347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. L. Lewis, Shreepad Panth, Xin Zhao, S. Lim, H. Lee
{"title":"Designing 3D test wrappers for pre-bond and post-bond test of 3D embedded cores","authors":"D. L. Lewis, Shreepad Panth, Xin Zhao, S. Lim, H. Lee","doi":"10.1109/ICCD.2011.6081381","DOIUrl":"https://doi.org/10.1109/ICCD.2011.6081381","url":null,"abstract":"3D integration is a promising new technology for tightly integrating multiple active silicon layers into a single chip stack. Both the integration of heterogeneous tiers and the partitioning of functional units across tiers leads to significant improvements in functionality, area, performance, and power consumption. Managing the complexity of 3D design is a significant challenge that will require a system-on-chip approach, but the application of SOC design to 3D necessitates extensions to current test methodology. In this paper, we propose extending test wrappers, a popular SOC DFT technique, into the third dimension. We develop an algorithm employing the Best Fit Decreasing and Kernighan-Lin Partitioning heuristics to produce 3D wrappers that minimize test time, maximize reuse of routing resources across test modes, and allow for different TAM bus widths in different test modes. On average the two variants of our algorithm reuse 93% and 92% of the test wrapper wires while delivering test times of just 0.06% and 0.32% above the minimum.","PeriodicalId":354015,"journal":{"name":"2011 IEEE 29th International Conference on Computer Design (ICCD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129282026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
O. Al-Khaleel, Zakaria Al-Qudah, M. Al-khaleel, C. Papachristou, F. Wolff
{"title":"Fast and compact binary-to-BCD conversion circuits for decimal multiplication","authors":"O. Al-Khaleel, Zakaria Al-Qudah, M. Al-khaleel, C. Papachristou, F. Wolff","doi":"10.1109/ICCD.2011.6081401","DOIUrl":"https://doi.org/10.1109/ICCD.2011.6081401","url":null,"abstract":"Decimal arithmetic has received considerable attention recently due to its suitability for many financial and commercial applications. In particular, numerous algorithms have been recently proposed for decimal multiplication. A major approach to decimal multiplication shaped by these proposals is based on performing the decimal digit-by-digit multiplication in binary, converting the binary partial product back to decimal, and then adding the decimal partial products as appropriate to form the final product in decimal. With this approach, the efficiency of binary-to-BCD partial product conversion is critical for the efficiency of the overall multiplication process. A recently proposed algorithm for this conversion is based on splitting the binary partial product into two parts (i.e., two groups of bits), and then computing the contributions of the two parts to the partial BCD result in parallel. This paper proposes two new algorithms (Three-Four split and Four-Three split) based on this principle. We present our proposed architectures that implement these algorithms and compare them to existing algorithms. The synthesis results show that the Three-Four split algorithm runs 15%faster and occupies 26.1%less area than the best performing equivalent circuit found in the literature. Furthermore, the Four-Three split algorithm occupies 37.5% less area than the state of the art equivalent circuit.","PeriodicalId":354015,"journal":{"name":"2011 IEEE 29th International Conference on Computer Design (ICCD)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128577438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy aware task mapping algorithm for heterogeneous MPSoC based architectures","authors":"A. Hussien, A. Eltawil, R. Amin, Jim Martin","doi":"10.1109/ICCD.2011.6081444","DOIUrl":"https://doi.org/10.1109/ICCD.2011.6081444","url":null,"abstract":"Energy Management for multi-mode Software Defined Radio (SDR) systems remains a daunting challenge. In this paper, we focus on the issue of task allocation for multi-processor based systems with hybrid processing resources that can be reconfigured. With the objective of minimizing energy, we propose a fast, energy aware static task mapping heuristic to minimize the average overall energy consumption. Simulation results show that the proposed heuristic is capable of achieving results that are within 20% of the optimal solution while providing orders of magnitude speedup in processing time.","PeriodicalId":354015,"journal":{"name":"2011 IEEE 29th International Conference on Computer Design (ICCD)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121087526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A memristor-based memory cell using ambipolar operation","authors":"P. Junsangsri, F. Lombardi","doi":"10.1109/ICCD.2011.6081390","DOIUrl":"https://doi.org/10.1109/ICCD.2011.6081390","url":null,"abstract":"This paper presents a novel memory cell consisting of a memristor and ambipolar transistors. Macroscopic models are utilized to characterize the operations of this memory cell. A detailed treatment of the two basic memory operations (write and read) with respect to memristor features is provided; particular, emphasis is devoted to the threshold characterization of the memristance and the on/off states. Extensive simulation results are provided to assess performance in terms of the write/read times, transistor scaling and power dissipation. The simulation results show that the proposed memory cell achieves superior performance compared with other memristor-based cells found in the technical literature.","PeriodicalId":354015,"journal":{"name":"2011 IEEE 29th International Conference on Computer Design (ICCD)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121504883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Blue team red team approach to hardware trust assessment","authors":"Jeyavijayan Rajendran, V. Jyothi, R. Karri","doi":"10.1109/ICCD.2011.6081410","DOIUrl":"https://doi.org/10.1109/ICCD.2011.6081410","url":null,"abstract":"Hardware security techniques are validated using fixed in-house methods. However, the effectiveness of such techniques in the field cannot be the same as the attacks are dynamic. A red team blue team approach mimics dynamic attack scenarios and thus can be used to validate such techniques by determining the effectiveness of a defense and identifying vulnerabilities in it. By following a red team blue team approach, we validated two trojan detection techniques namely, path delay measurement and ring oscillator frequency monitoring, in the Embedded Systems Challenge (ESC) 2010. In ESC, one team performed the blue team activities and eight other teams performed red team activities. The path delay measurement technique detected all the trojans. The ESC exposed a vulnerability in the RO-based technique which was exploited by the red teams causing some trojans to be undetected. Post ESC, we developed a technique to fix this vulnerability.","PeriodicalId":354015,"journal":{"name":"2011 IEEE 29th International Conference on Computer Design (ICCD)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132649513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced symbolic simulation of a round-robin arbiter","authors":"Yongjian Li, Naiju Zeng, W. Hung, Xiaoyu Song","doi":"10.1109/ICCD.2011.6081383","DOIUrl":"https://doi.org/10.1109/ICCD.2011.6081383","url":null,"abstract":"In this work, we present our results on formally verifying hardware design of round-robin arbiter which is the core component in many real network systems. Our approach is enhanced STE, which explores fully symbolic simulation for not only one round of round-robin arbitration, but also the sequential behaviors of the arbiter. Our experiments demonstrate that the enhanced STE specification for real-world hardware design can be finished automatically in a reasonable time and memory usage.","PeriodicalId":354015,"journal":{"name":"2011 IEEE 29th International Conference on Computer Design (ICCD)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132347910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simultaneous continual flow pipeline architecture","authors":"K. Jothi, Mageda Sharafeddine, Haitham Akkary","doi":"10.1109/ICCD.2011.6081387","DOIUrl":"https://doi.org/10.1109/ICCD.2011.6081387","url":null,"abstract":"Since the introduction of the first industrial out-of-order superscalar processors in the 1990s, instruction buffers and cache sizes have kept increasing with every new generation of out-of-order cores. The motivation behind this continuous evolution has been performance of single-thread applications. Performance gains from larger instruction buffers and caches come at the expense of area, power, and complexity. We show that this is not the most energy efficient way to achieve performance. Instead, sizing the instruction buffers to the minimum size necessary for the common case of L1 data cache hits and using new latency-tolerant microarchitecture to handle loads that miss the L1 data cache, improves execution time and energy consumption on SpecCPU 2000 benchmarks by an average of 10% and 12% respectively, compared to a large superscalar baseline. Our non-blocking architecture outperforms other latency tolerant architectures, such as Continual Flow Pipelines, by up to 15% on the same SpecCPU 2000 benchmarks.","PeriodicalId":354015,"journal":{"name":"2011 IEEE 29th International Conference on Computer Design (ICCD)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122402433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using content-aware bitcells to reduce static energy dissipation","authors":"Fahrettin Koc, O. Simsek, O. Ergin","doi":"10.1109/ICCD.2011.6081375","DOIUrl":"https://doi.org/10.1109/ICCD.2011.6081375","url":null,"abstract":"Static energy dissipation is an increasing problem in contemporary processor design with shrinking feature sizes. Many schemes are proposed to cope with leakage in the literature ranging from using sleep transistors to lowering supply voltage. In this paper, we introduce a Conscious SRAM (CSRAM) design to lower static energy dissipation in the storage components of a processor. The proposed bitcell design adapts the body bias of its own transistors according to its contents. We show that the use of the proposed CSRAM cells results in significant reduction in the static energy dissipation of on-chip storage components without significant performance degradation. In order to reduce the area overhead introduced by the CSRAM we propose a simplified version of the cell at the circuit level. We also leverage the fact that the contents of adjacent bits of the stored values are highly dependent on each other, especially on the upper order bits of a value, and propose some architectural level solutions that lower the area overhead to as low as 7%.","PeriodicalId":354015,"journal":{"name":"2011 IEEE 29th International Conference on Computer Design (ICCD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127676708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}