{"title":"Pull-off buffer: Borrowing cache space to avoid deadlock for fault-tolerant NoC routing","authors":"Airan Shao, Dongsheng Wang, Haixia Wang","doi":"10.1109/ICCD.2016.7753328","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753328","url":null,"abstract":"Advances in semiconductor technology have led to large chip multiprocessor (CMP) employing network-on-chip (NoC) to provide scalable on-chip communication. This higher integration capacity, on the other hand, increases the possibility of faults. To tackle this challenge, fault-tolerant routing in NoC becomes essential, which allows packets to be routed around faulty network components and maintains normal communication. However, to tolerate a large number of faults, the deadlock problem becomes very difficult to deal with. Existing highly fault-tolerant routing solutions employ virtual channel (VC) or topology-agnostic routing for deadlock avoidance, but at the cost of lower network performance and the demand for extra hardware. In this paper, we show that it is possible to design a novel highly fault-tolerant routing method without VC and topology-agnostic routing. We present pull-off buffer (POB), a FIFO buffer borrowing the space already present in cache, to avoid potentially existing deadlocks. POBs borrow cache space only from selected nodes and only after the occurrence of faults. The space of caches at other nodes will not be affected. Experimental results show that our solution can provide 2x to 3x higher network throughput and reduce router area and power overhead, when compared against existing highly fault-tolerant routing methods employing VC or topology-agnostic routing.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126523856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"WILD: A workload-based learning model to predict dynamic delay of functional units","authors":"Xun Jiao, Yu Jiang, Abbas Rahimi, Rajesh K. Gupta","doi":"10.1109/ICCD.2016.7753279","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753279","url":null,"abstract":"Dynamic critical path analysis in modern processors is needed to reduce margins typically determined by the static timing analysis. Dynamic path analysis, however, is cost-prohibitive. In this paper, we propose WILD, a supervised learning model to predict dynamic delay of functional units (FUs) based on the input workload during execution. We measure the dynamic delay using switching activity generated through gate-level simulation of a post place-and-route design in TSMC 45nm process. We then look for `features' in the input data that influence dynamic path sensitization. Using these features we apply a logistic regression (LR) method to construct a predictive model trained and tested using three datasets: random, Sobel filter and Gaussian filter. We classify dynamic delay into five distinct classes. For a given test input, WILD predicts the class of output dynamic delay. On average across several FUs, 98.0% of WILD predictions are consistent with gate-level simulation. Using WILD-directed dynamic frequency scaling can improve instruction-level performance by 13%-44% compared to the state-of-the-art instruction-level timing model.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116837546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fast, fully verifiable, and hardware predictable ASIC design methodology","authors":"P. Yang, M. Marek-Sadowska","doi":"10.1109/ICCD.2016.7753304","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753304","url":null,"abstract":"In this paper, a fast, fully verifiable, and hardware predictable ASIC design methodology is proposed and demonstrated for the Vertical Slit FET (VeSFET) based integrated circuits. The key enablers of this methodology are the unique and powerful capabilities of pillar-based two-side accessible transistor arrays and monolithic 3D integration. VeSFET is a successfully fabricated transistor of this kind. In the proposed methodology, the circuit is first designed on a 3D FPGA platform using a conventional FPGA design flow. With a little extra Back End of Line (BEOL) masking cost, the design implemented on the 3D FPGA is migrated to the final 2D ASIC, which has exactly the same performance and the verification tasks performed on the 3D FPGA platform remain valid for the final 2D ASIC. The 2D ASIC has the same layout as the silicon-proven 3D FPGA, which greatly mitigates the unpredictable factors of fabrication. The proposed methodology retains all the benefits of FPGA design flow. Eleven MCNC benchmark circuits were implemented. Comparing to the 2D FPGA, the performance of the final 2D ASIC implementation as well as the performance of the 3D FPGA design platform are on average 15% faster, consume 17% less power and 44% less area.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125248388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Ly, Orlando Arias, Jacob Wurm, Khoa Hoang, Kaveh Shamsi, Yier Jin
{"title":"Voting system design pitfalls: Vulnerability analysis and exploitation of a model platform","authors":"K. Ly, Orlando Arias, Jacob Wurm, Khoa Hoang, Kaveh Shamsi, Yier Jin","doi":"10.1109/ICCD.2016.7753273","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753273","url":null,"abstract":"Homomorphic encryption may be seen as a substantial potential boon to voting systems. If properly used, it allows provably anonymous elections to take place. However, when poorly constructed, using weak cryptographic primitives results in highly vulnerable systems that are prone to attacks. This paper details one attack done against a model of an election system as part of a security competition, where a hardware Trojan has weakened its security. We designed a proof of concept exploit and implemented it on an FPGA, demonstrating weaknesses in the system regardless of the existence of this Trojan.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120843633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Gu, Dylan C. Stow, Russell Barnes, E. Kursun, Yuan Xie
{"title":"Thermal-aware 3D design for side-channel information leakage","authors":"P. Gu, Dylan C. Stow, Russell Barnes, E. Kursun, Yuan Xie","doi":"10.1109/ICCD.2016.7753336","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753336","url":null,"abstract":"Side-channel attacks are important security challenges as they reveal sensitive information about on-chip activities. Among such attacks, the thermal side-channel has been shown to disclose the activities of key functional blocks and even encryption keys. This paper proposes a novel approach to proactively conceal critical activities in the functional layers while minimizing the power dissipation by (i) leveraging inherent characteristics of 3D integration to protect from side-channel attacks and (ii) dynamically generating custom activity patterns to match the activity to be concealed in the functional layers. Experimental analysis shows that 3D technology combined with the proposed run-time algorithm effectively reduces the Side-channel Vulnerability Factor (SVF) below 0.05 and the Spatial Thermal Side-channel Factor (STSF) below 0.59.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115291892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Provenance to boost the Metadata Prefetching in distributed storage systems","authors":"G. Wu, Yuhui Deng, X. Qin","doi":"10.1109/ICCD.2016.7753264","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753264","url":null,"abstract":"Caching and prefetching are effective approaches to boosting the performance of metadata access in distributed storage systems. Many research efforts have been devoted in developing new metadata prefetching methods by considering past file access patterns. However, the existing methods do not consider the correlations between processes and the corresponding files(e.g. file provenance). Therefore, the methods cannot obtain very rich and accurate correlations, thus decreasing the effectiveness of metadata prefetching. This paper presents a Provenance-based Metadata Prefetching(ProMP) scheme, which considers both provenance and the past file access patterns. Through mining the correlations between processes and corresponding files from provenance and past access history, ProMP can achieve accurate and rich correlation information. ProMP is conducive to employing aggressive metadata prefetching to boost the performance by leveraging the correlations. Our experimental results show that ProMP performs more effectively with less memory overhead than the existing solutions, while improving the hit rates by up to 49% and 7% in contrast to traditional LRU and a state-of-art metadata prefetching algorithm Nexus, respectively.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122722746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPMario: Scale up MapReduce with I/O-Oriented Scheduling for the GPU","authors":"Yang Liu, Hung-Wei Tseng, S. Swanson","doi":"10.1109/ICCD.2016.7753309","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753309","url":null,"abstract":"The popularity of GPUs in general purpose computation has prompted efforts to scale up MapReduce systems with GPUs, but lack of efficient I/O handling results in underutilization of shared system resources in existing systems. This paper presents SPMario, a scale-up GPU MapReduce framework to speed up job execution and boost utilization of system resources with the new I/O Oriented Scheduling. The evaluation on a set of representative benchmarks against a highly-optimized baseline system shows that for the single job cases, SPMario can speedup job execution by up to 2.28×, and boost GPU utilization by 2.12× and 2.51× for I/O utilization. When scheduling two jobs together, I/O Oriented Scheduling outperforms round-robin scheduling by up to 13.54% in total execution time, and by up to 12.27% and 14.92% in GPU and I/O utilization, respectively.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129597521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 64 kb differential single-port 12T SRAM design with a bit-interleaving scheme for low-voltage operation in 32 nm SOI CMOS","authors":"Samira Ataei, J. Stine, Matthew R. Guthaus","doi":"10.1109/ICCD.2016.7753333","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753333","url":null,"abstract":"In this paper, a novel differential single-port 12T SRAM bitcell is presented. This bitcell uses a read buffer to eliminate read disturbance, improves the read stability and achieves read static noise margin equal to its hold static noise margin. Using a column-based select signal this bitcell provides a half-select free feature, facilitating a bit-interleaving structure to reduce multi-bit soft errors by conventional error correcting code techniques. By boosting the wordline and select signal voltage, this bitcell can read and write with no error at 300 mV while data can be held down to 250 mV in standby mode. Bitline leakage suppression in 12T bitcell allows more bitcells per bitline for high density SRAMs and provides faster read operation. This paper also introduces OpenRAM, an open-source memory compiler, that provides a platform for the generation, characterization, and verification of fabricable memory designs across various technologies, sizes, and configurations. Using OpenRAM, a 64 kb 12T SRAM macro is designed in IBM 32 nm SOI CMOS technology that operates down to 0.3 V with 50 MHz operating frequency while it functions at 0.9 V with 2.2 GHz operating frequency, as well.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129982090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DOART: A low-power and low-latency Network-on-Chip","authors":"W. Zong, Qiang Xu","doi":"10.1109/ICCD.2016.7753301","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753301","url":null,"abstract":"Packet-switched Network-on-Chip (NoC) is the shared global communication infrastructure for future large-scale chip multi-processors (CMPs). Recently, Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) on repeater-inserted wires to reduce packet delay was proposed. But current NoC with SMART support adds complexity to conventional routers and incurs high power consumption. In this paper, we propose a low-power and low-latency NoC design with SMART support, called Dimension Ordered Asynchronous Repeated Traversal (DOART). First we design a low-power interconnect called Single-cycle Intra-dimension Bridge (SIB) with SMART support, and then we propose an efficient construction framework to connect SIBs generating a large-scale low-power and low-latency NoC. In addition, the proposed DOART supports virtual channel and is protocol and routing-level deadlock-free. Experimental results show that DOART can reduce both the application execution time and network power consumption compared with state-of-the-art NoCs with SMART support.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121065010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware thread reordering to boost OpenCL throughput on FPGAs","authors":"Amir Momeni, H. Tabkhi, G. Schirner, D. Kaeli","doi":"10.1109/ICCD.2016.7753288","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753288","url":null,"abstract":"Availability of OpenCL for FPGAs has raised new questions about the efficiency of massive thread-level parallelism on FPGAs. The general trend is toward creating deep pipelining and in-order execution of many OpenCL threads across a shared data-path. While this can be a very effective approach for regular kernels, its efficiency significantly diminishes for irregular kernels with runtime-dependent control flow. We need to look for new approaches to improve execution efficiency of FPGAs when targeting irregular OpenCL kernels. This paper proposes a novel solution, called Hardware Thread Reordering (HTR), to boost the throughput of the FPGAs when executing irregular kernels possessing non-deterministic runtime control flow. The key insight of HRT is out-of-order OpenCL thread execution over a shared data-path to achieve significantly higher throughput. The thread reordering is performed at a basic-block level granularity. The synthesized basic-blocks are extended with independent pipeline control signals and context registers to bypass the live values of reordered threads. We demonstrate the efficiency of our proposed solution on three parallel irregular kernels. For the experiments, we utilize the LegUp tool to compare the baseline (in-order) data-path with HTR-enhanced data-path. Our RTL simulation results demonstrate that HTR-enhanced data-path achieves up to 11× increase in kernels throughput at a very low overhead (less than 2× increase in FPGA resources).","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122480236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}