{"title":"CNN-MERP: An FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks","authors":"Xushen Han, Dajiang Zhou, Shihao Wang, S. Kimura","doi":"10.1109/ICCD.2016.7753296","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753296","url":null,"abstract":"Large-scale deep convolutional neural networks (CNNs) are widely used in machine learning applications. While CNNs involve huge complexity, VLSI (ASIC and FPGA) chips that deliver high-density integration of computational resources are regarded as a promising platform for CNN's implementation. At massive parallelism of computational units, however, the external memory bandwidth, which is constrained by the pin count of the VLSI chip, becomes the system bottleneck. Moreover, VLSI solutions are usually regarded as a lack of the flexibility to be reconfigured for the various parameters of CNNs. This paper presents CNN-MERP to address these issues. CNN-MERP incorporates an efficient memory hierarchy that significantly reduces the bandwidth requirements from multiple optimizations including on/off-chip data allocation, data flow optimization and data reuse. The proposed 2-level reconfigurability is utilized to enable fast and efficient reconfiguration, which is based on the control logic and the multiboot feature of FPGA. As a result, an external memory bandwidth requirement of 1.94MB/GFlop is achieved, which is 55% lower than prior arts. Under limited DRAM bandwidth, a system throughput of 1244GFlop/s is achieved at the Vertex UltraScale platform, which is 5.48 times higher than the state-of-the-art FPGA implementations.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124722331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. K. Khatamifard, M. Resch, N. Kim, Ulya R. Karpuzcu
{"title":"VARIUS-TC: A modular architecture-level model of parametric variation for thin-channel switches","authors":"S. K. Khatamifard, M. Resch, N. Kim, Ulya R. Karpuzcu","doi":"10.1109/ICCD.2016.7753353","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753353","url":null,"abstract":"Under aggressive miniaturization, unconventional digital switches rapidly come to light, which introduce new sources of variation in design parameters, and hence challenge the manufacturing process further. As a result, performance and power of manufactured hardware becomes greatly unpredictable. Characterizing variation-incurred unpredictability at early stages of the design necessitates dependable architecture-level models of variation, which distill device- and circuit-level details to accurately evaluate system-level implications. In this paper, we introduce a modular architecture-level model of parametric variation to address this challenge. As a case study, we refine our discussion to a representative class of emerging thin-channel switches, FinFETs.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130344391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How logic masking can improve path delay analysis for Hardware Trojan detection","authors":"Arash Nejat, D. Hély, V. Beroulle","doi":"10.1109/ICCD.2016.7753319","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753319","url":null,"abstract":"Hardware Trojan (HT), Integrated Circuit (IC) piracy, and overproduction are three important threats which may happen in untrusted foundries. Modifying structurally the IC design at different abstraction level to counter the HT threats is known as Design-For-Hardware-Trust (DFHT). DFHT methods are used in order to facilitate HT detection methods. In addition, logic masking has been proposed against IC piracy and overproduction. Logic masking modifies the circuit such that it does not work correctly without applying the correct key. In this paper, we propose a DFHT method reusing logic masking approach. The proposed DFHT method modifies the design to improve the HT detection methods that are based on the path delay analysis. The objective of the proposed approach is to generate fake short paths for nets which only belong to long paths, because the delay of shorter paths varies less than longer ones. Our experiments, after technology mapping, show that the proposed DFHT method increases the HT detectability and also provides the advantages of usual logic masking methods.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121019793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A readback based general debugging framework for soft-core processors","authors":"Changgong Li, Alexander Schwarz, C. Hochberger","doi":"10.1109/ICCD.2016.7753342","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753342","url":null,"abstract":"Using Field Programmable Gate Arrays (FPGAs) as implementation platform for systems-on-chip (SoC) has become quite popular. Typically, the software part of the system functionality is executed on a soft-core processor. Debugging such systems becomes more difficult than standard SoCs since regular debugging facilities are not always available for the processor cores and also additional hardware problems can overlap with software bugs. Thus, it is interesting to provide a general debugging framework that can help to identify SW and HW problems. In this contribution, we use the readback feature of modern FPGAs to implement such a general framework while at the same time minimizing the additional HW resources required for the debugging. We interface our debugging facilities with a full featured development environment such that the user can work at a very high level of abstraction.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"571 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116266818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic prefetcher reconfiguration for diverse memory architectures","authors":"Junghoon Lee, Taehoon Kim, Jaehyuk Huh","doi":"10.1109/ICCD.2016.7753270","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753270","url":null,"abstract":"With the advent of stacked memory and new memory architectures, the heterogeneity of memory has been increasing. In the diverse memory technologies, each memory architecture has its own advantages and weaknesses. Considering the trade-offs, future systems are expected to support multiple memory architectures with a hybrid memory system. However, such diversity of memory architectures complicates the performance optimization of on-chip memory hierarchy. One of the key components affected by this trend is the hardware prefetcher. The available memory bandwidth highly affects the effectiveness of prefetchers, and the aggressiveness of prefetchers must be tuned for memory architectures as well as application behaviors. This paper investigates the effect of memory diversity on the prefetcher parameter selection, and proposes a dynamic parameter search mechanism to adjust the prefetch aggressiveness under various memory architectures. Using a general hill climbing scheme periodically, the mechanism adapts to the memory architectures and application behaviors effectively. In addition to such automatic tuning, the study improves the solution for cache pollution exacerbated by the increase of speculative data from more aggressive prefetchers in higher bandwidth memory. With the dynamic parameter search and pollution mitigation, the proposed framework improves the performance of applications by 12.4% on average compared to the prior scheme for tuning prefetch parameters.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125217102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Begum, Mark Hempstead, Guru Prasad Srinivasa, Geoffrey Challen
{"title":"Algorithms for CPU and DRAM DVFS under inefficiency constraints","authors":"R. Begum, Mark Hempstead, Guru Prasad Srinivasa, Geoffrey Challen","doi":"10.1109/ICCD.2016.7753276","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753276","url":null,"abstract":"Dynamic voltage and frequency scaling (DVFS) of both the core and DRAM provides opportunities to trade-off performance in order to save energy. Previous approaches to core and DRAM power management using DVFS used performance, specifically acceptable performance loss, as a constraint. We present energy management algorithms that coordinate core and DRAM frequency scaling under a specified energy budget. Approaches that work under performance constraints, as we will show, are not directly applicable to systems operating under energy constraints, as it is difficult to calculate the correct performance bounds in real-time to stay under an energy budget. Setting arbitrary energy budgets for a diverse set of applications can be harmful to application performance. We use the previously introduced concept of Inefficiency - the additional amount of energy above the minimum required energy that can be used to improve performance - to provide a dynamic energy constraint to our system. We introduce new power management algorithms that search the power and performance space to find the best performing point under this constraint. We demonstrate the efficacy of our algorithms using CPU DVFS and DRAM frequency scaling. We show that our algorithms have 24% lower tuning cost and save up to 5% energy with a little performance loss compared to a state-of-the-art performance constrained system.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116901892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vasil Pano, I. Yilmaz, Yuqiao Liu, B. Taskin, K. Dandekar
{"title":"Wireless Network-on-Chip analysis of propagation technique for on-chip communication","authors":"Vasil Pano, I. Yilmaz, Yuqiao Liu, B. Taskin, K. Dandekar","doi":"10.1109/ICCD.2016.7753313","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753313","url":null,"abstract":"Network-on-Chip (NoC) is a communication paradigm capable of facilitating a scalable interconnection infrastructure for multi core processors. Wireless NoCs have been introduced to improve the communication performance over long-distance processing nodes. Current on-chip antennas used in wireless NoCs communicate predominantly through surface waves, where the efficacy of the wireless nodes is partially determined by the radiation efficiency and transmission gain limited due to the conductivity loss of the silicon substrate. Recently, an on-chip propagation technique of radio waves was introduced, through the un-doped silicon layer as opposed to surface-waves prevalent in literature. The through-substrate propagation waves provide a unique solution to overcome the challenge of long-distance communication between processing nodes. In this work, overall improvements are shown compared to traditional wireless NoCs with the placement of antennas on undoped silicon (i.e. communicating through surface waves), simulated in NoC architectures across performance metrics of area, power consumption and latency.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128238941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. Lin, Yuankun Xue, P. Bogdan, Yanzhi Wang, S. Garg, Massoud Pedram
{"title":"Power-aware virtual machine mapping in the data-center-on-a-chip paradigm","authors":"X. Lin, Yuankun Xue, P. Bogdan, Yanzhi Wang, S. Garg, Massoud Pedram","doi":"10.1109/ICCD.2016.7753286","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753286","url":null,"abstract":"It is projected that hundreds of cores can be integrated into a chip at the sub-20nm technology nodes. However, some challenges exist in the many-core architecture such as maintaining memory coherence, underutilized parallelism, and increased inter-core communication delay. This work proposes the data-center-on-a-chip (DCoC) paradigm employing virtualization technologies commonly used in today's data centers to reduce the overhead of maintaining memory coherence and inter-core communication and improve parallelism. In the DCoC paradigm, user applications with specific resource requirements need to be mapped onto different chips of a data center and different cores of a chip in the form of virtual machines (VMs). By a judicious VM mapping method, the data center performance can be maximized while satisfying the power budget and power density constraints of the chips and the resource requirements of VMs. To tackle the NP-hardness of the VM mapping problem, we propose a two-tier algorithm, which effectively solves the mapping problem with polynomial time complexity.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130595821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data-Pattern enabled Self-Recovery multimedia storage system for near-threshold computing","authors":"Na Gong, J. Edstrom, Dongliang Chen, Jinhui Wang","doi":"10.1109/ICCD.2016.7753332","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753332","url":null,"abstract":"The growing popularity of powerful mobile devices such as smart phones and tablet devices has resulted in the exponential growth of demand for video applications. However, due to the intensive computation of the video decoding process, mobile video applications require frequent embedded memory access, which consumes a large amount of power and limits battery life. Various low-voltage memory techniques have been investigated to enhance the energy efficiency of multimedia processing system. Unfortunately, the existing research suffers from high implementation complexity and large area overhead. In this paper, we present a low-cost self-recovery video storage system by investigating meaningful data patterns hidden in mobile video data. Specifically, we propose a two-dimensional data-pattern approach to explore horizontal data-association and vertical data-correlation characteristics. Based on the identified optimal data patterns, we present a simple circuit-level SRAM design to enable self-recovery at low voltages. A 45nm 32kb SRAM is designed that delivers good video quality at near-threshold voltage (0.5 V) with negligible area overhead (3.97%).","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130678304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Error behaviors testing with temperature and magnetism dependency for MRAM","authors":"Xin Shi, Fei Wu, Xidong Guan, C. Xie","doi":"10.1109/ICCD.2016.7753302","DOIUrl":"https://doi.org/10.1109/ICCD.2016.7753302","url":null,"abstract":"Magnetoresistive random access memory (MRAM) has the potential to become a universal memory for future storage system. However, the stability of MRAM is sensitive to temperature and magnetic field. To obtain a strong understanding about how the temperature and magnetic field impact the reliability characteristics of real MRAM devices, Everspin MR4A08BYS35, we present an error behavior model to categorize two types of MRAM errors. Based on our proposed error model, we conduct extensive experiments on real MRAM devices in different temperatures and magnetic fields. Our results show that MRAM lifetime for the chips we tested is demonstrated infinite under normal operation environment. The critical temperature is 75°C and the dominant error type is read error. In contrast, write error is more seriously than read error in magnetic environment. The critical magnetic field intensity is 140Gauss. These results can be used for measuring the fabrication quality of individual MRAM memory chips.","PeriodicalId":297899,"journal":{"name":"2016 IEEE 34th International Conference on Computer Design (ICCD)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123344092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}