{"title":"A 375 nA Input Off Current Schmitt Triger LDO for Energy Harvesting IoT Sensors","authors":"K. Ishibashi, Shiho Takahashi","doi":"10.1109/ISVLSI.2018.00043","DOIUrl":"https://doi.org/10.1109/ISVLSI.2018.00043","url":null,"abstract":"This paper introduces 375 nA input off current Schmitt trigger LDO, which is suitable for receiving the power from high internal impedance energy harvesting power sources. The Schmitt trigger LDO consumes 375nA input current at the input voltage of 0.5V, and occupies 276 x 295 um area using 0.18 um CMOS technology. The proposed Schmitt trigger LDO is used to make an Energy Harvesting Illumination Beat Sensor Node, so that the sensor node wirelessly transmits the data of illumination from 780 to 1540 lx without battery.","PeriodicalId":114330,"journal":{"name":"2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122386003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predictive Modeling for CPU, GPU, and FPGA Performance and Power Consumption: A Survey","authors":"Kenneth O'Neal, P. Brisk","doi":"10.1109/ISVLSI.2018.00143","DOIUrl":"https://doi.org/10.1109/ISVLSI.2018.00143","url":null,"abstract":"CPUs and dedicated accelerators (namely GPUs and FPGAs) continue to grow increasingly large and complex to support todays demanding performance and power requirements. Designers are tasked with evaluating the performance and power of similarly increasingly large design spaces during pre-silicon design for CPUs and GPUs to reduce time-to-market and limit manufacturing costs, or to figure out how to best map applications onto FPGAs using high-level synthesis tools. Typically, cycle-accurate simulators are used to evaluate workloads for pre-silicon CPUs and GPUs and to avoid the overhead of synthesis and place-and-route when targeting FPGAs; however, simulators exhibit prohibitively long run times that limit the number of design points and workloads that can be evaluated in a reasonable timeframe. This survey focuses on predictive modeling as an alternative to cycle-accurate simulation, which enables rapid evaluation of workloads and design points. When applied properly, predictive modeling can improve time to market, and can facilitate more comprehensive design space explorations with far less overhead than simulation. The survey focuses on predictive models applied to CPUs, GPUs, and FPGAs, noting that the general approach has been applied to many other computing platforms as well.","PeriodicalId":114330,"journal":{"name":"2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130226816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploration on Routing Configuration of HNoC with Reasonable Energy Consumption","authors":"Juan Fang, Zeqing Chang, Yanjin Cheng, Hui Zhao","doi":"10.1109/ISVLSI.2018.00140","DOIUrl":"https://doi.org/10.1109/ISVLSI.2018.00140","url":null,"abstract":"The Heterogeneous Network-on-Chip (HNoC) integrates CPU cores, Graphic Processing Unit (GPU) cores, last-level-cache and memory controllers. The heterogeneity of this architecture inevitably brings resource contention and energy shortage. In this work, we evaluate the impact of different capacity of router buffers on communication delay and energy consumption. We run benchmarks to simulate different characteristics of the real-world applications, with the aim to balance performance with energy consumption under buffer resource limitations. Our evaluations of HNoC show that when the buffer resources are limited, by allocating more buffer to GPU, the energy consumption decrease by an average of 44.6%, while the performance degradation is negligible.","PeriodicalId":114330,"journal":{"name":"2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130259389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware Implementation of Reconfigurable Separable Convolution","authors":"L. Rao, Bin Zhang, Jizhong Zhao","doi":"10.1109/ISVLSI.2018.00051","DOIUrl":"https://doi.org/10.1109/ISVLSI.2018.00051","url":null,"abstract":"Convolution operations occupy large amounts of computation resource in convolutional neural networks (CNNs). Separable convolution can greatly reduce computational complexity. Unfortunately, most trained kernels in CNNs are not separable. In this paper, least squares approach is applied to decompose a non-separable 2D kernel into two 1D kernels. A reconfigurable convolutional architecture is proposed to convert a 2D convolution into 1D convolution in convolutional layers. Moreover, a denoising CNN is mapped to the proposed convolution architecture. Experimental results show that the hardware architecture can restore a 1280 720 image in 0.83s, which achieves an 8.4 speed-up over GPU implementation. Verification experiments demonstrate that our approach and hardware architecture can drastically reduce the computational complexity in convolution operations without sacrificing the performance.","PeriodicalId":114330,"journal":{"name":"2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127857126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ankit Jindal, Binod Kumar, Nitish Jindal, M. Fujita, Virendra Singh
{"title":"Silicon Debug with Maximally Expanded Internal Observability Using Nearest Neighbor Algorithm","authors":"Ankit Jindal, Binod Kumar, Nitish Jindal, M. Fujita, Virendra Singh","doi":"10.1109/ISVLSI.2018.00019","DOIUrl":"https://doi.org/10.1109/ISVLSI.2018.00019","url":null,"abstract":"One of the most difficult challenges during the process of silicon debug is overcoming the bottleneck of limited visibility of internal states. Although the application of state restoration technique enhances the limited debug data available through on-chip trace buffers, yet the number of restored signal states are not significant. This paper proposes an approach which addresses the limited observability problem through a machine learning perspective. Based on training with pre-silicon buggy signatures on a relatively smaller design, a model is developed which identifies a set of neighbors for every flip-flop of the design. The application of nearest neighbors principle eliminates the obstacle of unknown signal values despite restoration because these values are obtained from the neighbors. Experimental results on benchmark circuits depict that the proposed approach is able to correctly discover 93% of the total signal values. The methodology is verified with the help of cross-validation of the debug data on designs injected with gate-level error models.","PeriodicalId":114330,"journal":{"name":"2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126234982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chang Song, Hsin-Pai Cheng, Huanrui Yang, Sicheng Li, Chunpeng Wu, Qing Wu, Yiran Chen, H. Li
{"title":"MAT: A Multi-strength Adversarial Training Method to Mitigate Adversarial Attacks","authors":"Chang Song, Hsin-Pai Cheng, Huanrui Yang, Sicheng Li, Chunpeng Wu, Qing Wu, Yiran Chen, H. Li","doi":"10.1109/ISVLSI.2018.00092","DOIUrl":"https://doi.org/10.1109/ISVLSI.2018.00092","url":null,"abstract":"Some recent work revealed that deep neural networks (DNNs) are vulnerable to so-called adversarial attacks where input examples are intentionally perturbed to fool DNNs. In this work, we revisit the DNN training process that includes adversarial examples into the training dataset so as to improve DNN's resilience to adversarial attacks, namely, adversarial training. Our experiments show that different adversarial strengths, i.e., perturbation levels of adversarial examples, have different working ranges to resist the attacks. Based on the observation, we propose a multi-strength adversarial training method (MAT) that combines the adversarial training examples with different adversarial strengths to defend adversarial attacks. Two training structures—mixed MAT and parallel MAT—are developed to facilitate the tradeoffs between training time and hardware cost. Our results show that MAT can substantially minimize the accuracy degradation of deep learning systems to adversarial attacks on MNIST, CIFAR-10, CIFAR-100, and SVHN. The tradeoffs between training time, robustness, and hardware cost are also well discussed on a FPGA platform.","PeriodicalId":114330,"journal":{"name":"2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127716784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PPAP and iPPAP: PLL-Based Protection Against Physical Attacks","authors":"P. Ravi, S. Bhasin, J. Breier, A. Chattopadhyay","doi":"10.1109/ISVLSI.2018.00118","DOIUrl":"https://doi.org/10.1109/ISVLSI.2018.00118","url":null,"abstract":"Digital security practitioners are facing enormous challenge in face of the growing repertoire of physical attacks, e.g., Side Channel Attack (SCA) and Fault Injection Attack (FIA). Countermeasures to such threats are usually very different in nature and come with a significant performance penalty. While the FIA countermeasures rely on fault-detecting sensors or concurrent error detection schemes, SCA countermeasures are based on data masking or dual-rail logic circuits. Recently, a low-overhead FIA countermeasure has been proposed that utilises a ring oscillator circuit with Phase-Locked Loop (PLL). In this paper, we extend that countermeasure to further provide protection against SCA, thereby proposing PLL based Protection Against Physical attacks (PPAP). We demonstrate the PPAP on an FPGA prototype under rigorous SCA and FIA testing. We evaluate SCA resistance using the TVLA metric and observe a 2000x increase in SCA protection (in terms of number of traces) with PPAP. We further improve the security of PPAP using statistical analysis through an improved PPAP design (iPPAP) with an increase in SCA resistance of at least 5000x compared to the unprotected implementation with a minimal area overhead.","PeriodicalId":114330,"journal":{"name":"2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127999913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPAP: A Folded Architecture for Efficient Computing of Convolutional Neural Networks","authors":"Yizhi Wang, Jun Lin, Zhongfeng Wang","doi":"10.1109/ISVLSI.2018.00098","DOIUrl":"https://doi.org/10.1109/ISVLSI.2018.00098","url":null,"abstract":"Convolutional neural networks (CNNs) have found extensive applications in practice. However, weight/activation's sparsity and different data precision requirements across layers lead to a large amount of redundant computations. In this paper, we propose an efficient architecture for CNNs, named Folded Precision-Adjustable Processor (FPAP), to skip those unnecessary computations with ease. Computations are folded in the following two aspects to achieve efficient computing. On one hand, the dominant multiply-and-add (MAC) operations are performed bit-serially based on a bit-pair encoding algorithm so that the FPAP can adapt to different numerical precisions without using multipliers with long data width. On the other hand, a 1-D convolution is undertaken by a multi-tap transposed finite impulse response (FIR) filter, which is folded into one tap so that computations involving zero activations and weights can be easily skipped. Equipped with the precision-adjustable MAC unit and the folded FIR filter structure, a well-designed array architecture, consisting of many identical processing elements is developed, which is scalable for different throughput requirements and highly flexible for different numerical precisions. Besides, a novel genetic algorithm based kernel reallocation scheme is introduced to mitigate the load imbalance issue. Our synthesis results demonstrate that the proposed FPAP can significantly reduce the logic complexity and the critical path over the corresponding unfolded design, which only delivers slightly higher throughput when processing sparse and compact models. Our experiments also show that FPAP can scale its energy efficiency from 1.01TOP/s/W to 6.26TOP/s/W under 90nm CMOS technology when different data precisions are used.","PeriodicalId":114330,"journal":{"name":"2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134006911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-block APUF with 2-Level Voltage Supply","authors":"Yunxi Guo, Timothy Dee, A. Tyagi","doi":"10.1109/ISVLSI.2018.00067","DOIUrl":"https://doi.org/10.1109/ISVLSI.2018.00067","url":null,"abstract":"Physical Unclonable Functions (PUFs) are hardware cryptographic primitives for generating unique signatures from device manufacturing variations. Arbiter PUFs (APUFs) are a widely used class of PUF detecting process variations by exploiting the propagation delay differences between signals. However, both FPGA and ASIC implementations of APUFs suffer from systematic bias caused by either asymmetric routing or gradient effects in wafer doping. In this work, we introduce an improved APUF ASIC implementation achieving entropy enhancement without increasing area and power consumption significantly. In this design, a selector chain is divided into multiple blocks to avoid accumulation of systematic variation. Different voltage supplies are chosen for selector chain and arbiter circuit to overcome reliability problems produced by short chains. Cadence Monte Carlo sampling on 256-stage APUFs built in IBM 0.13µm technology shows the proposed Multi-Block (MB-) APUFs provide inter-chip uniqueness and reproducibility similar to double APUF (DAPUF); compared to DAPUF with similar uniqueness performance, MBAPUFs decrease area and power consumption by a factor of 2.","PeriodicalId":114330,"journal":{"name":"2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133142719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cunxi Yu, Chau-Chin Huang, Gi-Joon Nam, M. Choudhury, Victor N. Kravets, A. Sullivan, M. Ciesielski, G. Micheli
{"title":"End-to-End Industrial Study of Retiming","authors":"Cunxi Yu, Chau-Chin Huang, Gi-Joon Nam, M. Choudhury, Victor N. Kravets, A. Sullivan, M. Ciesielski, G. Micheli","doi":"10.1109/ISVLSI.2018.00046","DOIUrl":"https://doi.org/10.1109/ISVLSI.2018.00046","url":null,"abstract":"Sequential circuits are combinational circuits that are separated by registers. Retiming is considered as the most promising technique for optimizing sequential circuits, that involves moving the edge-triggered registers across the combinational logic without changing the functionality. Despite significant efforts spent on sequential optimization since 1980's, there are few works? discussed its performance in an end-to-end design flow. The retiming algorithms were mostly evaluated at the logic level. However, it turns out that the retiming results at logic level could be significantly different than evaluating the physical level. This paper provides the findings of how retiming algorithms perform in an end-to-end industrial design flow, with seven industry designs taken from a recent 14nm microprocessor. Experiments are conducted with several complete industrial design flows. The evaluations are made at the end of the physical design flow. The experimental results show that the performance (design quality) of the retiming algorithms vary on the designs. Based these experimental results, we discover a feature that describes the retiming potentials of sequential designs. This model successfully forecast whether the given industrial designs could be significantly improved by retiming in an end-to-end design flow, regarding timing, area, and power.","PeriodicalId":114330,"journal":{"name":"2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115541822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}