{"title":"Computing in memory with FeFETs","authors":"D. Reis, M. Niemier, X. Hu","doi":"10.1145/3218603.3218640","DOIUrl":"https://doi.org/10.1145/3218603.3218640","url":null,"abstract":"Data transfer between a processor and memory frequently represents a bottleneck with respect to improving application-level performance. Computing in memory (CiM), where logic and arithmetic operations are performed in memory, could significantly reduce both energy consumption and computational overheads associated with data transfer. Compact, low-power, and fast CiM designs could ultimately lead to improved application-level performance. This paper introduces a CiM architecture based on ferroelectric field effect transistors (FeFETs). The CiM design can serve as a general purpose, random access memory (RAM), and can also perform Boolean operations ((N)AND, (N)OR, X(N)OR, INV) as well as addition (ADD) between words in memory. Unlike existing CiM designs based on other emerging technologies, FeFET-CiM accomplishes the aforementioned operations via a single current reference in the sense amplifier, which leads to more compact designs and lower power. Furthermore, the high Ion/Ioff ratio of FeFETs enables an inexpensive voltage-based sense scheme. Simulation-based case studies suggest that our FeFET-CiM can achieve speed-ups (and energy reduction) of ~119X (~1.6X) and ~1.97X (~1.5X) over ReRAM and STT-RAM CiM designs with respect to in-memory addition of 32-bit words. Furthermore, our approach offers an average speedup of ~2.5X and energy reduction of ~1.7X when compared to a conventional (not in-memory) approach across a wide range of benchmarks.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73558725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hossein Farrokhbakht, H. M. Kamali, Natalie D. Enright Jerger, S. Hessabi
{"title":"SPONGE","authors":"Hossein Farrokhbakht, H. M. Kamali, Natalie D. Enright Jerger, S. Hessabi","doi":"10.1093/nq/s1-iii.81.390i","DOIUrl":"https://doi.org/10.1093/nq/s1-iii.81.390i","url":null,"abstract":"","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73333282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 2.6 mW Single-Ended Positive Feedback LNA for 5G Applications","authors":"S. Arshad, A. Beg, R. Ramzan","doi":"10.1145/3218603.3218619","DOIUrl":"https://doi.org/10.1145/3218603.3218619","url":null,"abstract":"This paper presents the design of a single-ended positive feedback Common Gate (CG) Low Noise Amplifier (LNA) for 5G applications. Positive feedback is utilized to achieve the trade-off between the input matching, the gain and the noise factor (NF) of the LNA. The positive feedback inherently cancels the noise produced by the input CG transistor. The proposed LNA is designed and fabricated in 150 nm CMOS by L-Foundry. At 1.41 GHz, the measured S11 and S22 are better than -20 dB and -8.4 dB, respectively. The highest voltage gain is 16.17 dB with a NF of 3.64 dB. The complete chip has an area of 1 mm2. The LNA's power dissipation is only 2.6 mW with a 1 dB compression point of -13 dBm. The simple, low power and single-ended architecture of the proposed LNA allows it to be implemented in phase array and Multiple Input Multiple Output (MIMO) radars, which have limited input and output pads and constrained power budgets for on-board components.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79794227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NNest","authors":"Liu Ke, Xin He, Xuan Zhang","doi":"10.1145/3218603.3218647","DOIUrl":"https://doi.org/10.1145/3218603.3218647","url":null,"abstract":"Deep neural network (DNN) has achieved spectacular success in recent years. In response to DNN's enormous computation demand and memory footprint, numerous inference accelerators have been proposed. However, the diverse nature of DNNs, both at the algorithm level and the parallelization level, makes it hard to arrive at an \"one-size-fits-all\" hardware design. In this paper, we develop NNest, an early-stage design space exploration tool that can speedily and accurately estimate the area/performance/energy of DNN inference accelerators based on high-level network topology and architecture traits, without the need for low-level RTL codes. Equipped with a generalized spatial architecture framework, NNest is able to perform fast high-dimensional design space exploration across a wide spectrum of architectural/micro-architectural parameters. Our proposed novel date movement strategies and multi-layer fitting schemes allow NNest to more effectively exploit parallelism inherent in DNN. Results generated by NNest demonstrate: 1) previously-undiscovered accelerator design points that can outperform state-of-the-art implementation by 39.3% in energy efficiency; 2) Pareto frontier curves that comprehensively and quantitatively reveal the multi-objective tradeoffs in custom DNN accelerators; 3) holistic design exploration of different level of quantization techniques including recently-proposed binary neural network (BNN).","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88332430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Image Sensor Subsampling for DNN-Based Image Classification","authors":"Jiaxian Guo, Hongxiang Gu, M. Potkonjak","doi":"10.1145/3218603.3218618","DOIUrl":"https://doi.org/10.1145/3218603.3218618","url":null,"abstract":"Today's mobile devices are equipped with cameras capable of taking very high-resolution pictures. For computer vision tasks which require relatively low resolution, such as image classification, sub-sampling is desired to reduce the unnecessary power consumption of the image sensor. In this paper, we study the relationship between subsampling and the performance degradation of image classifiers that are based on deep neural networks (DNNs). We empirically show that subsampling with the same step size leads to very similar accuracy changes for different classifiers. In particular, we could achieve over 15x energy savings just by subsampling while suffering almost no accuracy lost. For even better energy accuracy trade-offs, we propose AdaSkip, where the row sampling resolution is adaptively changed based on the image gradient. We implement AdaSkip on an FPGA and report its energy consumption.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83930169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ioannis Tsiokanos, L. Mukhanov, Dimitrios S. Nikolopoulos, G. Karakonstantis
{"title":"Variation-Aware Pipelined Cores through Path Shaping and Dynamic Cycle Adjustment: Case Study on a Floating-Point Unit","authors":"Ioannis Tsiokanos, L. Mukhanov, Dimitrios S. Nikolopoulos, G. Karakonstantis","doi":"10.1145/3218603.3218617","DOIUrl":"https://doi.org/10.1145/3218603.3218617","url":null,"abstract":"In this paper, we propose a framework for minimizing variation-induced timing failures in pipelined designs, while limiting any overhead incurred by conventional guardband based schemes. Our approach initially limits the long latency paths (LLPs) and isolates them in as few pipeline stages as possible by shaping the path distribution. Such a strategy, facilitates the adoption of a special unit that predicts the excitation of the isolated LLPs and dynamically allows an extra cycle for the completion of only these error-prone paths. Moreover, our framework performs post-layout dynamic timing analysis based on real operands that we extract from a variety of applications. This allows us to estimate the bit error rates under potential delay variations, while considering the dynamic data dependent path excitation. When applied to the implementation of an IEEE-754 compatible double precision floating-point unit (FPU) in a 45nm process technology, the path shaping helps to reduce the bit error rates on average by 2.71 x compared to the reference design under 8% delay variations. The integrated LLPs prediction unit and the dynamic cycle adjustment avoid such failures and any quality loss at a cost of up-to 0.61% throughput and 0.3% area overheads, while saving 37.95% power on average compared to an FPU with pessimistic margins.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89218197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Aggressive Slack Recycling via Transparent Pipelines","authors":"Gokul Subramanian Ravi, Mikko H. Lipasti","doi":"10.1145/3218603.3218623","DOIUrl":"https://doi.org/10.1145/3218603.3218623","url":null,"abstract":"In order to operate reliably and produce expected outputs, modern architectures set timing margins conservatively at design time to support extreme variations in workload and environment. Unfortunately, the conservative guard bands set to achieve this reliability create clock cycle slack and are detrimental to performance and energy efficiency. To combat this, we propose Aggressive Slack Recycling via Transparent Pipelines. Our proposal performs timing speculation while allowing data to flow asynchronously via transparent latches, between synchronous boundaries. This allows timing speculation to cater to the average slack across asynchronous operations rather than the slack of the most critical operation - maximizing slack conservation and timing speculation efficiency. We design a slack tracking mechanism which runs in parallel with the transparent data path to estimate the accumulated slack across operation sequences. The mechanism then appropriately clocks synchronous boundaries early to minimize wasted slack and maximize clock cycle savings. We implement our proposal on a spatial fabric and achieves absolute speedups up to 20% and relative improvements (vs. competing mechanisms) of up to 75%.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"9 12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88812918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minxuan Zhou, M. Imani, Saransh Gupta, Tajana Rosing
{"title":"GAS","authors":"Minxuan Zhou, M. Imani, Saransh Gupta, Tajana Rosing","doi":"10.1145/3218603.3218631","DOIUrl":"https://doi.org/10.1145/3218603.3218631","url":null,"abstract":": Yufutsu gas condensate fi eld has a fractured reservoir which consists of conglomerate and granite. It contains condensate-rich gas. We observed some strange fl uid behaviors. Production gas oil ratio ranges from 900 m 3 /k l to 1,900 m l to 1,900 m l 3 /k l according to production area where reservoir depth is different. Gas production started l according to production area where reservoir depth is different. Gas production started l ten years ago. At the early stage of the production, gas oil ratio decreased with time. This trend is opposite from normal behavior of a gas condensate reservoir. Also, we observed that heavy gas existed on light gas in a tubing at a specifi c well. we The reservoir is approximately from 4,000 m and 1,000 m thick. pressure and temperature are varying in a wide range with the reservoir depth. Gas compositions at the top and the bottom of the reservoir are so different that the lightest component, tends to increase with depth and temperature due to the thermal diffusion because methane has larger thermal diffusion coeffi cient than the other heavier the uid behaviors. This paper reviews the above phenomena and the theory and reports current effort to increase condensate recovery using gas cycling scheme.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74675870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLINK","authors":"Zhe Chen, Andrew G. Howe, H. T. Blair, J. Cong","doi":"10.1145/3218603.3218637","DOIUrl":"https://doi.org/10.1145/3218603.3218637","url":null,"abstract":"Neurofeedback device measures brain wave and generates feedback signal in real time and can be employed as treatments for various neurological diseases. Such devices require high energy efficiency because they need to be worn or surgically implanted into patients and support long battery life time. In this paper, we propose CLINK, a compact LSTM inference kernel, to achieve high energy efficient EEG signal processing for neurofeedback devices. The LSTM kernel can approximate conventional filtering functions while saving 84% computational operations. Based on this method, we propose energy efficient customizable circuits for realizing CLINK function. We demonstrated a 128-channel EEG processing engine on Zynq-7030 with 0.8 W, and the scaled up 2048-channel evaluation on Virtex-VU9P shows that our design can achieve 215x and 7.9x energy efficiency compared to highly optimized implementations on E5-2620 CPU and K80 GPU, respectively. We carried out the CLINK design in a 15-nm technology, and synthesis results show that it can achieve 272.8 pJ/inference energy efficiency, which further outperforms our design on the Virtex-VU9P by 99x.","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74797064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Swagath Venkataramani, V. Srinivasan, Jungwook Choi, K. Gopalakrishnan, Leland Chang
{"title":"Taming the beast: Programming Peta-FLOP class Deep Learning Systems","authors":"Swagath Venkataramani, V. Srinivasan, Jungwook Choi, K. Gopalakrishnan, Leland Chang","doi":"10.1145/3218603.3241338","DOIUrl":"https://doi.org/10.1145/3218603.3241338","url":null,"abstract":"1 EXTENDED ABSTRACT The field of Artificial Intelligence (AI) has witnessed quintessential growth in recent years with the advent of Deep Neural Networks (DNNs) that have achieved state-of-the-art performance on challenging cognitive tasks involving images, videos, text and natural language. They are being increasingly deployed in many real-world services and products, and have pervaded the spectrum of computing devices from mobile/IoT devices to server-class platforms. However, DNNs are highly compute and data intensive workloads, far outstripping the capabilities of today’s computing platforms. For example, state-of-the-art image recognition DNNs require billions of operations to classify a single image. On the other hand, training DNN models demands exa-flops of compute and uses massive datasets requiring 100s of giga-bytes of memory. One approach to address the computational challenges imposed by DNNs is through the design of hardware accelerators, whose compute cores, memory hierarchy and interconnect topology are specialized to match the DNN’s compute and communication characteristics. Several such designs ranging from low-power IP cores to largescale accelerator systems have been proposed in literature. Some factors that enable the design of specialized systems for DNNs are: (i) their computations can be expressed as static data-flow graphs, (ii) their computation patterns are regular with no data-dependent control flows and offer abundant opportunities for data-reuse, and (iii) their functionality could be encapsulated within a set of few (tens of) basic functions (e.g. convolution, matrix-multiplication etc.). That said, DNNs also exhibit abundant heterogeneity at various levels. Across layers, the number of input and output channels and the dimensions of each feature are substantially different. Further, each layer comprises of operations whose Bytes/FLOP requirement vary by over two orders of magnitude. The heterogeneity in compute characteristics engenders a wide range of possibilities to spatiotemporally map DNNs on accelerator platforms, defined in terms of how computations are split across the different compute elements in the architecture and how computations assigned to a compute element are temporally sequenced in time. We are therefore led to ask whether it is possible to come up with a systematic exploration of the design space of mapping configurations to maximize DNN’s performance on a given accelerator architecture using a variety of different dataflows? How will the computations be partitioned and sequenced across the processing","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73272514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}