{"title":"ADDHard","authors":"Sai Manoj Pudukotai Dinakarrao, A. Jantsch","doi":"10.1145/3194554.3194647","DOIUrl":"https://doi.org/10.1145/3194554.3194647","url":null,"abstract":"Anomaly detection in Electrocardiogram (ECG) signals facilitates the diagnosis of cardiovascular diseases i.e., arrhythmias. Existing methods, although fairly accurate, demand a large number of computational resources. Based on the pre-processing of ECG signal, we present a low-complex digital hardware implementation (ADDHard) for arrhythmia detection. ADDHard has the advantages of low-power consumption and a small foot print. ADDHard is suitable especially for resource constrained systems such as body wearable devices. Its implementation was tested with the MIT-BIH arrhythmia database and achieved an accuracy of 97.28% with a specificity of 98.25% on average.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123074987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Congming Gao, Liang Shi, Yejia Di, Qiao Li, C. Xue, E. Sha
{"title":"An Efficient Cache Management Scheme for Capacitor Equipped Solid State Drives","authors":"Congming Gao, Liang Shi, Yejia Di, Qiao Li, C. Xue, E. Sha","doi":"10.1145/3194554.3194639","DOIUrl":"https://doi.org/10.1145/3194554.3194639","url":null,"abstract":"Within SSDs, random access memory (RAM) has been adopted as cache inside controller for achieving better performance. However, due to the volatility characteristic of RAM, data loss may happen when sudden power interrupts. To solve this issue, capacitor has been equipped inside emerging SSDs as interim supplier. However, the aging issue of capacitor will result in capacitance decreases over time. Once the remaining capacitance is not able to write all dirty pages in the cache back to flash memory, data loss may happen. In order to solve the above issue, an efficient cache management scheme for capacitor equipped SSDs is proposed in this work. The basic idea of the scheme is to bound the number of dirty pages in cache within the capability of the capacitor. Simulation results show that the proposed scheme achieves encourage improvement on lifetime and performance while power interruption induced data loss is avoided.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126848587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dataflow-Based Mapping of Spiking Neural Networks on Neuromorphic Hardware","authors":"Anup Das, Akash Kumar","doi":"10.1145/3194554.3194627","DOIUrl":"https://doi.org/10.1145/3194554.3194627","url":null,"abstract":"Spiking Neural Networks (SNNs) are powerful computation engines for pattern recognition and image classification applications. Apart from application performance such as recognition and classification accuracy, system performance such as throughput becomes important when executing these applications on a hardware. We propose a systematic design-flow to map SNN-based applications on a crossbar-based neuromorphic hardware, guaranteeing application as well as system performance. Synchronous Dataflow Graphs (SDFGs) are used to model these applications with extended semantics to represent neural network topologies. Self-timed scheduling is then used to analyze throughput, incorporating hardware constraints such as synaptic memory, communication and I/O bandwidth of crossbars. Our design-flow integrates CARLsim, a GPU-accelerated application-level SNN simulator with SDF3, a tool for mapping SDFG on hardware. We conducted experiments with realistic and synthetic SNNs on representative neuromorphic hardware, demonstrating throughput-resource trade-offs for a given application performance. For throughput-constrained applications, we show average 20% reduction of hardware usage with 19% reduction in energy consumption. For throughput-scalable applications, we show an average 53% higher throughput compared to a state-of-the-art approach.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121446312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 7: Machine Learning and HW Accelerators","authors":"Fatemeh Tehranipoor","doi":"10.1145/3252913","DOIUrl":"https://doi.org/10.1145/3252913","url":null,"abstract":"","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130250538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy and Performance Efficient Computation Offloading for Deep Neural Networks in a Mobile Cloud Computing Environment","authors":"Amir Erfan Eshratifar, Massoud Pedram","doi":"10.1145/3194554.3194565","DOIUrl":"https://doi.org/10.1145/3194554.3194565","url":null,"abstract":"In today's computing technology scene, mobile devices are considered to be computationally weak, while large cloud servers are capable of handling expensive workloads, therefore, intensive computing tasks are typically offloaded to the cloud. Recent advances in learning techniques have enabled Deep Neural Networks (DNNs) to be deployed in a wide range of applications. Commercial speech based intelligent personal assistants (IPA) like Apple's Siri, which employs DNN as its recognition model, operate solely over the cloud. The cloud-only approach may require a large amount of data transfer between the cloud and the mobile device. The mobile-only approach may lack performance efficiency. In addition, the cloud server may be slow at times due to the congestion and limited subscription and mobile devices may have battery usage constraints. In this paper, we investigate the efficiency of offloading only some parts of the computations in DNNs to the cloud. We have formulated an optimal computation offloading framework for forward propagation in DNNs, which adapts to battery usage constraints on the mobile side and limited available resources on the cloud. Our simulation results show that our framework can achieve 1.42x on average and up to 3.07x speedup in the execution time on the mobile device. In addition, it results in 2.11x on average and up to 4.26x reduction in mobile energy consumption.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132813910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy Consumption and Lifetime Improvement of Coarse-Grained Reconfigurable Architectures Targeting Low-Power Error-Tolerant Applications","authors":"H. Afzali-Kusha, O. Akbari, M. Kamal, M. Pedram","doi":"10.1145/3194554.3194631","DOIUrl":"https://doi.org/10.1145/3194554.3194631","url":null,"abstract":"In this work, the application of a voltage over-scaling (VOS) technique for improving the lifetime and reliability of coarse-grained reconfigurable architectures (GCRAs) is presented. The proposed technique, which may be applied to CGRAs used as accelerators for low-power, error-tolerant applications, reduces the (strongly voltage-dependent) wearout effects and the energy consumption of processing elements (PEs) whenever the error impact on the output quality degradation can be tolerated. This provides us with the ability to lessen the wearout and reduce energy consumption of PEs when accuracy requirement for the results is rather low. Multiple degrees of computational accuracy can be achieved by using different overscaled voltage levels for the PEs. The efficacy of the proposed technique is studied by considering the bias temperature instability. The study is performed for two error-resilient applications. The CGRAs are implemented with 15nm FinFET operating at a nominal supply voltage of 0.8V. In addition, supply voltages of 0.75, 0.7, 0.65, and 0.6V are considered as overscaled voltage levels for this technology. Based on the quality constraint requirements of the benchmarks, optimum overscaled voltage levels for various PEs are determined and utilized. The approach may provide considerable lifetime and energy consumption improvements over those of the conventional exact and approximate computation approaches.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"1984 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114089276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MC3A","authors":"Lahir Marni, M. Hosseini, T. Mohsenin","doi":"10.1145/3194554.3194577","DOIUrl":"https://doi.org/10.1145/3194554.3194577","url":null,"abstract":"The paper presents \"MC3A\"- Markov Chain Monte Carlo Many Core Accelerator, a high-throughput, domain-specific, programmable manycore accelerator, which effectively generates samples from a provided target distribution. MCMC samplers are used in machine learning, image and signal processing applications that are computationally intensive. In such scenarios, high-throughput samplers are of paramount importance. To achieve a high-throughput platform, we add two domain-specific instructions with dedicated hardware whose functions are extensively used in MCMC algorithms. These instructions bring down the number of clock cycles needed to implement the respective functions by 10x and 21x. A 64-cluster architecture of the MC3A is fully placed and routed in 65 nm, TSMC CMOS technology, where the VLSI layout of each cluster occupies an area of 0.577 mm^2 while consuming a power of 247 mW running at 1 GHz clock frequency. Our proposed MC3A achieves 6x higher throughput than its equivalent predecessor (PENC) and consumes 4x lower energy per sample. Also, when compared to other off-the-shelf platforms, such as Jetson TX1 and TX2 SoC, MC3A results in 195x and 191x higher throughput and consumes 808x and 726x lower energy per sample generation, respectively.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115834744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Hardware Accelerator for Mini-Batch Gradient Descent","authors":"Sandeep Rasoori, V. Akella","doi":"10.1145/3194554.3194559","DOIUrl":"https://doi.org/10.1145/3194554.3194559","url":null,"abstract":"Iterative first-order methods that use gradient information form the core computation kernels of modern statistical data analytic engines, such as MADLib, Impala, Google Brain, GraphLab, MLlib in Spark, among others. Even the most advanced parallel stochastic gradient descent algorithm, such as Hogwild is not very scalable on conventional chip multiprocessors because of the bottlenecks induced by the memory system when sharing large model vectors. We propose a scalable architecture for large scale parallel gradient descent on a Field Programmable Gate Array (FPGA) by taking advantage of the large amount of embedded memory in modern FPGAs. We propose a novel data layout mechanism that eliminates the need for expensive synchronization and locking of shared data, which makes the architecture scalable. A 32-PE system on the Stratix V FPGA shows about 5x improvement in performance compared to state-of-the-art implementation on a 14 core/28 thread Intel Xeon CPU with 64 GB memory and operating at 2.6 GHz.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"2015 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121006685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joel Mandebi Mbongue, Danielle Tchuinkou Kwadjo, C. Bobda
{"title":"FLexiTASK","authors":"Joel Mandebi Mbongue, Danielle Tchuinkou Kwadjo, C. Bobda","doi":"10.1145/3194554.3194644","DOIUrl":"https://doi.org/10.1145/3194554.3194644","url":null,"abstract":"One of the major obstacles to the adoption of FPGAs in high-performance computing is their programmability. It requires hardware design skills and long compilation times. Overlays have been proposed as a way to abstract FPGA resources. Unfortunately, most of the time, the topologies they use to connect computing cores impose restrictions on where tasks are placed and how they communicate. In this paper, we propose an overlay architecture designed for efficiency and flexibility. It features a novel Network-on-Chip (NoC) infrastructure making flexible, with no limitation, the placement of hardware tasks. The presented architecture allows tasks to communicate with a low latency and eases the reconfiguration of desired areas on the fabric at runtime. After prototyping the proposed architecture on an Altera Cyclone V FPGA, a maximum frequency of 282 MHz has been reached and a speedup ranging from 4x to 195x has been observed in some applications compared to the native execution.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121303501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-Energy Architectures of Linear Classifiers for IoT Applications using Incremental Precision and Multi-Level Classification","authors":"Sandhya Koteshwara, K. Parhi","doi":"10.1145/3194554.3194603","DOIUrl":"https://doi.org/10.1145/3194554.3194603","url":null,"abstract":"This paper presents a novel incremental-precision classification approach that leads to a reduction in energy consumption of linear classifiers for IoT applications. Features are first input to a low-precision classifier. If the classifier successfully classifies the sample, then the process terminates. Otherwise, the classification performance is incrementally improved by using a classifier of higher precision. This process is repeated until the classification is complete. The argument is that many samples can be classified using the low-precision classifier, leading to a reduction in energy. To achieve incremental-precision, a novel data-path decomposition is proposed to design of fixed-width adders and multipliers. These components improve the precision without recalculating the outputs, thus reducing energy. Using a linear classification example, it is shown that the proposed incremental-precision based multi-level classifier approach can reduce energy by about 41% while achieving comparable accuracies as that of a full-precision system.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"975 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116214655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}