Kris Min, Brenda Ly, Joshua Garner, Shahnam Mirzaei
{"title":"A Novel Method for Hardware Acceleration of Convex Hull Algorithm on Reconfigurable Hardware","authors":"Kris Min, Brenda Ly, Joshua Garner, Shahnam Mirzaei","doi":"10.1109/socc49529.2020.9524805","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524805","url":null,"abstract":"This paper presents a novel high speed implementation of Andrew's Convex Hull Monotone Chain software algorithm on FPGA. Convex hull in its simplest form is the smallest convex polygon that contains a set of discrete points with many applications in engineering, mathematics and science. Convex hull algorithm in its best case has a linear time complexity assuming data points are sorted. Our implementation targets Zynq system on chip platform. We accelerate the software algorithm by designing components that can work in parallel. This involves using burst transfer, dynamic branch prediction, and resource sharing.. Our approach achieves a speed up of 2.18 for 4 levels of parallelism at 100 MHz clock. Higher speed up can be attained by increasing the levels of parallelism. To the best of our knowledge, our proposed method is the only available hardware accelerated implementation that truly optimizes the hull processing datapath. This is in contrast with other competitive software acceleration which reduce the number of data points to be processed using additional preprocessing steps or increase the speedup by using high speed interface.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125106926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Processing-in-Memory Accelerator for Dynamic Neural Network with Run-Time Tuning of Accuracy, Power and Latency","authors":"Li Yang, Zhezhi He, Shaahin Angizi, Deliang Fan","doi":"10.1109/socc49529.2020.9524770","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524770","url":null,"abstract":"With the widely deployment of powerful deep neural network (DNN) into smart, but resource limited IoT devices, many prior works have been proposed to compress DNN in a hardware-aware manner to reduce the computing complexity, while maintaining accuracy, such as weight quantization, pruning, convolution decomposition, etc. However, in typical DNN compression methods, a smaller, but fixed, network structure is generated from a relative large background model for resource limited hardware accelerator deployment. However, such optimization lacks the ability to tune its structure on-the-fly to best fit for a dynamic computing hardware resource allocation and workloads. In this paper, we mainly review two of our prior works [1], [2] to address this issue, discussing how to construct a dynamic DNN structure through either uniform or non-uniform channel selection based sub-network sampling. The constructed dynamic DNN could tune its computing path to involve different number of channels, thus providing the ability to trade-off between speed, power and accuracy on-the-fly after model deployment. Correspondingly, an emerging Spin-Orbit Torque Magnetic Random-Access-Memory (SOT-MRAM) based Processing-In-Memory (PIM) accelerator will also be discussed for such dynamic neural network structure.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127705155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Supply and Threshold Voltage Scaling towards Runtime Energy Optimization over a Wide Operating Performance Region","authors":"Shoya Sonoda, Jun Shiomi, H. Onodera","doi":"10.1109/socc49529.2020.9524767","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524767","url":null,"abstract":"This paper proposes a runtime voltage-scaling method that optimizes the supply voltage (Vdd) and the threshold voltage (Vth) under a given delay constraint. This paper refers to the optimal voltage pair as a Minimum Energy Point (MEP). This paper firstly proposes a closed-form continuous function that determines the MEP over a wide operating performance region ranging from the above-threshold region down to the subthreshold region. The MEP dynamically fluctuates depending on the operating condition determined by a given delay constraint, an activity factor and a circuit temperature. In order to track the MEP, this paper proposes a voltage scaling technique enabling to set Vdd and Vth to near the MEP without iteratively tuning the voltages based on the proposed function. Existing MEP tracking techniques iteratively tune Vdd, which may not be suitable in terms of (1) the hardware design cost for generating a number of VddS and (2) the MEP tracking time. Measurement results based on a 32-bit RISC processor fabricated in a 65-nm process technology shows that the proposed method estimates the MEP within a 5% energy error in comparison with the actual MEP operation.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127602342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiabao Gao, Jian Wang, Md Tanvir Arafin, Jinmei Lai
{"title":"FABLE-DTS: Hardware-Software Co-Design of a Fast and Stable Data Transmission System for FPGAs","authors":"Jiabao Gao, Jian Wang, Md Tanvir Arafin, Jinmei Lai","doi":"10.1109/socc49529.2020.9524764","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524764","url":null,"abstract":"Developers often need collaborative execution between processors and field-programmable gate arrays (FPGAs) to meet resource-intensive computation requirements. Performance of theses collaborative executions depends heavily on the efficiency of simultaneous data movement between multiple FPGAs. Hence, this paper presents a fast and stable data transmission system (FABLE-DTS) to address high-speed data transfer issues in a multi-FPGA environment. First, we design a Dynamic Phase-shift Data Transmission (DPSDTM) hardware module along with a non-linear phase-shift method to obtain an optimal interface timing between two FPGAs. Then, we develop the software framework for processor-FPGA collaboration called DPSDTM-Linux. This framework implements memory buffer allocation and DPSDTM management. After that, a bus bridge module (BBM) is devised to ensure the compatibility of the proposed DTS with different bus types (i.e., AXI and PLB). Finally, we evaluate the system on a custom IC-testing platform consisting of a ZYNQ-7000 SoC and a Virtex-4 FPGA. We find that the proposed FABLE-DTS provides accurate results for transmission tests and FPGA resources tests, demonstrating the stability of the system in intensive computational tasks. Additionally, the proposed design is the fastest FPGA-processor DTS reported to date, supporting up to 368.80MB/s transmission at the clock frequency of the double-data-rate (DDR) interface (i.e., 200MHz).","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133034141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Inverter-based On-chip Voltage Reference Generator for Low Power Application","authors":"Yuchen Zhao, Z. Zou, Lirong Zheng","doi":"10.1109/socc49529.2020.9524793","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524793","url":null,"abstract":"This paper presents an on-chip voltage reference generator for low power applications. The circuits consist of an array of inverters, a switch-capacitor, and a notch filter. The proposed circuits avoid using any bipolar junction transistor (BJT) and only contain MOSFETS and capacitors. The inverter is utilized as the core circuit to generate the reference voltage. The changes in temperature and process are suppressed by using the switch-capacitor. A notch filter is adopted for ripple reduction. This work is designed and simulated in a 0.18 µm CMOS process. Due to the mostly-digital architecture, this circuit can operate at a sub-1 V supply and generate a reference voltage of 0.504 V with a temperature coefficient of 20 ppm. The power consumption is 83 nW.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"409 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132034078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Metal Inter-Layer Via Utilization Strategies for Three-dimensional Integrated Circuits","authors":"Umamaheswara Rao Tida, M. Vemuri","doi":"10.1109/socc49529.2020.9524756","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524756","url":null,"abstract":"Three-dimensional integrated circuit (3D-IC) technology gained prominence for future integrated chips (ICs) due to increased transistor density at the same technology node. Conventional 3D-IC implementation involves die stacking with vertical interconnects realized by through-silicon-via (TSV). One of the main challenges associated with 3D-IC technology is the TSV size since they are large in size (100-400x larger than standard cells in 45nm technology) and their diameters do not scale with technology. In addition, lot of dummy TSVs are inserted to satisfy minimum density rule laid by foundries which further increases the overhead. Also, a need for small form factor implementation of on-chip devices especially inductors are required for heterogeneous integration. In this paper, we discuss about utilizing TSVs to form on-chip inductors for various applications. On the other hand, monolithic Three-dimensional integrated circuit (M3D-IC) technology is enabled by sequential integration of substrate layers and the devices at different layers are connected by metal inter-layer via (MIV) where MIV passes through the silicon but the size very small compared with the TSV in 3D-ICs. The effective area occupied by MIV on the substrate will increase with the number of MIVs. Therefore, in this paper, we will discuss efficient strategies to reduce silicon footprint overhead by MIV through reconfiguration of silicon around MIV to design MIV-capacitor and MIV-transistor devices. TCAD simulations of 14nm channel length demonstrate that the proposed approach will reduce the silicon area of inverter by about 24% compared with the conventional approach for transistor-level M3D-IC technology.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116217614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA","authors":"Xiao Wu, Yufei Ma, Zhongfeng Wang","doi":"10.1109/socc49529.2020.9524773","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524773","url":null,"abstract":"Convolutional neural networks (CNNs) have achieved significant accuracy improvement in many intelligent applications at the cost of intensive convolution operations and massive data movements. To efficiently deploy CNNs on low power embedded platforms in real time, the depthwise separable convolution has been proposed to replace the standard convolution, especially in lightweight CNNs, which remarkably reduces the computation complexity and model size. However, it is difficult for a general convolution engine to obtain the theoretical performance improvement as the decreased data dependency of depthwise convolution significantly reduces the data reuse opportunity. To address this issue, a flexible and highperformance accelerator based on FPGA is proposed to efficiently process the inference of both large-scale and lightweight CNNs. Firstly, by sharing the activation dataflow between the depthwise convolution and pooling layers, the control logic and data bus of the two layers are reused to maximize the data utilization and minimize the logic overhead. Furthermore, these two layers can be processed either directly after standard convolutions to eliminate the external memory accesses or independently to gain better flexibility. Thirdly, a performance model is proposed to automatically explore the optimal design options of the accelerator. The proposed hardware accelerator is evaluated on Intel Arria 10 SoC FPGA and demonstrates state-of-the-art performance on both large-scale CNNs, e.g., VGG, and lightweight ones, e.g., MobileNet.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125060002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Renyuan Zhang, Tati Erlina, T. Nguyen, Y. Nakashima
{"title":"Hybrid Stochastic Computing Circuits in Continuous Statistics Domain","authors":"Renyuan Zhang, Tati Erlina, T. Nguyen, Y. Nakashima","doi":"10.1109/socc49529.2020.9524786","DOIUrl":"https://doi.org/10.1109/socc49529.2020.9524786","url":null,"abstract":"A hybrid scheme of stochastic computing (SC) is explored by representing and processing the stochastic numbers (SNs) in multiple domains of continuous statistics space. On the basis of Neuron-MOS mechanism, pulses with arbitrary duty-cycles and various frequencies are efficiently generated. By interfering the pulses with multiple keys such as the level and frequency, the SNs are observed in continuous domain instead of long discrete bit-streams conventionally. Employing this stochastic representation, all of three typical SC fashions including straight multiplication/summation, Bernstein polynomial expansion, and finite state machine (FSM) are retrieved by the proposed hybrid schemes. For the Multiply-ACcumulations (MACs), the combination of pulse strength and duty-cycle performs the multiplication; the entanglement among various combinations above behaves accumulations; and the integral within a specific time window indicates the scale-free MAC result efficiently. For retrieving arbitrary functions in SC, the frequency interfering mechanism and novel multi-valued logic (MVL) multiplexer are employed to implement Bernstein polynomials with an ultra-compact VLSI circuit. Moreover, the continuous Markov chain is simply implemented by the SNs switching and a membrane capacitor for performing a special continuous state machine (CSM) which offers the SC sigmoid function with post-silicon scalability. From the circuit simulation results, the transistor amounts of proposed hybrid SC circuits are reduced to 6.1 %, 2.7%, and 8.3% of the state-of-art works for MAC, Bernstein polynomial, and FSM, respectively. Meanwhile, the performances over the accuracy, speed, and power consumption are all similar or superior to state-of-arts.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115034220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}