Jean. C. Scheunemann, Marlon S. Sigales, M. Fonseca, E. D. da Costa
{"title":"Optimizing Encoder and Decoder Blocks for a Power-Efficient Radix-4 Modified Booth Multiplier","authors":"Jean. C. Scheunemann, Marlon S. Sigales, M. Fonseca, E. D. da Costa","doi":"10.1109/SBCCI53441.2021.9529975","DOIUrl":"https://doi.org/10.1109/SBCCI53441.2021.9529975","url":null,"abstract":"The conventional modified Booth multiplier comprises an encoder and a decoder, which produce partial products by adjusting the multiplicand according to 3-bit windows generated in the segmentation of the multiplier. The partial products are added later with the necessary left shifts in each one. They produce accurate results, more useful in processes where the error is not acceptable, like divisions, transformations and filters. This paper proposes optimizing the radix-4 Modified Booth encoder (MBE) and decoder (MBD) for a power-efficient multiplier. Two new topologies are presented for optimizing the encoder and decoder set so that the partial product terms avoid unnecessary operations. The proposed encoder and decoder structures are highly regular, with few logic gates, and easily parallelized for any number of input bits. The results show that the proposed optimizations reveal gains in area and power compared to the conventional multiplexer-based multiplier. When applying the proposed multiplier in the butterflies of the Fast Fourier Transform, the proposed multipliers are more efficient with gains in the area, power, and power-delay-product (PDP) compared with the multipliers using the ‘*’ operator from the literature.","PeriodicalId":270661,"journal":{"name":"2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126865916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luiz Neto, M. Corrêa, Bruno Zatt Daniel Palomino, L. Agostini, G. Corrêa
{"title":"Configurable Power/Quality-Aware Hardware Design for the AV1 Directional Intra Frame Prediction","authors":"Luiz Neto, M. Corrêa, Bruno Zatt Daniel Palomino, L. Agostini, G. Corrêa","doi":"10.1109/SBCCI53441.2021.9529997","DOIUrl":"https://doi.org/10.1109/SBCCI53441.2021.9529997","url":null,"abstract":"AOMedia Video 1 (AV1) is a royalty-free video format launched in 2018 that includes several tools to achieve high coding efficiency, but also presents a high computational cost. Because of this, hardware solutions are a good alternative to speed up the encoding process and enable real-time processing. This work presents a configurable hardware design for the AV1 directional intra-frame prediction. Two operation modes were defined on a software-based analysis and implemented in the architecture: the High Quality mode, which offers high coding efficiency, and the Low Power mode, which offers a power dissipation 84.4% lower in comparison to the first mode at the cost of a compression efficiency loss of 2.48%. The proposed architecture is capable of operating in real-time for video resolutions of up to 4K@60fps. When synthesized for the TSMC 40nm technology and operating at a target frequency of 1,902 MHz for 4K@60fps videos, the High Quality and the Low Power modes presented a power dissipation of 631.76 mW and 82.76 mW, respectively.","PeriodicalId":270661,"journal":{"name":"2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"579 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132363588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling wave propagation using cellular automata on Chip","authors":"H. Moura, D. Muñoz","doi":"10.1109/SBCCI53441.2021.9529978","DOIUrl":"https://doi.org/10.1109/SBCCI53441.2021.9529978","url":null,"abstract":"Cellular automata (CA) are often used for physical modelings in which space and time are discrete by assumption. An artificial organism can be considered as a circuit that controls a high-level integrated CA architecture, developed for some mathematical purpose. In this work, propagation and reflection wave phenomenon has been reproduced using CA systems, in a one-dimensional lossless medium. The main objective here was to develop a modeling tool, called vCAgen, which provides the VHDL code for the CA-based system implementation. The CA-based system called CamphslD, composed of 185 cells and two receiver channels, was mapped on a Xilinx Zynq Ultrascale 7EV1156 device, using 27-bit width floating-point arithmetic representation. The proposed CamphslD was effectively mapped on the Zynq device, achieving an operational frequency of 100 MHz and throughput of 25 MOPS. Numerical comparisons allow us to conclude that the proposed CA-system, which is based on simple arithmetic operations, achieves the same results that a reference model based on the well known d'Alembert one-dimensional discrete wave solution, with a Mean Square Error (MSE) in the order of 10−13","PeriodicalId":270661,"journal":{"name":"2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132730293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Gabbay, A. Mendelson, Basel Salameh, Majd Ganaiem
{"title":"Asymmetric Aging Avoidance EDA Tool","authors":"F. Gabbay, A. Mendelson, Basel Salameh, Majd Ganaiem","doi":"10.1109/SBCCI53441.2021.9529984","DOIUrl":"https://doi.org/10.1109/SBCCI53441.2021.9529984","url":null,"abstract":"The latest process technologies have become highly susceptible to asymmetric aging, whereby the timing of logical elements degrades at unequal rates over the element lifetime, causing severe reliability concerns. Although several tools are available to handle asymmetric aging, such tools mainly rely on circuit or physical design approaches and offer a limited capability to handle large-scale ICs. In this paper, we introduce a flow and a tool to minimize the asymmetric aging effect in data path design structures. The proposed tool can be straightforwardly integrated as part of standard design flows of large-scale ICs. In addition, the tool can automatically analyze various designs at RTL or gate-level and identify logical elements which are suspectable to asymmetric aging. As part of the design flow, the tool automatically embeds a special logical circuitry in the design to eliminate asymmetric aging. Our experimental analysis shows that the proposed design flow can minimize the asymmetric aging effect and eliminate reliability concerns while introducing minor power and silicon area overhead.","PeriodicalId":270661,"journal":{"name":"2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124333340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francisco Carlos Silva Junior, I.S. Silva, R. Jacobi
{"title":"Evaluating the Performance, Energy and Area Tradeoffs of ATHENA in Superscalar Processors","authors":"Francisco Carlos Silva Junior, I.S. Silva, R. Jacobi","doi":"10.1109/SBCCI53441.2021.9529979","DOIUrl":"https://doi.org/10.1109/SBCCI53441.2021.9529979","url":null,"abstract":"Coarse-Grained Reconfigurable architectures (CGRA) have been widely used as accelerator, providing energy saving and performance improvements while also offers flexibility to meet different applications requirements. Despite the aforementioned advantages, CGRAs usually consist of many processing elements, which implies area overhead that can be prohibitive to its integration in system with hard area constraint, such as embedded system and mobile devices. To cope with that, this work evaluates a CGRA for systems with hard area constraint called ATHENA (A Thin rEcoNfigurable Architecture). The thinness concept consists of a CGRA that uses considerably less processing elements than the CGRAs found in the literature. ATHENA is attached to a superescalar processor and is dynamically mapped. A design space exploration on ATHENA and the superescalar processor is carried out to evaluate the different area, energy and performance tradeoffs that these solutions can deliver. The results shows that, even using fewer processing elements, ATHENA was able to speed up to 2.43x while saving up to 32% of energy. When compared with other dynamically mapped CGRAs of the state of the art, ATHENA is up to 4x smaller and provides competitive performance.","PeriodicalId":270661,"journal":{"name":"2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128442420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ronald Hassib Galvis Chacón, Agnaldo Vieira Dias, Angela Alves dos Santos, P. C. Secheusk, Silvio Manea, J. A. Diniz, S. Finco
{"title":"A Latching Current Limiter with Telemetries for Space Applications","authors":"Ronald Hassib Galvis Chacón, Agnaldo Vieira Dias, Angela Alves dos Santos, P. C. Secheusk, Silvio Manea, J. A. Diniz, S. Finco","doi":"10.1109/SBCCI53441.2021.9529973","DOIUrl":"https://doi.org/10.1109/SBCCI53441.2021.9529973","url":null,"abstract":"In electrical power systems, current limiting circuits are necessary to protect loads and isolate faults to keep the entire network stable. These types of circuits in space applications require integrated solutions to reduce weight, size and improve reliability. This paper presents the design of a latching current limiter (LCL) as a protection device in DC power distribution systems of scientific satellites. The proposed topology of the LCL allows to adjust the limiting current and the trip-off time and eliminates peak current during a short-circuit condition, protecting the payloads. Moreover, the LCL provides a fault indication signal, current and voltage telemetries to check the status of the load. The control loop and telemetry circuits have been developed in 0.6µm CMOS technology. The functionality of the LCL was verified by simulations and by tests on prototyping circuits.","PeriodicalId":270661,"journal":{"name":"2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133405114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving energy efficiency by transparently sharing SIMD Execution Units in Assymetric Multicores","authors":"Caio Vieira, Antonio Carlos Schneider Beck","doi":"10.1109/SBCCI53441.2021.9529982","DOIUrl":"https://doi.org/10.1109/SBCCI53441.2021.9529982","url":null,"abstract":"Single-ISA Asymmetric multicore architectures (e.g., ARM big.LITTLE) combine high-performance and energy efficiency in the same chip by providing different microarchitectures so the applications can transparently migrate from one to another accordingly. However, in such architectures, the big core features resource-expensive Execution Units (EU) to support ISA extensions, such as SIMD and FP, which may rarely be used depending on the application at hand. These same extensions are supported by the little core but using power-efficient EUs. Given that, in this work, we propose a decoupled offloading mechanism to allow the big core to use such power-efficient EUs in the little core while its own can be power-gated, maintaining the original migration transparency of the architecture. Since applications may have different phases, thus having more or fewer extension instructions usage, we also propose an arbiter to decide when to activate the decoupled offloading at runtime. We evaluate our technique considering ARM NEON as the ISA extension and ARM A7 and A15 as the little and big cores, respectively. Our evaluation shows that, on average, our approach provides 15.9% in energy improvements at the cost of 2.2% in time overhead for mibench benchmarks, which represent embedded application workloads; and, on average, 6.4% in energy gains at the cost of 1.1% in time overhead for polybench benchmarks, which have high NEON usage.","PeriodicalId":270661,"journal":{"name":"2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125384655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Versatile Test Set Generation Tool for Structural Analog Circuit Testing","authors":"Lucas B. Zilch, M. Lubaszewski, T. Balen","doi":"10.1109/SBCCI53441.2021.9529987","DOIUrl":"https://doi.org/10.1109/SBCCI53441.2021.9529987","url":null,"abstract":"This work presents a low cost automatic test generation tool for structural analog testing. With the spice netlist and technology models of the circuit to be tested, a fault list (of size F) is generated, considering a defect modeling provided by the user. The tool interacts with a spice simulator, simulating the fault-free and F faulty circuits. The test limits used to calculate the fault coverage may be either defined by the user or automatically computed considering the process variability of the fault-free circuit. The test development considers DC, AC (single tone) and transient (step) stimuli applied at the primary circuit inputs, computing the obtained fault coverage when taking different circuit nodes as observation points. The final test set determination relies on a fault dictionary that helps maximizing the fault coverage, at the same time as minimizing the test application time and exposing undetected faults. A case study, consisting in a second order Butterworth filter, built with a 180 nm fully-differential OpAmp is presented, considering a resistive defect modeling to generate the fault list. For this case study, the tool indicates a fault coverage of 96.25% considering four tests (two AC and two DC).","PeriodicalId":270661,"journal":{"name":"2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125640687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Angelo Elias Dalzotto, Marcelo Ruaro, Leonardo Vian Erthal, F. Moraes
{"title":"Management Application - a New Approach to Control Many-Core Systems","authors":"Angelo Elias Dalzotto, Marcelo Ruaro, Leonardo Vian Erthal, F. Moraes","doi":"10.1109/SBCCI53441.2021.9529989","DOIUrl":"https://doi.org/10.1109/SBCCI53441.2021.9529989","url":null,"abstract":"The increasing core count in many-core systems introduced management challenges, including scalability, portability, and reduced overhead in the user's applications. Works available in the literature seek to manage a given objective, such as power, temperature, communication, and quality-of-service. These management strategies are tightly coupled to the hardware platform and the operating system (OS) running on it. This coupling implies the lack of management modularity, resulting in low flexibility related to modifying management strategies at runtime, and low portability. State-of-the-art shows that few works propose management strategies or frameworks, only evaluating the proposed objective's quality. This work aims to present a new approach to control many-core systems, named Management Application (MA), which can implement multiobjective management decoupled from the hardware and the OS through a set of high-priority tasks. MA transforms the management problem into a distributed application, allowing the management to truly benefit from the high parallel power of many-cores. The MA approach is demonstrated with a proof-of-concept framework. Results evaluate the cost to adopt MA, compared to the cluster management, and the benefits of adopting MA to manage a benchmark with real-time constraints revealing improved memory footprint and higher management throughput due to its parallelization.","PeriodicalId":270661,"journal":{"name":"2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"1089 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127771968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ilayda Yaman, Allan Andersen, Lucas Ferreira, Joachirn Rodrigues
{"title":"FLoPAD-GRU: A Flexible, Low Power, Accelerated DSP for Gated Recurrent Unit Neural Network","authors":"Ilayda Yaman, Allan Andersen, Lucas Ferreira, Joachirn Rodrigues","doi":"10.1109/SBCCI53441.2021.9529981","DOIUrl":"https://doi.org/10.1109/SBCCI53441.2021.9529981","url":null,"abstract":"Recurrent neural networks (RNNs) are efficient for classification of sequential data such as speech and audio due to their high precision on tasks. However, power efficiency, the required memory capacity and bandwidth requirements make them less suitable for battery powered devices. In this work, we introduce FLoPAD-GRU: a system on a chip (SoC) for efficient processing of gated recurrent unit (GRU) networks, that consists of a digital signal processor (DSP), supplemented with an optimized hardware accelerator, which reduces memory accesses and cost. The system is programmable and scalable, which allows for execution of different network sizes. Synthesized in 28 nm CMOS technology, real-time classification is achieved at 4 MHz, with an energy dissipation of 4.1 pJ/classification, an improvement of 15 × compared to a pure DSP realization. The memory requirements are reduced by 75 %, which results in a silicon area of 0.7 mm2for the entire SoC.","PeriodicalId":270661,"journal":{"name":"2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)","volume":"150 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113988621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}