qBSA: Logic Design of a 32-bit Block-Skewed RSFQ Arithmetic Logic Unit

2019 IEEE International Superconductive Electronics Conference (ISEC) Pub Date : 2019-07-01 DOI:10.1109/ISEC46533.2019.8990921

Souvik Kundu, G. Datta, P. Beerel, M. Pedram

{"title":"qBSA: Logic Design of a 32-bit Block-Skewed RSFQ Arithmetic Logic Unit","authors":"Souvik Kundu, G. Datta, P. Beerel, M. Pedram","doi":"10.1109/ISEC46533.2019.8990921","DOIUrl":null,"url":null,"abstract":"Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz. However, every SFQ gate is clocked creating very deep gate-level pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ pipelines by redesigning the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits. This skewed datapath approach reduces the latency of the LSB side which can be feedback earlier for use in subsequent data-dependent operations increasing their throughput. In particular, we propose to group the bits into 4-bit blocks that are operated on concurrently and create block-skewed datapath units for 32-bit operation. This skewed approach allows a subsequent data-dependent operation to start evaluating as soon as the first 4-bit block completes. Using this general approach, we develop a block-skewed MIPS-compatible 32-bit ALU. Our gate-level Verilog design improves the throughput of 32-bit data dependent operations by 2x and 1.5x compared to previously proposed 4-bit bit-slice and 32-bit Ladner-Fischer ALUs respectively. We have quantified the benefit of this design on instructions per cycle (IPC) for various RISC-V benchmarks assuming a range of non-ALU operation latencies from one to ten cycles. Averaging across benchmarks, our experimental results show that compared to the 32-bit Ladner-Fischer our proposed architecture provides a range of IPC improvements between 1.37x assuming one-cycle non-ALU latency to 1.2x assuming ten-cycle non-ALU latency. Moreover, our average IPC improvements compared to a 32-bit ALU based on the 4-bit bit-slice range from 2.93x to 4x.","PeriodicalId":250606,"journal":{"name":"2019 IEEE International Superconductive Electronics Conference (ISEC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Superconductive Electronics Conference (ISEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISEC46533.2019.8990921","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz. However, every SFQ gate is clocked creating very deep gate-level pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ pipelines by redesigning the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits. This skewed datapath approach reduces the latency of the LSB side which can be feedback earlier for use in subsequent data-dependent operations increasing their throughput. In particular, we propose to group the bits into 4-bit blocks that are operated on concurrently and create block-skewed datapath units for 32-bit operation. This skewed approach allows a subsequent data-dependent operation to start evaluating as soon as the first 4-bit block completes. Using this general approach, we develop a block-skewed MIPS-compatible 32-bit ALU. Our gate-level Verilog design improves the throughput of 32-bit data dependent operations by 2x and 1.5x compared to previously proposed 4-bit bit-slice and 32-bit Ladner-Fischer ALUs respectively. We have quantified the benefit of this design on instructions per cycle (IPC) for various RISC-V benchmarks assuming a range of non-ALU operation latencies from one to ten cycles. Averaging across benchmarks, our experimental results show that compared to the 32-bit Ladner-Fischer our proposed architecture provides a range of IPC improvements between 1.37x assuming one-cycle non-ALU latency to 1.2x assuming ten-cycle non-ALU latency. Moreover, our average IPC improvements compared to a 32-bit ALU based on the 4-bit bit-slice range from 2.93x to 4x.

查看原文本刊更多论文

一个32位块倾斜RSFQ算术逻辑单元的逻辑设计

单通量量子(SFQ)电路是一种超越cmos技术的有吸引力的技术，因为它们承诺在时钟频率超过25 GHz时将功率降低两个数量级。然而，每个SFQ门都有时钟，这会产生非常深的门级管道，很难保持满，特别是对于包含数据依赖操作的序列。本文建议通过重新设计数据路径来提高SFQ管道的吞吐量，使其在较低有效位(lbs)时钟周期上接受和操作的时间比高有效位早。这种倾斜的数据路径方法减少了LSB端的延迟，可以更早地进行反馈，以便在后续依赖数据的操作中使用，从而提高它们的吞吐量。特别是，我们建议将位分组为并发操作的4位块，并为32位操作创建块倾斜数据路径单元。这种倾斜的方法允许随后的数据依赖操作在第一个4位块完成后立即开始计算。使用这种通用方法，我们开发了一个块倾斜的兼容mips的32位ALU。我们的门级Verilog设计与之前提出的4位位片和32位Ladner-Fischer alu相比，分别将32位数据依赖操作的吞吐量提高了2倍和1.5倍。我们已经量化了这种设计对各种RISC-V基准测试的每周期指令(IPC)的好处，假设一系列非alu操作延迟从1到10个周期。在基准测试中进行平均，我们的实验结果表明，与32位Ladner-Fischer相比，我们提出的架构提供了一系列IPC改进，从1.37倍(假设一个周期非alu延迟)到1.2倍(假设十个周期非alu延迟)。此外，与基于4位位片的32位ALU相比，我们的平均IPC改进幅度从2.93倍到4倍不等。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE International Superconductive Electronics Conference (ISEC)

自引率

0.00%

发文量