An Overflow-free Quantized Memory Hierarchy in General-purpose Processors

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI:10.1109/IISWC47752.2019.9042035

Marzieh Lenjani, Patricia González, Elaheh Sadredini, M Arif Rahman, M. Stan

{"title":"An Overflow-free Quantized Memory Hierarchy in General-purpose Processors","authors":"Marzieh Lenjani, Patricia González, Elaheh Sadredini, M Arif Rahman, M. Stan","doi":"10.1109/IISWC47752.2019.9042035","DOIUrl":null,"url":null,"abstract":"Data movement comprises a significant portion of energy consumption and execution time in modern applications. Accelerator designers exploit quantization to reduce the bitwidth of values and reduce the cost of data movement. However, any value that does not fit in the reduced bitwidth results in an overflow (we refer to these values as outliers). Therefore accelerators use quantization for applications that are tolerant to overflows. We observe that in most applications the rate of outliers is low and values are often within a narrow range, providing the opportunity to exploit quantization in general-purpose processors. However, a software implementation of quantization in general-purpose processors has three problems. First, the programmer has to manually implement conversions and the additional instructions that quantize and dequantize values, imposing a programmer's effort and performance overhead. Second, to cover outliers, the bitwidth of the quantized values often become greater than or equal to the original values. Third, the programmer has to use standard bitwidth; otherwise, extracting non-standard bitwidth (i.e., 1–7, 9–15, and 17-31) for representing narrow integers exacerbates the overhead of software-based quantization. The key idea of this paper is to propose a hardware support in the memory hierarchy of general-purpose processors for quantization, which represents values by few and flexible numbers of bits and stores outliers in their original format in a separate space, preventing any overflow. We minimize metadata and the overhead of locating quantized values using a software-hardware interaction that transfers quantization parameters and data layout to hardware. As a result, our approach has three advantages over cache compression techniques: (i) less metadata, (ii) higher compression ratio for floating-point values and cache blocks with multiple data types, and (iii) lower overhead for locating the compressed blocks. It delivers on average $1.40/1.45/1.56\\times$ speedup and 24/26/30% energy reduction compared to a baseline that uses full-length variables in a 4/8/16-core system. Our approach also provides $1.23\\times$ speedup, in a 4-core system, compared to the state of the art cache compression techniques and adds only 0.25% area overhead to the baseline processor.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"311 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC47752.2019.9042035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Data movement comprises a significant portion of energy consumption and execution time in modern applications. Accelerator designers exploit quantization to reduce the bitwidth of values and reduce the cost of data movement. However, any value that does not fit in the reduced bitwidth results in an overflow (we refer to these values as outliers). Therefore accelerators use quantization for applications that are tolerant to overflows. We observe that in most applications the rate of outliers is low and values are often within a narrow range, providing the opportunity to exploit quantization in general-purpose processors. However, a software implementation of quantization in general-purpose processors has three problems. First, the programmer has to manually implement conversions and the additional instructions that quantize and dequantize values, imposing a programmer's effort and performance overhead. Second, to cover outliers, the bitwidth of the quantized values often become greater than or equal to the original values. Third, the programmer has to use standard bitwidth; otherwise, extracting non-standard bitwidth (i.e., 1–7, 9–15, and 17-31) for representing narrow integers exacerbates the overhead of software-based quantization. The key idea of this paper is to propose a hardware support in the memory hierarchy of general-purpose processors for quantization, which represents values by few and flexible numbers of bits and stores outliers in their original format in a separate space, preventing any overflow. We minimize metadata and the overhead of locating quantized values using a software-hardware interaction that transfers quantization parameters and data layout to hardware. As a result, our approach has three advantages over cache compression techniques: (i) less metadata, (ii) higher compression ratio for floating-point values and cache blocks with multiple data types, and (iii) lower overhead for locating the compressed blocks. It delivers on average $1.40/1.45/1.56\times$ speedup and 24/26/30% energy reduction compared to a baseline that uses full-length variables in a 4/8/16-core system. Our approach also provides $1.23\times$ speedup, in a 4-core system, compared to the state of the art cache compression techniques and adds only 0.25% area overhead to the baseline processor.

查看原文本刊更多论文

通用处理器中无溢出的量化内存层次结构

在现代应用程序中，数据移动占能源消耗和执行时间的很大一部分。加速器设计者利用量化来减少值的位宽和降低数据移动的成本。然而，任何不适合减小位宽的值都会导致溢出(我们将这些值称为离群值)。因此，加速器对允许溢出的应用程序使用量化。我们观察到，在大多数应用中，异常值的比率很低，值通常在一个狭窄的范围内，这为在通用处理器中利用量化提供了机会。然而，在通用处理器中量化的软件实现有三个问题。首先，程序员必须手动实现转换和量化和去量化值的附加指令，这增加了程序员的工作量和性能开销。其次，为了覆盖异常值，量化值的位宽通常大于或等于原始值。第三，程序员必须使用标准位宽;否则，提取用于表示窄整数的非标准位宽(即1 - 7,9 - 15和17-31)会加剧基于软件的量化的开销。本文的核心思想是在通用处理器的内存层次结构中提出一种量化的硬件支持，该支持以少量灵活的位数表示值，并将异常值以其原始格式存储在单独的空间中，防止任何溢出。我们使用将量化参数和数据布局传输到硬件的软硬件交互来最小化元数据和定位量化值的开销。因此，与缓存压缩技术相比，我们的方法有三个优点:(i)更少的元数据，(ii)具有多种数据类型的浮点值和缓存块的更高压缩率，以及(iii)更低的定位压缩块的开销。与在4/8/16核系统中使用全长变量的基准相比，它提供了平均1.40/1.45/1.56倍的加速和24/26/30%的能耗降低。与最先进的缓存压缩技术相比，我们的方法在4核系统中还提供了1.23倍的加速，并且仅为基准处理器增加了0.25%的面积开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量