An Overflow-free Quantized Memory Hierarchy in General-purpose Processors

Marzieh Lenjani, Patricia González, Elaheh Sadredini, M Arif Rahman, M. Stan
{"title":"An Overflow-free Quantized Memory Hierarchy in General-purpose Processors","authors":"Marzieh Lenjani, Patricia González, Elaheh Sadredini, M Arif Rahman, M. Stan","doi":"10.1109/IISWC47752.2019.9042035","DOIUrl":null,"url":null,"abstract":"Data movement comprises a significant portion of energy consumption and execution time in modern applications. Accelerator designers exploit quantization to reduce the bitwidth of values and reduce the cost of data movement. However, any value that does not fit in the reduced bitwidth results in an overflow (we refer to these values as outliers). Therefore accelerators use quantization for applications that are tolerant to overflows. We observe that in most applications the rate of outliers is low and values are often within a narrow range, providing the opportunity to exploit quantization in general-purpose processors. However, a software implementation of quantization in general-purpose processors has three problems. First, the programmer has to manually implement conversions and the additional instructions that quantize and dequantize values, imposing a programmer's effort and performance overhead. Second, to cover outliers, the bitwidth of the quantized values often become greater than or equal to the original values. Third, the programmer has to use standard bitwidth; otherwise, extracting non-standard bitwidth (i.e., 1–7, 9–15, and 17-31) for representing narrow integers exacerbates the overhead of software-based quantization. The key idea of this paper is to propose a hardware support in the memory hierarchy of general-purpose processors for quantization, which represents values by few and flexible numbers of bits and stores outliers in their original format in a separate space, preventing any overflow. We minimize metadata and the overhead of locating quantized values using a software-hardware interaction that transfers quantization parameters and data layout to hardware. As a result, our approach has three advantages over cache compression techniques: (i) less metadata, (ii) higher compression ratio for floating-point values and cache blocks with multiple data types, and (iii) lower overhead for locating the compressed blocks. It delivers on average $1.40/1.45/1.56\\times$ speedup and 24/26/30% energy reduction compared to a baseline that uses full-length variables in a 4/8/16-core system. Our approach also provides $1.23\\times$ speedup, in a 4-core system, compared to the state of the art cache compression techniques and adds only 0.25% area overhead to the baseline processor.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"311 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC47752.2019.9042035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Data movement comprises a significant portion of energy consumption and execution time in modern applications. Accelerator designers exploit quantization to reduce the bitwidth of values and reduce the cost of data movement. However, any value that does not fit in the reduced bitwidth results in an overflow (we refer to these values as outliers). Therefore accelerators use quantization for applications that are tolerant to overflows. We observe that in most applications the rate of outliers is low and values are often within a narrow range, providing the opportunity to exploit quantization in general-purpose processors. However, a software implementation of quantization in general-purpose processors has three problems. First, the programmer has to manually implement conversions and the additional instructions that quantize and dequantize values, imposing a programmer's effort and performance overhead. Second, to cover outliers, the bitwidth of the quantized values often become greater than or equal to the original values. Third, the programmer has to use standard bitwidth; otherwise, extracting non-standard bitwidth (i.e., 1–7, 9–15, and 17-31) for representing narrow integers exacerbates the overhead of software-based quantization. The key idea of this paper is to propose a hardware support in the memory hierarchy of general-purpose processors for quantization, which represents values by few and flexible numbers of bits and stores outliers in their original format in a separate space, preventing any overflow. We minimize metadata and the overhead of locating quantized values using a software-hardware interaction that transfers quantization parameters and data layout to hardware. As a result, our approach has three advantages over cache compression techniques: (i) less metadata, (ii) higher compression ratio for floating-point values and cache blocks with multiple data types, and (iii) lower overhead for locating the compressed blocks. It delivers on average $1.40/1.45/1.56\times$ speedup and 24/26/30% energy reduction compared to a baseline that uses full-length variables in a 4/8/16-core system. Our approach also provides $1.23\times$ speedup, in a 4-core system, compared to the state of the art cache compression techniques and adds only 0.25% area overhead to the baseline processor.
通用处理器中无溢出的量化内存层次结构
在现代应用程序中,数据移动占能源消耗和执行时间的很大一部分。加速器设计者利用量化来减少值的位宽和降低数据移动的成本。然而,任何不适合减小位宽的值都会导致溢出(我们将这些值称为离群值)。因此,加速器对允许溢出的应用程序使用量化。我们观察到,在大多数应用中,异常值的比率很低,值通常在一个狭窄的范围内,这为在通用处理器中利用量化提供了机会。然而,在通用处理器中量化的软件实现有三个问题。首先,程序员必须手动实现转换和量化和去量化值的附加指令,这增加了程序员的工作量和性能开销。其次,为了覆盖异常值,量化值的位宽通常大于或等于原始值。第三,程序员必须使用标准位宽;否则,提取用于表示窄整数的非标准位宽(即1 - 7,9 - 15和17-31)会加剧基于软件的量化的开销。本文的核心思想是在通用处理器的内存层次结构中提出一种量化的硬件支持,该支持以少量灵活的位数表示值,并将异常值以其原始格式存储在单独的空间中,防止任何溢出。我们使用将量化参数和数据布局传输到硬件的软硬件交互来最小化元数据和定位量化值的开销。因此,与缓存压缩技术相比,我们的方法有三个优点:(i)更少的元数据,(ii)具有多种数据类型的浮点值和缓存块的更高压缩率,以及(iii)更低的定位压缩块的开销。与在4/8/16核系统中使用全长变量的基准相比,它提供了平均1.40/1.45/1.56倍的加速和24/26/30%的能耗降低。与最先进的缓存压缩技术相比,我们的方法在4核系统中还提供了1.23倍的加速,并且仅为基准处理器增加了0.25%的面积开销。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信