一个多线程VLIW软处理器家族

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines Pub Date : 2013-04-28 DOI:10.1109/FCCM.2013.36

Kalin Ovtcharov, Ilian Tili, J. Steffan

{"title":"一个多线程VLIW软处理器家族","authors":"Kalin Ovtcharov, Ilian Tili, J. Steffan","doi":"10.1109/FCCM.2013.36","DOIUrl":null,"url":null,"abstract":"Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft processing engines. Soft processing engines such as CPUs are familiar to programmers, can be reprogrammed quickly without rebuilding the FPGA image, and by their general nature can support multiple software functions in a smaller area than the alternative of multiple per-function synthesized circuits. Finally, compelling processing engines can be incorporated into the output of high-level synthesis systems. For FPGA-based soft compute engines to be compelling they must be computationally dense: they must achieve high throughput per area. For simple CPUs with simple functional units (FUs) it is relatively straightforward to achieve good utilization, and it is not overly-detrimental if a small, single-pipeline-stage FU such as an integer adder is under-utilized. In contrast, larger, more deeply pipelined, more numerous, and more varied FUs can be quite challenging to keep busy-even for an engine capable of extracting instruction-level parallelism (ILP) from an application. Hence a key challenge for FPGA-based compute engines is how to maximize compute density (throughput per-area) by achieving high utilization of a datapath composed of multiple varying FUs of significant and varying pipeline depth. In this work, we propose a highly-parameterizable template architecture of a multi-threaded FPGA-based compute engine designed to highly-utilize varied and deeply pipelined FUs. Our approach to achieving high utilization is to leverage (i) support for multiple thread contexts (ii) thread-level and instruction-level parallelism, and (iii) static compiler analysis and scheduling. We focus on deeply-pipelined, IEEE-754 floating-point FUs of widely-varying latency, executing both Hodgkin-Huxley neuron simulation and Black-Scholes options pricing models as example applications, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural tradeoffs by measuring area and throughput for designs with varying numbers of FUs, thread contexts (T), memory banks (B), and bank multi-porting. To determine the most efficient designs that would be suitable for replicating we measure compute density (application throughput per unit of FPGA area), and report which architectural choices lead to the most computationally-dense designs.The most computationally dense design is not necessarily the one with highest throughput and (i) for maximizing throughput, having each thread reside in its own bank is best; (ii) when only moderate numbers of independent threads are available, the compute engine has higher compute density than a custom hardware implementation eg., 2.3x for 32 threads; (iii) the best FU mix does not necessarily match the FU usage in the dataflow graph of the application; and (iv) architectural parameters.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Multithreaded VLIW Soft Processor Family\",\"authors\":\"Kalin Ovtcharov, Ilian Tili, J. Steffan\",\"doi\":\"10.1109/FCCM.2013.36\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft processing engines. Soft processing engines such as CPUs are familiar to programmers, can be reprogrammed quickly without rebuilding the FPGA image, and by their general nature can support multiple software functions in a smaller area than the alternative of multiple per-function synthesized circuits. Finally, compelling processing engines can be incorporated into the output of high-level synthesis systems. For FPGA-based soft compute engines to be compelling they must be computationally dense: they must achieve high throughput per area. For simple CPUs with simple functional units (FUs) it is relatively straightforward to achieve good utilization, and it is not overly-detrimental if a small, single-pipeline-stage FU such as an integer adder is under-utilized. In contrast, larger, more deeply pipelined, more numerous, and more varied FUs can be quite challenging to keep busy-even for an engine capable of extracting instruction-level parallelism (ILP) from an application. Hence a key challenge for FPGA-based compute engines is how to maximize compute density (throughput per-area) by achieving high utilization of a datapath composed of multiple varying FUs of significant and varying pipeline depth. In this work, we propose a highly-parameterizable template architecture of a multi-threaded FPGA-based compute engine designed to highly-utilize varied and deeply pipelined FUs. Our approach to achieving high utilization is to leverage (i) support for multiple thread contexts (ii) thread-level and instruction-level parallelism, and (iii) static compiler analysis and scheduling. We focus on deeply-pipelined, IEEE-754 floating-point FUs of widely-varying latency, executing both Hodgkin-Huxley neuron simulation and Black-Scholes options pricing models as example applications, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural tradeoffs by measuring area and throughput for designs with varying numbers of FUs, thread contexts (T), memory banks (B), and bank multi-porting. To determine the most efficient designs that would be suitable for replicating we measure compute density (application throughput per unit of FPGA area), and report which architectural choices lead to the most computationally-dense designs.The most computationally dense design is not necessarily the one with highest throughput and (i) for maximizing throughput, having each thread reside in its own bank is best; (ii) when only moderate numbers of independent threads are available, the compute engine has higher compute density than a custom hardware implementation eg., 2.3x for 32 threads; (iii) the best FU mix does not necessarily match the FU usage in the dataflow graph of the application; and (iv) architectural parameters.\",\"PeriodicalId\":269887,\"journal\":{\"name\":\"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines\",\"volume\":\"75 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FCCM.2013.36\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2013.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

只提供摘要形式。使用fpga进行计算加速的商业兴趣越来越大。为了减轻非硬件专家程序员的编程任务，可以将C和OpenCL等高级语言映射到针对fpga的编译器生成电路和软处理引擎的系统正在出现。像cpu这样的软处理引擎是程序员所熟悉的，可以在不重建FPGA映像的情况下快速重新编程，并且由于其一般性质，可以在比多个功能合成电路的替代方案更小的区域内支持多个软件功能。最后，引人注目的处理引擎可以被纳入高级合成系统的输出。为了使基于fpga的软计算引擎引人注目，它们必须计算密集:它们必须实现每个区域的高吞吐量。对于具有简单功能单元(FU)的简单cpu来说，实现良好的利用率是相对简单的，如果一个小的、单管道级的FU(如整数加法器)没有得到充分利用，也不会造成太大的损害。相比之下，更大、更深入的流水线化、更多数量和更多样化的FUs可能非常难以保持忙碌——即使对于能够从应用程序中提取指令级并行性(ILP)的引擎也是如此。因此，基于fpga的计算引擎面临的一个关键挑战是，如何通过实现由多个具有显著和不同管道深度的不同FUs组成的数据路径的高利用率来最大化计算密度(每区域的吞吐量)。在这项工作中，我们提出了一个基于fpga的多线程计算引擎的高度可参数化的模板架构，旨在高度利用各种深度流水线化的FUs。我们实现高利用率的方法是利用(i)对多线程上下文的支持(ii)线程级和指令级并行性，以及(iii)静态编译器分析和调度。我们专注于深度流水线，IEEE-754浮点FUs具有广泛的延迟，执行Hodgkin-Huxley神经元模拟和Black-Scholes期权定价模型作为示例应用程序，使用我们基于llvm的调度程序编译。针对Stratix IV FPGA，我们通过测量具有不同数量的FUs，线程上下文(T)，内存库(B)和银行多端口的设计的面积和吞吐量来探索架构权衡。为了确定适合复制的最有效的设计，我们测量了计算密度(FPGA面积单位的应用程序吞吐量)，并报告了哪些架构选择导致了最计算密度的设计。计算密度最高的设计不一定是具有最高吞吐量的设计，并且(i)为了最大限度地提高吞吐量，让每个线程驻留在自己的线程组中是最好的;(ii)当只有中等数量的独立线程可用时，计算引擎比自定义硬件实现具有更高的计算密度。， 2.3倍，32线程;(iii)最佳的傅里叶混合不一定符合应用程序数据流图中的傅里叶使用情况;(四)建筑参数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Multithreaded VLIW Soft Processor Family

Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft processing engines. Soft processing engines such as CPUs are familiar to programmers, can be reprogrammed quickly without rebuilding the FPGA image, and by their general nature can support multiple software functions in a smaller area than the alternative of multiple per-function synthesized circuits. Finally, compelling processing engines can be incorporated into the output of high-level synthesis systems. For FPGA-based soft compute engines to be compelling they must be computationally dense: they must achieve high throughput per area. For simple CPUs with simple functional units (FUs) it is relatively straightforward to achieve good utilization, and it is not overly-detrimental if a small, single-pipeline-stage FU such as an integer adder is under-utilized. In contrast, larger, more deeply pipelined, more numerous, and more varied FUs can be quite challenging to keep busy-even for an engine capable of extracting instruction-level parallelism (ILP) from an application. Hence a key challenge for FPGA-based compute engines is how to maximize compute density (throughput per-area) by achieving high utilization of a datapath composed of multiple varying FUs of significant and varying pipeline depth. In this work, we propose a highly-parameterizable template architecture of a multi-threaded FPGA-based compute engine designed to highly-utilize varied and deeply pipelined FUs. Our approach to achieving high utilization is to leverage (i) support for multiple thread contexts (ii) thread-level and instruction-level parallelism, and (iii) static compiler analysis and scheduling. We focus on deeply-pipelined, IEEE-754 floating-point FUs of widely-varying latency, executing both Hodgkin-Huxley neuron simulation and Black-Scholes options pricing models as example applications, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural tradeoffs by measuring area and throughput for designs with varying numbers of FUs, thread contexts (T), memory banks (B), and bank multi-porting. To determine the most efficient designs that would be suitable for replicating we measure compute density (application throughput per unit of FPGA area), and report which architectural choices lead to the most computationally-dense designs.The most computationally dense design is not necessarily the one with highest throughput and (i) for maximizing throughput, having each thread reside in its own bank is best; (ii) when only moderate numbers of independent threads are available, the compute engine has higher compute density than a custom hardware implementation eg., 2.3x for 32 threads; (iii) the best FU mix does not necessarily match the FU usage in the dataflow graph of the application; and (iv) architectural parameters.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines

自引率

0.00%

发文量