Compiling Efficiently with Arithmetic Emulation for the Custom-Width Connex Vector Processor

WPMVP'19 Pub Date : 2019-02-16 DOI:10.1145/3303117.3306166

Alexandru E. Susu

{"title":"Compiling Efficiently with Arithmetic Emulation for the Custom-Width Connex Vector Processor","authors":"Alexandru E. Susu","doi":"10.1145/3303117.3306166","DOIUrl":null,"url":null,"abstract":"Compiling from sequential C programs using LLVM for the wide Connex vector accelerator, a competitive customizable architecture for embedded applications with 32 to 4096 16-bit integer lanes, is challenging.\n Our compiler targets Opincaa, a JIT assembler and coordination C++ library for Connex, which is able to run portable programs w.r.t. the vector width. For this to work, our back end needs to handle symbolic C/C++ expressions represented as adjacent inline assembly strings, which are used as scalar immediate operands in the vector code.\n Also, our back end for Connex needs to lower code to emulate efficiently arithmetic operations for non-native types such as 32-bit integer and 16-bit floating point. To simplify the work of the compiler writer we conceive a method to code generate how we lower these operations inside LLVM's instruction selection pass.\n We report speedup factors of up to 12.24 when running on a Connex processor with 128 lanes w.r.t. the dual-core ARM Cortex A9 clocked at a frequency 6.67 times higher, and an energy efficiency improvement average of 1.07 times. However, note that a Connex IC can achieve an order of magnitude more energy efficiency than our FPGA implementation.","PeriodicalId":381073,"journal":{"name":"WPMVP'19","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WPMVP'19","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3303117.3306166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Compiling from sequential C programs using LLVM for the wide Connex vector accelerator, a competitive customizable architecture for embedded applications with 32 to 4096 16-bit integer lanes, is challenging. Our compiler targets Opincaa, a JIT assembler and coordination C++ library for Connex, which is able to run portable programs w.r.t. the vector width. For this to work, our back end needs to handle symbolic C/C++ expressions represented as adjacent inline assembly strings, which are used as scalar immediate operands in the vector code. Also, our back end for Connex needs to lower code to emulate efficiently arithmetic operations for non-native types such as 32-bit integer and 16-bit floating point. To simplify the work of the compiler writer we conceive a method to code generate how we lower these operations inside LLVM's instruction selection pass. We report speedup factors of up to 12.24 when running on a Connex processor with 128 lanes w.r.t. the dual-core ARM Cortex A9 clocked at a frequency 6.67 times higher, and an energy efficiency improvement average of 1.07 times. However, note that a Connex IC can achieve an order of magnitude more energy efficiency than our FPGA implementation.

查看原文本刊更多论文

自定义宽度连接矢量处理器的算法仿真高效编译

使用LLVM为宽Connex矢量加速器编译顺序C程序是具有挑战性的，这是一种具有竞争力的可定制架构，适用于具有32到4096个16位整数通道的嵌入式应用程序。我们的编译器的目标是Opincaa，这是一个用于Connex的JIT汇编和协调c++库，它能够在向量宽度之外运行可移植程序。为此，后端需要处理表示为相邻内联汇编字符串的符号C/ c++表达式，这些表达式在向量代码中用作标量直接操作数。此外，我们的Connex后端需要降低代码，以便有效地模拟非本机类型(如32位整数和16位浮点)的算术运算。为了简化编译器编写器的工作，我们设想了一个方法来生成我们如何在LLVM的指令选择传递中降低这些操作。我们报告说，在128通道的Connex处理器上运行时，加速系数高达12.24，而双核ARM Cortex A9的频率提高了6.67倍，能效平均提高了1.07倍。但是，请注意，与我们的FPGA实现相比，Connex IC可以实现更高的能效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

WPMVP'19

自引率

0.00%

发文量