Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs

E. Nurvitadhi, Mishali Naik, Andrew Boutros, Prerna Budhkar, A. Jafari, Dongup Kwon, D. Sheffield, Abirami Prabhakaran, Karthik Gururaj, Pranavi Appana
{"title":"Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs","authors":"E. Nurvitadhi, Mishali Naik, Andrew Boutros, Prerna Budhkar, A. Jafari, Dongup Kwon, D. Sheffield, Abirami Prabhakaran, Karthik Gururaj, Pranavi Appana","doi":"10.1109/ICFPT47387.2019.00054","DOIUrl":null,"url":null,"abstract":"We present a CPU server with multiple FPGAs that is purely software-programmable by a unified framework to enable flexible implementation of modern real-life complex AI that scales to large model size (100M+ parameters), while delivering real-time inference latency (~ms). Using multiple FPGAs, we scale by keeping a large model persistent in on-chip memories across FPGAs to avoid costly off-chip accesses. We study systems with 1 to 8 FPGAs for different devices: Intel® Arria® 10, Stratix® 10, and a research Stratix 10 with an AI chiplet. We present the first multi-FPGA evaluation of a complex NMT with bi-directional LSTMs, attention, and beam search. Our system scales well. Going from 1 to 8 FPGAs allows hosting ~8× larger model with only ~2× latency increase. A batch-1 inference for a 100M-parameter NMT on 8 Stratix 10 FPGAs takes only ~10 ms. This system offers 110× better latency than the only prior NMT work on FPGAs, which uses a high-end FPGA and stores the model off-chip.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT47387.2019.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

We present a CPU server with multiple FPGAs that is purely software-programmable by a unified framework to enable flexible implementation of modern real-life complex AI that scales to large model size (100M+ parameters), while delivering real-time inference latency (~ms). Using multiple FPGAs, we scale by keeping a large model persistent in on-chip memories across FPGAs to avoid costly off-chip accesses. We study systems with 1 to 8 FPGAs for different devices: Intel® Arria® 10, Stratix® 10, and a research Stratix 10 with an AI chiplet. We present the first multi-FPGA evaluation of a complex NMT with bi-directional LSTMs, attention, and beam search. Our system scales well. Going from 1 to 8 FPGAs allows hosting ~8× larger model with only ~2× latency increase. A batch-1 inference for a 100M-parameter NMT on 8 Stratix 10 FPGAs takes only ~10 ms. This system offers 110× better latency than the only prior NMT work on FPGAs, which uses a high-end FPGA and stores the model off-chip.
基于多fpga的CPU服务器上可扩展的低延迟持久神经机器翻译
我们提出了一个带有多个fpga的CPU服务器,这些fpga是通过统一的框架纯软件可编程的,可以灵活地实现现代现实生活中的复杂人工智能,可扩展到大型模型尺寸(100M+参数),同时提供实时推理延迟(~ms)。使用多个fpga,我们通过在fpga的片上存储器中保持大型模型来进行扩展,以避免昂贵的片外访问。我们研究了针对不同设备的1至8个fpga系统:Intel®Arria®10,Stratix®10和带AI芯片的研究Stratix 10。我们提出了具有双向lstm、注意力和波束搜索的复杂NMT的第一个多fpga评估。我们的系统可扩展性很好。从1到8个fpga允许承载8倍大的模型,而延迟只增加2倍。在8个Stratix 10 fpga上对一个100m参数的NMT进行批1推理只需要~ 10ms。该系统提供了比之前唯一的FPGA上的NMT工作更好的110倍的延迟,它使用高端FPGA并将模型存储在片外。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信