Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Hongren Zheng, Yusong Wu, Fan Zhang, Xinhao Yang, Yi Cai, Yu Wang, Huazhong Yang
{"title":"3M-AI:云环境下多fpga AI系统的多任务多核虚拟化框架","authors":"Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Hongren Zheng, Yusong Wu, Fan Zhang, Xinhao Yang, Yi Cai, Yu Wang, Huazhong Yang","doi":"10.1145/3431920.3439480","DOIUrl":null,"url":null,"abstract":"With the ever-growing demands for online Artificial Intelligence (AI), the hardware virtualization support for deep learning accelerators is vital for providing AI capability in the cloud. Three basic features, multi-task, dynamic workload, and remote access, are fundamental for hardware virtualization. However, most of the deep learning accelerators do not support concurrent execution of multiple tasks. Besides, the SOTA multi-DNN scheduling algorithm for NN accelerators neither consider the multi-task concurrent execution and resources allocation for the multi-core DNN accelerators. Moreover, existing GPU virtualized solutions could introduce a huge remote access latency overhead, resulting in a severe system performance drop. In order to tackle these challenges, we propose 3M-AI, a Multi-task and Multi-core virtualization framework for Multi-FPGA AI systems in the cloud. 3M-AI enables model parallelism on multi-FPGA by optimizing data synchronization and movement between FPGAs. 3M-AI exploits heuristic hardware resource allocation algorithm and accurate multi-core latency prediction model. 3M-AI significantly reduces the remote API access overhead to nearly 1%, and achieves better NN inference latency with a batch size 1 compared with GPU virtualization solutions.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems in the Cloud\",\"authors\":\"Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Hongren Zheng, Yusong Wu, Fan Zhang, Xinhao Yang, Yi Cai, Yu Wang, Huazhong Yang\",\"doi\":\"10.1145/3431920.3439480\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the ever-growing demands for online Artificial Intelligence (AI), the hardware virtualization support for deep learning accelerators is vital for providing AI capability in the cloud. Three basic features, multi-task, dynamic workload, and remote access, are fundamental for hardware virtualization. However, most of the deep learning accelerators do not support concurrent execution of multiple tasks. Besides, the SOTA multi-DNN scheduling algorithm for NN accelerators neither consider the multi-task concurrent execution and resources allocation for the multi-core DNN accelerators. Moreover, existing GPU virtualized solutions could introduce a huge remote access latency overhead, resulting in a severe system performance drop. In order to tackle these challenges, we propose 3M-AI, a Multi-task and Multi-core virtualization framework for Multi-FPGA AI systems in the cloud. 3M-AI enables model parallelism on multi-FPGA by optimizing data synchronization and movement between FPGAs. 3M-AI exploits heuristic hardware resource allocation algorithm and accurate multi-core latency prediction model. 3M-AI significantly reduces the remote API access overhead to nearly 1%, and achieves better NN inference latency with a batch size 1 compared with GPU virtualization solutions.\",\"PeriodicalId\":386071,\"journal\":{\"name\":\"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"2011 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3431920.3439480\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431920.3439480","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems in the Cloud
With the ever-growing demands for online Artificial Intelligence (AI), the hardware virtualization support for deep learning accelerators is vital for providing AI capability in the cloud. Three basic features, multi-task, dynamic workload, and remote access, are fundamental for hardware virtualization. However, most of the deep learning accelerators do not support concurrent execution of multiple tasks. Besides, the SOTA multi-DNN scheduling algorithm for NN accelerators neither consider the multi-task concurrent execution and resources allocation for the multi-core DNN accelerators. Moreover, existing GPU virtualized solutions could introduce a huge remote access latency overhead, resulting in a severe system performance drop. In order to tackle these challenges, we propose 3M-AI, a Multi-task and Multi-core virtualization framework for Multi-FPGA AI systems in the cloud. 3M-AI enables model parallelism on multi-FPGA by optimizing data synchronization and movement between FPGAs. 3M-AI exploits heuristic hardware resource allocation algorithm and accurate multi-core latency prediction model. 3M-AI significantly reduces the remote API access overhead to nearly 1%, and achieves better NN inference latency with a batch size 1 compared with GPU virtualization solutions.