{"title":"CPU-FPGA异构平台上加速蒙特卡罗树搜索","authors":"Yuan Meng, R. Kannan, V. Prasanna","doi":"10.1109/FPL57034.2022.00037","DOIUrl":null,"url":null,"abstract":"Monte Carlo Tree Search (MCTS) methods have achieved great success in many Artificial Intelligence (AI) benchmarks. The in-tree operations become a critical performance bottleneck in realizing parallel MCTS on CPUs. In this work, we develop a scalable CPU-FPGA system for Tree-Parallel MCTS. We propose a novel decomposition and mapping of MCTS data structure and computation onto CPU and FPGA to reduce communication and coordination. High scalability of our system is achieved by encapsulating in-tree operations in an SRAM-based FPGA accelerator. To lower the high data access latency and inter-worker synchronization overheads, we develop several hardware optimizations. We show that by using our accelerator, we obtain up to 35× speedup for in-tree operations, and 3× higher overall system throughput. Our CPU-FPGA system also achieves superior scalability wrt number of parallel workers than state-of-the-art parallel MCTS implementations on CPU.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Accelerating Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform\",\"authors\":\"Yuan Meng, R. Kannan, V. Prasanna\",\"doi\":\"10.1109/FPL57034.2022.00037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Monte Carlo Tree Search (MCTS) methods have achieved great success in many Artificial Intelligence (AI) benchmarks. The in-tree operations become a critical performance bottleneck in realizing parallel MCTS on CPUs. In this work, we develop a scalable CPU-FPGA system for Tree-Parallel MCTS. We propose a novel decomposition and mapping of MCTS data structure and computation onto CPU and FPGA to reduce communication and coordination. High scalability of our system is achieved by encapsulating in-tree operations in an SRAM-based FPGA accelerator. To lower the high data access latency and inter-worker synchronization overheads, we develop several hardware optimizations. We show that by using our accelerator, we obtain up to 35× speedup for in-tree operations, and 3× higher overall system throughput. Our CPU-FPGA system also achieves superior scalability wrt number of parallel workers than state-of-the-art parallel MCTS implementations on CPU.\",\"PeriodicalId\":380116,\"journal\":{\"name\":\"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FPL57034.2022.00037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPL57034.2022.00037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Accelerating Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform
Monte Carlo Tree Search (MCTS) methods have achieved great success in many Artificial Intelligence (AI) benchmarks. The in-tree operations become a critical performance bottleneck in realizing parallel MCTS on CPUs. In this work, we develop a scalable CPU-FPGA system for Tree-Parallel MCTS. We propose a novel decomposition and mapping of MCTS data structure and computation onto CPU and FPGA to reduce communication and coordination. High scalability of our system is achieved by encapsulating in-tree operations in an SRAM-based FPGA accelerator. To lower the high data access latency and inter-worker synchronization overheads, we develop several hardware optimizations. We show that by using our accelerator, we obtain up to 35× speedup for in-tree operations, and 3× higher overall system throughput. Our CPU-FPGA system also achieves superior scalability wrt number of parallel workers than state-of-the-art parallel MCTS implementations on CPU.