ONNXim：快速、周期级多核 NPU 仿真器

arXiv - CS - Performance Pub Date : 2024-06-12 DOI:arxiv-2406.08051

Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim

{"title":"ONNXim：快速、周期级多核 NPU 仿真器","authors":"Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim","doi":"arxiv-2406.08051","DOIUrl":null,"url":null,"abstract":"As DNNs are widely adopted in various application domains while demanding\nincreasingly higher compute and memory requirements, designing efficient and\nperformant NPUs (Neural Processing Units) is becoming more important. However,\nexisting architectural NPU simulators lack support for high-speed simulation,\nmulti-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or\ndifferent deep learning frameworks. To address these limitations, this work\nproposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN\nserving systems. It takes DNN models represented in the ONNX graph format\ngenerated from various deep learning frameworks for ease of simulation. In\naddition, based on the observation that typical NPU cores process tensor tiles\nfrom on-chip scratchpad memory with deterministic compute latency, we forgo a\ndetailed modeling for the computation while still preserving simulation\naccuracy. ONNXim also preserves dependencies between compute and tile DMAs.\nMeanwhile, the DRAM and NoC are modeled in cycle-level to properly model\ncontention among multiple cores that can execute different DNN models for\nmulti-tenancy. Consequently, ONNXim is significantly faster than existing\nsimulators (e.g., by up to 384x over Accel-sim) and enables various case\nstudies, such as multi-tenant NPUs, that were previously impractical due to\nslow speed and/or lack of functionalities. ONNXim is publicly available at\nhttps://github.com/PSAL-POSTECH/ONNXim.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ONNXim: A Fast, Cycle-level Multi-core NPU Simulator\",\"authors\":\"Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim\",\"doi\":\"arxiv-2406.08051\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As DNNs are widely adopted in various application domains while demanding\\nincreasingly higher compute and memory requirements, designing efficient and\\nperformant NPUs (Neural Processing Units) is becoming more important. However,\\nexisting architectural NPU simulators lack support for high-speed simulation,\\nmulti-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or\\ndifferent deep learning frameworks. To address these limitations, this work\\nproposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN\\nserving systems. It takes DNN models represented in the ONNX graph format\\ngenerated from various deep learning frameworks for ease of simulation. In\\naddition, based on the observation that typical NPU cores process tensor tiles\\nfrom on-chip scratchpad memory with deterministic compute latency, we forgo a\\ndetailed modeling for the computation while still preserving simulation\\naccuracy. ONNXim also preserves dependencies between compute and tile DMAs.\\nMeanwhile, the DRAM and NoC are modeled in cycle-level to properly model\\ncontention among multiple cores that can execute different DNN models for\\nmulti-tenancy. Consequently, ONNXim is significantly faster than existing\\nsimulators (e.g., by up to 384x over Accel-sim) and enables various case\\nstudies, such as multi-tenant NPUs, that were previously impractical due to\\nslow speed and/or lack of functionalities. ONNXim is publicly available at\\nhttps://github.com/PSAL-POSTECH/ONNXim.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.08051\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.08051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着 DNN 被广泛应用于各种应用领域，同时对计算和内存的要求越来越高，设计高效且性能优异的 NPU（神经处理单元）变得越来越重要。然而，现有的架构 NPU 仿真器缺乏对高速仿真、多核建模、多租户场景、详细的 DRAM/NoC 建模和/或不同深度学习框架的支持。为了解决这些局限性，本研究提出了 ONNXim，一种用于 DNN 服务系统中多核 NPU 的快速周期级仿真器。为了便于仿真，它采用由各种深度学习框架生成的 ONNX 图格式表示 DNN 模型。此外，根据观察，典型的 NPU 内核处理来自片上刮板内存的张量瓦片时具有确定的计算延迟，因此我们放弃了计算的详细建模，同时仍然保持了仿真精度。ONNXim 还保留了计算与磁贴 DMA 之间的依赖关系。同时，对 DRAM 和 NoC 进行了周期级建模，以正确模拟可执行不同 DNN 模型的多核之间的多租户关系。因此，ONNXim 的速度明显快于现有模拟器（例如，比 Accel-sim 快 384 倍），并支持各种案例研究，例如多租户 NPU，而这些案例以前由于速度慢和/或缺乏功能而不切实际。ONNXim可在https://github.com/PSAL-POSTECH/ONNXim 公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator

As DNNs are widely adopted in various application domains while demanding increasingly higher compute and memory requirements, designing efficient and performant NPUs (Neural Processing Units) is becoming more important. However, existing architectural NPU simulators lack support for high-speed simulation, multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or different deep learning frameworks. To address these limitations, this work proposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN serving systems. It takes DNN models represented in the ONNX graph format generated from various deep learning frameworks for ease of simulation. In addition, based on the observation that typical NPU cores process tensor tiles from on-chip scratchpad memory with deterministic compute latency, we forgo a detailed modeling for the computation while still preserving simulation accuracy. ONNXim also preserves dependencies between compute and tile DMAs. Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model contention among multiple cores that can execute different DNN models for multi-tenancy. Consequently, ONNXim is significantly faster than existing simulators (e.g., by up to 384x over Accel-sim) and enables various case studies, such as multi-tenant NPUs, that were previously impractical due to slow speed and/or lack of functionalities. ONNXim is publicly available at https://github.com/PSAL-POSTECH/ONNXim.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量