在基于fpga的DPU上使用vitis AI工具链对主动脉瓣钙病变分割的DNN进行基准测试

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-09-04 DOI:10.1016/j.future.2025.108115

Valentina Sisini , Andrea Miola , Giada Minghini , Enrico Calore , Armando Ugo Cavallo , Sebastiano Fabio Schifano , Cristian Zambelli

{"title":"在基于fpga的DPU上使用vitis AI工具链对主动脉瓣钙病变分割的DNN进行基准测试","authors":"Valentina Sisini , Andrea Miola , Giada Minghini , Enrico Calore , Armando Ugo Cavallo , Sebastiano Fabio Schifano , Cristian Zambelli","doi":"10.1016/j.future.2025.108115","DOIUrl":null,"url":null,"abstract":"<div><div>Semantic segmentation assigns a class to every pixel of an image to automatically locate objects in the context of computer vision applications for autonomous vehicles, robotics, agriculture, gaming, and medical imaging. Deep Neural Network models, such as Convolutional Neural Networks (CNNs), are widely used for this purpose. Among the plethora of models, the U-Net is a standard in biomedical imaging. Nowadays, GPUs efficiently perform segmentation and are the reference architectures for running CNNs, and FPGAs compete for inferences among alternative platforms, promising higher energy efficiency and lower latency solutions. In this contribution, we evaluate the performance of FPGA-based Deep Processing Units (DPUs) implemented on the AMD Alveo U55C for the inference task, using calcium segmentation in cardiac aortic valve computer tomography scans as a benchmark. We design and implement a U-Net-based application, optimize the hyperparameters to maximize the prediction accuracy, perform pruning to simplify the model, and use different numerical quantizations to exploit low-precision operations supported by the DPUs and GPUs to boost the computation time. We describe how to port and deploy the U-Net model on DPUs, and we compare accuracy, throughput, and energy efficiency achieved with four generations of GPUs and a recent dual 32-core high-end CPU platform. Our results show that a complex DNN like the U-Net can run effectively on DPUs using 8-bit integer computation, achieving a prediction accuracy of approximately <span><math><mrow><mn>95</mn><mspace></mspace><mo>%</mo></mrow></math></span> in Dice and <span><math><mrow><mn>91</mn><mspace></mspace><mo>%</mo></mrow></math></span> in IoU scores. These results are comparable to those measured when running the floating-point models on GPUs and CPUs. On the one hand, in terms of computing performance, the DPUs achieves a inference latency of approximately 3.5 ms and a throughput of approximately 4.2 kPFS, boosting the performance of a 64-core CPU system by approximately <span><math><mrow><mn>10</mn><mspace></mspace><mo>%</mo></mrow></math></span> in terms of latency and a factor <span><math><mrow><mn>2</mn><mi>X</mi></mrow></math></span> in terms of throughput, but still do not overcoming the performance of GPUs when using the same numerical precision. On the other hand, considering the energy efficiency, the improvements are approximately a factor <span><math><mrow><mn>6.7</mn><mi>X</mi></mrow></math></span> compared to the CPU, and <span><math><mrow><mn>1.6</mn><mi>X</mi></mrow></math></span> compared to the P100 GPU manufactured with the same technological process (16 nm).</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108115"},"PeriodicalIF":6.2000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking a DNN for aortic valve calcium lesions segmentation on FPGA-based DPU using the vitis AI toolchain\",\"authors\":\"Valentina Sisini , Andrea Miola , Giada Minghini , Enrico Calore , Armando Ugo Cavallo , Sebastiano Fabio Schifano , Cristian Zambelli\",\"doi\":\"10.1016/j.future.2025.108115\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Semantic segmentation assigns a class to every pixel of an image to automatically locate objects in the context of computer vision applications for autonomous vehicles, robotics, agriculture, gaming, and medical imaging. Deep Neural Network models, such as Convolutional Neural Networks (CNNs), are widely used for this purpose. Among the plethora of models, the U-Net is a standard in biomedical imaging. Nowadays, GPUs efficiently perform segmentation and are the reference architectures for running CNNs, and FPGAs compete for inferences among alternative platforms, promising higher energy efficiency and lower latency solutions. In this contribution, we evaluate the performance of FPGA-based Deep Processing Units (DPUs) implemented on the AMD Alveo U55C for the inference task, using calcium segmentation in cardiac aortic valve computer tomography scans as a benchmark. We design and implement a U-Net-based application, optimize the hyperparameters to maximize the prediction accuracy, perform pruning to simplify the model, and use different numerical quantizations to exploit low-precision operations supported by the DPUs and GPUs to boost the computation time. We describe how to port and deploy the U-Net model on DPUs, and we compare accuracy, throughput, and energy efficiency achieved with four generations of GPUs and a recent dual 32-core high-end CPU platform. Our results show that a complex DNN like the U-Net can run effectively on DPUs using 8-bit integer computation, achieving a prediction accuracy of approximately <span><math><mrow><mn>95</mn><mspace></mspace><mo>%</mo></mrow></math></span> in Dice and <span><math><mrow><mn>91</mn><mspace></mspace><mo>%</mo></mrow></math></span> in IoU scores. These results are comparable to those measured when running the floating-point models on GPUs and CPUs. On the one hand, in terms of computing performance, the DPUs achieves a inference latency of approximately 3.5 ms and a throughput of approximately 4.2 kPFS, boosting the performance of a 64-core CPU system by approximately <span><math><mrow><mn>10</mn><mspace></mspace><mo>%</mo></mrow></math></span> in terms of latency and a factor <span><math><mrow><mn>2</mn><mi>X</mi></mrow></math></span> in terms of throughput, but still do not overcoming the performance of GPUs when using the same numerical precision. On the other hand, considering the energy efficiency, the improvements are approximately a factor <span><math><mrow><mn>6.7</mn><mi>X</mi></mrow></math></span> compared to the CPU, and <span><math><mrow><mn>1.6</mn><mi>X</mi></mrow></math></span> compared to the P100 GPU manufactured with the same technological process (16 nm).</div></div>\",\"PeriodicalId\":55132,\"journal\":{\"name\":\"Future Generation Computer Systems-The International Journal of Escience\",\"volume\":\"175 \",\"pages\":\"Article 108115\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Generation Computer Systems-The International Journal of Escience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167739X25004091\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25004091","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

语义分割为图像的每个像素分配一个类，以便在自动驾驶汽车、机器人、农业、游戏和医学成像的计算机视觉应用环境中自动定位物体。深度神经网络模型，如卷积神经网络（cnn），被广泛用于此目的。在众多的模型中，U-Net是生物医学成像的标准。如今，gpu高效地执行分割，是运行cnn的参考架构，而fpga在替代平台之间竞争推理，承诺更高的能效和更低的延迟解决方案。在这篇文章中，我们评估了基于fpga的深度处理单元（DPUs）在AMD Alveo U55C上实现的推理任务的性能，使用心脏主动脉瓣计算机断层扫描中的钙分割作为基准。我们设计并实现了一个基于u - net的应用程序，优化超参数以最大限度地提高预测精度，执行剪枝以简化模型，并使用不同的数值量化来利用dpu和gpu支持的低精度操作来提高计算时间。我们描述了如何在dpu上移植和部署U-Net模型，并比较了四代gpu和最近的双32核高端CPU平台所实现的精度、吞吐量和能效。我们的研究结果表明，像U-Net这样的复杂DNN可以使用8位整数计算在dpu上有效运行，在Dice上实现约95%的预测精度，在IoU分数上实现91%的预测精度。这些结果与在gpu和cpu上运行浮点模型时测量的结果相当。一方面，在计算性能方面，dpu实现了大约3.5 ms的推理延迟和大约4.2 kPFS的吞吐量，在延迟方面将64核CPU系统的性能提高了大约10%，在吞吐量方面提高了2倍，但在使用相同的数值精度时仍然无法克服gpu的性能。另一方面，考虑到能源效率，与CPU相比，改进约为6.7倍，与采用相同工艺制造的P100 GPU（16 nm）相比，改进约为1.6倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmarking a DNN for aortic valve calcium lesions segmentation on FPGA-based DPU using the vitis AI toolchain

Semantic segmentation assigns a class to every pixel of an image to automatically locate objects in the context of computer vision applications for autonomous vehicles, robotics, agriculture, gaming, and medical imaging. Deep Neural Network models, such as Convolutional Neural Networks (CNNs), are widely used for this purpose. Among the plethora of models, the U-Net is a standard in biomedical imaging. Nowadays, GPUs efficiently perform segmentation and are the reference architectures for running CNNs, and FPGAs compete for inferences among alternative platforms, promising higher energy efficiency and lower latency solutions. In this contribution, we evaluate the performance of FPGA-based Deep Processing Units (DPUs) implemented on the AMD Alveo U55C for the inference task, using calcium segmentation in cardiac aortic valve computer tomography scans as a benchmark. We design and implement a U-Net-based application, optimize the hyperparameters to maximize the prediction accuracy, perform pruning to simplify the model, and use different numerical quantizations to exploit low-precision operations supported by the DPUs and GPUs to boost the computation time. We describe how to port and deploy the U-Net model on DPUs, and we compare accuracy, throughput, and energy efficiency achieved with four generations of GPUs and a recent dual 32-core high-end CPU platform. Our results show that a complex DNN like the U-Net can run effectively on DPUs using 8-bit integer computation, achieving a prediction accuracy of approximately

95 %

in Dice and

91 %

in IoU scores. These results are comparable to those measured when running the floating-point models on GPUs and CPUs. On the one hand, in terms of computing performance, the DPUs achieves a inference latency of approximately 3.5 ms and a throughput of approximately 4.2 kPFS, boosting the performance of a 64-core CPU system by approximately

10 %

in terms of latency and a factor

2 X

in terms of throughput, but still do not overcoming the performance of GPUs when using the same numerical precision. On the other hand, considering the energy efficiency, the improvements are approximately a factor

6.7 X

compared to the CPU, and

1.6 X

compared to the P100 GPU manufactured with the same technological process (16 nm).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.