在x86-64多核处理器上优化DGL操作

Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications Pub Date : 2022-06-23 DOI:10.1145/3546000.3546018

Chao Liu, Huayou Su, Y. Dou, Qinglin Wang

{"title":"在x86-64多核处理器上优化DGL操作","authors":"Chao Liu, Huayou Su, Y. Dou, Qinglin Wang","doi":"10.1145/3546000.3546018","DOIUrl":null,"url":null,"abstract":"Modern x86-64 processors have strong performance due to long vector units. Therefore long vector units are widely used in CNN-like neural network model inference on modern x86-64 processors. However the performance of GNN inference on modern x86-64 processors is poor. Unfortunately, with the development of GNNs and the increase of graph datasets, GNN inference performance meets the serious challenge on x86-64 processors. In this paper, we study the problem of poorly optimized DGL-based GAT models on the x86-64 platform, and analyze the main performance bottlenecks in this case. In order to optimize the performance of DGL on the two main x86-64 platform CPUs of Intel and AMD, we implement a simple and effective task allocator to balance the task load among multiple cores and use vector instructions to optimize the core operators in DGL. In addition, we also propose corresponding optimization ideas for the NUMA architecture. The experimental results show that our optimization method improves the performance of the basic DGL version by up to 3.12x and 2.6x on Intel and AMD platforms.","PeriodicalId":196955,"journal":{"name":"Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimize DGL Operations on x86-64 Multi-Core Processors\",\"authors\":\"Chao Liu, Huayou Su, Y. Dou, Qinglin Wang\",\"doi\":\"10.1145/3546000.3546018\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern x86-64 processors have strong performance due to long vector units. Therefore long vector units are widely used in CNN-like neural network model inference on modern x86-64 processors. However the performance of GNN inference on modern x86-64 processors is poor. Unfortunately, with the development of GNNs and the increase of graph datasets, GNN inference performance meets the serious challenge on x86-64 processors. In this paper, we study the problem of poorly optimized DGL-based GAT models on the x86-64 platform, and analyze the main performance bottlenecks in this case. In order to optimize the performance of DGL on the two main x86-64 platform CPUs of Intel and AMD, we implement a simple and effective task allocator to balance the task load among multiple cores and use vector instructions to optimize the core operators in DGL. In addition, we also propose corresponding optimization ideas for the NUMA architecture. The experimental results show that our optimization method improves the performance of the basic DGL version by up to 3.12x and 2.6x on Intel and AMD platforms.\",\"PeriodicalId\":196955,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3546000.3546018\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546000.3546018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

现代x86-64处理器由于长向量单元而具有强大的性能。因此，在现代x86-64处理器上，长向量单元被广泛应用于类cnn神经网络模型推理。然而，在现代x86-64处理器上，GNN推理的性能很差。然而，随着GNN的发展和图数据集的增加，GNN的推理性能在x86-64处理器上遇到了严峻的挑战。本文研究了基于dgl的GAT模型在x86-64平台上优化不佳的问题，并分析了这种情况下的主要性能瓶颈。为了优化DGL在Intel和AMD两种主要x86-64平台cpu上的性能，我们实现了一种简单有效的任务分配器来平衡多核之间的任务负载，并使用矢量指令来优化DGL中的核心运算符。此外，我们还对NUMA架构提出了相应的优化思路。实验结果表明，我们的优化方法在Intel和AMD平台上将基本DGL版本的性能分别提高了3.12倍和2.6倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimize DGL Operations on x86-64 Multi-Core Processors

Modern x86-64 processors have strong performance due to long vector units. Therefore long vector units are widely used in CNN-like neural network model inference on modern x86-64 processors. However the performance of GNN inference on modern x86-64 processors is poor. Unfortunately, with the development of GNNs and the increase of graph datasets, GNN inference performance meets the serious challenge on x86-64 processors. In this paper, we study the problem of poorly optimized DGL-based GAT models on the x86-64 platform, and analyze the main performance bottlenecks in this case. In order to optimize the performance of DGL on the two main x86-64 platform CPUs of Intel and AMD, we implement a simple and effective task allocator to balance the task load among multiple cores and use vector instructions to optimize the core operators in DGL. In addition, we also propose corresponding optimization ideas for the NUMA architecture. The experimental results show that our optimization method improves the performance of the basic DGL version by up to 3.12x and 2.6x on Intel and AMD platforms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications

自引率

0.00%

发文量