Impact of Vectorization and Multithreading on Performance and Energy Consumption on Jetson Boards

2018 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2018-07-01 DOI:10.1109/HPCS.2018.00055

S. Jubertie, Emmanuel Melin, Naly Raliravaka, E. Bodèle, P. E. Bocanegra

{"title":"Impact of Vectorization and Multithreading on Performance and Energy Consumption on Jetson Boards","authors":"S. Jubertie, Emmanuel Melin, Naly Raliravaka, E. Bodèle, P. E. Bocanegra","doi":"10.1109/HPCS.2018.00055","DOIUrl":null,"url":null,"abstract":"ARM processors are well known for their energy efficiency and are consequently widely used in embedded platforms. Like other processor architectures, they are built with different levels of parallelism, from Instruction Level Parallelism (out-of- order and superscalar capabilities) to Thread Level Parallelism (multicore), to increase their performance levels. These processors are now also targeting the HPC domain and will equip the Fujitsu Post-K supercomputer. Some ARM processors from the Cortex-A series, which equip smartphones and tablets, also provide Data Level Parallelism through SIMD units called NEON. These units are able to process 128-bit of data at a time, for example four 32bit floating point values. Taking advantage of these units requires code vectorization which may be performed automatically by the compiler or explicitly by using NEON intrinsics. Exploiting all these levels of parallelism may lead to better performance as well as a higher energy consumption. This is not an issue in the HPC domain where application development is driven by finding the best performance. However, developing for embedded applications is driven by finding the best trade-off between energy consumption and performance. In this paper, we propose to study the impact of vectorization and multithreading on both performance and energy consumption on some Nvidia Jetson boards. Results show that depending on the algorithm and on its implementation, vectorization may bring a similar speedup as an OpenMP scalar implementation but with a lower energy consumption. However, combining vectorization and multithreading may lead close to both the best performance level and the lowest energy consumption but not when running cores at their maximum frequencies.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"31 35","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2018.00055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

ARM processors are well known for their energy efficiency and are consequently widely used in embedded platforms. Like other processor architectures, they are built with different levels of parallelism, from Instruction Level Parallelism (out-of- order and superscalar capabilities) to Thread Level Parallelism (multicore), to increase their performance levels. These processors are now also targeting the HPC domain and will equip the Fujitsu Post-K supercomputer. Some ARM processors from the Cortex-A series, which equip smartphones and tablets, also provide Data Level Parallelism through SIMD units called NEON. These units are able to process 128-bit of data at a time, for example four 32bit floating point values. Taking advantage of these units requires code vectorization which may be performed automatically by the compiler or explicitly by using NEON intrinsics. Exploiting all these levels of parallelism may lead to better performance as well as a higher energy consumption. This is not an issue in the HPC domain where application development is driven by finding the best performance. However, developing for embedded applications is driven by finding the best trade-off between energy consumption and performance. In this paper, we propose to study the impact of vectorization and multithreading on both performance and energy consumption on some Nvidia Jetson boards. Results show that depending on the algorithm and on its implementation, vectorization may bring a similar speedup as an OpenMP scalar implementation but with a lower energy consumption. However, combining vectorization and multithreading may lead close to both the best performance level and the lowest energy consumption but not when running cores at their maximum frequencies.

查看原文本刊更多论文

矢量化和多线程对捷迅板性能和能耗的影响

ARM处理器以其高能效而闻名，因此被广泛应用于嵌入式平台。与其他处理器架构一样，它们具有不同级别的并行性，从指令级并行性(无序和超标量能力)到线程级并行性(多核)，以提高它们的性能水平。这些处理器现在也瞄准了高性能计算领域，并将装备富士通Post-K超级计算机。智能手机和平板电脑配备的Cortex-A系列的一些ARM处理器也通过称为NEON的SIMD单元提供数据级并行性。这些单元能够一次处理128位的数据，例如4个32位的浮点值。利用这些单元需要代码向量化，这可以由编译器自动执行，也可以通过使用NEON intrinsic显式执行。利用所有这些级别的并行性可能会带来更好的性能以及更高的能耗。在高性能计算领域，这不是问题，因为应用程序开发是通过寻找最佳性能来驱动的。然而，嵌入式应用程序的开发是通过寻找能耗和性能之间的最佳权衡来驱动的。在本文中，我们建议研究矢量化和多线程对一些Nvidia Jetson主板性能和能耗的影响。结果表明，根据不同的算法及其实现，矢量化可以带来与OpenMP标量实现相似的加速，但能耗更低。然而，将矢量化和多线程结合起来可能会导致接近最佳性能水平和最低能耗，但在内核以最高频率运行时则不然。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量