arXiv - CS - Performance最新文献

HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms HRA：元搜索优化算法排序的多标准框架

arXiv - CS - Performance Pub Date : 2024-09-18 DOI: arxiv-2409.11617

Evgenia-Maria K. Goula, Dimitris G. Sotiropoulos

{"title":"HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms","authors":"Evgenia-Maria K. Goula, Dimitris G. Sotiropoulos","doi":"arxiv-2409.11617","DOIUrl":"https://doi.org/arxiv-2409.11617","url":null,"abstract":"Metaheuristic algorithms are essential for solving complex optimization\u0000problems in different fields. However, the difficulty in comparing and rating\u0000these algorithms remains due to the wide range of performance metrics and\u0000problem dimensions usually involved. On the other hand, nonparametric\u0000statistical methods and post hoc tests are time-consuming, especially when we\u0000only need to identify the top performers among many algorithms. The\u0000Hierarchical Rank Aggregation (HRA) algorithm aims to efficiently rank\u0000metaheuristic algorithms based on their performance across many criteria and\u0000dimensions. The HRA employs a hierarchical framework that begins with\u0000collecting performance metrics on various benchmark functions and dimensions.\u0000Rank-based normalization is employed for each performance measure to ensure\u0000comparability and the robust TOPSIS aggregation is applied to combine these\u0000rankings at several hierarchical levels, resulting in a comprehensive ranking\u0000of the algorithms. Our study uses data from the CEC 2017 competition to\u0000demonstrate the robustness and efficacy of the HRA framework. It examines 30\u0000benchmark functions and evaluates the performance of 13 metaheuristic\u0000algorithms across five performance indicators in four distinct dimensions. This\u0000presentation highlights the potential of the HRA to enhance the interpretation\u0000of the comparative advantages and disadvantages of various algorithms by\u0000simplifying practitioners' choices of the most appropriate algorithm for\u0000certain optimization problems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study 图重排能加速图神经网络训练吗？实验研究

arXiv - CS - Performance Pub Date : 2024-09-17 DOI: arxiv-2409.11129

Nikolai Merkel, Pierre Toussing, Ruben Mayer, Hans-Arno Jacobsen

{"title":"Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study","authors":"Nikolai Merkel, Pierre Toussing, Ruben Mayer, Hans-Arno Jacobsen","doi":"arxiv-2409.11129","DOIUrl":"https://doi.org/arxiv-2409.11129","url":null,"abstract":"Graph neural networks (GNNs) are a type of neural network capable of learning\u0000on graph-structured data. However, training GNNs on large-scale graphs is\u0000challenging due to iterative aggregations of high-dimensional features from\u0000neighboring vertices within sparse graph structures combined with neural\u0000network operations. The sparsity of graphs frequently results in suboptimal\u0000memory access patterns and longer training time. Graph reordering is an\u0000optimization strategy aiming to improve the graph data layout. It has shown to\u0000be effective to speed up graph analytics workloads, but its effect on the\u0000performance of GNN training has not been investigated yet. The generalization\u0000of reordering to GNN performance is nontrivial, as multiple aspects must be\u0000considered: GNN hyper-parameters such as the number of layers, the number of\u0000hidden dimensions, and the feature size used in the GNN model, neural network\u0000operations, large intermediate vertex states, and GPU acceleration. In our work, we close this gap by performing an empirical evaluation of 12\u0000reordering strategies in two state-of-the-art GNN systems, PyTorch Geometric\u0000and Deep Graph Library. Our results show that graph reordering is effective in\u0000reducing training time for CPU- and GPU-based training, respectively. Further,\u0000we find that GNN hyper-parameters influence the effectiveness of reordering,\u0000that reordering metrics play an important role in selecting a reordering\u0000strategy, that lightweight reordering performs better for GPU-based than for\u0000CPU-based training, and that invested reordering time can in many cases be\u0000amortized.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures 不同多核架构 Ondes3D 地震模拟器上的时间负载失衡问题

arXiv - CS - Performance Pub Date : 2024-09-17 DOI: arxiv-2409.11392

Ana Luisa Veroneze Solórzano, Philippe Olivier Alexandre Navaux, Lucas Mello Schnorr

{"title":"Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures","authors":"Ana Luisa Veroneze Solórzano, Philippe Olivier Alexandre Navaux, Lucas Mello Schnorr","doi":"arxiv-2409.11392","DOIUrl":"https://doi.org/arxiv-2409.11392","url":null,"abstract":"The variety of today's multicore architectures motivates researchers to\u0000explore parallel scientific applications on different platforms. Load imbalance\u0000is one performance issue that can prejudice parallel applications from\u0000exploiting the computational power of these platforms. Ondes3D is a scientific\u0000application for seismic wave simulation used to assess the geological impact of\u0000earthquakes. Its parallelism relies on applying a regular domain decomposition\u0000in the geological domain provided and distributing each sub-domain to MPI\u0000ranks. Previous works investigate the significant spatial and temporal\u0000imbalance in Ondes3D and suggest new parallelization and load balancing\u0000techniques to minimize them. However, none explored its execution on different\u0000architectures. Our paper evaluates the performance of Ondes3D for two\u0000earthquake scenarios on eight different multicore architectures, including\u0000Intel, AMD, and ARM processors. We measure the load distribution per MPI rank,\u0000evaluate the temporal load imbalance, and compare the execution of the\u0000application's kernels. Our results show that the temporal load imbalance in\u0000Ondes3D depends on the architecture chosen, with some platforms minimizing such\u0000imbalance more effectively.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Landscape of GPU-Centric Communication 以 GPU 为中心的通信格局

arXiv - CS - Performance Pub Date : 2024-09-15 DOI: arxiv-2409.09874

Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov

{"title":"The Landscape of GPU-Centric Communication","authors":"Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov","doi":"arxiv-2409.09874","DOIUrl":"https://doi.org/arxiv-2409.09874","url":null,"abstract":"n recent years, GPUs have become the preferred accelerators for HPC and ML\u0000applications due to their parallelism and fast memory bandwidth. While GPUs\u0000boost computation, inter-GPU communication can create scalability bottlenecks,\u0000especially as the number of GPUs per node and cluster grows. Traditionally, the\u0000CPU managed multi-GPU communication, but advancements in GPU-centric\u0000communication now challenge this CPU dominance by reducing its involvement,\u0000granting GPUs more autonomy in communication tasks, and addressing mismatches\u0000in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on\u0000vendor mechanisms and user-level library supports. It aims to clarify the\u0000complexities and diverse options in this field, define the terminology, and\u0000categorize existing approaches within and across nodes. The paper discusses\u0000vendor-provided mechanisms for communication and memory management in multi-GPU\u0000execution and reviews major communication libraries, their benefits,\u0000challenges, and performance insights. Then, it explores key research paradigms,\u0000future outlooks, and open research questions. By extensively describing\u0000GPU-centric communication techniques across the software and hardware stacks,\u0000we provide researchers, programmers, engineers, and library designers insights\u0000on how to exploit multi-GPU systems at their best.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink 从全球视角看星际链路视频流的过去、现在和未来

arXiv - CS - Performance Pub Date : 2024-09-15 DOI: arxiv-2409.09846

Liz Izhikevich, Reese Enghardt, Te-Yuan Huang, Renata Teixeira

引用次数: 0

Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators 为深度神经网络加速器自动生成快速准确的性能模型

arXiv - CS - Performance Pub Date : 2024-09-13 DOI: arxiv-2409.08595

Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann

{"title":"Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators","authors":"Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann","doi":"arxiv-2409.08595","DOIUrl":"https://doi.org/arxiv-2409.08595","url":null,"abstract":"Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices\u0000is a challenging task that requires tailored hardware accelerator architectures\u0000and a clear understanding of their performance characteristics when executing\u0000the intended AI workload. To facilitate this, we present an automated\u0000generation approach for fast performance models to accurately estimate the\u0000latency of a DNN mapped onto systematically modeled and concisely described\u0000accelerator architectures. Using our accelerator architecture description\u0000method, we modeled representative DNN accelerators such as Gemmini, UltraTrail,\u0000Plasticine-derived, and a parameterizable systolic array. Together with DNN\u0000mappings for those modeled architectures, we perform a combined DNN/hardware\u0000dependency graph analysis, which enables us, in the best case, to evaluate only\u0000154 loop kernel iterations to estimate the performance for 4.19 billion\u0000instructions achieving a significant speedup. We outperform regression and\u0000analytical models in terms of mean absolute percentage error (MAPE) compared to\u0000simulation results, while being several magnitudes faster than an RTL\u0000simulation.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computational Algorithms for the Product Form Solution of Closed Queuing Networks with Finite Buffers and Skip-Over Policy 具有有限缓冲区和跳过策略的封闭排队网络乘积形式求解计算算法

arXiv - CS - Performance Pub Date : 2024-09-12 DOI: arxiv-2409.08075

Gianfranco Balbo, Andrea Marin, Diletta Olliaro, Matteo Sereno

引用次数: 0

Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa 最先进 CPU 的微架构比较和内核建模：格雷斯、蓝宝石急流和热那亚

arXiv - CS - Performance Pub Date : 2024-09-12 DOI: arxiv-2409.08108

Jan Laukemann, Georg Hager, Gerhard Wellein

引用次数: 0

E-QUARTIC: Energy Efficient Edge Ensemble of Convolutional Neural Networks for Resource-Optimized Learning E-QUARTIC：用于资源优化学习的高能效边缘卷积神经网络集合

arXiv - CS - Performance Pub Date : 2024-09-12 DOI: arxiv-2409.08369

Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing

{"title":"E-QUARTIC: Energy Efficient Edge Ensemble of Convolutional Neural Networks for Resource-Optimized Learning","authors":"Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing","doi":"arxiv-2409.08369","DOIUrl":"https://doi.org/arxiv-2409.08369","url":null,"abstract":"Ensemble learning is a meta-learning approach that combines the predictions\u0000of multiple learners, demonstrating improved accuracy and robustness.\u0000Nevertheless, ensembling models like Convolutional Neural Networks (CNNs)\u0000result in high memory and computing overhead, preventing their deployment in\u0000embedded systems. These devices are usually equipped with small batteries that\u0000provide power supply and might include energy-harvesting modules that extract\u0000energy from the environment. In this work, we propose E-QUARTIC, a novel Energy\u0000Efficient Edge Ensembling framework to build ensembles of CNNs targeting\u0000Artificial Intelligence (AI)-based embedded systems. Our design outperforms\u0000single-instance CNN baselines and state-of-the-art edge AI solutions, improving\u0000accuracy and adapting to varying energy conditions while maintaining similar\u0000memory requirements. Then, we leverage the multi-CNN structure of the designed\u0000ensemble to implement an energy-aware model selection policy in\u0000energy-harvesting AI systems. We show that our solution outperforms the\u0000state-of-the-art by reducing system failure rate by up to 40% while ensuring\u0000higher average output qualities. Ultimately, we show that the proposed design\u0000enables concurrent on-device training and high-quality inference execution at\u0000the edge, limiting the performance and energy overheads to less than 0.04%.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU Inf-MLLM：在单个 GPU 上实现多模态大型语言模型的高效流推理

arXiv - CS - Performance Pub Date : 2024-09-11 DOI: arxiv-2409.09086

Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo

{"title":"Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU","authors":"Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo","doi":"arxiv-2409.09086","DOIUrl":"https://doi.org/arxiv-2409.09086","url":null,"abstract":"Multimodal Large Language Models (MLLMs) are distinguished by their\u0000multimodal comprehensive ability and widely used in many real-world\u0000applications including GPT-4o, autonomous driving and robotics. Despite their\u0000impressive performance, the multimodal inputs always incur long context. The\u0000inference under long context requires caching massive Key and Value states (KV\u0000cache) of previous tokens, which introduces high latency and excessive memory\u0000consumption. Due to this reason, it is challenging to deploy streaming\u0000inference of MLLMs on edge devices, which largely constrains the power and\u0000usage of MLLMs in real-world applications. In this paper, we introduce\u0000Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming\u0000inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on\u0000our key observation of the attention pattern in both LLMs and MLLMs called\u0000\"attention saddles\". Thanks to the newly discovered attention pattern, Inf-MLLM\u0000maintains a size-constrained KV cache by dynamically caching recent tokens and\u0000relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel\u0000approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM\u0000enables multiple LLMs and MLLMs to achieve stable performance over 4M-token\u0000long texts and multi-round conversations with 1-hour-long videos on a single\u0000GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than\u0000existing methods such as StreamingLLM and 2x speedup than H2O.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0