Optimizing latency and throughput for spawning processes on massively multicore processors

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI:10.1145/2318916.2318924

Abhishek Kulkarni, A. Lumsdaine, M. Lang, Latchesar Ionkov

{"title":"Optimizing latency and throughput for spawning processes on massively multicore processors","authors":"Abhishek Kulkarni, A. Lumsdaine, M. Lang, Latchesar Ionkov","doi":"10.1145/2318916.2318924","DOIUrl":null,"url":null,"abstract":"The execution of a SPMD application involves running multiple instances of a process with possibly varying arguments. With the widespread adoption of massively multicore processors, there has been a focus towards harnessing the abundant compute resources effectively in a power-efficient manner. Although much work has been done towards optimizing distributed process launch using hierarchical techniques, there has been a void in studying the performance of spawning processes within a single node. Reducing the latency to spawn a new process locally results in faster global job launch. Further, emerging dynamic and resilient execution models are designed on the premise of maintaining process pools for fault isolation and launching several processes in a relatively shorter period of time. Optimizing the latency and throughput for spawning processes would help improve the overall performance of runtime systems, allow adaptive process-replication reliability and motivate the design and implementation of process management interfaces in future manycore operating systems.\n In this paper, we study the several limiting factors for efficient spawning of processes on massively multicore architectures. We have developed a library to optimize launching multiple instances of the same executable. Our microbenchmarks show a 20-80% decrease in the process spawn time for multiple executables. We further discuss the effects of memory locality and propose NUMA-aware extensions to optimize launching processes with large memory-mapped segments including dynamic shared libraries. Finally, we describe vector operating system interfaces for spawning a batch of processes from a given executable on specific cores. Our results show a 50x speedup over the traditional method of launching new processes using fork and exec system calls.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on Runtime and Operating Systems for Supercomputers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2318916.2318924","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The execution of a SPMD application involves running multiple instances of a process with possibly varying arguments. With the widespread adoption of massively multicore processors, there has been a focus towards harnessing the abundant compute resources effectively in a power-efficient manner. Although much work has been done towards optimizing distributed process launch using hierarchical techniques, there has been a void in studying the performance of spawning processes within a single node. Reducing the latency to spawn a new process locally results in faster global job launch. Further, emerging dynamic and resilient execution models are designed on the premise of maintaining process pools for fault isolation and launching several processes in a relatively shorter period of time. Optimizing the latency and throughput for spawning processes would help improve the overall performance of runtime systems, allow adaptive process-replication reliability and motivate the design and implementation of process management interfaces in future manycore operating systems. In this paper, we study the several limiting factors for efficient spawning of processes on massively multicore architectures. We have developed a library to optimize launching multiple instances of the same executable. Our microbenchmarks show a 20-80% decrease in the process spawn time for multiple executables. We further discuss the effects of memory locality and propose NUMA-aware extensions to optimize launching processes with large memory-mapped segments including dynamic shared libraries. Finally, we describe vector operating system interfaces for spawning a batch of processes from a given executable on specific cores. Our results show a 50x speedup over the traditional method of launching new processes using fork and exec system calls.

查看原文本刊更多论文

优化大规模多核处理器上生成进程的延迟和吞吐量

SPMD应用程序的执行涉及使用可能不同的参数运行一个进程的多个实例。随着大规模多核处理器的广泛采用，人们开始关注以高效节能的方式有效利用丰富的计算资源。尽管在使用分层技术优化分布式进程启动方面已经做了很多工作，但在研究单个节点内生成进程的性能方面还存在空白。减少在本地生成新进程的延迟会导致更快的全局作业启动。此外，新兴的动态和弹性执行模型是在维护流程池以进行故障隔离和在相对较短的时间内启动多个流程的前提下设计的。优化生成进程的延迟和吞吐量将有助于提高运行时系统的整体性能，允许自适应进程复制的可靠性，并激励未来多核操作系统中进程管理接口的设计和实现。本文研究了在大规模多核体系结构上进程高效生成的几个限制因素。我们已经开发了一个库来优化启动同一可执行文件的多个实例。我们的微基准测试显示，多个可执行文件的进程生成时间减少了20-80%。我们进一步讨论了内存局部性的影响，并提出了numa感知扩展，以优化包含动态共享库的大型内存映射段的启动进程。最后，我们描述了矢量操作系统接口，用于在特定内核上从给定的可执行文件生成一批进程。我们的结果显示，与使用fork和exec系统调用启动新进程的传统方法相比，这种方法的速度提高了50倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Workshop on Runtime and Operating Systems for Supercomputers

自引率

0.00%

发文量