PDES可以在异构延迟环境中扩展吗?

Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation Pub Date : 2013-05-19 DOI:10.1145/2486092.2486098

Jingjing Wang, Ketan Bahulkar, D. Ponomarev, N. Abu-Ghazaleh

{"title":"PDES可以在异构延迟环境中扩展吗?","authors":"Jingjing Wang, Ketan Bahulkar, D. Ponomarev, N. Abu-Ghazaleh","doi":"10.1145/2486092.2486098","DOIUrl":null,"url":null,"abstract":"The performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by communication latencies and overheads. The emergence of multi-core processors and their expected evolution into many-cores offers the promise of low latency communication and tight memory integration between cores; these properties should significantly improve the performance of PDES in such environments. However, on clusters of multi-cores (CMs), the latency and processing overheads incurred when communicating between different machines (nodes) far outweigh those between cores on the same chip, especially when commodity networking fabrics and communication software are used. It is unclear if there is any benefit to the low latency among cores on the same node given that communication links across nodes are significantly worse. In this study, we examine the performance of a multi-threaded implementation of PDES on CMs. We demonstrate that the inter-node communication costs impose a substantial bottleneck on PDES and demonstrate that without optimizations addressing these long latencies, multi-threaded PDES does not significantly outperform the multiprocess version despite direct communication through shared memory on the individual nodes. We then propose three optimizations: message consolidation and routing, infrequent polling and latency-sensitive model partitioning. We show that with these optimizations in place, threaded implementation of PDES significantly outperforms process-based implementation even on CMs.","PeriodicalId":115341,"journal":{"name":"Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation","volume":"147 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Can PDES scale in environments with heterogeneous delays?\",\"authors\":\"Jingjing Wang, Ketan Bahulkar, D. Ponomarev, N. Abu-Ghazaleh\",\"doi\":\"10.1145/2486092.2486098\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by communication latencies and overheads. The emergence of multi-core processors and their expected evolution into many-cores offers the promise of low latency communication and tight memory integration between cores; these properties should significantly improve the performance of PDES in such environments. However, on clusters of multi-cores (CMs), the latency and processing overheads incurred when communicating between different machines (nodes) far outweigh those between cores on the same chip, especially when commodity networking fabrics and communication software are used. It is unclear if there is any benefit to the low latency among cores on the same node given that communication links across nodes are significantly worse. In this study, we examine the performance of a multi-threaded implementation of PDES on CMs. We demonstrate that the inter-node communication costs impose a substantial bottleneck on PDES and demonstrate that without optimizations addressing these long latencies, multi-threaded PDES does not significantly outperform the multiprocess version despite direct communication through shared memory on the individual nodes. We then propose three optimizations: message consolidation and routing, infrequent polling and latency-sensitive model partitioning. We show that with these optimizations in place, threaded implementation of PDES significantly outperforms process-based implementation even on CMs.\",\"PeriodicalId\":115341,\"journal\":{\"name\":\"Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation\",\"volume\":\"147 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2486092.2486098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2486092.2486098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

并行离散事件仿真(PDES)的性能和可伸缩性经常受到通信延迟和开销的限制。多核处理器的出现及其向多核的预期演变为核心之间的低延迟通信和紧密的内存集成提供了希望;这些属性将显著提高PDES在这种环境中的性能。然而，在多核集群(CMs)上，在不同机器(节点)之间通信时产生的延迟和处理开销远远超过同一芯片上的内核之间的延迟和处理开销，特别是在使用商品网络结构和通信软件时。考虑到节点之间的通信链路明显更差，目前尚不清楚同一节点上的核心之间的低延迟是否有任何好处。在本研究中，我们研究了CMs上PDES的多线程实现的性能。我们证明了节点间通信成本对PDES造成了很大的瓶颈，并且证明了如果没有针对这些长延迟的优化，尽管通过单个节点上的共享内存进行直接通信，多线程PDES的性能并不明显优于多进程版本。然后，我们提出了三种优化:消息整合和路由、不频繁的轮询和延迟敏感的模型分区。我们表明，有了这些优化，PDES的线程实现甚至在CMs上也明显优于基于进程的实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Can PDES scale in environments with heterogeneous delays?

The performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by communication latencies and overheads. The emergence of multi-core processors and their expected evolution into many-cores offers the promise of low latency communication and tight memory integration between cores; these properties should significantly improve the performance of PDES in such environments. However, on clusters of multi-cores (CMs), the latency and processing overheads incurred when communicating between different machines (nodes) far outweigh those between cores on the same chip, especially when commodity networking fabrics and communication software are used. It is unclear if there is any benefit to the low latency among cores on the same node given that communication links across nodes are significantly worse. In this study, we examine the performance of a multi-threaded implementation of PDES on CMs. We demonstrate that the inter-node communication costs impose a substantial bottleneck on PDES and demonstrate that without optimizations addressing these long latencies, multi-threaded PDES does not significantly outperform the multiprocess version despite direct communication through shared memory on the individual nodes. We then propose three optimizations: message consolidation and routing, infrequent polling and latency-sensitive model partitioning. We show that with these optimizations in place, threaded implementation of PDES significantly outperforms process-based implementation even on CMs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

自引率

0.00%

发文量