2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

筛选
英文 中文
Characterizing and Modeling Power and Energy for Extreme-Scale In-Situ Visualization 表征和建模的能量和能量的极端尺度现场可视化
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.113
Vignesh Adhinarayanan, Wu-chun Feng, D. Rogers, J. Ahrens, S. Pakin
{"title":"Characterizing and Modeling Power and Energy for Extreme-Scale In-Situ Visualization","authors":"Vignesh Adhinarayanan, Wu-chun Feng, D. Rogers, J. Ahrens, S. Pakin","doi":"10.1109/IPDPS.2017.113","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.113","url":null,"abstract":"Plans for exascale computing have identified power and energy as looming problems for simulations running at that scale. In particular, writing to disk all the data generated by these simulations is becoming prohibitively expensive due to the energy consumption of the supercomputer while it idles waiting for data to be written to permanent storage. In addition, the power cost of data movement is also steadily increasing. A solution to this problem is to write only a small fraction of the data generated while still maintaining the cognitive fidelity of the visualization. With domain scientists increasingly amenable towards adopting an in-situ framework that can identify and extract valuable data from extremely large simulation results and write them to permanent storage as compact images, a large-scale simulation will commit to disk a reduced dataset of data extracts that will be much smaller than the raw results, resulting in a savings in both power and energy. The goal of this paper is two-fold: (i) to understand the role of in-situ techniques in combating power and energy issues of extreme-scale visualization and (ii) to create a model for performance, power, energy, and storage to facilitate what-if analysis. Our experiments on a specially instrumented, dedicated 150-node cluster show that while it is difficult to achieve power savings in practice using in-situ techniques, applications can achieve significant energy savings due to shorter write times for in-situ visualization. We present a characterization of power and energy for in-situ visualization; an application-aware, architecturespecific methodology for modeling and analysis of such in-situ workflows; and results that uncover indirect power savings in visualization workflows for high-performance computing (HPC).","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121495763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications 在耦合并行应用程序中适应线程级异构性
S. Gutierrez, K. Davis, D. Arnold, R. Baker, R. Robey, P. McCormick, Daniel Holladay, J. Dahl, R. Zerr, Florian Weik, Christoph Junghans
{"title":"Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications","authors":"S. Gutierrez, K. Davis, D. Arnold, R. Baker, R. Robey, P. McCormick, Daniel Holladay, J. Dahl, R. Zerr, Florian Weik, Christoph Junghans","doi":"10.1109/IPDPS.2017.13","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.13","url":null,"abstract":"Hybrid parallel program models that combine message passing and multithreading (MP+MT) are becoming more popular, extending the basic message passing (MP) model that uses single-threaded processes for both inter- and intra-node parallelism. A consequence is that coupled parallel applications increasingly comprise MP libraries together with MP+MT libraries with differing preferred degrees of threading, resulting in thread-level heterogeneity. Retroactively matching threading levels between independently developed and maintained libraries is difficult; the challenge is exacerbated because contemporary parallel job launchers provide only static resource binding policies over entire application executions. A standard approach for accommodating thread-level heterogeneity is to under-subscribe compute resources such that the library with the highest degree of threading per process has one processing element per thread. This results in libraries with fewer threads per process utilizing only a fraction of the available compute resources. We present and evaluate a novel approach for accommodating thread-level heterogeneity. Our approach enables full utilization of all available compute resources throughout an application's execution by providing programmable facilities to dynamically reconfigure runtime environments for compute phases with differing threading factors and memory affinities. We show that our approach can improve overall application performance by up to 5.8x in real-world production codes. Furthermore, the practicality and utility of our approach has been demonstrated by continuous production use for over one year, and by more recent incorporation into a number of production codes.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131075388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Data Centric Performance Measurement Techniques for Chapel Programs 以数据为中心的教堂项目性能测量技术
Hui Zhang, J. Hollingsworth
{"title":"Data Centric Performance Measurement Techniques for Chapel Programs","authors":"Hui Zhang, J. Hollingsworth","doi":"10.1109/IPDPS.2017.37","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.37","url":null,"abstract":"Chapel is an emerging PGAS (Partitioned Global Address Space) language whose design goal is to make parallel programming more productive and generally accessible. To date, the implementation effort has focused primarily on correctness over performance. We present a performance measurement technique for Chapel and the idea is also applicable to other PGAS models. The unique feature of our tool is that it associates the performance statistics not to the code regions (functions), but to the variables (including the heap allocated, static, and local variables) in the source code. Unlike code-centric methods, this data-centric analysis capability exposes new optimization opportunities that are useful in resolving data locality problems. This paper introduces our idea and implementations of the approach with three benchmarks. We also include a case study optimizing benchmarks based on the information from our tool. The optimized versions improved the performance by a factor of 1.4x for LULESH, 2.3x for MiniMD, and 2.1x for CLOMP with simple modifications to the source code.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128604050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Memory Compression Techniques for Network Address Management in MPI MPI中网络地址管理的内存压缩技术
Yanfei Guo, C. Archer, M. Blocksome, Scott Parker, Wesley Bland, Kenneth Raffenetti, P. Balaji
{"title":"Memory Compression Techniques for Network Address Management in MPI","authors":"Yanfei Guo, C. Archer, M. Blocksome, Scott Parker, Wesley Bland, Kenneth Raffenetti, P. Balaji","doi":"10.1109/IPDPS.2017.18","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.18","url":null,"abstract":"MPI allows applications to treat processes as a logical collection of integer ranks for each MPI communicator, while internally translating these logical ranks into actual network addresses. In current MPI implementations the management and lookup of such network addresses use memory sizes that are proportional to the number of processes in each communicator. In this paper, we propose a new mechanism, called AV-Rankmap, for managing such translation. AV-Rankmap takes advantage of logical patterns in rank-address mapping that most applications naturally tend to have, and it exploits the fact that some parts of network address structures are naturally more performance critical than others. It uses this information to compress the memory used for network address management. We demonstrate that AV-Rankmap can achieve performance similar to or better than that of other MPI implementations while using significantly less memory.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131325275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
DC^2-MTCP: Light-Weight Coding for Efficient Multi-Path Transmission in Data Center Network DC^2-MTCP:数据中心网络中高效多径传输的轻量级编码
Jiyan Sun, Yan Zhang, Xin Wang, Shihan Xiao, Zhen Xu, Hongjing Wu, Xin Chen, Yanni Han
{"title":"DC^2-MTCP: Light-Weight Coding for Efficient Multi-Path Transmission in Data Center Network","authors":"Jiyan Sun, Yan Zhang, Xin Wang, Shihan Xiao, Zhen Xu, Hongjing Wu, Xin Chen, Yanni Han","doi":"10.1109/IPDPS.2017.40","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.40","url":null,"abstract":"Multi-path TCP has recently shown great potential to take advantage of the rich path diversity in data center networks (DCN) to increase transmission throughput. However, the small flows, which take a large fraction of data center traffic, will easily get a timeout when split onto multiple paths. Moreover, the dynamic congestions and node failures in DCN will exacerbate the reorder problem of parallel multi-path transmissions for large flows. In this paper, we propose DC2-MTCP (Data Center Coded Multi-path TCP), which employs a fast and light-weight coding method to address the above challenges while maintaining the benefit of parallel multi-path transmissions. To meet the high flow performance in DCN, we insert a very low ratio of coded packets with a careful selection of the packets to be coded. We further present a progressive decoding algorithm to decode the packets online with a low time complexity. Extensive ns2-based simulations show that with two orders of magnitude lower coding delay, DC2-MTCP can reduce on average 40% flow completion time for small flows and increase 30% flow throughput for large flows compared to the peer schemes in varying network conditions.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123740278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Language-Based Optimizations for Persistence on Nonvolatile Main Memory Systems 基于语言的非易失性主存系统持久性优化
J. Denny, Seyong Lee, J. Vetter
{"title":"Language-Based Optimizations for Persistence on Nonvolatile Main Memory Systems","authors":"J. Denny, Seyong Lee, J. Vetter","doi":"10.1109/IPDPS.2017.60","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.60","url":null,"abstract":"Substantial advances in nonvolatile memory (NVM) technologies have motivated wide-spread integration of NVM into mobile, enterprise, and HPC systems. Recently, considerable research has focused on architectural integration of NVM and respective programming systems, exploiting NVM's trait of persistence correctly and efficiently. In this regard, we design several novel language-based optimization techniques for programming NVM and demonstrate them as an extension of our NVL-C system. Specifically, we focus on optimizing the performance of atomic updates to complex data structures residing in NVM. We build on two variants of automatic undo logging: canonical undo logging, and shadow updates. We show these techniques can be implemented transparently and efficiently, using dynamic selection and other logging optimizations. Our empirical results on several applications gathered on an NVM testbed illustrate that our cost-model-based dynamic selection technique can accurately choose the best logging variant across different NVM modes and input sizes. In comparison to statically choosing canonical undo logging, this improvement reduces execution time to as little as 53% for block-addressable NVM and 73% for emulated byte-addressable NVM on a Fusion-io ioScale device.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126506625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Autonomic Resource Management for Program Orchestration in Large-Scale Data Analysis 大规模数据分析中程序编排的自主资源管理
Masahiro Tanaka, K. Taura, Kentaro Torisawa
{"title":"Autonomic Resource Management for Program Orchestration in Large-Scale Data Analysis","authors":"Masahiro Tanaka, K. Taura, Kentaro Torisawa","doi":"10.1109/IPDPS.2017.89","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.89","url":null,"abstract":"Large-scale data analysis applications are becoming more and more prevalent in a wide variety of areas. These applications are composed of many currently available programs called analysis components. Thousands of analysis component processes are orchestrated on many compute nodes. This paper proposes a novel self-tuning framework for optimizing an application's throughput in large-scale data analysis. One challenge is developing efficient orchestration that takes into account the diversity of analysis components and the varying performances of compute nodes. In our previous work, we achieved such an orchestration to a certain degree by introducing our own middleware, which wraps each analysis component as a remote procedure call (RPC) service. The middleware also pools the processes to reduce startup overhead, which is a serious obstacle to achieving high throughput. This work tackles the remaining task of tuning the size of the analysis components' process pools to maximize the application's throughput. This is challenging because analysis components differ drastically in turnaround times and memory footprints. The size of the process pool for each type of analysis component should be set by giving consideration to these properties as well as the constraints on both the memory capacity and the processor core counts. In this work, we formulate this task as a linear programming problem and obtain the optimal pool sizes by solving it. Compared to our previous work, we significantly improved the scalability of our framework by reformulating the performance model to work on hundreds of heterogeneous nodes. We also extended the service allocation mechanism to manage the computational load on each compute node and reduce communication overhead. The experimental results show that our approach is scalable to thousands of analysis component processes running on 200 compute nodes across three clusters. Moreover, our approach significantly reduces memory footprint.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114158840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Scalable and Resilient Microarchitecture Based on Multiport Binding for High-Radix Router Design 基于多端口绑定的高基数路由器微架构设计
Yi Dai, Kefei Wang, G. Qu, Liquan Xiao, Dezun Dong, Xingyun Qi
{"title":"A Scalable and Resilient Microarchitecture Based on Multiport Binding for High-Radix Router Design","authors":"Yi Dai, Kefei Wang, G. Qu, Liquan Xiao, Dezun Dong, Xingyun Qi","doi":"10.1109/IPDPS.2017.15","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.15","url":null,"abstract":"High-radix routers with low latency and high bandwidth play an increasingly important role in the design of large-scale interconnection networks such as those used in super-computers and datacenters. The tile-based crossbar approach partitions a single large crossbar into many small tiles and can considerably reduce the complexity of arbitration while providing throughput higher than the conventional switch implementation. However, it is not scalable due to power consumption, placement, and routing problems. In this paper, we propose a truly scalable router microarchitecture called Multiport Binding Tile-based Router (MBTR). By aggregating multiple physical ports into a single tile a high-radix router can be flexibly organized into a different array of tiles, thus the number of tiles and hardware overhead can be considerably reduced. Compared with a hierarchical crossbar, MBTR achieves up to 50%∼75% reduction in memory consumption as well as wire area. Simulation results demonstrate MBTR is indistinguishable from the YARC router in terms of throughput and delay, and can even outperform it by reducing potential contention for output ports. We have fabricated an ASIC MBTR chip with 28nm technology. Internally, it runs at 700MHz and 30ns latency without any speedup. We also discuss how the microarchitecture parameters of MBTR can be adjusted based on the power, area, and design complexity constraints of the arbitration logic.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114185445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Computational Challenges in Constructing the Tree of Life 构建生命之树的计算挑战
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.128
T. Warnow
{"title":"Computational Challenges in Constructing the Tree of Life","authors":"T. Warnow","doi":"10.1109/IPDPS.2017.128","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.128","url":null,"abstract":"Estimating the Tree of Life is one of the grand computational challenges in Science, and has applications to many areas of science and biomedical research. Despite intensive research over the last several decades, many problems remain inadequately solved. Relatively small datasets can take hundreds of CPU years (e.g., the Avian Phylogenomics Project analysis of just 48 bird genomes used more than 200 CPU years to construct its tree), and larger datasets will require much more time. Thus, the estimation of the Tree of Life, which contains millions of species each with a genome containing millions of nucleotides, will depend on both novel algorithmic designs and effective use of high performance and distributed computing platforms.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116051993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Bounded Reordering Allows Efficient Reliable Message Transmission 有界重排序允许高效可靠的消息传输
Keishla D. Ortiz-Lopez, J. Welch
{"title":"Bounded Reordering Allows Efficient Reliable Message Transmission","authors":"Keishla D. Ortiz-Lopez, J. Welch","doi":"10.1109/IPDPS.2017.14","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.14","url":null,"abstract":"In the reliable message transmission problem (RMTP) processors communicate by exchanging messages, but the channel that connects two processors is subject to message loss, duplication, and reordering. Previous work focused on proposing protocols in asynchronous systems, where message size is finite and sequence numbers are bounded. However, if the channel can duplicate messages-but not lose them-and arbitrarily reorder the messages, the problem is unsolvable. We consider a strengthening of the asynchronous model in which reordering of messages is bounded. In this model, we develop an efficient protocol to solve the RMTP when messages may be duplicated but not lost. This result is in contrast to the impossibility of such an algorithm when reordering is unbounded. Our protocol has the pleasing property that no messages need to be sent from the receiver to the sender and it works when message loss is allowed with some minimal modifications.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115356105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信