2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)最新文献

筛选
英文 中文
Bio-Inspired Call-Stack Reconstruction for Performance Analysis 基于仿生的性能分析调用栈重建
Harald Servat, Germán Llort, Juan Gonzalez, Judit Giménez, Jesús Labarta
{"title":"Bio-Inspired Call-Stack Reconstruction for Performance Analysis","authors":"Harald Servat, Germán Llort, Juan Gonzalez, Judit Giménez, Jesús Labarta","doi":"10.1109/PDP.2016.62","DOIUrl":"https://doi.org/10.1109/PDP.2016.62","url":null,"abstract":"The correlation of performance bottlenecks and their associated source code has become a cornerstone of performance analysis. It allows understanding why the efficiency of an application falls behind the computer's peak performance and enabling optimizations on the code ultimately. To this end, performance analysis tools collect the processor call-stack and then combine this information with measurements to allow the analyst comprehend the application behavior. Some tools modify the call-stack during run-time to diminish the collection expense but at the cost of resulting in non-portable solutions. In this paper, we present a novel portable approach to associate performance issues with their source code counterpart. To address it, we capture a reduced segment of the call-stack (up to three levels) and then process the segments using an algorithm inspired by multi-sequence alignment techniques. The results of our approach are easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code. To demonstrate the usefulness of our approach, we have applied the algorithm to several first-time seen in-production applications to describe them finely, and optimize them by using tiny modifications based on the analyses.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133462515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Evaluation of Splitting-Up Conjugate Gradient Method on GPUs gpu上拆分共轭梯度法的评价
A. Wakatani
{"title":"Evaluation of Splitting-Up Conjugate Gradient Method on GPUs","authors":"A. Wakatani","doi":"10.1109/PDP.2016.9","DOIUrl":"https://doi.org/10.1109/PDP.2016.9","url":null,"abstract":"This paper describes the implementation of a preconditioned CG (Conjugate Gradient) method on GPUs and evaluates the performance compared with CPUs. Our CG method utilizes SP (Splitting-Up) preconditioner, which is suitable for parallel processing because other dimensions except for one dimension are independent. In order to enhance the memory bandwidth to the global memory of GPUs, our implementation utilizes a pseudo matrix transposition before and after a tridiagonal matrix solver, which results in coalesced memory accesses. In addition, the number of pseudo matrix transpositions can be reduced to only one by using a rotation configuration technique. By these techniques, the speedups of our approach can be enhanced by up to 102.2%.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127935050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Black-Box Optimization of Hadoop Parameters Using Derivative-Free Optimization 使用无导数优化的Hadoop参数黑盒优化
Diego Desani, V. Gil-Costa, C. Marcondes, H. Senger
{"title":"Black-Box Optimization of Hadoop Parameters Using Derivative-Free Optimization","authors":"Diego Desani, V. Gil-Costa, C. Marcondes, H. Senger","doi":"10.1109/PDP.2016.35","DOIUrl":"https://doi.org/10.1109/PDP.2016.35","url":null,"abstract":"Since its inception in 2004, MapReduce has revealed as a paramount platform and disruptive technology for the execution of high performance applications that process very large volumes of data. Hadoop is one of the most popular and widely adopted open source MapReduce implementation. Companies that execute large applications over hundreds or thousands of machines every day spend large efforts in performance tuning and optimization to reduce infrastructure costs. However, the framework has around 190 parameters which can be adjusted in a large number of different configurations that can significantly impact the performance of applications. The task of optimizing Hadoop parameters requires deep knowledge about a myriad platform details. In this paper, we propose and evaluate the use of derivative-free (DFO) methods for the automatic setup of Hadoop parameters to optimize the performance of applications. DFO methods provide a simple and efficient manner for automatic optimization of Hadoop MapReduce programs. Parameter changes are deployed through DevOps tools which are used to efficiently reconfigure the cluster according to DFO decisions. In the best scenario in our experiments, the automatic optimization leads to a reduction of 71% in the execution time over the default setup of parameters (i.e., an acceleration of 3.5 times) on a cluster of 28 nodes with very low overhead for production environments. Such results show that DFO methods and automatic optimization provide a promising tool for optimizing performance and reduction of costs for Hadoop applications which do not present dramatic variation in their behavior in daily production environments.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"338 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115671798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Parallel Improved Schnorr-Euchner Enumeration SE++ for the CVP and SVP CVP和SVP的并行改进Schnorr-Euchner枚举
Fábio Correia, Artur Mariano, A. Proença, C. Bischof, E. Agrell
{"title":"Parallel Improved Schnorr-Euchner Enumeration SE++ for the CVP and SVP","authors":"Fábio Correia, Artur Mariano, A. Proença, C. Bischof, E. Agrell","doi":"10.1109/PDP.2016.95","DOIUrl":"https://doi.org/10.1109/PDP.2016.95","url":null,"abstract":"The Closest Vector Problem (CVP) and the Shortest Vector Problem (SVP) are prime problems in lattice-based cryptanalysis, since they underpin the security of many lattice-based cryptosystems. Despite the importance of these problems, there are only a few CVP-solvers publicly available, and their scalability was never studied. This paper presents a scalable implementation of an enumeration-based CVP-solver for multi-cores, which can be easily adapted to solve the SVP. In particular, it achieves super-linear speedups in some instances on up to 8 cores and almost linear speedups on 16 cores when solving the CVP on a 50-dimensional lattice. Our results show that enumeration-based CVP-solvers can be parallelized as effectively as enumeration-based solvers for the SVP, based on a comparison with a state of the art SVP-solver. In addition, we show that we can optimize the SVP variant of our solver in such a way that it becomes 35%-60% faster than the fastest enumeration-based SVP-solver to date.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130720524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Row Key Designs of NoSQL Database Tables and Their Impact on Write Performance NoSQL数据库表的行键设计及其对写性能的影响
Eftim Zdravevski, Petre Lameski, A. Kulakov
{"title":"Row Key Designs of NoSQL Database Tables and Their Impact on Write Performance","authors":"Eftim Zdravevski, Petre Lameski, A. Kulakov","doi":"10.1109/PDP.2016.84","DOIUrl":"https://doi.org/10.1109/PDP.2016.84","url":null,"abstract":"In several NoSQL database systems, among which is HBase, only one index is available for the tables, which is also the row key and the clustered index. Using other indexes does not come out of the box. As a result, the row key design is the most important thing when designing tables, because an inappropriate design can lead to detrimental consequences on performances and costs. Particular row key designs are suitable for different problems, and in this paper we analyze the performance, characteristics and applicability of each of them. In particular we investigate the effect of using various techniques for modeling row keys: sequences, salting, padding, hashing, and modulo operations. We propose four different designs based on these techniques and we analyze their performance on different HBase clusters when loading HDFS files with various sizes. The experiments show that particular designs consistently outperform others on differently sized clusters in both execution time and even load distribution across nodes.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128978638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Exploring Cache Size and Core Count Tradeoffs in Systems with Reduced Memory Access Latency 在减少内存访问延迟的系统中探索缓存大小和核心计数的权衡
P. C. Santos, M. Alves, M. Diener, L. Carro, P. Navaux
{"title":"Exploring Cache Size and Core Count Tradeoffs in Systems with Reduced Memory Access Latency","authors":"P. C. Santos, M. Alves, M. Diener, L. Carro, P. Navaux","doi":"10.1109/PDP.2016.55","DOIUrl":"https://doi.org/10.1109/PDP.2016.55","url":null,"abstract":"One of the main challenges for computer architects is how to hide the high average memory access latency from the processor. In this context, Hybrid Memory Cubes (HMCs) can provide substantial energy and bandwidth improvements compared to traditional memory organizations. However, it is not clear how this reduced average memory access latency will impact the LLC. For applications with high cache miss ratios, the latency to search for the data inside the cache memory will impact negatively on the performance. The importance of this overhead depends on the memory access latency. In this paper, we present an evaluation of the L3 cache importance on a high performance processor using HMC also exploring chip area tradeoffs between the cache size and number of processor cores. We show that the high bandwidth provided by HMC memories can eliminate the need for L3 caches, removing hardware and making room for more processing power. Our evaluations show that performance increased 37% and the EDP improved 12% while maintaining the same original chip area in a wide range of parallel applications, when compared to DDR3 memories.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134560627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Avionics Applications on a Time-Predictable Chip-Multiprocessor 时间可预测芯片多处理器的航空电子应用
André Rocha, Cláudio Silva, R. B. Sorensen, J. Sparsø, Martin Schoeberl
{"title":"Avionics Applications on a Time-Predictable Chip-Multiprocessor","authors":"André Rocha, Cláudio Silva, R. B. Sorensen, J. Sparsø, Martin Schoeberl","doi":"10.1109/PDP.2016.36","DOIUrl":"https://doi.org/10.1109/PDP.2016.36","url":null,"abstract":"Avionics applications need to be certified for the highest criticality standard. This certification includes schedulability analysis and worst-case execution time (WCET) analysis. WCET analysis is only possible when the software is written to be WCET analyzable and when the platform is time-predictable. In this paper we present prototype avionics applications that have been ported to the time-predictable T-CREST platform. The applications are WCET analyzable, and T-CREST is supported by the aiT WCET analyzer. This combination allows us to provide WCET bounds of avionic tasks, even when executing on a multicore processor.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128087664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
MWPF: A Deadlock Avoidance Fully Adaptive Routing Algorithm in Networks-on-Chip 片上网络中避免死锁的全自适应路由算法
Kamran Nasiri, H. Zarandi
{"title":"MWPF: A Deadlock Avoidance Fully Adaptive Routing Algorithm in Networks-on-Chip","authors":"Kamran Nasiri, H. Zarandi","doi":"10.1109/PDP.2016.69","DOIUrl":"https://doi.org/10.1109/PDP.2016.69","url":null,"abstract":"The fully adaptive routing algorithms for Networks-On-Chip (NoC) based on number of packets held in a virtual channel (VC), can be classified into two main groups: 1) Traditional fully adaptive routing algorithms which only one packet reside in a VC at the same time. 2) Whole packet forwarding (WPF) which multiple packets can be resided in a VC. Based on an analysis, the WPF as regards multiple packets can be held in a VC, suffers from the full output buffer problem. This problem increases the overall input packet latency. In this paper, a fully adaptive routing algorithm is presented (MWPF). Compared with TFA and WPF, our design achieves an average 65.3% and 35.4% latency improvement, respectively. 38.4% and 24.3% saturation throughput improvement in the standard synthetic traffic pattern. Compared with WPF an average 26% and 61% maximum latency reduction on SPLASH-2 benchmarks running on a 49-core CMP. Our design also offers higher performance than partially adaptive and deterministic routing algorithms.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133359805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
VerCors: A Layered Approach to Practical Verification of Concurrent Software VerCors:并行软件实际验证的分层方法
A. Amighi, S. Blom, M. Huisman
{"title":"VerCors: A Layered Approach to Practical Verification of Concurrent Software","authors":"A. Amighi, S. Blom, M. Huisman","doi":"10.1109/PDP.2016.107","DOIUrl":"https://doi.org/10.1109/PDP.2016.107","url":null,"abstract":"This paper discusses how several concurrent program verification techniques can be combined in a layered approach, where each layer is especially suited to verify one aspect of concurrent programs, thus making verification of concurrent programs practical. At the bottom layer, we use a combination of implicit dynamic frames and CSL-style resource invariants, to reason about data race freedom of programs. We illustrate this on the verification of a lock-free queue implementation. On top of this, layer 2 enables reasoning about resource invariants that express a relationship between thread-local and shared variables. This is illustrated by the verification of a reentrant lock implementation, where thread-locality is used to specify for a thread which locks it holds, while there is a global notion of ownership, expressing for a lock by which thread it is held. Finally, the top layer adds a notion of histories to reason about functional properties. We illustrate how this is used to prove that the lock-free queue preserves the order of elements, without having to reverify the aspects related to data race freedom.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"600 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123169304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
A General Purpose Branch and Bound Parallel Algorithm 一种通用分支定界并行算法
A. Dimopoulos, C. Pavlatos, G. Papakonstantinou
{"title":"A General Purpose Branch and Bound Parallel Algorithm","authors":"A. Dimopoulos, C. Pavlatos, G. Papakonstantinou","doi":"10.1109/PDP.2016.33","DOIUrl":"https://doi.org/10.1109/PDP.2016.33","url":null,"abstract":"In this paper a parallel algorithm for branch and bound applications is proposed. The algorithm is a general purpose one and it can be used to parallelize effortlessly any sequential branch and bound style algorithm, that is written in a certain format. It is a distributed dynamic scheduling algorithm, i.e. each node schedules the load of its cores, it can be used with different programming platforms and architectures and is a hybrid algorithm (OpenMP, MPI). To prove its validity and efficiency the proposed algorithm has been implemented and tested with numerous examples in this paper that are described in detail. A speed-up of about 9 has been achieved for the tested examples, for a cluster of three nodes with four cores each.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126209428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信