F. Boito, J. Méhaut, T. Deutsch, B. Videau, F. Desprez
{"title":"Instrumental Data Management and Scientific Workflow Execution: the CEA Case Study","authors":"F. Boito, J. Méhaut, T. Deutsch, B. Videau, F. Desprez","doi":"10.1109/IPDPSW.2019.00139","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00139","url":null,"abstract":"In this paper, we study a typical scenario in research facilities. Instrumental data is generated by lab equipment such as microscopes, collected by researchers into USB devices, and analyzed in their own computers. In this scenario, an instrumental data management framework could store data in a institution-level storage infrastructure and allow to execute tasks to analyze this data in some available processing nodes. This setup has the advantages of promoting reproducible research and the efficient usage of the expensive lab equipment (in addition to increasing researchers productivity). We detail the requirements for such a framework regarding the needs of our case study of the CEA, review existing solutions and recommend the choice of Galaxy. We then analyze the performance limitations of the proposed architecture, and point to the connection between centralized storage and the processing nodes as the critical point. We also conduct a performance evaluation over an experimental platform to observe the limitations encountered in practice. We finish by pointing issues that are not addressed by existing solutions, and are therefore future work perspectives for the research field.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116415853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of Circuits on the Reconfigurable Mesh","authors":"Y. Ben-Asher, Esti Stein","doi":"10.1109/IPDPSW.2019.00020","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00020","url":null,"abstract":"The Reconfigurable Mesh (RM) is a grid of Processing Elements (PEs) that use dynamic reconfigurations to create varying bus-segments between its PEs. This allows the RM to perform computations such as sorting or counting in a constant number of steps. It has long been speculated that the RM's dynamic reconfiguration should replace the static reconfiguration architecture of the FPGA. In this work, we show that the RM can be used not only to accelerate specific computations such as sorting or summing but also for speeding up the main function of the FPGA, namely evaluation of Boolean Circuits (BCs). We propose an RM algorithm to evaluate BCs and show that it can be done without size blow-up. Moreover, like in the FPGA, it can be done using a grid of tri-state switching elements, rather than a grid of PEs as is the case with the regular RM. This model is called FPRM, and preliminary ASIC synthesis results illustrate that the FPRM architecture is about 2X faster and also more efficient in power/area than the FPGA routing infrastructure.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123431871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FIFO-Based Hardware Sorters for High Bandwidth Memory","authors":"K. Nakano, Yasuaki Ito, J. Bordim","doi":"10.1109/IPDPSW.2019.00112","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00112","url":null,"abstract":"The main contribution of this paper is to show efficient FIFO-based hardware sorters that sort n elements with w bits each stored in a high bandwidth memory with modest access latency. We assume that each address of the high bandwidth memory can store p elements of w bits each, which can be read or written at the same time. The access latency l of the high bandwidth memory is assumed to take l clock cycles to access p elements in a specified address. Furthermore, burst mode is supported and k (≥ 1) consecutive addresses can be accessed in k+l-1 clock cycles in a pipeline fashion. However, if k addresses are not consecutive, kl clock cycles are necessary to access all of them. Clearly, all n elements arranged n/p addresses can be duplicated in 2(n/p+l-1) clock cycles. We present two types of hardware sorters that sort n=rc elements stored in an r×c matrix of the high bandwidth memory. We first develop Three-Pass-Sort and Four-Pass-Sort that sort an r×c matrix by reading from and witting in it three times and four times, respectively. We implement these two algorithms using FIFO-based mergers that can be configured as pairwise mode and sliding mode. Our hardware sorter based on Three-Pass-Sort runs in 6n/p+3c^2/p^2l+O(c/p(l+log r)+r) clock cycles using a circuit of size O(rwp) provided that r≥c^2. Also, our hardware sorter based on Four-Pass-Sort runs in 8n/p+2c^2l+O(cl+log r+p) clock cycles using a circuit of size O(rw).","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127152603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
André Nascimento, Victor Olimpio, V. Silva, A. Paes, Daniel de Oliveira
{"title":"A Reinforcement Learning Scheduling Strategy for Parallel Cloud-Based Workflows","authors":"André Nascimento, Victor Olimpio, V. Silva, A. Paes, Daniel de Oliveira","doi":"10.1109/IPDPSW.2019.00134","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00134","url":null,"abstract":"Scientific experiments can be modeled as Workflows. Such Workflows are usually computing-and data-intensive, demanding the use of High-Performance Computing environments such as clusters, grids, and clouds. This latter offers the advantage of elasticity, which allows for increasing and/or decreasing the number of Virtual Machines (VMs) on demand. Workflows are typically managed using Scientific Workflow Management Systems (SWfMS). Many existing SWfMSs offer support for cloud-based execution. Each SWfMS has its own scheduler that follows a well-defined cost function. However, such cost functions must consider the characteristics of a dynamic environment, such as live migrations and/or performance fluctuations, which are far from trivial to model. This paper proposes a novel scheduling strategy, named ReASSIgN, based on Reinforcement Learning (RL). By relying on an RL technique, one may assume that there is an optimal (or sub-optimal) solution for the scheduling problem, and aims at learning the best scheduling based on previous executions in the absence of a mathematical model of the environment. For this, an extension of a well-known workflow simulator WorkflowSim is proposed to implement an RL strategy for scheduling workflows. Once the scheduling plan is generated, the workflow is executed in the cloud using SciCumulus SWfMS. We conducted a thorough evaluation of the proposed scheduling strategy using a real astronomy workflow.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129919068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Approximate and Exact Selection on GPUs","authors":"T. Ribizel, H. Anzt","doi":"10.1109/IPDPSW.2019.00088","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00088","url":null,"abstract":"We present a novel algorithm for parallel selection on GPUs. The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always using the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for - and exploiting the characteristics of - \"pleasant\" data distributions. At the same time, as the SampleSelect does not work on the actual values but the ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. Additionally to the exact SampleSelect, we address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114168780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. A. J. Marzulo, Calebe P. Bianchini, Leandro Santiago, V. C. Ferreira, Brunno F. Goldstein, F. França
{"title":"Teaching High Performance Computing through Parallel Programming Marathons","authors":"L. A. J. Marzulo, Calebe P. Bianchini, Leandro Santiago, V. C. Ferreira, Brunno F. Goldstein, F. França","doi":"10.1109/IPDPSW.2019.00058","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00058","url":null,"abstract":"Parallel and distributed programming is essential for exploiting the processing power of modern computing platforms. However, during the first years of a Computer Science course, students usually learn problem solving techniques, data structures and programming paradigms that are inherently sequential, hindering the transition to parallel architectures. Parallel Programming Marathons organized in Brazil are similar to other Programming Competitions around the world and have been used for teaching and stimulating undergraduate and graduate students into learning to \"think in parallel\" and to develop applications for different parallel architectures, including multicores, clusters and accelerators. This paper presents the structure of this Parallel Programming Marathon and an overview of how it supports regional and national contests. Also, this work presents use cases on Parallel and Distributed Computing course from two different Brazilian universities that use a challenge based learning approach and employ marathon problems as course assignments. This approach contributed to increase students' interest towards High Performance Computing.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116698228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sai Charan Regunta, Sai Harsh Tondomker, Kishore Kothapalli
{"title":"BRICS – Efficient Techniques for Estimating the Farness-Centrality in Parallel","authors":"Sai Charan Regunta, Sai Harsh Tondomker, Kishore Kothapalli","doi":"10.1109/IPDPSW.2019.00110","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00110","url":null,"abstract":"In this paper, we study scalable parallel algorithms for estimating the farness-centrality value of the nodes in a given undirected and connected graph. Our algorithms consider approaches that are more suitable for sparse graphs. To this end, we propose four optimization techniques based on removing redundant nodes, removing identical nodes, removing chain nodes, and making use of decomposition based on the biconnected components of the input graph. We test our techniques on a collection of real-world graphs for the time taken and the average error percentage. We further analyze the applicability of our techniques on various classes of real-world graphs. We suggest why certain techniques work better on certain classes of graphs.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126193857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Are we Doing the Right Thing? — A Critical Analysis of the Academic HPC Community","authors":"H. Anzt, Goran Flegar","doi":"10.1109/IPDPSW.2019.00122","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00122","url":null,"abstract":"Like in any other research field, academically surviving in the High Performance Computing (HPC) community generally requires to publish papers, in the bast case many of them and in high-ranked journals or at top-tier conferences. As a result, the number of scientific papers published each year in this relatively small community easily outnumbers what a single researcher can read. At the same time, many of the proposed and analyzed strategies, algorithms, and hardware-optimized implementations never make it beyond the prototype stage, as they are abandoned once they served the single purpose of yielding (another) publication. In a time and field where high-quality manpower is a scarce resource, this is extremely inefficient. In this position paper we promote a radical paradigm shift towards accepting high-quality software patches to community software packages as legitimate conference contributions. In consequence, the reputation and appointability of researchers is no longer based on the classical scientific metrics, but on the quality and documentation of open source software contributions — effectively improving and accelerating the collaborative development of community software.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130517162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneous Active Messages for Offloading on the NEC SX-Aurora TSUBASA","authors":"M. Noack, E. Focht, T. Steinke","doi":"10.1109/IPDPSW.2019.00014","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00014","url":null,"abstract":"The NEC SX-Aurora TSUBASA is a new generation of vector processing architectures that combines a standard Intel Xeon host with the newly developed NEC Vector Engine coprocessor cards. One way to use these coprocessors is offloading suitable parts of the program from the host to the Vector Engines. Currently, the only vendor-provided offloading solutions are the low-level Vector Engine Offloading (VEO) library, and a builtin reverse-offloading mechanism named VHcall. In this work, we extend the portable Heterogeneous Active Messages (HAM) based HAM-Offload framework with support for the NEC SX-Aurora TSUBASA. Therefore, we design, implement, and evaluate two messaging protocols aimed at minimising offloading cost. This sheds some light on how to achieve fast communication between host CPU and the Vector Engines of the NEC SX-Aurora TSUBASA. Compared with VEO, the DMA-based protocol reduces offloading overhead by a factor of 13×. The resulting framework enables users to write portable offload applications with low overhead, that do neither require a language extension like OpenMP, nor a special language like OpenCL. Existing HAM-Offload applications are now ready to run on the NEC SX-Aurora TSUBASA.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130762110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Fast Local Algorithm for Track Reconstruction on Parallel Architectures","authors":"D. C. Pérez, N. Neufeld, A. Riscos-Núñez","doi":"10.1109/IPDPSW.2019.00118","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00118","url":null,"abstract":"The reconstruction of particle trajectories, tracking, is a central process in the reconstruction of particle collisions in High Energy Physics detectors. At the LHCb detector in the Large Hadron Collider, bunches of particles collide 30 million times per second. These collisions produce about 10^9 particle trajectories per second that need to be reconstructed in real time, in order to filter and store data. Upcoming improvements in the LHCb detector will deprecate the hardware filter in favour of a full software filter, posing a computing challenge that requires a renovation of current algorithms and the underlying hardware. We present Search by triplet, a local tracking algorithm optimized for parallel architectures. We design our algorithm reducing Read-After-Write dependencies as well as conditional branches, incrementing the potential for parallelization. We analyze the complexity of our algorithm and validate our results. We show the scaling of our algorithm for an increasing number of collision events. We show sustained tests for our algorithm sequence given a simulated dataflow. We develop CPU and GPU implementations of our work, and hide the transmission times between device and host by executing a multi-stream pipeline. Our results provide a reliable basis for an informed assessment on the feasibility of LHCb event reconstruction on parallel architectures, enabling us to develop cost models for upcoming technology upgrades. The created software infrastructure is extensible and permits the addition of subsequent reconstruction algorithms.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128371860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}