Akihiro Hayashi, S. Paul, M. Grossman, J. Shirako, Vivek Sarkar
{"title":"Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages","authors":"Akihiro Hayashi, S. Paul, M. Grossman, J. Shirako, Vivek Sarkar","doi":"10.1145/3152041.3152086","DOIUrl":"https://doi.org/10.1145/3152041.3152086","url":null,"abstract":"With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed. While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the existing Qthreads backend of Chapel.","PeriodicalId":102432,"journal":{"name":"Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125655243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extending the Open Community Runtime with External Application Support","authors":"J. Dokulil, S. Benkner","doi":"10.1145/3152041.3152088","DOIUrl":"https://doi.org/10.1145/3152041.3152088","url":null,"abstract":"The Open Community Runtime specification prescribes the way a task-parallel application has to be written, in order to give the runtime system the ability to automatically migrate work and data, provide fault tolerance, improve portability, etc. These constraints prevent an application from efficiently starting a new process to run another external program. We have designed an extension of the specification which provides exactly this functionality in a way that fits the task-based model. The bulk of our work is devoted to exploring the way the task-parallel application can interact with an external application without having to resort to using files on a physical drive for data exchange. To eliminate the need to make changes to the external application, the data is exposed via a virtual file system using the filesystem-in-userspace architecture.","PeriodicalId":102432,"journal":{"name":"Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware","volume":"344 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134427706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Peterson, A. Humphrey, John A. Schmidt, M. Berzins
{"title":"Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs","authors":"B. Peterson, A. Humphrey, John A. Schmidt, M. Berzins","doi":"10.1145/3152041.3152082","DOIUrl":"https://doi.org/10.1145/3152041.3152082","url":null,"abstract":"Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying the all-to-all communication of data dependencies among the nodes. Intranodal challenges include gathering together these data dependencies into usable data objects while avoiding data duplication. This paper addresses these challenges within the context of a large-scale, industrial coal boiler simulation using the Uintah asynchronous many-task runtime system on GPU architectures. We show significant reduction in time spent analyzing data dependencies through refinements in our dependency search algorithm. Multiple task graphs are used to eliminate subsequent analysis when task graphs change in predictable and repeatable ways. Using a combined data store and task scheduler redesign reduces data dependency duplication ensuring that problems fit within host and GPU memory. These modifications did not require any changes to application code or sweeping changes to the Uintah runtime system. We report results running on the DOE Titan system on 119K CPU cores and 7.5K GPUs simultaneously. Our solutions can be generalized to other task dependency problems with global dependencies among thousands of nodes which must be processed efficiently at large scale.","PeriodicalId":102432,"journal":{"name":"Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121395189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joshua D. Suetterlein, Joshua Landwehr, A. Márquez, J. Manzano, K. Barker, G. Gao
{"title":"Verification of the Extended Roofline Model for Asynchronous Many Task Runtimes","authors":"Joshua D. Suetterlein, Joshua Landwehr, A. Márquez, J. Manzano, K. Barker, G. Gao","doi":"10.1145/3152041.3152087","DOIUrl":"https://doi.org/10.1145/3152041.3152087","url":null,"abstract":"Asynchronous Many Task (AMT) runtimes promise application designers the ability to better utilize novel hardware resources and to take advantages of the idle times that might arise from the discrepancies due to mismatches between software and hardware components. To foresee possible problems between hardware and software components (described as mismatches), designers usually use models to predict and analyze application behaviors. However, current models are ill suited for the AMT crowd because of its dynamic behavior and agility. To this effect, we developed an extended roofline model that aims to provide upper bounds on execution for AMT frameworks. This work focuses on the validation and error characterization of this model using different statistical techniques and a large set of experiments to evaluate and characterize its error and its sources. We found out that in the worst case, the error can grow to an order of magnitude, however there are several techniques to increase the model accuracy given a machine configuration.","PeriodicalId":102432,"journal":{"name":"Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122200793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Risk-based Selective Redundancy for Fault-tolerant Task-parallel HPC Applications","authors":"Omer Subasi, O. Unsal, S. Krishnamoorthy","doi":"10.1145/3152041.3152083","DOIUrl":"https://doi.org/10.1145/3152041.3152083","url":null,"abstract":"Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high-performance computing (HPC) systems. In this study, we present an automatic, efficient and lightweight redundancy mechanism to mitigate both error types. We propose partial task-replication and checkpointing for task-parallel HPC applications to mitigate silent and fail-stop errors. To avoid the prohibitive costs of complete replication, we introduce a lightweight selective replication mechanism. Using a fully automatic and transparent heuristics, we identify and selectively replicate only the reliability-critical tasks based on a risk metric. Our approach detects and corrects around 70% of silent errors with only 5% average performance overhead. Additionally, the performance overhead of the heuristic itself is negligible.","PeriodicalId":102432,"journal":{"name":"Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125849717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seonmyeong Bak, Harshitha Menon, Sam White, M. Diener, L. Kalé
{"title":"Integrating OpenMP into the Charm++ Programming Model","authors":"Seonmyeong Bak, Harshitha Menon, Sam White, M. Diener, L. Kalé","doi":"10.1145/3152041.3152085","DOIUrl":"https://doi.org/10.1145/3152041.3152085","url":null,"abstract":"The recent trend of rapid increase in the number of cores per chip has resulted in vast amounts of on-node parallelism. These high core counts result in hardware variability that introduces imbalance. Applications are also becoming more complex themselves, resulting in dynamic load imbalance. Load imbalance of any kind can result in loss of performance and decrease in system utilization. In this paper, we propose a new integrated runtime system that adds OpenMP shared-memory parallelism to the Charm++ distributed programming model to improve load balancing on distributed systems. Our proposal utilizes an infrequent periodic assignment of work to cores based on load measurement, in combination with tasks created via OpenMP's parallel loop construct from each core to handle load imbalance. We demonstrate the benefits of using this integrated runtime system on the LLNL ASC proxy application Lassen, achieving speedups of 50% over runs without any load balancing and 10% over existing distributed-memory-only balancing schemes in Charm++.","PeriodicalId":102432,"journal":{"name":"Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129041203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zahra Khatami, Lukas Troska, Hartmut Kaiser, J. Ramanujam, Adrian Serio
{"title":"HPX Smart Executors","authors":"Zahra Khatami, Lukas Troska, Hartmut Kaiser, J. Ramanujam, Adrian Serio","doi":"10.1145/3152041.3152084","DOIUrl":"https://doi.org/10.1145/3152041.3152084","url":null,"abstract":"The performance of many parallel applications depends on loop-level parallelism. However, manually parallelizing all loops may result in degrading parallel performance, as some of them cannot scale desirably to a large number of threads. In addition, the overheads of manually tuning loop parameters might prevent an application from reaching its maximum parallel performance. We illustrate how machine learning techniques can be applied to address these challenges. In this research, we develop a framework that is able to automatically capture the static and dynamic information of a loop. Moreover, we advocate a novel method by introducing HPX smart executors for determining the execution policy, chunk size, and prefetching distance of an HPX loop to achieve higher possible performance by feeding static information captured during compilation and runtime-based dynamic information to our learning model. Our evaluated execution results show that using these smart executors can speed up the HPX execution process by around 12% -- 35% for the Matrix Multiplication, Stream and 2D Stencil benchmarks compared to setting their HPX loop's execution policy/parameters manually or using HPX auto-parallelization techniques.","PeriodicalId":102432,"journal":{"name":"Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114976450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Panda, K. Schulz, Khaled Hamidouche, H. Subramoni
{"title":"Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware","authors":"D. Panda, K. Schulz, Khaled Hamidouche, H. Subramoni","doi":"10.1145/3152041","DOIUrl":"https://doi.org/10.1145/3152041","url":null,"abstract":"Welcome to ESPM2 '15 workshop! As the HPC field is heading to Exascale, the role of Programming Models and Middleware is getting more important. The objectives of this workshop are to bring together researchers working in this area and discuss the stateof- the-art developments in the field. \u0000 \u0000The detailed workshop program is indicated in the previous page. We would like to thank all authors who submitted papers to this workshop. Special thanks go to the program committee members for providing us with high-quality reviews under tight deadlines. For each submitted paper, we were able to collect at least four reviews. We were able to receive 100% reviews on a tight deadline. Based on the reviews and online discussion among the PC members, a set of five regular papers and two short papers were selected. These papers reflect the state-of-the-art research and developments being conducted in the community in the emerging programming models and middleware area.","PeriodicalId":102432,"journal":{"name":"Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware","volume":"02 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127331056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}