{"title":"MPI Acceleration of Image Classification: Are We Seeing the Resurgence of MPI in Solving Big Data Problems?","authors":"Sameer Kumar","doi":"10.1145/3085158.3091993","DOIUrl":"https://doi.org/10.1145/3085158.3091993","url":null,"abstract":"Recent work has shown the effectiveness of the MPI programming paradigm in accelerating image classification via the Stochastic Gradient Descent optimization technique. Applications such as Caffe, Torch and Tensor Flow, that use Graphic Processing Unit accelerators within the SMP node, have been extended to use MPI across nodes with scalable speedups. In this talk, we will briefly review convolutional neural networks and the stochastic gradient technique to explore optimized solutions for the image classification problem. Next, I will review opportunities and challenges in parallel and distributed asynchronous stochastic gradient descent and the benefits from using MPI libraries. I will also present possible future directions for MPI based deep learning and other Big Data applications.","PeriodicalId":425891,"journal":{"name":"Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128549535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How Effective is Design Abstraction in Thrust?: An Empirical Evaluation","authors":"Ajai V. George, Sankar Manoj, S. Gupte, S. Sarkar","doi":"10.1145/3085158.3086159","DOIUrl":"https://doi.org/10.1145/3085158.3086159","url":null,"abstract":"High performance computing applications are far more difficult to write, therefore, practitioners expect a well-tuned software to last long and provide optimized performance even when the hardware is upgraded. It may also be necessary to write software using sufficient abstraction over the hardware so that it is capable of running on heterogeneous architecture. A good design abstraction paradigm strikes a balance between the abstraction and visibility over the hardware. This allows the programmer to write applications without having to understand the hardware nuances while exploiting the computing power optimally. In this paper we have analyzed the power of design abstraction of a popular design abstraction framework called Thrust both from ease of programming and performance perspectives. We have shown that while Thrust framework is good in describing an algorithm compared to the native CUDA or OpenMP version but it has quite a few design limitations. With respect to CUDA it does not provide any abstraction over the shared, texture or constant memory usage to the programmer. We have compared the performance of a Thrust application code in CUDA, OpenMP and the CPP backends with respect to the native versions (implementing exactly same algorithm), written for these backends and found that the current Thrust version performs poorly in most of the cases. While we conclude that the framework is not ready for writing applications that can exploit the optimal performance from the hardware, we also highlight the improvements necessary for the framework to make the performance comparable.","PeriodicalId":425891,"journal":{"name":"Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130900587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 1","authors":"Atul Kumar","doi":"10.1145/3248714","DOIUrl":"https://doi.org/10.1145/3248714","url":null,"abstract":"","PeriodicalId":425891,"journal":{"name":"Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123863281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 2","authors":"S. Sarkar","doi":"10.1145/3248715","DOIUrl":"https://doi.org/10.1145/3248715","url":null,"abstract":"","PeriodicalId":425891,"journal":{"name":"Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"291 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116111350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using High Level GPU Tasks to Explore Memory and Communications Options on Heterogeneous Platforms","authors":"Chao Liu, J. Bhimani, M. Leeser","doi":"10.1145/3085158.3086160","DOIUrl":"https://doi.org/10.1145/3085158.3086160","url":null,"abstract":"Heterogeneous computing platforms that use GPUs for acceleration are becoming prevalent. Developing parallel applications for GPU platforms and optimizing GPU related applications for good performance is important. In this work, we develop a set of applications based on a high level task design, which ensures a well defined structure for portability improvement. Together with the GPU task implementation, we utilize a uniform interface to allocate and manage memory blocks that are used by both host and device. In this way we can choose the appropriate types of memory for host/device communication easily and flexibly in GPU tasks. Through asynchronous task execution and CUDA streams, we can explore concurrent GPU kernels for performance improvement when running multiple tasks. We developed a test benchmark set containing nine different kernel applications. Through tests we can learn that pinned memory can improve host/device data transfer for GPU platforms. The performance of unified memory differs a lot on different GPU architectures and is not a good choice if performance is the main focus. The multiple task tests show that applications based on our GPU tasks can effectively make use of the concurrent kernel ability of modern GPUs for better resource utilization.","PeriodicalId":425891,"journal":{"name":"Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126492067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"READEX Tool Suite for Energy-efficiency Tuning of HPC Applications","authors":"Anamika Chowdhury, Madhura Kumaraswamy, M. Gerndt","doi":"10.1145/3085158.3091994","DOIUrl":"https://doi.org/10.1145/3085158.3091994","url":null,"abstract":"The European Union Horizon 2020 READEX project is developing a tool suite for dynamic energy tuning of HPC applications. The tool suite performs an analysis during design-time before production run to construct a tuning model encapsulated with the best-found configurations that are then fed to the runtime tuning library. The library switches the configurations at runtime to adapt the application for energy-efficiency.","PeriodicalId":425891,"journal":{"name":"Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134515003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PRESGen: A Fully Automatic Equivalence Checker for Validating Optimizing and Parallelizing Transformations","authors":"S. Bandyopadhyay, K. Banerjee","doi":"10.1145/3085158.3086158","DOIUrl":"https://doi.org/10.1145/3085158.3086158","url":null,"abstract":"Petri net has been a popular choice of model of computation (MoC) for representing parallel programs. PRES+ is an extension of the traditional Petri net model which is specially equipped to precisely model embedded systems. Since multi-core and multiprocessor systems have proliferated in the domain of embedded systems as well, it has become critical to validate the optimizing and parallelizing transformations which embedded system specifications go through before being implemented in the hardware. PRES+ model based equivalence checkers for validating such transformations already exist. However, construction of the PRES+ models from the original and the translated codes in these equivalence checkers was not done in an automated manner; thus, leaving scope for inaccurate representation of the PRES+ models since they had to be done manually. Moreover, PRES+ model tends to grow more rapidly with the program size when compared to other MoCs, such as FSMD. To tackle these problems, we propose a method for automated construction of PRES+ models from high-level language programs and using an existing translation scheme to convert PRES+ models to FSMD models, we validate the transformations using a state-of-the-art FSMD equivalence checker. Thus, we have effectively composed an end-to-end fully automatic equivalence checker for validating optimizing and parallelizing transformations. The experimental results demonstrate the practical applicability of our method.","PeriodicalId":425891,"journal":{"name":"Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122171318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications","authors":"","doi":"10.1145/3085158","DOIUrl":"https://doi.org/10.1145/3085158","url":null,"abstract":"","PeriodicalId":425891,"journal":{"name":"Proceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122015060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}