M. Osama, Serban D. Porumbescu, John Douglas Owens
{"title":"Essentials of Parallel Graph Analytics","authors":"M. Osama, Serban D. Porumbescu, John Douglas Owens","doi":"10.1109/IPDPSW55747.2022.00061","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00061","url":null,"abstract":"We identify the graph data structure, frontiers, operators, an iterative loop structure, and convergence conditions as essential components of graph analytics systems based on the native-graph approach. Using these essential components, we propose an abstraction that captures all the significant programming models within graph analytics, such as bulk-synchronous, asynchronous, shared-memory, message-passing, and push vs. pull traversals. Finally, we demonstrate the power of our abstraction with an elegant modern C++ implementation of single-source shortest path and its required components.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123851132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EDAML 2022 Invited Speaker 2: AI Algorithm and Accelerator Co-design for Computing on the Edge","authors":"Deming Chen","doi":"10.1109/IPDPSW55747.2022.00195","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00195","url":null,"abstract":"In a conventional top-down design flow, deep-learning algorithms are first designed concentrating on the model accuracy, and then accelerated through hardware accelerators trying to meet various system design targets on power, energy, speed, and cost. However, this approach often does not work well because it ignores the physical constraints that the hardware architectures themselves would have towards the deep neural network (DNN) algorithm design and deployment, especially for the DNNs that will be deployed unto edge devices. Thus, an ideal scenario is that algorithms and their hardware accelerators are developed simultaneously. In this talk, we will present our DNN/Accelerator co-design and co-search methods. Our results have shown great promises for delivering high-performance hardware-tailored DNNs and DNNtailored accelerators naturally and elegantly. One of the DNN models coming out of this co-design method, called SkyNet, won a double championship in the competitive DAC System Design Contest for both the GPU and the FPGA tracks for low-power object detection.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115881091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Coarse Grained Reconfigurable Architecture for SHA-2 Acceleration","authors":"H. Pham, T. Tran, Luc Duong, Y. Nakashima","doi":"10.1109/IPDPSW55747.2022.00117","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00117","url":null,"abstract":"The development of high-speed SHA-2 hardware with high flexibility is urgently needed because SHA-2 functions are widely employed in numerous fields, from loT devices to cryp-to currency. Unfortunately, the existing SHA-2 circuits have difficulty in achieving high flexibility and hardware efficiency. Therefore, this paper proposes a coarse-grained reconfigurable architecture (CGRA) for accelerating SHA-2 computation, named a CGRA SHA-2 accelerator. To effectively support various algorithms and requirements, three optimization techniques are proposed to achieve high flexibility and hardware efficiency. First, an on-demand pro-cessing element array is proposed to enable flexible computation for long and short messages. Second, a dual-ALU processing element (D-PE) is proposed to compute various SHA-2 functions. Third, the pipelined dual-ALU architecture is proposed to reduce the critical paths, leading to remarkably improved performance and hardware efficiency. The accuracy of our proposed accelerator is verified on a real hardware platform (the Xilinx Alveo U280 FPGA). Besides, the experimental results on several FPGAs prove that the proposed CGRA SHA-2 accelerator is significantly higher performance, hardware efficiency, and flexibility than existing works.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117013960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HCW 2022 Keynote Speaker: Heterogeneous Computing for Scientific Machine Learning","authors":"L. White","doi":"10.1109/IPDPSW55747.2022.00011","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00011","url":null,"abstract":"More than ever, the semiconductor industry is asked to answer society's call for more computing capacity and capability, which are driven by rapid digitalization, the widespread adoption of artificial intelligence, and the ever-increasing need for high-fidelity scientific simulations. While facing high demand, the supply of computing capability is being technically challenged by the slowdown of Moore's law and the need for high energy efficiency. This tug-of-war has now pushed the industry towards domain-specific accelerators, perhaps likely past the point of no return. The mix of general-purpose CPUs and high-end GPGPUs, which has pervaded data centers over the past few years, is likely to be expanded to a much richer set of application-specific accelerators, including AI engines, reconfigurable hardware, and even perhaps quantum, annealing, and neuromorphic devices. While acceleration and better efficiency may be enabled by using domain-specific accelerators for selected workloads, a much more holistic (i.e., system-wide) approach will have to be adopted to achieve significant performance gains for complex applications that consist of a variety of workloads where each could benefit from a specific accelerator. As an important example, scientific computing, which increasingly incorporates AI training and inference kernels in a tightly-integrated fashion, provides a rich and exciting laboratory for addressing the challenges of efficiently using highly-heterogeneous systems and for ultimately realizing their promises. Those challenges include co-designing the application, which requires domain experts to collaborate with other experts across the stack for workload mapping and data orchestration, and also adopting a decentralized strategy that embeds processing units where the data need them. Finally, the early experience of those co-design efforts should help the industry devise a longer-term strategy for developing programming models that would relieve application experts from what is often perceived as the burden of hardwareaware development and code optimization.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114330002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lidia Kidane, P. Townend, Thijs Metsch, E. Elmroth
{"title":"When and How to Retrain Machine Learning-based Cloud Management Systems","authors":"Lidia Kidane, P. Townend, Thijs Metsch, E. Elmroth","doi":"10.1109/IPDPSW55747.2022.00120","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00120","url":null,"abstract":"Cloud management systems increasingly rely on machine learning (ML) models to predict incoming workload rates, load, and other system behaviours for efficient dynamic resource management. Current state-of-the-art prediction models demonstrate high accuracy but assume that data patterns remain stable. However, in production use, systems may face hardware upgrades, changes in user behaviour etc. that lead to concept drifts - significant changes in the characteristics of data streams over time. To mitigate prediction deterioration, ML models need to be updated - but questions of when and how to best retrain these models are unsolved in the context of cloud management. We present a pilot study that addresses these questions for one of the most common models for adaptive prediction - Long Short Term Memory (LSTM) - using synthetic and real-world workload data. Our analysis of when to retrain explores approaches for detecting when retraining is required using both concept drift detection and prediction error thresholds, and at what point retraining should actually take place. Our analysis of how to retrain focuses on the data required for retraining, and what proportion should be taken from before and after the need for retraining is detected. We present initial results that indicate that retraining of existing models can achieve prediction accuracy close to that of newly trained models but for much less cost, and present initial advice for how to provide cloud management systems with support for automatic retraining of ML-based models.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114496301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Zhao, Nathan C Frey, Joseph McDonald, M. Hubbell, David Bestor, Michael Jones, Andrew Prout, V. Gadepally, S. Samsi
{"title":"A Green(er) World for A.I.","authors":"Dan Zhao, Nathan C Frey, Joseph McDonald, M. Hubbell, David Bestor, Michael Jones, Andrew Prout, V. Gadepally, S. Samsi","doi":"10.1109/IPDPSW55747.2022.00126","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00126","url":null,"abstract":"As research and practice in artificial intelligence (A.I.) grow in leaps and bounds, the resources necessary to sustain and support their operations also grow at an increasing pace. While innovations and applications from A.I. have brought significant advances, from applications to vision and natural language to improvements to fields like medical imaging and materials engineering, their costs should not be neglected. As we embrace a world with ever-increasing amounts of data as well as research & development of A.I. applications, we are sure to face an ever-mounting energy footprint to sustain these computational budgets, data storage needs, and more. But, is this sustainable and, more importantly, what kind of setting is best positioned to nurture such sustainable A.I. in both research and practice? In this paper, we outline our outlook for Green A.I.—a more sustainable, energy-efficient and energy-aware ecosystem for developing A.I. across the research, computing, and practitioner communities alike—and the steps required to arrive there. We present a bird's eye view of various areas for potential changes and improvements from the ground floor of AI's operational and hardware optimizations for datacenter/HPCs to the current incentive structures in the world of A.I. research and practice, and more. We hope these points will spur further discussion, and action, on some of these issues and their potential solutions.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115756157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AsHES 2022 Keynote Speaker: The Modular Supercomputing Architecture (MSA)","authors":"E. Suarez","doi":"10.1109/IPDPSW55747.2022.00071","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00071","url":null,"abstract":"The Modular Supercomputing Architecture (MSA) is a system design that orchestrates heterogeneous computer resources (CPUs, GPUs, many-core accelerators, disruptive technologies, etc.) at system-level, organizing them in compute modules. Modules are clusters of potentially large size, each configured with a specific type of user requirement in mind. The different modules are interconnected via a high-speed network, and a common software stack brings all modules together creating a unique machine. The MSA aims at supporting a large diversity of applications and has been developed at the Jülich Supercomputing Centre (JSC) through the EU-funded DEEP projects.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127141699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EDAML 2022 Invited Speaker 5: Combining Optimization and Machine Learning in Physical Design","authors":"L. Behjat","doi":"10.1109/IPDPSW55747.2022.00198","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00198","url":null,"abstract":"The exponential increase in computing power and the availability of big data have ignited innovations in EDA. The most recent trend in innovations has involved using machine learning algorithms for solving problems of scale. Machine learning techniques can solve large-scale problems efficiently once they are trained. However, their training takes a large amount of computing power and might not translate well from one type of problem to another. On the other hand, many of the existing algorithms in physical design take advantage of mathematical optimization techniques to improve their solution quality. These techniques can find optimal or near-optimal solutions using fast heuristics. These techniques do not require a large amount of data but need some level of insight into the nature of the problem by the designer. The mathematical optimization techniques rely heavily on the developed models. In this talk, we will discuss how machine learning can be used to develop better models for optimization problems and how optimization techniques can then use the models to generate more data to improve the accuracy and robustness of machine learning techniques. We will first discuss the algorithm-driven nature of the optimization techniques and compare that to the data-driven nature of the machine learning techniques. We will use examples of physical design placement and routing. Then, we will discuss how optimization and ML can be used to solve the problems of scale both in numbers and transistor sizes. We will also discuss how reinforcement learning can be used to come up with new heuristics for solving the problems encountered in physical design. The talk will end with some practical suggestions on how to improve the quality and speed of the design.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126971479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Médane A. Tchakorom, R. Couturier, Jean-Claude Charr
{"title":"Synchronous parallel multisplitting method with convergence acceleration using a local Krylov-based minimization for solving linear systems","authors":"Médane A. Tchakorom, R. Couturier, Jean-Claude Charr","doi":"10.1109/IPDPSW55747.2022.00146","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00146","url":null,"abstract":"Computer simulations of physical phenomena, such as heat transfer, often require the solution of linear equations. These linear equations occur in the form Ax $=mathbf{b}$, where A is a matrix, $mathbf{b}$ is a vector, and $mathbf{x}$ is the vector of unknowns. Iterative methods are the most adapted to solve large linear systems because they can be easily parallelized. This paper presents a variant of the multisplitting iterative method with convergence acceleration using the Krylov-based minimization method. This paper particularly focuses on improving the convergence speed of the method with an implementation based on the PETSc (Portable Extensible Toolkit for Scientific Computation) library. This was achieved by reducing the need for synchronization - data exchange - during the minimization process and adding a preconditioner before the multisplitting method. All experiments were performed either over one or two sites of the Grid5000 platform and up to 128 cores were used. The results for solving a 2D Laplacian problem of size 10242 components, show a speed up of up to 23X and 86X when respectively compared to the algorithm in [8] and to the general multisplitting implementation.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124914648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Optimization for Sparse Data on Heterogeneous GPUs","authors":"Yujing Ma, Florin Rusu, Kesheng Wu, A. Sim","doi":"10.1109/IPDPSW55747.2022.00177","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00177","url":null,"abstract":"Motivated by extreme multi-label classification applications, we consider training deep learning models over sparse data in multi-GPU servers. The variance in the number of non-zero features across training batches and the intrinsic GPU heterogeneity combine to limit accuracy and increase the time to convergence. We address these challenges with Adaptive SGD, an adaptive elastic model averaging stochastic gradient descent algorithm for heterogeneous multi-GPUs that is characterized by dynamic scheduling, adaptive batch size scaling, and normalized model merging. Instead of statically partitioning batches to GPUs, batches are routed based on the relative processing speed. Batch size scaling assigns larger batches to the faster GPUs and smaller batches to the slower ones, with the goal to arrive at a steady state in which all the GPUs perform the same number of model updates. Normalized model merging computes optimal weights for every GPU based on the assigned batches such that the combined model achieves better accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy and is scalable with the number of GPUs.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121624075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}