P. Chen, Pouya Haghi, J.-Y. Chung, Tong Geng, R. West, A. Skjellum, Martin C. Herbordt
{"title":"The Viability of Using Online Prediction to Perform Extra Work while Executing BSP Applications","authors":"P. Chen, Pouya Haghi, J.-Y. Chung, Tong Geng, R. West, A. Skjellum, Martin C. Herbordt","doi":"10.1109/HPEC55821.2022.9926405","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926405","url":null,"abstract":"A fundamental problem in parallel processing is the difficulty in efficiently partitioning work: the result is that much of a parallel program's execution time is often spent idle or performing overhead operations. We propose to improve the efficiency of system resource utilization by having idle processes execute extra work. We develop a method whereby the execution of extra work is optimized through performance prediction and the setting of limits (a deadline) on the duration of the extra work execution. In our preliminary experiments of proxy BSP applications on a production supercomputer we find that this approach is promising with all five applications benefiting from this approach, with an average of 12 % improvement.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117070654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Powering Practical Performance: Accelerated Numerical Computing in Pure Python","authors":"Matthew Penn, Chris Milroy","doi":"10.1109/HPEC55821.2022.9926309","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926309","url":null,"abstract":"In this paper, we tackle a generic n-dimensional numerical computing problem to compare performance and analyze tradeoffs between popular frameworks using open source Jupyter notebook examples. Most data science practitioners perform their work in Python because of its high-level abstraction and rich set of numerical computing libraries. However, the choice of library and methodology is driven by complexity-impacting constraints like problem size, latency, memory, physical size, weight, power, hardware, and others. To that end, we demonstrate that a wide selection of GPU-accelerated libraries (RAPIDS, CuPy, Numba, Dask), including the development of hand-tuned CUDA kernels, are accessible to data scientists without ever leaving Python. We address the Python developer community by showing C/C++ is not necessary to access single/multi-GPU acceleration for data science applications. We solve a common numerical computing problem - finding the closest point in array B from every point (and its index) in array A, requiring up to 8.8 trillion distance comparisons - on a GPU-equipped workstation without writing a line of C/C++.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114374474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting Ankle Moment Trajectory with Adaptive Weighted Ensemble of LSTM Networks","authors":"E. Grzesiak, Jennifer Sloboda, H. Siu","doi":"10.1109/HPEC55821.2022.9926370","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926370","url":null,"abstract":"Estimations of ankle moments can provide clinically helpful information on the function of lower extremities and further lead to insight on patient rehabilitation and assistive wearable exoskeleton design. Current methods for estimating ankle moments leave room for improvement, with most recent cutting-edge methods relying on machine learning models trained on wearable sEMG and IMU data. While machine learning eliminates many practical challenges that troubled more traditional human body models for this application, we aim to expand on prior work that showed the feasibility of using LSTM models by employing an ensemble of LSTM networks. We present an adaptive weighted LSTM ensemble network and demonstrate its performance during standing, walking, running, and sprinting. Our result show that the LSTM ensemble outperformed every single LSTM model component within the ensemble. Across every activity, the ensemble reduced median root mean squared error (RMSE) by 0.0017-0.0053 N. m/kg, which is 2.7 – 10.3% lower than the best performing single LSTM model. Hypothesis testing revealed that most reductions in RMSE were statistically significant between the ensemble and other single models across all activities and subjects. Future work may analyze different trajectory lengths and different combinations of LSTM submodels within the ensemble.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125903796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Brewer, Joel U. Bretheim, John Kaniarz, Peilin Song, Burhman Q. Gates
{"title":"Scalable Interactive Autonomous Navigation Simulations on HPC","authors":"W. Brewer, Joel U. Bretheim, John Kaniarz, Peilin Song, Burhman Q. Gates","doi":"10.1109/HPEC55821.2022.9926384","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926384","url":null,"abstract":"We present our work of enabling HPC in an interactive real-time autonomy loop. The workflow consists of many different software components deployed within Singu-larity containers and communicating using both the Robotic Operating System's (ROS) publish-subscribe system and the Message Passing Interface (MPI). We use Singularity's container networking interface (CNI) to enable virtual networking within the containers, so that multiple containers can run the various components using different IP addresses on the same compute node. The Virtual Autonomous Navigation Environment Environmental Sensor Engine (VANE: ESE) is used for physically-realistic simulation of LIDAR along with the Autonomous Navigation Virtual Environment Laboratory (ANVEL) for vehicle simulation. VANE: ESE sends Velodyne UDP LIDAR packets directly to the Robotic Technology Kernel (RTK) and is distributed across multiple compute nodes via MPI along with OpenMP for shared memory parallelism within each compute node. The user interfaces with the navigation environment using an XFCE desk-top with virtual workspaces running over a VNC containerized deployment through a double-hop ssh tunnel, which uses noVNC (a JavaScript-based VNC client) to provide a browser-based client interface. We automate the complete launch process using a custom iLauncher plugin. We benchmark scalable performance with multiple vehicle simulations on four different HPC systems and discuss our findings.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127271769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karim Youssef, Niteya Shah, M. Gokhale, R. Pearce, Wu-chun Feng
{"title":"AutoPager: Auto-tuning Memory-Mapped I/O Parameters in Userspace","authors":"Karim Youssef, Niteya Shah, M. Gokhale, R. Pearce, Wu-chun Feng","doi":"10.1109/HPEC55821.2022.9926409","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926409","url":null,"abstract":"The exponential growth in dataset sizes has shifted the bottleneck of high-performance data analytics from the compute subsystem to the memory and storage subsystems. This bottleneck has led to the proliferation of non-volatile memory (NVM). To bridge the performance gap between the Linux I/O subsystem and NVM, userspace memory-mapped I/O enables application-specific I/O optimizations. Specifically, UMap, an open-source userspace memory-mapping tool, exposes tunable paging parameters to application users, such as page size and degree of paging concurrency. Tuning these parameters is computationally intractable due to the vast search space and the cost of evaluating each parameter combination. To address this challenge, we present Autopager, a tool for auto-tuning userspace paging parameters. Our evaluation, using five data-intensive applications with UMap, shows that Autopager automatically achieves comparable performance to exhaustive tuning with 10 x less tuning overhead. and 16.3 x and 1.52 x speedup over UMap with default parameters and UMap with page-size only tuning, respectively.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126282235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Edge-Connected Jaccard Similarity for Graph Link Prediction on FPGA","authors":"P. Sathre, Atharva Gondhalekar, Wu-chun Feng","doi":"10.1109/HPEC55821.2022.9926326","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926326","url":null,"abstract":"Graph analysis is a critical task in many fields, such as social networking, epidemiology, bioinformatics, and fraud de-tection. In particular, understanding and inferring relationships between graph elements lies at the core of many graph-based workloads. Real-world graph workloads and their associated data structures create irregular computational patterns that compli-cate the realization of high-performance kernels. Given these complications, there does not exist a de facto “best” architecture, language, or algorithmic approach that simultaneously balances performance, energy efficiency, portability, and productivity. In this paper, we realize different algorithms of edge-connected Jaccard similarity for graph link prediction and characterize their performance across a broad spectrum of graphs on an Intel Stratix 10 FPGA. By utilizing a high-level synthesis (HLS)-driven, high-productivity approach (via the C++-based SYCL language) we rapidly prototype two implementations - a from-scratch edge-centric version and a faithfully-ported commodity GPU implementation - which would have been intractable via a hardware description language. With these implementations, we further consider the benefit and necessity of four HLS-enabled optimizations, both in isolation and in concert - totaling seven distinct synthesized hardware pipelines. Leveraging real-world graphs of up to 516 million edges, we show empirically-measured speedups of up to 9.5 x over the initial HLS implementations when all optimizations work in concert.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130369622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How to Prevent a Sick ASIC","authors":"W. Ellersick","doi":"10.1109/HPEC55821.2022.9926305","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926305","url":null,"abstract":"High performance computing systems increasingly require mixed-signal ASICs to achieve competitive speed, power efficiency and cost. The integration of processing, transceivers, sensors and power management results in dramatic reductions in size, which can yield great savings in power, enabling higher performance. However, few design elements demand such high quality as a mixed-signal ASIC. In this paper, actual near-disasters from decades of integrated circuit design are presented along with methods to prevent potentially severe damage to projects, careers, and even companies. Such stories of failure are rarely told, but acknowledging them is crucial to avoid repeating the mistakes and to reduce ASIC development risk to ultimately ensure success. Key takeaways include planning for failure with designed-in observability, controllability and workarounds; the use of simple and robust circuits; and that organizing the people can be as challenging and important as arranging the transistors.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115075339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved Distributed-memory Triangle Counting by Exploiting the Graph Structure","authors":"Sayan Ghosh","doi":"10.1109/HPEC55821.2022.9926376","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926376","url":null,"abstract":"Graphs are ubiquitous in modeling complex systems and representing interactions between entities to uncover structural information of the domain. Traditionally, graph analytics workloads are challenging to efficiently scale (both strong and weak cases) on distributed memory due to the irregular memory-access driven nature (with little or no computations) of the meth-ods. The structure of graphs and their relative distribution over the processing elements poses another level of complexity, making it difficult to attain sustainable scalability across platforms. In this paper, we discuss enhancements to TriC, a distributed-memory implementation of graph triangle counting using Mes-sage Passing Interface (MPI), which was featured in the 2020 Graph Challenge competition. We have made some incremental enhancements to TriC, primarily adopting a user-defined buffering strategy to overcome the startup problem for large graphs (by fixing the memory for intermediate data), and experimenting with probabilistic data structures such as bloom filter to improve the query response time for assessing edge existence, at the expense of increasing the overall false positive rate. These adjustments have led to a modest improvements in most cases, as compared to the previous version.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134090681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Systolic Array based FPGA accelerator for Yolov3-tiny","authors":"Prithvi Velicheti, Sivani Pentapati, Suresh Purini","doi":"10.1109/HPEC55821.2022.9926371","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926371","url":null,"abstract":"FPGAs are increasingly significant for deploying convolutional neural network (CNN) inference models because of performance demands and power constraints in embedded and data centre applications. Object detection and classification are essential tasks in computer vision. You Only Look Once (YOLO) is a very efficient algorithm for object detection and classification with its variant Yolov3-tiny specially designed for embedded applications. This paper presents the FPGA accelerator for multiple precisions (FIXED-8, FIXED-16, FLOAT32) of YoloV3-tiny. We use a homogenous systolic array architecture with a synchronized pipeline adder tree for convolution, allowing it to be scalable for multiple variants of Yolo with a change in host driver. We evaluated the design on Terasic DE5a-Net-DDR4. The Fixed point (FP-8, FP-16) implementations attain a throughput of 57 GOPs/s (> 23%) and 46.16 GOPs/s (> 340 %). We synthesized the first FLOAT32 imnlementation attaining 11.22 GFLOPs/s.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130849942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Essa Imhmed, Jonathan J. Cook, Abdel-Hameed A. Badawy
{"title":"Evaluation of a Novel Scratchpad Memory through Compiler Supported Simulation","authors":"Essa Imhmed, Jonathan J. Cook, Abdel-Hameed A. Badawy","doi":"10.1109/HPEC55821.2022.9926335","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926335","url":null,"abstract":"Local Memory Store (LMStore) is a novel hardware-controlled, compiler-managed Scratchpad memory (SPM) design [1], with an initial research evaluation that showed its possibility for improving program performance. This initial evaluation was performed over memory traces prior to the development of compiler support for LMStore. In this paper, we present compiler support for the LMStore design, and present experimental results that better evaluate LMStore performance. Experimental results on benchmarks from Malardalen benchmark suite [2] executing on the LMStore architecture modeled in Multi2Sim demonstrate that a hybrid LMStore-Cache architecture improves execution time by an average of 19.8 %, compared to a conventional cache-only architecture.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132518558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}