{"title":"Towards a GPU accelerated selective sparsity multilayer perceptron algorithm using K-Nearest Neighbors search","authors":"B. H. Meyer, Wagner M. Nunan Zola","doi":"10.1145/3547276.3548634","DOIUrl":"https://doi.org/10.1145/3547276.3548634","url":null,"abstract":"The use of artificial neural networks and deep learning is common in several areas of knowledge. In many situations, it is necessary to use neural networks with many neurons. For example, the Extreme Classification problems can use neural networks that process more than 500,000 classes and inputs with more than 100,000 dimensions, which can make the training process unfeasible due to the high computational cost required. To overcome this limitation, several techniques were proposed in past works, such as the SLIDE algorithm, whose implementation is based on the construction of hash tables and on CPU parallelism. This work proposes the SLIDE-GPU, which replaces the use of hash tables by algorithms that use GPU to search for approximate neighbors, or approximate nearest neighbors (ANN) search. In addition, SLIDE-GPU also proposes the use of GPU to accelerate the activation step of neural networks. Among the experiments carried out, it was possible to notice a training process acceleration of up to 268% in execution time considering the inference accuracy, although currently maintaining the backpropagation phase with CPU processing. This suggests that further acceleration can be obtained in future work, by using massive parallelism in the entire process. The ANN-based technique provides better inference accuracy at each epoch, which helps producing the global acceleration, besides using the GPU in the neuron activation step. The GPU neuron activation acceleration reached a 28.09 times shorter execution time compared to the CPU implementation on this step alone.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129570742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Support of the Scan Vector Model for RISC-V Vector Extension","authors":"Hung-Ming Lai, Jenq-Kuen Lee","doi":"10.1145/3547276.3548518","DOIUrl":"https://doi.org/10.1145/3547276.3548518","url":null,"abstract":"RISC-V vector extension (RVV) provides wide vector registers, which is applicable for workloads with high data-level parallelism such as machine learning or cloud computing. However, it is not easy for developers to fully utilize the underlying performance of a new architecture. Hence, abstractions such as primitives or software frameworks could be employed to ease this burden. Scan, also known as all-prefix-sum, is a common building block for many parallel algorithms. Blelloch presented an algorithmic model called the scan vector model, which uses scan operations as primitives, and demonstrates that a broad range of applications and algorithms can be implemented by them. In our work, we present an efficient support of the scan vector model for RVV. With this support, parallel algorithms can be developed upon those primitives without knowing the details of RVV while gaining the performance that RVV provides. In addition, we provide an optimization scheme related to the length multiplier feature of RVV, which can further improve the utilization of the vector register files. The experiment shows that our support of scan and segmented scan for RVV can achieve 2.85x and 4.29x speedup, respectively, compared to the sequential implementation. With further optimization using the length multiplier of RVV, we can improve the previous result to 21.93x and 15.09x speedup.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127011867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hybrid Data-flow Visual Programing Language*","authors":"Hongxin Wang, Qiuming Luo, Zheng Du","doi":"10.1145/3547276.3548525","DOIUrl":"https://doi.org/10.1145/3547276.3548525","url":null,"abstract":"In this paper, we introduced a Hybrid Data-flow Visual Programing Language (HDVPL), which is an extended C/C++ language with a visual frontend and a dataflow runtime library. Although, most of the popular dataflow visual programming languages are designed for specialized purposes, HDVPL is for general-purpose programming. Unlike the others, the dataflow node behavior of HDVPL can be customized by programmer. Our intuitive visual interface can easily build a general-purpose dataflow program. It provides a visual editor to create nodes and connect them to form a DAG of dataflow task. This makes the beginner of computer programming capable of building parallel programs easily. With subgraph feature, complex hierarchical graphs can be built with container node. After the whole program is accomplished, the HDVPL can translate it into text-based source code and compile it into object file, which will be linked with HDVPL dataflow runtime library. To visualize dataflow programs in runtime, we integrated our dataflow library with frontend visual editor. The visual frontend will show the detailed information about the running program in console window.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133894347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jung-Hsien Wu, Jie Yang, Yung-Chin Chang, Min-Te Sun
{"title":"A Fast and Secure AKA Protocol for B5G","authors":"Jung-Hsien Wu, Jie Yang, Yung-Chin Chang, Min-Te Sun","doi":"10.1145/3547276.3548440","DOIUrl":"https://doi.org/10.1145/3547276.3548440","url":null,"abstract":"With the popularity of mobile devices, the mobile service requirements are now changing rapidly. This implies that the micro network operator dedicated to a specific sector of users has the potential to improve the 5G architecture in terms of scalability and autonomy. However, the traditional AKA protocol does not allow the micro operator to authenticate mobile users independently. To solve this problem, we propose the Fast AKA protocol, which disseminates a subscriber’s profile among base stations via a Blockchain and mutually authenticates the subscriber and serving base station locally for roaming. The proposed architecture speeds up the authentication process, provides forward/backward secrecy, and resists replay attack as well as man-in-the-middle attack. We believe that Fast AKA can serve as a cornerstone for B5G.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"120 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115839598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A User-Based Bike Return Algorithm for Docked Bike Sharing Systems","authors":"Donghui Chen, Kazuya Sakai","doi":"10.1145/3547276.3548443","DOIUrl":"https://doi.org/10.1145/3547276.3548443","url":null,"abstract":"Recently, the development of Internet connection, intelligence, and sharing in the bicycle industry has assisted bike sharing systems (BSS’s) in establishing a connection between public transport hubs. In this paper, we propose a novel user-based bike return (UBR) algorithm for docked BSS’s which leverages a dynamic price adjustment mechanism so that the system is able to rebalance the number of lent and returned bikes by itself at different docks nearby. The proposed scheme motivates users to return their bikes to other underflow docks close-by their target destinations through a cheaper plan to compensate the shortage in them. Consequentially, the bike sharing system is able to achieve dynamic self-balance and the operational cost of the entire system for operators is reduced while the satisfaction of users is significantly increased. The simulations are conducted using real traces, called Citi Bike, and the results demonstrate that the proposed UBR achieves its design goals.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116190324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OpenMP Offloading in the Jetson Nano Platform","authors":"Ilias K. Kasmeridis, V. Dimakopoulos","doi":"10.1145/3547276.3548517","DOIUrl":"https://doi.org/10.1145/3547276.3548517","url":null,"abstract":"The nvidia Jetson Nano is a very popular system-on-module and developer kit which brings high-performance specs in a small and power-efficient embedded platform. Integrating a 128-core gpu and a quad-core cpu, it provides enough capabilities to support computationally demanding applications such as AI inference, deep learning and computer vision. While the Jetson Nano family supports a number of apis and libraries out of the box, comprehensive support of OpenMP, one of the most popular apis, is not readily available. In this work we present the implementation of an OpenMP infrastructure that is able to harness both the cpu and the gpu of a Jetson Nano board using the offload facilities of the recent versions of the OpenMP specifications. We discuss the compiler-side transformations of key constructs, the generation of cuda-based code as well as how the runtime support is provided. We also provide experimental results for a number of applications, exhibiting performance comparable with their pure cuda versions.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126680829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extracting High Definition Map Information from Aerial Images","authors":"Guan-Wen Chen, Hsueh-Yi Lai, Tsì-Uí İk","doi":"10.1145/3547276.3548442","DOIUrl":"https://doi.org/10.1145/3547276.3548442","url":null,"abstract":"Compared with traditional digital maps, high definition maps (HD maps) collect information in lane-level instead of road-level, and provide more diverse and detailed road network information, including lane markings, speed limits, rules, and intersection junction. HD maps can be used for driving navigation and autonomous driving cars with high-precision information to improve driving safety. However, it takes a lot of time to construct the HD map, so that the HD map cannot be widely used in applications at present. This paper proposes a method to identify road information through semantic image segmentation algorithm from aerial traffic images, and then convert it into the open source HD map standard format, which is OpenDRIVE. Through experiments, 13 categories of lane markings can be identified with mIoU of 84.3% and mPA of 89.6%.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116259525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Runtime Techniques for Automatic Process Virtualization","authors":"Evan Ramos, Sam White, A. Bhosale, L. Kalé","doi":"10.1145/3547276.3548522","DOIUrl":"https://doi.org/10.1145/3547276.3548522","url":null,"abstract":"Asynchronous many-task runtimes look promising for the next generation of high performance computing systems. But these runtimes are usually based on new programming models, requiring extensive programmer effort to port existing applications to them. An alternative approach is to reimagine the execution model of widely used programming APIs, such as MPI, in order to execute them more asynchronously. Virtualization is a powerful technique that can be used to execute a bulk synchronous parallel program in an asynchronous manner. Moreover, if the virtualized entities can be migrated between address spaces, the runtime can optimize execution with dynamic load balancing, fault tolerance, and other adaptive techniques. Previous work on automating process virtualization has explored compiler approaches, source-to-source refactoring tools, and runtime methods. These approaches achieve virtualization with different tradeoffs in terms of portability (across different architectures, operating systems, compilers, and linkers), programmer effort required, and the ability to handle all different kinds of global state and programming languages. We implement support for three different related runtime methods, discuss shortcomings and their applicability to user-level virtualized process migration, and compare performance to existing approaches. Compared to existing approaches, one of our new methods achieves what we consider the best overall functionality in terms of portability, automation, support for migration, and runtime performance.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121200917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Coletti, Chathika Gunaratne, Catherine D. Schuman, Robert M. Patton
{"title":"Training reinforcement learning models via an adversarial evolutionary algorithm","authors":"M. Coletti, Chathika Gunaratne, Catherine D. Schuman, Robert M. Patton","doi":"10.1145/3547276.3548635","DOIUrl":"https://doi.org/10.1145/3547276.3548635","url":null,"abstract":"When training for control problems, more episodes used in training usually leads to better generalizability, but more episodes also requires significantly more training time. There are a variety of approaches for selecting the way that training episodes are chosen, including fixed episodes, uniform sampling, and stochastic sampling, but they can all leave gaps in the training landscape. In this work, we describe an approach that leverages an adversarial evolutionary algorithm to identify the worst performing states for a given model. We then use information about these states in the next cycle of training, which is repeated until the desired level of model performance is met. We demonstrate this approach with the OpenAI Gym cart-pole problem. We show that the adversarial evolutionary algorithm did not reduce the number of episodes required in training needed to attain model generalizability when compared with stochastic sampling, and actually performed slightly worse.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"402 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133610297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early Experiences","authors":"Cristian Peñaranda Cebrián, C. Reaño, F. Silla","doi":"10.1145/3547276.3548628","DOIUrl":"https://doi.org/10.1145/3547276.3548628","url":null,"abstract":"The amount of Internet of Things (IoT) devices has been increasing in the last years. These are usually low-performance devices with slow network connections. A common improvement is therefore to perform some computations at the edge of the network (e.g. preprocessing data), thereby reducing the amount of data sent through the network. To enhance the computing capabilities of edge devices, remote virtual Graphics Processing Units (GPUs) can be used. Thus, edge devices can leverage GPUs installed in remote computers. However, this solution requires exchanging data with the remote GPU across the network, which as mentioned is typically slow. In this paper we present a novel approach to improve communication performance of edge devices using rCUDA remote GPU virtualization framework. We implement within this framework on-the-fly pipelined data compression, which is done transparently to applications. We use four popular machine learning samples to carry out an initial performance exploration. The analysis is done using a slow 10 Mbps network to emulate the conditions of these devices. Early results show potential improvements provided some current issues are addressed.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134316450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}