{"title":"I/O Performance Evaluation of Large-Scale Deep Learning on an HPC System","authors":"Minho Bae, Minjoong Jeong, Sangho Yeo, Sangyoon Oh, Oh-Kyoung Kwon","doi":"10.1109/HPCS48598.2019.9188225","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188225","url":null,"abstract":"Recently, deep learning has become important in diverse fields. Because the process requires a huge amount of computing resources, many researchers have proposed methods to utilize large-scale clusters to reduce the training time. Despite many proposals concerning the training process for large-scale clusters, there remain areas to be developed. In this study, we benchmark the performance of Intel-Caffe, which is a generalpurpose distributed deep learning framework on the Nurion supercomputer of the Korea Institute of Science and Technology Information. We particularly focus on identifying the file I/O factors that affect the performance of Intel-Caffe, as well as a performance evaluation in a container-based environment. Finally, to the best of our knowledge, we present the first benchmark results for distributed deep learning in the container-based environment for a large-scale cluster.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125686085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jorge Fernández-Fabeiro, Arturo González-Escribano, D. Ferraris
{"title":"Simplifying the multi-GPU programming of a hyperspectral image registration algorithm","authors":"Jorge Fernández-Fabeiro, Arturo González-Escribano, D. Ferraris","doi":"10.1109/HPCS48598.2019.9188064","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188064","url":null,"abstract":"Hyperspectral image registration is a relevant task for real-time applications like environmental disasters management or search and rescue scenarios. Traditional algorithms for this problem were not really devoted to real-time performance. The HYFMGPU algorithm arose as a high-performance GPU-based solution to solve such a lack. Nevertheless, a single-GPU solution is not enough, as sensors are evolving and then generating images with finer resolutions and wider wavelength ranges. An MPI+CUDA multi-GPU implementation of HYFMGPU was previously presented. However, this solution shows the programming complexity of combining MPI with an accelerator programming model. In this paper we present a new and more abstract programming approach for this type of applications, which provides a high efficiency while simplifying the programming of the multi-device parts of the code. The solution uses Hitmap, a library to ease the programming of parallel applications based on distributed arrays. It uses a more algorithm-oriented approach than MPI, including abstractions for the automatic partition and mapping of arrays at runtime with arbitrary granularity, as well as techniques to build flexible communication patterns that transparently adapt to the data partitions. We show how these abstractions apply to this application class. We present a comparison of development effort metrics between the original MPI implementation and the one based on Hitmap, with reductions of up to 95% for the Halstead score in specific work redistribution steps. We finally present experimental results showing that these abstractions are internally implemented in a high efficient way that can reduce the overall performance time in up to 37% comparing with the original MPI implementation.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126219491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel construction of the Symbolic Observation Graph","authors":"Hiba Ouni, Kais Klai, Belhassen Zouari","doi":"10.1109/HPCS48598.2019.9188053","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188053","url":null,"abstract":"Extended AbstractAn efficient way to cope with the combinatorial explosion problem induced by the model checking process is to compute the Symbolic Observation Graph (SOG) which is a condensed representation of the state space graph based on a symbolic encoding of the nodes. Another way is to parallelize the construction/traversal of the state space on multiple processors. In this paper, we combine the two mentioned approaches by proposing three different approaches to parallelize the construction of the SOG. A multi-threaded approach based on a dynamic load balancing and a shared memory architecture, a distributed approach based on a distributed memory architecture and a hybrid approach that combines the two previous approaches.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128100879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inward Fractal Dual Band High Gain Compact Antenna","authors":"M. Madi, Maria Moussa, K. Kabalan","doi":"10.1109/HPCS48598.2019.9188220","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188220","url":null,"abstract":"This paper presents a microstrip antenna with nested two folds fractal shape. The fractal geometry of the patch has a slotted square boundary and spheres connected to its sides. The antenna was first optimized over a 5 x 5 cm2 area, so that dual frequency operation is obtained at the IMT band and in the Radiolocation Service at 1.4 GHz and 3.2 GHz respectively. Applications include breast cancer detection in addition to a wide range of wireless devices due to the radiation pattern and directivity characteristics. The design size is further minimized targeting the ISM frequency of 2.45 GHz. A 4 x 5 cm2 and 4 x 4 cm2 design were successfully simulated with high 10 dB gain. Results of measurements coincide very well with simulations results.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126061965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Assembly micro-benchmark generator for characterizing Floating Point Units","authors":"Jean Pourroy, P. Demichel, C. Denis","doi":"10.1109/HPCS48598.2019.9188209","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188209","url":null,"abstract":"Making the right platform choice has always been a challenge for the HPC users no matter the applications vertical they are in. The number of references is very large and making the wrong choice can have adverse effects. Formerly users only had to choose between, for example, the different processors and interconnect vendors. Lately, due to the new Intel Skylake processors the choice has become increasingly difficult as different levels of performance are available within the same vendor platforms. To facilitate selection and give possible directions for the real benchmarked applications we introduce the Kernel Generator, an open source tool generating assembly kernels to help the programmer or the benchmarker understand the behavior of the different micro-architectures. We used our tool to study the behavior of the current micro-architectures and compare it to the current synthetic benchmarks which sometimes are not correctly characterizing a platform nor expose its strengths. The Kernel Generator facilitates the discovery of the platforms performance fit. To insure the relevance of our kernel, we are looking at Ansys Fluent behavior to explain the performance on the different Intel processors. In this case, we have that 4100 and 6100 Intel processors families can have equivalent performance on codes not well vectorized: Fluent being one of them. This demonstrates that we can use our tool for initial profiling and understanding of the different platforms.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131578678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of a Self-Similar GPU Thread Map for Data-parallel m-Simplex Domains","authors":"C. Navarro, B. Bustos, N. Hitschfeld-Kahler","doi":"10.1109/HPCS48598.2019.9188081","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188081","url":null,"abstract":"This work analyzes the possible performance benefits one could obtain by employing a Self-Similar type of GPU thread map on data-parallel m-simplex domains, which is the geometrical representation of several interaction problems. The main contributions of this work are (1) the proposal of a new block-space map H: $mathbb{Z}^{m}mapsto mathbb{Z}^{m}$ based on a self-similar set of sub-orthotopes, and (2) its analysis in terms of performance and thread space, from which we obtain that $mathcal{H}(omega)$ is time and space efficient for 2-simplices and only time efficient for 3-simplices unless the theoretical model is relaxed to allow concurrent parallel spaces. Experimental tests on a 2-simplex domain support the theoretical results, giving up to 30% of speedup over the standard approach. We also show how the map can utilize GPU tensor cores and further accelerate through fast matrix-multiply-accumulate operations. Finally, we show that extending the map to general m-simplices is a non-trivial optimization problem and depends of the choice of two parameters $r, beta$, for which we provide some insights in order to obtain a $mathcal{H}(omega)$ map that can be $m!$ times more space efficient than a bounding-box approach.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"249 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124737608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Performance Computing for Formal Security Assessment","authors":"L. Spalazzi, Francesco Spegni","doi":"10.1109/HPCS48598.2019.9188122","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188122","url":null,"abstract":"Assessing the degree of security of a given system w.r.t. some attacker model and security policy can be done by means of formal methods. For instance, the system can be described as a Markov Decision Process, the security policy by means of a modal logic formula, PCTL⋆, and then a probabilistic model checker can return the probability with which the policy holds in the system. This methodology suffices when all the system parameters and their values are known a priori. On the other side, in case the degree of security of the system depends on the values of the system parameters, the formally security assessment task must output a probability function which takes the system parameters and returns the probability of a successful attack to the security of the system. One simple way to describe such function involves solving many instances of the probabilistic model checking problem, one for each combination of the parameter values. In this scenario, probabilistic model checking, which suffers from the state explosion problem, may become an unfeasible task for traditional workstations or even servers.In this work we introduce the tool SecMC which drives the user in the task of modeling the system under analysis and the required security policies, together with the parameters that affect them. Next, the user can specify the range of values assumed by the parameters, and the tool can take care of iterating the probabilistic model checking task, distributing the computations among different local or remote nodes of a cluster, and collect the results to produce a combined picture of how the level of security varies w.r.t. the parameter values.In this paper we show how the tool can be used in order to formally assess security of probabilistic systems known from the literature, viz. a probabilistic cryptographic protocol, a synchronization algorithm for wireless devices inspired by fireflies in nature, and the privacy of dispersed cloud storages.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133047512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced Autonomous Resource Selection Algorithm for Cooperative Awareness in Vehicular Communication","authors":"Brahmjit Singh, Sandeepika Sharma","doi":"10.1109/HPCS48598.2019.9188190","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188190","url":null,"abstract":"With rapid development in wireless communication, Intelligent Transportations System (ITS) has received significant attention. This system delivers various social and economic benefits including efficient traffic management, lesser prone to accidents, reduced air pollution and enabling of unmanned driving for enhanced leisure. ITS is enabled through real-time communication either Vehicle-to-Vehicle (V2V) or Vehicle-to-Infrastructure (V2I) modes of intra-infrastructure communication. LTE-V2V is seen as a major technology towards ITS implementation. In this paper, we discuss various autonomous resource selection techniques available for Mode-4 V2V communication under LTE-A network. An enhanced resource selection scheme based on exponential averaging in time domain and normalized scaling in frequency domain is being proposed. Simulation results show the efficacy of the proposed algorithm in terms of packet reception ratio (PRR), error rate and update delay. The proposed scheme enables low latency transmission of cooperative awareness messages while maintaining high PRR and low error rate.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115049357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing the data behavior of parallel application for extracting performance knowledge.","authors":"F. Tirado, Alvaro Wong, Dolores Rexachs, E. Luque","doi":"10.1109/HPCS48598.2019.9188166","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188166","url":null,"abstract":"When performance tools are used to analyze an application with thousands of processes, the data generated can be bigger than the memory size of the cluster node, causing this data to be loaded in swap memory. In HPC systems, moving data to swap is not always an option. This problem causes scalability limitations that affect the user experience and it presents serious restrictions for executing on a large scale. In order to obtain knowledge about the application’s performance, the performance tools usually instrument the application to generate the data. When the instrumented parallel application is executed with thousands of processes, the data generated may be higher than the memory size of the compute node used to analyze the data in order to obtain the knowledge. Performance tools such as PAS2P predict the execution time in target machines. In order to predict the performance, PAS2P carries out a data analysis with the data in each application process. The data collected is analyzed sequentially, which results in an inefficient use of system resources. To solve this, we propose designing a parallel method to solve the problem when we manage a high volume of data, decreasing its execution time and increasing scalability, improving the PAS2P toolkit to generate performance knowledge defined by the application’s behavior phases.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134281996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a Scalable and QoS-Aware Load Balancing Platform for Edge Computing Environments","authors":"Charafeddine Mechalikh, Hajer Taktak, Faouzi Moussa","doi":"10.1109/HPCS48598.2019.9188159","DOIUrl":"https://doi.org/10.1109/HPCS48598.2019.9188159","url":null,"abstract":"Edge computing is a new computing paradigm that brings the cloud applications close to the Internet of Things (IoT) devices at the edge of the network. It improves the resources utilization efficiency by using the resources already available at the edge of the network [8]. As a result, it decreases the cloud workload, reduces the latency, and enables a new breed of latency-sensitive applications such as the connected vehicles. Horizontal scalability is another advantage of edge computing. Unlike the cloud and fog computing, the latter takes advantages of the growing number of connected devices, as this growth results in increasing the number of the available resources. Most researches in this field were only interested in finding the optimal tasks offloading destination by minimizing the latency, the resources utilization, and the energy consumption. Therefore, they ignore the effect of the synchronization between the devices, and the applications (i.e. containers) deployment delay. Motivated by the advantages of edge computing, in this paper, we introduce a load balancing platform for IoT-edge computing environments. As opposed to the current trend, we will first focus on the applications deployment and the synchronization between devices in order to provide better scalability, enable a self-manageable IoT network, and meet the quality of service (QoS). According to the simulation results, the proposed approach provides better scalability; it reduces the network utilization and the cloud workload. In addition, it provides better applications deployment delays and a lower latency.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133464585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}