{"title":"Fine-Grained Parallel Solution for Solving Sparse Triangular Systems on Multicore Platform Using OpenMP Interface","authors":"Sirine Marrakchi, M. Jemni","doi":"10.1109/HPCS.2017.102","DOIUrl":"https://doi.org/10.1109/HPCS.2017.102","url":null,"abstract":"This paper describes and analyses a novel method to improve the parallel performance for solving sparse triangular systems (spTRSV). The main objective of this study consists in reducing the total idle time of processors as well as the execution time. Also, the developed solution is suitable for sparse and band structures. To evaluate and validate our contribution, a series of experiments have been carried out on a multicore platform using OpenMP interface. Practical results show the efficiency of the proposed technique compared to the previous studies.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115918017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"iAgile: Mission Critical Military Software Development","authors":"L. Benedicenti, A. Messina, A. Sillitti","doi":"10.1109/HPCS.2017.87","DOIUrl":"https://doi.org/10.1109/HPCS.2017.87","url":null,"abstract":"This paper reports the experience of applying agile methods in the defense sector, characterized mostly by embedded and mission critical software. We describe the experience of creating a Command and Control system for the 4th Logistic Department of the Italian Army's General Staff. The project was approved by the Army as a pilot to determine whether it could be possible to reduce development costs and at the same time produce a product better responsive to the changing conditions in the theatre of operations, where often the confrontation has become asymmetric and requires reaction times much faster than the conventional approach. After 13 five-week long sprints, we were able to deliver a complete product that met all user requirements and satisfied regulatory Army requirements. Achieving this result required a concerted effort to change the development culture, but even when counting this effort as part of the development costs, the total development costs were lower than the costs of using the traditional development method. This paper summarizes the experience trying, whenever possible, to quantify the results, and to support the observed positive results with appropriate data.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114415304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory Aware Poisson Solver for Peta-Scale Simulations with one FFT Diagonalizable Direction","authors":"G. Oyarzun, R. Borrell, F. Trias, A. Oliva","doi":"10.1109/HPCS.2017.26","DOIUrl":"https://doi.org/10.1109/HPCS.2017.26","url":null,"abstract":"Problems with some sort of divergence constraint are found in many disciplines: computational fluid dynamics, linear elasticity and electrostatics are examples thereof. Such a constraint leads to a Poisson equation which usually is one of the most computationally intensive parts of scientific simulation codes. In this work, we present a memory aware Poisson solver for problems with one Fourier diagonalizable direction. This diagonalization decomposes the original 3D system into a set of independent 2D subsystems. The proposed algorithm focuses on optimizing the memory allocations and transactions by taking into account redundancies on such 2D subsystems. Moreover, we also take advantage of the uniformity of the solver through the periodic direction for its vectorization. Additionally, our novel approach automatically optimizes the choice of the preconditioner used for the solution of each frequency subsystem and dynamically balances its parallel distribution. Altogether constitutes a highly efficient and robust HPC Poisson solver that has been successfully attested up to 16384 CPU-cores.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"468 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116603290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Pandey, Lan Vu, Vivek Puthiyaveettil, Hari Sivaraman, Uday Kurkure, Aravind Bappanadu
{"title":"An Automation Framework for Benchmarking and Optimizing Performance of Remote Desktops in the Cloud","authors":"A. Pandey, Lan Vu, Vivek Puthiyaveettil, Hari Sivaraman, Uday Kurkure, Aravind Bappanadu","doi":"10.1109/HPCS.2017.113","DOIUrl":"https://doi.org/10.1109/HPCS.2017.113","url":null,"abstract":"In the current trend of moving everything into the cloud, cloud-based remote desktops are not an exception. Benchmarking virtual remote desktops for performance optimization is an important task for the successful development and deployment planning of virtual desktop infrastructure (VDI) used to deliver remote desktops. This task is very challenging at cloud scale because of rapid evolution of VDI software architectures with a very large number of remote desktops to be managed. In this paper, we present a new framework for evaluating VDI performance that has the capabilities of simulating real world VDI workloads and measuring important performance metrics at scale. Its design aims to provide facilities to easily automate the performance benchmarking tasks and the flexibility of adapting to changes in VDI software architecture, which are two major limitations of the existing solution. For evaluation, we present performance results of this framework.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123622251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient Transaction-Based GPU Implementation of Minimum Spanning Forest Algorithm","authors":"Shayan Manoochehri, B. Goodarzi, D. Goswami","doi":"10.1109/HPCS.2017.100","DOIUrl":"https://doi.org/10.1109/HPCS.2017.100","url":null,"abstract":"General Purpose GPUs (GPGPUs) are ideal platforms for parallel execution of applications with regular shared memory access patterns. However, majority of real world multithreaded applications require access to shared memory with irregular patterns. The Minimum Spanning Forest (MSF) calculation arises in many real world applications. The Boruvka's algorithm for calculating MSF has the most expressed parallelism; however, it is a challenging irregular algorithm to implement on GPUs. In this paper we show that a transaction- based design and implementation of the Boruvka's algorithm on GPU can handle some of the challenges arising due to irregularity. First, we identify the hotspots of the algorithm that are the main bottlenecks: edge discovery and merge. The edge discovery phase is implemented using lock-free synchronizations after extracting certain algebraic properties (e.g. monotonicity) of the computation. The merge phase, however, lacks such algebraic properties and hence we utilize a Software Transactional Memory (STM) based synchronization method. STM offers ease of use by guaranteeing deadlock/livelock-free behavior as opposed to blocking lock-based synchronization. It also increases programmability by providing high level abstractions for synchronization which facilitate a natural transition from algorithm design to implementation. In addition, we employ several optimization techniques in different phases of the algorithm to achieve load balance and enhanced GPU resource utilization. Experimental results show that our GPU-based implementation outperforms both the fastest sequential implementation and the existing STM-based implementation on multicore CPUs when tested on large-scale graphs with diverse densities.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128342912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Data-Driven Task Allocation for Future Many-Cluster On-chip Systems","authors":"A. Scionti, Somnath Mazumdar, A. Portero","doi":"10.1109/HPCS.2017.81","DOIUrl":"https://doi.org/10.1109/HPCS.2017.81","url":null,"abstract":"Continuous demand for higher performance is adding more pressure on hardware designers to provide faster machines with low energy consumption. Recent technological advancements allow placing a group of silicon dies on top of a conventional interposer (silicon layer), which provides space to integrate logic and interconnection resources to manage active processing cores. However, such large resource availability requires an adequate Program eXecution Model (PXM) as well as an efficient mechanism to allocate resources in the system. From this perspective, fine-grain data-driven PXMs represent an attractive solution to reduce the cost of synchronising concurrent activities. The contribution of this work is twofold. First, a hardware architecture called TALHES - a Task ALlocator for HEterogeneous System is proposed to support scheduling of multi-threaded applications (adhering to an explicit data-driven PXM). TALHES introduces a Network-on-Chip (NoC) extension: i) while on-chip 2D-mesh NoCs are used to support locality of computations in the execution of a single task; ii) a global task scheduler integrated into the silicon interposer orchestrates application tasks among different clusters of cores (eventually with different computing capabilities). The second contribution of the paper is a simulation framework that is tailored to support the analysis of such fine-grain data-driven applications. In this work, Linux Containers are used to abstract and efficiently simulate clusters of cores (i.e., a single die), as well as the behaviour of the global scheduling unit.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121053356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Adaptively Restrained Molecular Dynamics","authors":"Krishnavir Singh, D. F. Marin, S. Redon","doi":"10.1109/HPCS.2017.55","DOIUrl":"https://doi.org/10.1109/HPCS.2017.55","url":null,"abstract":"Force computations are one of the most time consuming part in performing Molecular Dynamics (MD) simulations. Adaptively Restrained Molecular Dynamics (ARMD) makes it possible to perform fewer force calculations by adaptively restraining particles positions. This paper introduces parallel algorithms for single-pass incremental force computations to take advantage of adaptive restraints using the Message Passage Interface (MPI) standard. The proposed algorithms are implemented and validated in LAMMPS, however, these algorithms can be applied to other MD simulators. We compared our algorithms with LAMMPS for performance and scalability measurements.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124398453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reducing the Memory Footprint of an Eikonal Solver","authors":"Daniel Ganellari, G. Haase","doi":"10.1109/HPCS.2017.57","DOIUrl":"https://doi.org/10.1109/HPCS.2017.57","url":null,"abstract":"The numerical solution of the Eikonal equation follows the fast iterative method with its application for tetrahe-dral meshes. Therein the main operations in each discretization element τ contain various inner products in the M-metric as ($e^{rarr}$k,s,$e^{rarr}$s,ℓMτ $e^{rarr}$Tk,s · Mτ · $e^{rarr}$s,ℓ with $e^{rarr}$s,ℓ as connecting edge between vertices s and ℓ in element τ. Instead of passing all coordinates of the tetrahedron together with the 6 entries of Mτ we precompute these inner products and use only them in the wave front computation. This first change requires less memory transfers for each tetrahedron. The second change is caused by the fact that ($e^{rarr}$k,s,$e^{rarr}$s, ℓMτ (k ≠ℓ) represents an angle of a surface triangle whereas $e^{rarr}$k,s,$e^{rarr}$k,smτ represents the length of an edge in the M- metric. Basic geometry as well as vector arithmetics yield to the conclusion that the angle information can be expressed by the combination of three edge lengths. Therefore we only have to precompute the 6 edge lengths of a tetrahedron and compute the remaining 12 angle data on-the-fly which reduces the memory footprint per tetrahedron to 6 numbers. The efficient implementation of the two changes requires a local Gray-code numbering of edges in the tetrahedron and a bunch of bit shifts to assign the appropriate data. First numerical experiments on CPUs show that the reduced memory footprint approach is faster than the original implementation. Detailed investigations as well as a CUDA implementation are ongoing work.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126223545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Pires, Virginie Felizardo, Nuno Pombo, N. Garcia
{"title":"Limitations of Energy Expenditure Calculation Based on a Mobile Phone Accelerometer","authors":"I. Pires, Virginie Felizardo, Nuno Pombo, N. Garcia","doi":"10.1109/HPCS.2017.29","DOIUrl":"https://doi.org/10.1109/HPCS.2017.29","url":null,"abstract":"Sensors available in a mobile device, e.g., a smartphone, a smartwatch, or others, allow the capture of several signals, that may be used to the estimation of the energy expenditure. This paper describes the adaption of a previous research, using different signals and validated with a golden standard, consisting in the comparison between the units of the data acquired by a tri-axial accelerometer and an electromyography signal and the data collected by a mobile device accelerometer. The validation of the system showed that the energy expenditure may not be as correct as expected. The data related to this research is available in an open repository and the platform is available for testing. The creation of a validated method for the measurement of energy expenditure during physical activities capable for the implementation in a mobile application is an important issue to increase the confidence of the mobile applications in this market area.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127332713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Morimoto, Khureltulga Dashdavaa, Keichi Takahashi, Y. Kido, S. Date, S. Shimojo
{"title":"Design and Implementation of SDN-enhanced MPI Broadcast Targeting a Fat-Tree Interconnect","authors":"H. Morimoto, Khureltulga Dashdavaa, Keichi Takahashi, Y. Kido, S. Date, S. Shimojo","doi":"10.1109/HPCS.2017.46","DOIUrl":"https://doi.org/10.1109/HPCS.2017.46","url":null,"abstract":"To meet the rising demands on high-performance computing, the number of computing nodes composing a high- performance computing system has been continuously growing. Simultaneously, the complexity of networks linking such computing nodes, or the interconnect, has also been increasing. Taking the scale-out of computing nodes in future high-performance computing systems into consideration, it is unrealistic to build more nodes with the strategy of building a network capacity sufficient enough to accommodate maximum traffic. We have worked on SDN-enhanced MPI based on the challenging idea that network traffic should be controlled based on the time-variant requirements of applications running on the high-performance computing systems. In particular, this paper aims to accelerate MPI_Bcast execution through the use of Software Defined Net-working (SDN), targeting a high-performance computing system with a Fat-tree interconnect. The MPI_Bcast proposed in this paper has the functionality of making a delivery tree of data based on traffic information obtained from SDN switches that compose the deployed interconnect. Our evaluation observed our proposed MPI_Bcast was executed up to 8.6 times faster than our previous MPI_Bcast implementation when a 700 Mbps pseudo traffic was flowed on the Fat-tree interconnect.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134017591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}