{"title":"Bi-Objective Cost Function for Adaptive Routing in Network-on-Chip","authors":"Asma Benmessaoud Gabis;Pierre Bomel;Marc Sevaux","doi":"10.1109/TMSCS.2018.2810223","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2810223","url":null,"abstract":"This paper proposes a new fully adaptive routing protocol for 2D-mesh Network-on-Chip (NoCs). It is inspired from the A-star search algorithm and called Heuristic based Routing Algorithm (HRA). It is distributed, congestion-aware, and fault-tolerant by using only the local information of each router neighbors. HRA does not use Virtual Channels (VCs) but tries to reduce the risk of deadlock by avoiding the 2-nodes and the 4-nodes loops. HRA is based on a bi-objective weighted sum cost function. Its goal is optimizing latency and throughput. Experiments show that HRA ensures a good reliability rate despite the presence of many faulty links. In addition, our approach reports interesting latencies and average throughput values when a non-dominated solution is chosen.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 2","pages":"177-187"},"PeriodicalIF":0.0,"publicationDate":"2018-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2810223","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68025091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tiago Mück;Bryan Donyanavard;Kasra Moazzemi;Amir M. Rahmani;Axel Jantsch;Nikil Dutt
{"title":"Design Methodology for Responsive and Rrobust MIMO Control of Heterogeneous Multicores","authors":"Tiago Mück;Bryan Donyanavard;Kasra Moazzemi;Amir M. Rahmani;Axel Jantsch;Nikil Dutt","doi":"10.1109/TMSCS.2018.2808524","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2808524","url":null,"abstract":"Heterogeneous multicore processors (HMPs) are commonly deployed to meet the performance and power requirements of emerging workloads. HMPs demand adaptive and coordinated resource management techniques to control such complex systems. While Multiple-Input-Multiple-Output (MIMO) control theory has been applied to adaptively coordinate resources for \u0000<italic>single-core</i>\u0000 processors, the coordinated management of HMPs poses significant additional challenges for achieving robustness and responsiveness, due to the unmanageable complexity of modeling the system dynamics. This paper presents, for the first time, a methodology to design robust MIMO controllers with rapid response and formal guarantees for coordinated management of HMPs. Our approach addresses the challenges of: (1) system decomposition and identification; (2) selection of suitable sensor and actuator granularity; and (3) appropriate system modeling to make the system identifiable as well as controllable. We demonstrate the practical applicability of our approach on an ARM big.LITTLE HMP platform running Linux, and demonstrate the efficiency and robustness of our method by designing MIMO-based resource managers.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"944-951"},"PeriodicalIF":0.0,"publicationDate":"2018-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2808524","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incremental Maintenance of Maximal Bicliques in a Dynamic Bipartite Graph","authors":"Apurba Das;Srikanta Tirthapura","doi":"10.1109/TMSCS.2018.2802920","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2802920","url":null,"abstract":"We consider incremental maintenance of maximal bicliques from a dynamic bipartite graph that changes over time due to the addition of edges. When new edges are added to the graph, we seek to enumerate the change in the set of maximal bicliques, without enumerating the set of maximal bicliques that remain unaffected. The challenge in an efficient algorithm is to enumerate the change without explicitly enumerating the set of all maximal bicliques. In this work, we present (1) Near-tight bounds on the magnitude of change in the set of maximal bicliques of a graph, due to a change in the edge set, and an (2) Incremental algorithm for enumerating the change in the set of maximal bicliques. For the case when a constant number of edges are added to the graph, our algorithm is “change-sensitive”, i.e., its time complexity is proportional to the magnitude of change in the set of maximal bicliques. To our knowledge, this is the first incremental algorithm for enumerating maximal bicliques in a dynamic graph, with a provable performance guarantee. Our algorithm is easy to implement, and experimental results show that its performance exceeds that of baseline implementations by orders of magnitude substructures.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"231-242"},"PeriodicalIF":0.0,"publicationDate":"2018-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2802920","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Docker Container Scheduler for I/O Intensive Applications Running on NVMe SSDs","authors":"Janki Bhimani;Zhengyu Yang;Ningfang Mi;Jingpei Yang;Qiumin Xu;Manu Awasthi;Rajinikanth Pandurangan;Vijay Balakrishnan","doi":"10.1109/TMSCS.2018.2801281","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2801281","url":null,"abstract":"By using fast back-end storage, performance benefits of a lightweight container platform can be leveraged with quick I/O response. Nevertheless, the performance of simultaneously executing multiple instances of same or different applications may vary significantly with the number of containers. The performance may also vary with the nature of applications because different applications can exhibit different nature on SSDs in terms of I/O types (read/write), I/O access pattern (random/sequential), I/O size, etc. Therefore, this paper aims to investigate and analyze the performance characterization of both homogeneous and heterogeneous mixtures of I/O intensive containerized applications, operating with high performance NVMe SSDs and derive novel design guidelines for achieving an optimal and fair operation of the both homogeneous and heterogeneous mixtures. By leveraging these design guidelines, we further develop a new docker controller for scheduling workload containers of different types of applications. Our controller decides the optimal batches of simultaneously operating containers in order to minimize total execution time and maximize resource utilization. Meanwhile, our controller also strives to balance the throughput among all simultaneously running applications. We develop this new docker controller by solving an optimization problem using five different optimization solvers. We conduct our experiments in a platform of multiple docker containers operating on an array of three enterprise NVMe drives. We further evaluate our controller using different applications of diverse I/O behaviors and compare it with simultaneous operation of containers without the controller. Our evaluation results show that our new docker workload controller helps speed-up the overall execution of multiple applications on SSDs.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"313-326"},"PeriodicalIF":0.0,"publicationDate":"2018-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2801281","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application-Arrival Rate Aware Distributed Run-Time Resource Management for Many-Core Computing Platforms","authors":"Vasileios Tsoutsouras;Sotirios Xydis;Dimitrios Soudris","doi":"10.1109/TMSCS.2018.2793189","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2793189","url":null,"abstract":"Modern many-core computing platforms execute a diverse set of dynamic workloads in the presence of varying application arrival rates. This inflicts strict requirements on run-time management to efficiently allocate system resources. On the way towards kilo-core processor architectures, centralized resource management approaches will most probably form a severe performance bottleneck, thus focus has been turned to the study of Distributed Run-Time Resource Management (DRTRM) schemes. In this article, we examine the behavior of a DRTRM of dynamic applications with malleable characteristics against stressing incoming application interval rate scenarios, using Intel SCC as the target many-core system. We show that resource allocation is highly affected by application input rate and propose an application-arrival aware DRTRM framework implementing an effective admission control strategy by carefully utilizing voltage and frequency scaling on parts of its resource allocation infrastructure. Through extensive experimental evaluation, we quantitatively analyze the behavior of the introduced DRTRM scheme and show that it achieves up to 44 percent performance gains while consuming 31 percent less energy, in comparison to a state-of-art DRTRM solution. In comparison to a centralized RTRM, the respective metric values rise up to 62 and 45 percent performance and energy gains, respectively.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"285-298"},"PeriodicalIF":0.0,"publicationDate":"2018-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2793189","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multilevel Parallelism for the Exploration of Large-Scale Graphs","authors":"Massimo Bernaschi;Mauro Bisson;Enrico Mastrostefano;Flavio Vella","doi":"10.1109/TMSCS.2018.2797195","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2797195","url":null,"abstract":"We present the most recent release of our parallel implementation of the BFS and BC algorithms for the study of large scale graphs. Although our reference platform is a high-end cluster of new generation Nvidia GPUs and some of our optimizations are CUDA specific, most of our ideas can be applied to other platforms offering multiple levels of parallelism. We exploit multi level parallel processing through a hybrid programming paradigm that combines highly tuned CUDA kernels, for the computations performed by each node, and explicit data exchange through the Message Passing Interface (MPI), for the communications among nodes. The results of the numerical experiments show that the performance of our code is comparable or better with respect to other state-of-the-art solutions. For the BFS, for instance, we reach a peak performance of 200 Giga Teps on a single GPU and 5.5 Terateps on 1024 Pascal GPUs. We release our source codes both for reproducing the results and for facilitating their usage as a building block for the implementation of other algorithms.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"204-216"},"PeriodicalIF":0.0,"publicationDate":"2018-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2797195","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable and Performant Graph Processing on GPUs Using Approximate Computing","authors":"Somesh Singh;Rupesh Nasre","doi":"10.1109/TMSCS.2018.2795543","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2795543","url":null,"abstract":"Graph algorithms are being widely used in several application domains. It has been established that parallelizing graph algorithms is challenging. The parallelization issues get exacerbated when graphics processing units (GPUs) are used to execute graph algorithms. While the prior art has shown effective parallelization of several graph algorithms on GPUs, a few algorithms are still expensive. In this work, we address the scalability issues in graph parallelization. In particular, we aim to improve the execution time by tolerating a little approximation in the computation. We study the effects of four heuristic approximations on six graph algorithms with five graphs and show that if an application allows for small inaccuracy, this can be leveraged to achieve considerable performance benefits. We also study the effects of the approximations on GPU-based processing and provide interesting takeaways.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"190-203"},"PeriodicalIF":0.0,"publicationDate":"2018-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2795543","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ashur Rafiev;Mohammed A. N. Al-Hayanni;Fei Xia;Rishad Shafik;Alexander Romanovsky;Alex Yakovlev
{"title":"Speedup and Power Scaling Models for Heterogeneous Many-Core Systems","authors":"Ashur Rafiev;Mohammed A. N. Al-Hayanni;Fei Xia;Rishad Shafik;Alexander Romanovsky;Alex Yakovlev","doi":"10.1109/TMSCS.2018.2791531","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2791531","url":null,"abstract":"Traditional speedup models, such as Amdahl's law, Gustafson's, and Sun and Ni's, have helped the research community and industry better understand system performance capabilities and application parallelizability. As they mostly target homogeneous hardware platforms or limited forms of processor heterogeneity, these models do not cover newly emerging multi-core heterogeneous architectures. This paper reports on novel speedup and energy consumption models based on a more general representation of heterogeneity, referred to as the normal form heterogeneity, that supports a wide range of heterogeneous many-core architectures. The modelling method aims to predict system power efficiency and performance ranges, and facilitates research and development at the hardware and system software levels. The models were validated through extensive experimentation on the off-the-shelf big. LITTLE heterogeneous platform and a dual-GPU laptop, with an average error of 1 percent for speedup and of less than 6.5 percent for power dissipation. A quantitative efficiency analysis targeting the system load balancer on the Odroid XU3 platform was used to demonstrate the practical use of the method.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"436-449"},"PeriodicalIF":0.0,"publicationDate":"2018-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2791531","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"$mathsf{CHOAMP}$ : Cost Based Hardware Optimization for Asymmetric Multicore Processors","authors":"Jyothi Krishna Viswakaran Sreelatha;Shankar Balachandran;Rupesh Nasre","doi":"10.1109/TMSCS.2018.2791955","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2791955","url":null,"abstract":"Heterogeneous Multiprocessors (HMPs) are popular due to their energy efficiency over Symmetric Multicore Processors (SMPs). Asymmetric Multicore Processors (AMPs) are a special case of HMPs where different kinds of cores share the same instruction set, but offer different power-performance trade-offs. Due to the computational-power difference between these cores, finding an optimal hardware configuration for executing a given parallel program is quite challenging. An inherent difficulty in this problem stems from the fact that the original program is written for SMPs. This challenge is exacerbated by the interplay of several configuration parameters that are allowed to be changed in AMPs. In this work, we propose a probabilistic method named CHOAMP to choose the bestavailable hardware configuration for a given parallel program. Selection of a configuration is guided by a user-provided run-time property such as energy-delay-product (EDP) and CHOAMP aspires to optimize the property in choosing a configuration. The core part of our probabilistic method relies on identifying the behavior of various program constructs in different classes of CPU cores in the AMP, and how it influences the cost function of choice. We implement the proposed technique in a compiler which automatically transforms a code optimized for SMP to run efficiently over an AMP, eliding requirement of any user annotations. CHOAMP transforms the same source program for different hardware configurations based on different user requirement. We evaluate the efficiency of our method for three different run-time properties: execution time, energy consumption, and EDP, in NAS Parallel Benchmarks for OpenMP. Our experimental evaluation shows that CHOAMP achieves an average of 65, 28, and 57 percent improvement over baseline HMP scheduling while optimizing for energy, execution time, and EDP, respectively.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 2","pages":"163-176"},"PeriodicalIF":0.0,"publicationDate":"2018-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2791955","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68021417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Execution Trace Graph of Dataflow Process Networks","authors":"Simone Casale-Brunet;Marco Mattavelli","doi":"10.1109/TMSCS.2018.2790921","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2790921","url":null,"abstract":"The paper introduces and specifies a formalism that provides complete representations of dataflow process network (DPN) program executions, by means of directed acyclic graphs. Such graphs, also known as execution trace graphs (ETG), are composed of nodes representing each action firing and by directed arcs representing the dataflow program execution constraints between two action firings. Action firings are atomic operations that encompass the algorithmic part of the action executions applied to both, the input data and the actor state variables. The paper describes how an ETG can be effectively derived from a dataflow program, specifies the type of dependencies that need to be included, and the processing that need to be applied so that an ETG become capable of representing all the admissible trajectories that dynamic dataflow programs can execute. The paper also describes how some characteristics of the ETG, related to specific implementations of the dataflow program, can be evaluated by means of high-level and architecture-independent executions of the program. Furthermore, some examples are provided showing how the analysis of the ETGs can support efficient explorations, reductions, and optimizations of the design space, providing results in terms of design alternatives, without requiring any partial implementation or reduction of the expressiveness of the original DPN dataflow program.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"340-354"},"PeriodicalIF":0.0,"publicationDate":"2018-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2790921","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}