{"title":"Moving to memoryland: in-memory computation for existing applications","authors":"P. Trancoso","doi":"10.1145/2742854.2742874","DOIUrl":"https://doi.org/10.1145/2742854.2742874","url":null,"abstract":"Migrating computation to memory was proposed a long time ago as a way to overcome the memory bandwidth and latency bottleneck, as well as increase the computation parallelism. While the concept had been applied to several research projects it is only recently that the technological hurdles have been solved and we are able to see products arriving the market. While in most cases we need to concentrate on developing new algorithms and porting applications to new models as to fully exploit the potentials of the new products, we will still want to be able to execute efficiently existing applications. As such, in this work we focus on the analysis of the in-memory computation characteristics of existing applications in a way to evaluate how we would be able to have them move to \"Memoryland\". We present a tool that analyses the locality of the memory accesses for the different routines in an application. The results observed from the execution of this tool on different applications are that while certain applications seem to be able to fit in a small granularity architecture (small memory-to-computation ratio), others have routines that require a large amount of data. Thus we believe that hierarchical in-memory processing architectures are a good fit for the demands of the different applications. In addition, results have shown that for most applications we can limit our analysis to the routines that issue the most memory accesses.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126879554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneous energy-efficient cache design in warehouse scale computers","authors":"Jing Wang, Xiaoyan Zhu, Yanjun Liu, Jiaqi Zhang, Minhua Wu, Wei-gong Zhang, Keni Qiu","doi":"10.1145/2742854.2742889","DOIUrl":"https://doi.org/10.1145/2742854.2742889","url":null,"abstract":"Energy efficiency is becoming the key design concern for modern warehouse-scale computer (WSC) systems, where tens of thousands of server processors consume a significant portion of the total power. Voltage scaling is one of the most effective mechanisms to improve energy efficiency at the cost of cell failures in large cache arrays. In this paper, we leverage the observation that there exists a diverse spectrum of tolerance to cache errors in large internet services to design a heterogeneous energy-efficient cache enforced by variable-strength error-correcting codes. The operating system may use the page coloring mechanism to control mapping applications to cache regions with differential reliability.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124878111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Chavarría-Miranda, Ajay Panyala, M. Halappanavar, J. Manzano, Antonino Tumeo
{"title":"Optimizing irregular applications for energy and performance on the Tilera many-core architecture","authors":"D. Chavarría-Miranda, Ajay Panyala, M. Halappanavar, J. Manzano, Antonino Tumeo","doi":"10.1145/2742854.2742865","DOIUrl":"https://doi.org/10.1145/2742854.2742865","url":null,"abstract":"Optimizing applications simultaneously for energy and performance is a complex problem. High performance, parallel, irregular applications are notoriously hard to optimize due to their data-dependent memory accesses, lack of structured locality and complex data structures and code patterns. Irregular kernels are growing in importance in applications such as machine learning, graph analytics and combinatorial scientific computing. Performance- and energy-efficient implementation of these kernels on modern, energy efficient, many-core platforms is therefore an important and challenging problem. We present results from optimizing two irregular applications -- the Louvain method for community detection (Grappolo), and high-performance conjugate gradient (HPCCG) -- on the Tilera many-core system. We have significantly extended MIT's OpenTuner auto-tuning framework to conduct a detailed study of platform-independent and platform-specific optimizations to improve performance as well as reduce total energy consumption. We explore the optimization design space along three dimensions: memory layout schemes, compiler-based code transformations, and optimization of parallel loop schedules. Using auto-tuning, we demonstrate whole-node energy savings of up to 41% relative to a baseline instantiation, and up to 31% relative to manually optimized variants.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123298555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kuldeep Singh, P. Saini, S. Rani, Awadhesh Kumar Singh
{"title":"Authentication and privacy preserving message transfer scheme for vehicular ad hoc networks (VANETs)","authors":"Kuldeep Singh, P. Saini, S. Rani, Awadhesh Kumar Singh","doi":"10.1145/2742854.2745718","DOIUrl":"https://doi.org/10.1145/2742854.2745718","url":null,"abstract":"Vehicular Ad hoc Networks (VANETs) are likely to be deployed for real-time applications in the coming years, thus, forming the most relevant form of mobile ad hoc networks. In such hostile environment, security is a major concern. The paper presents a novel architecture for VANETs to achieve authentication and privacy preserving message transfer among the vehicles. We have designed a four-phase protocol which employs Elliptic Curve Cryptography (ECC). Also, the paper discusses the performance of ECC over RSA in terms of key size and computation from the existing data set. Furthermore, the paper presents a static analysis to prove robustness and efficiency of the proposed approach.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131557441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Bernaschi, Giancarlo Carbone, Enrico Mastrostefano, M. Bisson, M. Fatica
{"title":"Enhanced GPU-based distributed breadth first search","authors":"M. Bernaschi, Giancarlo Carbone, Enrico Mastrostefano, M. Bisson, M. Fatica","doi":"10.1145/2742854.2742887","DOIUrl":"https://doi.org/10.1145/2742854.2742887","url":null,"abstract":"There is growing interest in studying large scale graphs having millions of vertices and billions of edges, up to the point that a specific benchmark, called Graph500, has been defined to measure the performance of graph algorithms on modern computing architectures. At first glance, Graphics Processing Units (GPUs) are not an ideal platform for the execution of graph algorithms that are characterized by low arithmetic intensity and irregular memory access patterns. For studying really large graphs, multiple GPUs are required to overcome the memory size limitations of a single GPU. In the present paper, we propose several techniques to minimize the communication among GPUs.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130603447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast packet forwarding engine based on software circuits","authors":"M. Makkes, A. Varbanescu, C. D. Laat, R. Meijer","doi":"10.1145/2742854.2742862","DOIUrl":"https://doi.org/10.1145/2742854.2742862","url":null,"abstract":"Forwarding packets is part of the performance critical path of routing devices, and affects the network performance at any scale. This operation is typically performed by dedicated routing boxes, which are fast, but expensive and inflexible. Recent work has shown that in many cases commodity hardware is becoming an alternative to these specialized boxes. In this work, we present a new technique - based on bitslicing - to improve the performance of forward decision-making on modern commodity hardware. Specifically, we propose to replace memory lookups with logical operations, by evaluating the packet header information as a Boolean circuit. Being less memory-intensive, our algorithm has the potential to achieve high performance on both modern CPUs and GPUs. To measure and qulify the performance of our algorithm, we implemented it in OpenCL and performed a large set of experiments on 5 different platforms - two CPUs and three GPUs. Our results show that bitslicing has the ability to outperform the traditional, memory lookup approach in 70% of the cases, depending on the type of traffic and routing parameters.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129564868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A matrix multiplier case study for an evaluation of a configurable dataflow-machine","authors":"L. Verdoscia, R. Vaccaro, R. Giorgi","doi":"10.1145/2742854.2747287","DOIUrl":"https://doi.org/10.1145/2742854.2747287","url":null,"abstract":"Configurable computing has become a subject of a great deal of research given its potential to greatly accelerate a wide variety of applications that require high throughput. In this context, the dataflow approach is still promising to accelerate the kernel of applications in the field of HPC. That tanks to a computational dataflow engine able to execute dataflow program graphs directly in a custom hardware. On the other hand, evaluating radically different models of computation remains yet an open issue. In this paper we present as case study the matrix multiplication that constitutes the fundamental kernel of the linear algebra. The evaluation takes into account the execution of the matrix product both in non-pipelined and pipelined modes. Results obtained running the execution of the two modes on an FPGA-based demonstrator show the validity of the configurable Dataflow-Machine. Moreover, at the same throughput, the power consumption is expected to be lower than in clock-based systems.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134325733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal allocation of virtual resources using genetic algorithm in cloud environments","authors":"K. D. Babu, D. Kumar, Suresh Veluru","doi":"10.1145/2742854.2744722","DOIUrl":"https://doi.org/10.1145/2742854.2744722","url":null,"abstract":"Optimal resource utilization is one of the biggest challenges for executing tasks within the cloud. The resource provider is responsible for providing the resources by creating virtual machines for executing task over a cloud. To utilize the resources optimally, the resource provider has to take care of the process of allocating resources to Virtual Machine Manager (VMM). In this paper, an efficient way to utilize the resources, within the cloud, has been proposed considering remaining resources should be maximum at a single machine but not distributed. As a framework to virtual resource mapping, a Simple Genetic Algorithm is applied to solve the heuristic of allocating problem. We may also use conversion of multiple parameters into single equivalent parameter so that number of inputs and comparisons will be reduced.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132131589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Song Wu, Qiong Tuo, Hai Jin, Chuxiong Yan, Qizheng Weng
{"title":"HRF: a resource allocation scheme for moldable jobs","authors":"Song Wu, Qiong Tuo, Hai Jin, Chuxiong Yan, Qizheng Weng","doi":"10.1145/2742854.2742870","DOIUrl":"https://doi.org/10.1145/2742854.2742870","url":null,"abstract":"Moldable jobs, which allow the number of allocated processors to be adjusted before running in clusters, have attracted increasing concern in parallel job scheduling research. Compared with traditional rigid jobs where the number of allocated processors is fixed, moldable jobs are more flexible and therefore have more potential for improving their average turnaround time (a crucial metric to describe performance of jobs in a cluster). Average turnaround time of moldable jobs depends greatly on resource allocation schemes. Unfortunately, existing schemes do not perform well in reducing average turnaround time, either because they only consider a single job's turnaround time instead of the average turnaround time of all jobs, or because they just aim at fairness between short and long jobs instead of their average turnaround time. In this paper, we investigate how resource allocation affects the average turnaround time of moldable jobs in clusters, and propose a scheme named HRF (highest revenue first), which allocates processors according to the highest revenue of shortening runtime. In our simulations, experimental results show that HRF can reduce average turnaround time up to 71% when compared with state-of-the-art schemes.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"50 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116318304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zehra Sura, A. Jacob, Tong Chen, Bryan S. Rosenburg, Olivier Sallenave, C. Bertolli, S. Antão, J. Brunheroto, Yoonho Park, K. O'Brien, R. Nair
{"title":"Data access optimization in a processing-in-memory system","authors":"Zehra Sura, A. Jacob, Tong Chen, Bryan S. Rosenburg, Olivier Sallenave, C. Bertolli, S. Antão, J. Brunheroto, Yoonho Park, K. O'Brien, R. Nair","doi":"10.1145/2742854.2742863","DOIUrl":"https://doi.org/10.1145/2742854.2742863","url":null,"abstract":"The Active Memory Cube (AMC) system is a novel heterogeneous computing system concept designed to provide high performance and power-efficiency across a range of applications. The AMC architecture includes general-purpose host processors and specially designed in-memory processors (processing lanes) that would be integrated in a logic layer within 3D DRAM memory. The processing lanes have large vector register files but no power-hungry caches or local memory buffers. Performance depends on how well the resulting higher effective memory latency within the AMC can be managed. In this paper, we describe a combination of programming language features, compiler techniques, operating system interfaces, and hardware design that can effectively hide memory latency for the processing lanes in an AMC system. We present experimental data to show how this approach improves the performance of a set of representative benchmarks important in high performance computing applications. As a result, we are able to achieve high performance together with power efficiency using the AMC architecture.","PeriodicalId":417279,"journal":{"name":"Proceedings of the 12th ACM International Conference on Computing Frontiers","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132866585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}