{"title":"Proposition for a sequential accelerator in future general-purpose manycore processors and the problem of migration-induced cache misses","authors":"P. Michaud, Yiannakis Sazeides, André Seznec","doi":"10.1145/1787275.1787330","DOIUrl":"https://doi.org/10.1145/1787275.1787330","url":null,"abstract":"As the number of transistors on a chip doubles with every technology generation, the number of on-chip cores also increases rapidly, making possible in a foreseeable future to design processors featuring hundreds of general-purpose cores. However, though a large number of cores speeds up parallel code sections, Amdahl's law requires speeding up sequential sections too. We argue that it will become possible to dedicate a substantial fraction of the chip area and power budget to achieve high sequential performance. Current general-purpose processors contain a handful of cores designed to be continuously active and run in parallel. This leads to power and thermal constraints that limit the core's performance. We propose removing these constraints with a sequential accelerator (SACC). A SACC consists of several cores designed for ultimate sequential performance. These cores cannot run continuously. A single core is active at any time, the rest of the cores are inactive and power-gated. We migrate the execution periodically to another core to spread heat generation uniformly over the whole SACC area, thus addressing the temperature issue. The SACC will be viable only if it yields significant sequential performance. Migration-induced cache misses may limit performance gains. We propose some solutions to mitigate this problem. We also investigate a migration method using thermal sensors, such that the migration interval depends on the ambient temperature and the migration penalty is negligible under normal thermal conditions.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128722987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Power 1","authors":"M. Alderighi","doi":"10.1145/3251915","DOIUrl":"https://doi.org/10.1145/3251915","url":null,"abstract":"","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133784928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Jin, M. Luján, L. Plana, Alexander D. Rast, S. Welbourne, S. Furber
{"title":"Efficient parallel implementation of multilayer backpropagation networks on SpiNNaker","authors":"Xin Jin, M. Luján, L. Plana, Alexander D. Rast, S. Welbourne, S. Furber","doi":"10.1145/1787275.1787297","DOIUrl":"https://doi.org/10.1145/1787275.1787297","url":null,"abstract":"This paper presents an efficient implementation and performance analysis of mapping multilayer perceptron networks with the backpropagation learning rule on SpiNNaker - a massively parallel architecture dedicated for neural network simulation. A new algorithm called pipelined checker-boarding partitioning scheme is proposed for efficient mapping. The new mapping algorithm relies on a checker-board partitioning scheme, but the key advantage comes from introducing a pipelined mode. The six-stage pipelined mode captures the parallelism within each partition of the weight matrix, allowing the overlapping of communication and computation. Not only does the proposed mapping localize communication, but it can also hide a part of or even all the communication for high efficiency.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124137473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote","authors":"N. Amato","doi":"10.1145/3251909","DOIUrl":"https://doi.org/10.1145/3251909","url":null,"abstract":"","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"285 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122401891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cupertino Miranda, Philippe Dumont, Albert Cohen, M. Duranton, Antoniu Pop
{"title":"ERBIUM: a deterministic, concurrent intermediate representation for portable and scalable performance","authors":"Cupertino Miranda, Philippe Dumont, Albert Cohen, M. Duranton, Antoniu Pop","doi":"10.1145/1787275.1787312","DOIUrl":"https://doi.org/10.1145/1787275.1787312","url":null,"abstract":"Optimizing compilers and runtime libraries do not shield programmers from the complexity of multi-core hardware; as a result the need for manual, target-specific optimizations increases with every processor generation. High-level languages are being designed to express concurrency and locality without reference to a particular architecture. But compiling such abstractions into efficient code requires a portable, intermediate representation: this is essential for modular composition (separate compilation), for optimization frameworks independent of the source language, and for just-in-time compilation of bytecode languages. This paper introduces Erbium, an intermediate representation for compilers, a low-level language for efficiency programmers, and a lightweight runtime implementation. It relies on a data structure for scalable and deterministic concurrency, called Event Recording, exposing the data-level, task and pipeline parallelism suitable to a given target. We provide experimental evidence of the productivity, scalability and efficiency advantages of Erbium, relying on a prototype implementation in GCC 4.3.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132176245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In-Hwan Doh, Young Jin Kim, Eunsam Kim, Jongmoo Choi, Donghee Lee, S. Noh
{"title":"Towards greener data centers with storage class memory: minimizing idle power waste through coarse-grain management in fine-grain scale","authors":"In-Hwan Doh, Young Jin Kim, Eunsam Kim, Jongmoo Choi, Donghee Lee, S. Noh","doi":"10.1145/1787275.1787340","DOIUrl":"https://doi.org/10.1145/1787275.1787340","url":null,"abstract":"Studies have shown much of today's data centers are over-provisioned and underutilized. Over-provisioning cannot be avoided as these centers must anticipate peak load with bursty behavior. Under-utilization, to date, has also been unavoidable as systems always had to be ready for that sudden burst of requests that loom just around the corner. Previous research has pointed to turning off systems as one solution, albeit, an infeasible one due to its irresponsiveness. In this paper, we present the feasibility of using new Storage Class Memory (SCM, which encompasses specific developments such as PCM, MRAM, or FeRAM) technology to turn systems on and off with minimum overhead. This feature is used to control systems on the whole (in comparison to previous fine-grained component-wise control) in finer time scale for high responsiveness with minimized power lost to idleness. Our empirical study is done by executing \"real trace\"-like workloads on a prototype \"data center\" of embedded systems deploying FeRAM. We quantify the energy savings and performance trade-off by turning idle systems off. We show that our energy savings approach consumes energy in proportion to user requests with configurable service of quality. Based on observations made on this data center, we discuss the requirements for real deployment. Finally, our conclusion is that SCM should not be viewed as just a replacement of RAM, but rather, as a component that could potentially open a whole new field of applications.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126463978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Homayoun, Avesta Sasan, Aseem Gupta, A. Veidenbaum, F. Kurdahi, N. Dutt
{"title":"Multiple sleep modes leakage control in peripheral circuits of a all major SRAM-based processor units","authors":"H. Homayoun, Avesta Sasan, Aseem Gupta, A. Veidenbaum, F. Kurdahi, N. Dutt","doi":"10.1145/1787275.1787339","DOIUrl":"https://doi.org/10.1145/1787275.1787339","url":null,"abstract":"Leakage currents in on-chip SRAMs: caches, branch predictor, register files and TLBs, are major contributors to the energy dissipated by processors in deep sub-micron technologies. High leakage also increases chip temperature and some SRAM-based structures become thermal hotspots. Previous work has addressed major sources of SRAM leakage in memory cells and bit-lines, making remaining SRAM components, in particular large drivers, the primary source of leakage. This paper proposes an approach to reduce this source of leakage in all major SRAM-based units of the processor, controlling them in a uniform way, yet treating each unit individually based on its behavior and memory organization. The new approach uses multiple bias voltages in sleep transistors allowing a trade-off between leakage reduction and wakeup delay in multi-stage peripheral drivers. Four low-power modes are defined, from basic to ultra low power, and SRAMs dynamically transition between these modes to minimize leakage without sacrificing performance. A novel control mechanism monitors and predicts future processor behavior for mode control. The leakage reduction in individual units is evaluated and shown to vary from 25% for IL1 to 75% for L2 caches. Resulting temperature reduction, including the effect of positive feedback between temperature and leakage power, is evaluated. A significant temperature reduction is achieved in each unit. It is also shown to reduce hot spots in the instruction TLB and branch predictor.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124690523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. C. Saez, Alexandra Fedorova, M. Prieto, Hugo Vegas
{"title":"Operating system support for mitigating software scalability bottlenecks on asymmetric multicore processors","authors":"J. C. Saez, Alexandra Fedorova, M. Prieto, Hugo Vegas","doi":"10.1145/1787275.1787281","DOIUrl":"https://doi.org/10.1145/1787275.1787281","url":null,"abstract":"Asymmetric multicore processors (AMP) promise higher performance per watt than their symmetric counterparts, and it is likely that future processors will integrate a few fast out-of-order cores, coupled with a large number of simpler, slow cores, all exposing the same instruction-set architecture (ISA). It is well known that one of the most effective ways to leverage the effectiveness of these systems is to use fast cores to accelerate sequential phases of parallel applications, and to use slow cores for running parallel phases. At the same time, we are not aware of any implementation of this parallelism-aware (PA) scheduling policy in an operating system. So the questions as to whether this policy can be delivered efficiently by the operating system to unmodified applications, and what the associated overheads are remain open. To answer these questions we created two different implementations of the PA policy in OpenSolaris and evaluated it on real hardware, where asymmetry was emulated via CPU frequency scaling. This paper reports our findings with regard to benefits and drawbacks of this scheduling policy.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125054159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self organization on a swarm computing fabric: a new way to look at fault tolerance","authors":"D. Pani, Simone Secchi, L. Raffo","doi":"10.1145/1787275.1787343","DOIUrl":"https://doi.org/10.1145/1787275.1787343","url":null,"abstract":"Recent studies have demonstrated the possibility to exploit Swarm Intelligence (SI) as an inspiration for the design of scalable VLSI tiled architectures exhibiting multitasking, adaptability, absence of centralized low-level control and fault-tolerance. SI approach to fault-tolerance, in principle, can be regarded as a reconfiguration-free cell-exclusion mechanism. The key elements at the basis of a reconfiguration free solution are: loose structure of the system, homogeneity, cooperative behaviors and self organization. In this paper, these self organization aspects, introduced in a recently developed multi-agent VLSI tiled architecture for array processing, expressly developed resorting to the SI inspiration, are presented along with some theoretical and experimental results. The architecture presents two forms of cell-exclusion (bypass and block of faulty elements), implementing self-adaptive behaviors rather than reconfiguration to face faults preserving system functionality. The proposed approach, exploiting indirect communications to provide workload spreading into the computing fabric, is also successful in reducing the effects of the presence of faulty elements without spare resources and with limited performance degradation.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115185160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supporting lock-free composition of concurrent data objects","authors":"Daniel Cederman, P. Tsigas","doi":"10.1145/1787275.1787286","DOIUrl":"https://doi.org/10.1145/1787275.1787286","url":null,"abstract":"Lock-free data objects offer several advantages over their blocking counterparts, such as being immune to deadlocks and convoying and, more importantly, being highly concurrent. However, composing the operations they provide into larger atomic operations, while still guaranteeing efficiency and lock-freedom, is a challenging algorithmic task. We present a lock-free methodology for composing highly concurrent linearizable objects together by unifying their linearization points. This makes it possible to relatively easily introduce atomic lock-free move operations to a wide range of concurrent objects. Experimental evaluation has shown that the operations originally supported by the data objects keep their performance behavior under our methodology.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116948366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}