{"title":"Novel low-cost aging sensor","authors":"M. Omaña, Daniele Rossi, N. Bosio, C. Metra","doi":"10.1145/1787275.1787299","DOIUrl":"https://doi.org/10.1145/1787275.1787299","url":null,"abstract":"Performance degradation of integrated circuits due to aging effects, such as Negative Bias Temperature Instability (NBTI), is becoming of great concern for current and future CMOS technology. Here we introduce an aging sensor able to detect such degradations in the combinational part of a critical data-path. It requires lower area than recently proposed alternative solutions, and a lower or comparable power consumption.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131636147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A heterogeneous parallel system running open mpi on a broadband network of embedded set-top devices","authors":"Richard Neill, Alexander Shabarshin, L. Carloni","doi":"10.1145/1787275.1787324","DOIUrl":"https://doi.org/10.1145/1787275.1787324","url":null,"abstract":"We present a heterogeneous parallel computing system that combines a traditional computer cluster with a broadband network of embedded set-top box (STB) devices. As multiple service operators (MSO) manage millions of these devices across wide geographic areas, the computational power of such a massively-distributed embedded system could be harnessed to realize a centrally-managed, energy-efficient parallel processing platform that supports a variety of application domains which are of interest to MSOs, consumers, and the high-performance computing research community. We investigate the feasibility of this idea by building a prototype system that includes a complete head-end cable system with a DOCSIS-2.0 network combined with an interoperable implementation of a subset of Open MPI running on the STB embedded operating system. We evaluate the performance and scalability of our system compared to a traditional cluster by solving approximately various instances of the Multiple Sequence Alignment bioinformatics problem, while the STBs continue simultaneously to operate their primary functions: decode MPEG streams for television display and run an interactive user interface. Based on our experimental results and given the technology trends in embedded computing we argue that our approach to leverage a broadband network of embedded devices in a heterogeneous distributed system offers the benefits of both parallel computing clusters and distributed Internet computing.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114398654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining deblurring and denoising for handheld HDR imaging in low light conditions","authors":"P. Lakshman","doi":"10.1145/1787275.1787303","DOIUrl":"https://doi.org/10.1145/1787275.1787303","url":null,"abstract":"This paper proposes a probability formulation that unifies both single-image deblurring and multi-image denoising using variational inference. Based on this formulation, a new algorithm for deblurring a noisy and blurry image pair is presented. Besides, we provide also an approach that combines existing optical flow and image denoising techniques for High Dynamic Range imaging.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114883821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Caches and branches 2","authors":"C. Trinitis","doi":"10.1145/3251913","DOIUrl":"https://doi.org/10.1145/3251913","url":null,"abstract":"","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122880814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote","authors":"G. Bilardi","doi":"10.1145/3251918","DOIUrl":"https://doi.org/10.1145/3251918","url":null,"abstract":"","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"14 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116859043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Kavadias, M. Katevenis, M. Zampetakis, Dimitrios S. Nikolopoulos
{"title":"On-chip communication and synchronization mechanisms with cache-integrated network interfaces","authors":"S. Kavadias, M. Katevenis, M. Zampetakis, Dimitrios S. Nikolopoulos","doi":"10.1145/1787275.1787328","DOIUrl":"https://doi.org/10.1145/1787275.1787328","url":null,"abstract":"Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multi-word blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134041697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interval-based models for run-time DVFS orchestration in superscalar processors","authors":"G. Keramidas, Vasileios Spiliopoulos, S. Kaxiras","doi":"10.1145/1787275.1787338","DOIUrl":"https://doi.org/10.1145/1787275.1787338","url":null,"abstract":"We develop two simple interval-based models for dynamic superscalar processors. These models allow us to: i) predict with great accuracy performance and power consumption under various frequency and voltage combinations and ii) implement targeted DVFS policies at run-time. The models analyze program execution in intervals - steady-state and miss-event intervals. Intervals are signalled by miss events (L2-misses in our case) that upset the \"steady state\" execution of the program and are ended when the pipeline reaches again a steady state. The first model is fed by an approximation of the stall cycles (the time the processor instruction window is blocked) due to long-latency L2-misses. The second model improves on this approximation using as input the occupancy of the L2's miss-handling registers (MSHRs). Despite their simplicity these models prove to be accurate in predicting the performance (and energy) for any target frequency/voltage setting, yielding average errors of 2.1% and 0.2% respectively. Besides modelling, we show that the methodology we propose is powerful enough to implement (at run-time) various DVFS policies: \"operate at optimal EDP\" or \"ED2P,\" or even \"reduce ED2P within specific performance constraints.\" Approaches based on the two models require minimal hardware cost: two counters for measuring the duration of the steady state and the miss-event intervals and some comparison logic. To validate our methodology we use a cycle-accurate simulator and the benchmarks provided by the SPEC2K suite. Our results indicate that our proposed run-time mechanism is able to orchestrate different DVFS policies with great success yielding negligible errors - bellow 1.5% on average.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115444214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James Dinan, P. Balaji, E. Lusk, P. Sadayappan, R. Thakur
{"title":"Hybrid parallel programming with MPI and unified parallel C","authors":"James Dinan, P. Balaji, E. Lusk, P. Sadayappan, R. Thakur","doi":"10.1145/1787275.1787323","DOIUrl":"https://doi.org/10.1145/1787275.1787323","url":null,"abstract":"The Message Passing Interface (MPI) is one of the most widely used programming models for parallel computing. However, the amount of memory available to an MPI process is limited by the amount of local memory within a compute node. Partitioned Global Address Space (PGAS) models such as Unified Parallel C (UPC) are growing in popularity because of their ability to provide a shared global address space that spans the memories of multiple compute nodes. However, taking advantage of UPC can require a large recoding effort for existing parallel applications. In this paper, we explore a new hybrid parallel programming model that combines MPI and UPC. This model allows MPI programmers incremental access to a greater amount of memory, enabling memory-constrained MPI codes to process larger data sets. In addition, the hybrid model offers UPC programmers an opportunity to create static UPC groups that are connected over MPI. As we demonstrate, the use of such groups can significantly improve the scalability of locality-constrained UPC codes. This paper presents a detailed description of the hybrid model and demonstrates its effectiveness in two applications: a random access benchmark and the Barnes-Hut cosmological simulation. Experimental results indicate that the hybrid model can greatly enhance performance; using hybrid UPC groups that span two cluster nodes, RA performance increases by a factor of 1.33 and using groups that span four cluster nodes, Barnes-Hut experiences a twofold speedup at the expense of a 2% increase in code size.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117074908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Global management of cache hierarchies","authors":"M. Zahran, S. Mckee","doi":"10.1145/1787275.1787315","DOIUrl":"https://doi.org/10.1145/1787275.1787315","url":null,"abstract":"Cache memories currently treat all blocks as if they were equally important. This assumption of equally important blocks is not always valid. For instance, not all blocks deserve to be in L1 cache. We therefore propose globalized block placement. We present a global placement algorithm for managing blocks in a cache hierarchy by deciding where in the hierarchy an incoming block should be placed. Our technique makes decisions by adapting to access patterns of different blocks. The contributions of this paper are fourfold. First, we motivate our solution by demonstrating the importance of a globalized placement scheme. Second, we present a method to categorize cache block behavior into one of four categories. Third, we present one potential design exploiting this categorization. Finally, we demonstrate the performance of our design. The proposed scheme enhances overall system performance (IPC) by an average of 12% over a traditional LRU scheme while reducing traffic between L1 cache and L2 cache by an average of 20%, using SPEC CPU benchmark suite. All of this is achieved with a table as small as 3 KB.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116920514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Navaridas, L. Plana, J. Miguel-Alonso, M. Luján, S. Furber
{"title":"SpiNNaker: impact of traffic locality, causality and burstiness on the performance of the interconnection network","authors":"J. Navaridas, L. Plana, J. Miguel-Alonso, M. Luján, S. Furber","doi":"10.1145/1787275.1787278","DOIUrl":"https://doi.org/10.1145/1787275.1787278","url":null,"abstract":"The SpiNNaker system is a biologically-inspired massively parallel architecture of bespoke multi-core System-on-Chips. The aim of its design is to simulate up to a billion spiking neurons in (biological) real-time. Packets, in SpiNNaker, represent neural spikes and these travel through the two-dimensional triangular torus network that connects the over 65 thousand nodes housed in the largest size of SpiNNaker. The research question that we explore is the impact that spatial locality, temporal causality and burstiness of the traffic have on the performance of such interconnection network. Given the limited knowledge of neuron activity patterns, we propose and use synthetic traffic patterns which resemble biological neural traffic and allow tuning of spatial locality. Causality is explored by means of temporal patterns that maintain a specified overall network load while allowing at the node level autonomous causal traffic generation. Part of the traffic is generated automatically, but the remaining traffic is triggered by a spike arrival in the form of a packet or a burst of packets; as neural stimuli do. In this way, we generate non-uniform traffic patterns with an evolving concentration of activity at nodes which contain more active parts of the spiking neural network. Given the application domain, the simulation-based study focuses on the real-time behavior of the system rather than focusing on standard HPC network metrics. The results show that the interconnection network of SpiNNaker can operate without dropping packets with traffic loads that exceed more than 3.5 times those required to simulate 109 spiking neurons, despite using non-local traffic. We also find that increments in the degree of traffic causality do not affect the performance of the system, but burstiness in the traffic can hurt performance.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129961487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}