A. Shabbir, S. Stuijk, Akash Kumar, B. Theelen, B. Mesman, H. Corporaal
{"title":"A predictable communication assist","authors":"A. Shabbir, S. Stuijk, Akash Kumar, B. Theelen, B. Mesman, H. Corporaal","doi":"10.1145/1787275.1787301","DOIUrl":"https://doi.org/10.1145/1787275.1787301","url":null,"abstract":"Modern multi-processor systems need to provide guaranteed services to their users. A communication assist (CA) helps in achieving tight timing guarantees. In this paper, we present a CA for a tile-based MP-SoC. Our CA has smaller memory requirements and a lower latency than existing CAs. The CA has been implemented in hardware. We compare it with two existing DMA controllers. When compared with these DMAs, our CA is up-to 44% smaller in terms of equivalent gate count.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131228835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture","authors":"A. Heinecke, C. Trinitis, J. Weidendorfer","doi":"10.1145/1787275.1787298","DOIUrl":"https://doi.org/10.1145/1787275.1787298","url":null,"abstract":"Cache-obliviousness represents an important but relatively new concept for cache optimization. As cache-oblivious algorithms perform well on architectures with arbitrary cache configurations, the programming effort required for porting and optimizing for future architectures can be significantly reduced. In [8] and [9], fast parallel cache-oblivious linear algebra modules have been presented. The underlying matrix storing schemes are based on space filling curves. For matrix multiplication, all cache misses can be avoided, whereas for the LU decomposition algorithm the number of cache misses is minimized. It has been shown that the resulting codes work very well on several kinds of systems ranging from laptops to supercomputers. In this paper, we will show that the runtime characteristics of our existing cache-oblivious codes can be preserved on newer Intel processors. Special emphasis is put on the first many-core processor architecture with complete hardware-based cache coherency: The Larrabee Architecture. As the latter is expected to be available as a PCIe card connected to the host system, porting had to take into account transfer of data structures between different memory address spaces. Unfortunately, Larrabee was canceled as a graphics device for 2010, but Intel is expected to outline futher steps about Larrabee during 2010.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122251937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ananth Nallamuthu, M. C. Smith, Scott S. Hampton, P. Agarwal, S. Alam
{"title":"Energy efficient biomolecular simulations with FPGA-based reconfigurable computing","authors":"Ananth Nallamuthu, M. C. Smith, Scott S. Hampton, P. Agarwal, S. Alam","doi":"10.1145/1787275.1787294","DOIUrl":"https://doi.org/10.1145/1787275.1787294","url":null,"abstract":"Reconfigurable computing (RC) is being investigated as a hardware solution for improving time-to-solution for biomolecular simulations. A number of popular molecular dynamics (MD) codes are used to study various aspects of biomolecules. These codes are now capable of simulating nanosecond time-scale trajectories per day on conventional microprocessor-based hardware, but biomolecular processes often occur at the microsecond time-scale or longer. A wide gap exists between the desired and achievable simulation capability; therefore, there is considerable interest in alternative algorithms and hardware for improving the time-to-solution of MD codes. The fine-grain parallelism provided by Field Programmable Gate Arrays (FPGA) combined with their low power consumption make them an attractive solution for improving the performance of MD simulations. In this work, we use an FPGA-based coprocessor to accelerate the compute-intensive calculations of LAMMPS, a popular MD code, achieving up to 5.5 fold speed-up on the non-bonded force computations of the particle mesh Ewald method and up to 2.2 fold speed-up in overall time-to-solution, and potentially an increase by a factor of 9 in power-performance efficiencies for the pair-wise computations. The results presented here provide an example of the multi-faceted benefits to an application in a heterogeneous computing environment.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"256 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115333869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applying statistical machine learning to multicore voltage & frequency scaling","authors":"Michael Moeng, R. Melhem","doi":"10.1145/1787275.1787336","DOIUrl":"https://doi.org/10.1145/1787275.1787336","url":null,"abstract":"Dynamic Voltage/Frequency Scaling (DVFS) is a useful tool for improving system energy efficiency, especially in multi-core chips where energy is more of a limiting factor. Per-core DVFS, where cores can independently scale their voltages and frequencies, is particularly effective. We present a DVFS policy using machine learning, which learns the best frequency choices for a machine as a decision tree. Machine learning is used to predict the frequency which will minimize the expected energy per user-instruction (epui) or energy per (user-instruction)2 (epui2). While each core independently sets its frequency and voltage, a core is sensitive to other cores' frequency settings. Also, we examine the viability of using only partial training to train our policy, rather than full profiling for each program. We evaluate our policy on a 16-core machine running multiprogrammed, multithreaded benchmarks from the PARSEC benchmark suite against a baseline fixed frequency as well as a recently-proposed greedy policy. For 1ms DVFS intervals, our technique improves system epui2 by 14.4% over the baseline no-DVFS policy and 11.3% on average over the greedy policy.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"356 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116239416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Flavius Opritoiu, M. Vladutiu, L. Prodan, M. Udrescu
{"title":"A high-speed AES architecture implementation","authors":"Flavius Opritoiu, M. Vladutiu, L. Prodan, M. Udrescu","doi":"10.1145/1787275.1787300","DOIUrl":"https://doi.org/10.1145/1787275.1787300","url":null,"abstract":"We present in this paper a high performance implementation for the Advanced Encryption Standard (AES) standard. The design goal is directed toward efficient implementation of an AES cryptocore. The proposed architecture exhibits parallelism by concurrently processing all the bytes of a data block and computes each round key on-the-fly. The design implements both AES encryption and decryption by efficiently sharing the complex design modules. The proposed high-speed iterative implementation performing the AES operations in 11 clock cycles was synthesized for ALTERA's Cyclone II FPGA.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116590071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Processor architecture","authors":"S. Mckee","doi":"10.1145/3251919","DOIUrl":"https://doi.org/10.1145/3251919","url":null,"abstract":"","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128720580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Gummaraju, B. Sander, L. Morichetti, Benedict R. Gaster, Lee W. Howes
{"title":"Efficient implementation of GPGPU synchronization primitives on CPUs","authors":"J. Gummaraju, B. Sander, L. Morichetti, Benedict R. Gaster, Lee W. Howes","doi":"10.1145/1787275.1787295","DOIUrl":"https://doi.org/10.1145/1787275.1787295","url":null,"abstract":"The GPGPU model represents a style of execution where thousands of threads execute in a data-parallel fashion, with a large subset (typically 10s to 100s) needing frequent synchronization. As the GPGPU model evolves target both GPUs and CPUs as acceleration targets, thread synchronization becomes an important problem when running on CPUs. CPUs have little hardware support for synchronization and must be emulated in software, reducing application performance. This paper presents software techniques to implement the GPGPU synchronization primitives on CPUs, while maintaining application debug-ability. Performing limit studies using real hardware, we evaluate the potential performance benefits of an efficient barrier primitive.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121908031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Power 2","authors":"R. Gioiosa","doi":"10.1145/3251916","DOIUrl":"https://doi.org/10.1145/3251916","url":null,"abstract":"","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121680291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient pattern matching on GPUs for intrusion detection systems","authors":"Antonino Tumeo, Oreste Villa, D. Sciuto","doi":"10.1145/1787275.1787296","DOIUrl":"https://doi.org/10.1145/1787275.1787296","url":null,"abstract":"In this paper we present an efficient implementation of the Aho-Corasick pattern matching algorithm on Graphics Processing Units (GPU), showing how we redesigned the algorithm and the data structures to fit on the architecture and comparing it with an equivalent implementation on the CPU. We show that with a synthetic dataset, our implementation obtains a speedup up to 6.67 with respect to the CPU solution.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"24 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132592718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hyperscalar multi-core architecture","authors":"J. Chiu, Yu-Liang Chou, Ding-Siang Su","doi":"10.1145/1787275.1787291","DOIUrl":"https://doi.org/10.1145/1787275.1787291","url":null,"abstract":"This paper proposes a reconfigurable multi-core architecture, called hyperscalar that enables many scalar cores to be united dynamically as a larger superscalar processor to accelerate a thread. To accomplish this, we propose the virtual shared register files (VSRF) that allow the instructions of a thread executed in the united cores to logically face a uniform set of register files. We also propose the instruction analyzer (IA) with the capability of detecting and tagging the dependence information to the newly fetched instructions. According to the tags, instructions in the united cores can issue requests to obtain their remote operands via the VSRF. The reconfigurable feature of hyperscalar can cover a spectrum of workloads well, providing high single-thread performance when TLP is low and high throughput when TLP is high. Simulation results show that the a 8-core hyperscalar chip multiprocessor's 2, 4, and 8-core-united configurations archive 94%, 90%, and 83% of the performance of the monolithic 2, 4, and 8-issue out-of-order superscalar processors with lower area costs and better support for software diversity.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126379518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}