ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168889
E. Schuchman, T. N. Vijaykumar
{"title":"A program transformation and architecture support for quantum uncomputation","authors":"E. Schuchman, T. N. Vijaykumar","doi":"10.1145/1168857.1168889","DOIUrl":"https://doi.org/10.1145/1168857.1168889","url":null,"abstract":"Quantum computing's power comes from new algorithms that exploit quantum mechanical phenomena for computation. Quantum algorithms are different from their classical counterparts in that quantum algorithms rely on algorithmic structures that are simply not present in classical computing. Just as classical program transformations and architectures have been designed for common classical algorithm structures, quantum program transformations and quantum architectures should be designed with quantum algorithms in mind. Because quantum algorithms come with these new algorithmic structures, resultant quantum program transformations and architectures may look very different from their classical counterparts.This paper focuses on uncomputation, a critical and prevalent structure in quantum algorithms, and considers how program transformations, and architecture support should be designed to accommodate uncomputation. In this paper,we show a simple quantum program transformation that exposes independence between uncomputation and later computation. We then propose a multicore architecture tailored to this exposed parallelism and propose a scheduling policy that efficiently maps such parallelism to the multicore architecture. Our policy achieves parallelism between uncomputation and later computation while reducing cumulative communication distance. Our scheduling and architecture allows significant speedup of quantum programs (between 1.8x and 2.8x speedup in Shor's factoring algorithm), while reducing cumulative communication distance 26%.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128562913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168905
M. Kawahito, H. Komatsu, T. Moriyama, H. Inoue, T. Nakatani
{"title":"A new idiom recognition framework for exploiting hardware-assist instructions","authors":"M. Kawahito, H. Komatsu, T. Moriyama, H. Inoue, T. Nakatani","doi":"10.1145/1168857.1168905","DOIUrl":"https://doi.org/10.1145/1168857.1168905","url":null,"abstract":"Modern processors support hardware-assist instructions (such as TRT and TROT instructions on IBM zSeries) to accelerate certain functions such as delimiter search and character conversion. Such special instructions have often been used in high performance libraries, but they have not been exploited well in optimizing compilers except for some limited cases. We propose a new idiom recognition technique derived from a topological embedding algorithm [4] to detect idiom patterns in the input program more aggressively than in previous approaches. Our approach can detect a pattern even if the code segment does not exactly match the idiom. For example, we can detect a code segment that includes additional code within the idiom pattern. We implemented our new idiom recognition approach based on the Java Just-In-Time (JIT) compiler that is part of the J9 Java Virtual Machine, and we supported several important idioms for special hardware-assist instructions on the IBM zSeries and on some models of the IBM pSeries. To demonstrate the effectiveness of our technique, we performed two experiments. The first one is to see how many more patterns we can detect compared to the previous approach. The second one is to see how much performance improvement we can achieve over the previous approach. For the first experiment, we used the Java Compatibility Kit (JCK) API tests. For the second one we used IBM XML parser, SPECjvm98, and SPCjbb2000. In summary, relative to a baseline implementation using exact pattern matching, our algorithm converted 75% more loops in JCK tests. We also observed significant performance improvement of the XML parser by 64%, of SPECjvm98 by 1%, and of SPECjbb2000 by 2% on average on a z990. Finally, we observed the JIT compilation time increases by only 0.32% to 0.44%.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133683597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168892
J. F. Cantin, Mikko H. Lipasti, James E. Smith
{"title":"Stealth prefetching","authors":"J. F. Cantin, Mikko H. Lipasti, James E. Smith","doi":"10.1145/1168857.1168892","DOIUrl":"https://doi.org/10.1145/1168857.1168892","url":null,"abstract":"Prefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching techniques can mitigate the increasing memory latency, they can harm performance by wasting precious interconnect bandwidth and prematurely accessing shared data, causing state downgrades at remote nodes that force later upgrades.This paper investigates Stealth Prefetching, a new technique that utilizes information from Coarse-Grain Coherence Tracking (CGCT) for prefetching data aggressively, stealthily, and efficiently in a broadcast-based shared-memory multiprocessor system. Stealth Prefetching utilizes CGCT to identify regions of memory that are not shared by other processors, aggressively fetches these lines from DRAM in open-page mode, and moves them close to the processor in anticipation of future references. Our analysis with commercial, scientific, and multiprogrammed workloads show that Stealth Prefetching provides an average speedup of 20% over an aggressive baseline system with conventional prefetching.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133452855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168860
Keith Adams, Ole Agesen
{"title":"A comparison of software and hardware techniques for x86 virtualization","authors":"Keith Adams, Ole Agesen","doi":"10.1145/1168857.1168860","DOIUrl":"https://doi.org/10.1145/1168857.1168860","url":null,"abstract":"Until recently, the x86 architecture has not permitted classical trap-and-emulate virtualization. Virtual Machine Monitors for x86, such as VMware ® Workstation and Virtual PC, have instead used binary translation of the guest kernel code. However, both Intel and AMD have now introduced architectural extensions to support classical virtualization.We compare an existing software VMM with a new VMM designed for the emerging hardware support. Surprisingly, the hardware VMM often suffers lower performance than the pure software VMM. To determine why, we study architecture-level events such as page table updates, context switches and I/O, and find their costs vastly different among native, software VMM and hardware VMM execution.We find that the hardware support fails to provide an unambiguous performance advantage for two primary reasons: first, it offers no support for MMU virtualization; second, it fails to co-exist with existing software techniques for MMU virtualization. We look ahead to emerging techniques for addressing this MMU virtualization problem in the context of hardware-assisted virtualization.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134401484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168878
M. Mishra, T. Callahan, Tiberiu Chelcea, Girish Venkataramani, S. Goldstein, M. Budiu
{"title":"Tartan: evaluating spatial computation for whole program execution","authors":"M. Mishra, T. Callahan, Tiberiu Chelcea, Girish Venkataramani, S. Goldstein, M. Budiu","doi":"10.1145/1168857.1168878","DOIUrl":"https://doi.org/10.1145/1168857.1168878","url":null,"abstract":"Spatial Computing (SC) has been shown to be an energy-efficient model for implementing program kernels. In this paper we explore the feasibility of using SC for more than small kernels. To this end, we evaluate the performance and energy efficiency of entire applications on Tartan, a general-purpose architecture which integrates a reconfigurable fabric (RF) with a superscalar core. Our compiler automatically partitions and compiles an application into an instruction stream for the core and a configuration for the RF. We use a detailed simulator to capture both timing and energy numbers for all parts of the system.Our results indicate that a hierarchical RF architecture, designed around a scalable interconnect, is instrumental in harnessing the benefits of spatial computation. The interconnect uses static configuration and routing at the lower levels and a packet-switched, dynamically-routed network at the top level. Tartan is most energyefficient when almost all of the application is mapped to the RF, indicating the need for the RF to support most general-purpose programming constructs. Our initial investigation reveals that such a system can provide, on average, an order of magnitude improvement in energy-delay compared to an aggressive superscalar core on single-threaded workloads.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129370523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168875
Katherine E. Coons, Xia Chen, D. Burger, K. McKinley, Sundeep K. Kushwaha
{"title":"A spatial path scheduling algorithm for EDGE architectures","authors":"Katherine E. Coons, Xia Chen, D. Burger, K. McKinley, Sundeep K. Kushwaha","doi":"10.1145/1168857.1168875","DOIUrl":"https://doi.org/10.1145/1168857.1168875","url":null,"abstract":"Growing on-chip wire delays are motivating architectural features that expose on-chip communication to the compiler. EDGE architectures are one example of communication-exposed microarchitectures in which the compiler forms dataflow graphs that specify how the microarchitecture maps instructions onto a distributed execution substrate. This paper describes a compiler scheduling algorithm called spatial path scheduling that factors in previously fixed locations - called anchor points - for each placement. This algorithm extends easily to different spatial topologies. We augment this basic algorithm with three heuristics: (1) local and global ALU and network link contention modeling, (2) global critical path estimates, and (3) dependence chain path reservation. We use simulated annealing to explore possible performance improvements and to motivate the augmented heuristics and their weighting functions. We show that the spatial path scheduling algorithm augmented with these three heuristics achieves a 21% average performance improvement over the best prior algorithm and comes within an average of 5% of the annealed performance for our benchmarks.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126990156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-20DOI: 10.1145/1168857.1168890
Shashidhar Mysore, B. Agrawal, N. Srivastava, Sheng-Chih Lin, K. Banerjee, T. Sherwood
{"title":"Introspective 3D chips","authors":"Shashidhar Mysore, B. Agrawal, N. Srivastava, Sheng-Chih Lin, K. Banerjee, T. Sherwood","doi":"10.1145/1168857.1168890","DOIUrl":"https://doi.org/10.1145/1168857.1168890","url":null,"abstract":"While the number of transistors on a chip increases exponentially over time, the productivity that can be realized from these systems has not kept pace. To deal with the complexity of modern systems, software developers are increasingly dependent on specialized development tools such as security profilers, memory leak identifiers, data flight recorders, and dynamic type analysis. Many of these tools require full-system data which covers multiple interacting threads, processes, and processors. Reducing the performance penalty and complexity of these software tools is critical to those developing next generation applications, and many researchers have proposed adding specialized hardware to assist in profiling and introspection. Unfortunately, while this additional hardware would be incredibly beneficial to developers, the cost of this hardware must be paid on every single die that is manufactured.In this paper, we argue that a new way to attack this problem is with the addition of specialized analysis hardware built on separate active layers stacked vertically on the processor die using 3D IC technology. This provides a modular \"snap-on\" functionality that could be included with developer systems, and omitted from consumer systems to keep the cost impact to a minimum. In this paper we describe the advantage of using inter-die vias for introspection and we quantify the impact they can have in terms of the area, power, temperature, and routability of the resulting systems. We show that hardware stubs could be inserted into commodity processors at design time that would allow analysis layers to be bonded to development chips, and that these stubs would increase area and power by no more than 0.021mm2 and 0.9% respectively.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127421200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}