David I. August, D. Connors, S. Mahlke, J. Sias, K. Crozier, B. Cheng, Patrick R. Eaton, Q. B. Olaniran, W. Hwu
{"title":"Integrated predicated and speculative execution in the IMPACT EPIC architecture","authors":"David I. August, D. Connors, S. Mahlke, J. Sias, K. Crozier, B. Cheng, Patrick R. Eaton, Q. B. Olaniran, W. Hwu","doi":"10.1109/ISCA.1998.694777","DOIUrl":"https://doi.org/10.1109/ISCA.1998.694777","url":null,"abstract":"Explicitly Parallel Instruction Computing (EPIC) architectures require the compiler to express program instruction level parallelism directly to the hardware. EPIC techniques which enable the compiler to represent control speculation, data dependence speculation, and predication have individually been shown to be very effective. However these techniques have not been studied in combination with each other. This paper presents the IMPACT EPIC Architecture to address the issues involved in designing processors based on these EPIC concepts. In particular we focus on new execution and recovery models in which microarchitectural support for predicated execution is also used to enable efficient recovery from exceptions caused by speculatively executed instructions. This paper demonstrates that a coherent framework to integrate the three techniques can be elegantly designed to achieve much better performance than each individual technique could alone provide.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117028146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effects of architectural and technological advances on the HP/Convex Exemplar's memory and communication performance","authors":"Gheith A. Abandah, E. Davidson","doi":"10.1145/279358.279401","DOIUrl":"https://doi.org/10.1145/279358.279401","url":null,"abstract":"Advances in microarchitecture, packaging, and manufacturing processes enable designers to build new systems with higher performance and scalability. Using microbenchmark techniques, we contrast the memory and communication performance of two generations of the HP/Convex Exemplar scalable parallel processing system. The SPP1000 and SPP2000 have significant architectural and implementation differences, but maintain upward binary compatibility. The SPP2000 employs manufacturing and packaging advances to obtain shorter system interconnects with wider data paths and improved functionality thereby reducing the latency and increasing the bandwidth of remote communication. Although the memory latency is not significantly improved, newer out-of-order execution processors coupled with nonblocking caches achieve much higher memory bandwidth. The SPP2000 has a richer system interconnect topology that allows scalability to a larger number of processors. The SPP2000 also employs innovations in its coherence protocols to improve synchronization and communication performance. This paper characterizes the performance effects of these changes, and identifies some remaining inefficiencies, in the cache coherence protocol and the node configuration, that future systems should address.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123201549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Branch prediction based on universal data compression algorithms","authors":"E. Federovsky, M. Feder, S. Weiss","doi":"10.1145/279358.279370","DOIUrl":"https://doi.org/10.1145/279358.279370","url":null,"abstract":"Data compression and prediction are closely related. Thus prediction methods based on data compression algorithms have been suggested for the branch prediction problem. In this work we consider two universal compression algorithms: prediction by partial matching (PPM), and a recently developed method, context tree weighting (CTW). We describe the prediction algorithms induced by these methods. We also suggest adaptive algorithms variations of the basic methods that attempt to fit limited memory constraints and to match the non-stationary nature of the branch sequence. Furthermore, we show how to incorporate address information and to combine other relevant data. Finally, we present simulation results for selected programs from the SPECint95, SYSmark/32, SYSmark/NT, and transactional processing benchmarks. Our results are most promising in programs with difficult to predict branch behavior.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131420506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low load latency through sum-addressed memory (SAM)","authors":"William L. Lynch, G. Lauterbach, J. Chamdani","doi":"10.1145/279358.279406","DOIUrl":"https://doi.org/10.1145/279358.279406","url":null,"abstract":"Load latency contributes significantly to execution time. Because most cache accesses hit, cache-hit latency becomes an important component of expected load latency. Most modern microprocessors have base+offset addressing loads; thus effective cache-hit latency includes an addition as well as the RAM access. This paper introduces a new technique used in the UltraSPARC III microprocessor Sum-Addressed Memory (SAM), which performs true addition using the decoder of the RAM array, with very low latency. We compare SAM with other methods for reducing the add part of load latency. These methods include sum-prediction with recovery, and bitwise indexing with duplicate-tolerance. The results demonstrate the superior performance of SAM.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129745811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Lo, L. Barroso, S. Eggers, K. Gharachorloo, H. Levy, S. Parekh
{"title":"An analysis of database workload performance on simultaneous multithreaded processors","authors":"J. Lo, L. Barroso, S. Eggers, K. Gharachorloo, H. Levy, S. Parekh","doi":"10.1145/279358.279367","DOIUrl":"https://doi.org/10.1145/279358.279367","url":null,"abstract":"Simultaneous multithreading (SMT) is an architectural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its performance on database systems is still an open question. In particular, database systems have poor cache performance, and the addition of multithreading has the potential to exacerbate cache conflicts. This paper examines database performance on SMT processors using traces of the Oracle database management system. Our research makes three contributions. First, it characterizes the memory-system behavior of database systems running on-line transaction processing and decision support system workloads. Our data show that while DBMS workloads have large memory footprints, there is substantial data reuse in a small, cacheable \"critical\" working set. Second, we show that the additional data cache conflicts caused by simultaneous-multithreaded instruction scheduling can be nearly eliminated by the proper choice of software-directed policies for virtual-to-physical page mapping and per-process address offsetting. Our results demonstrate that with the best policy choices, D-cache miss rates on an 8-context SMT are roughly equivalent to those on a single-threaded superscalar. Multithreading also leads to better interthread instruction cache sharing, reducing I-cache miss rates by up to 35%. Third, we show that SMT's latency tolerance is highly effective for database applications. For example, using a memory-intensive OLTP workload, an 8-context SMT processor achieves a 3-fold increase in instruction throughput over a single-threaded superscalar with similar resources.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127124987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Increasing TLB reach using superpages backed by shadow memory","authors":"M. Swanson, L. Stoller, J. Carter","doi":"10.1109/ISCA.1998.694775","DOIUrl":"https://doi.org/10.1109/ISCA.1998.694775","url":null,"abstract":"The amount of memory that can be accessed without causing a translation lookaside buffer (TLB) fault, the reach of a TLB, is failing to keep pace with the increasingly large working sets of applications. We propose to extend TLB reach via a novel Memory Controller TLB (MTLB) that lets us aggressively create superpages from non-contiguous, unaligned regions of physical memory. This flexibility increases the OS's ability to use superpages on arbitrary application data. The MTLB supports shadow pages, regions of physical address space for which the MTLB remaps accesses to \"real\" physical pages. The MTLB preserves per-base-page referenced and dirty bits, which enables the OS to swap shadow-backed superpages a page at a time, unlike conventional superpages. Simulation of five applications, including two SPECint95 benchmarks, demonstrated that a modest-sized MTLB improves performance of applications with moderate-to-high TLB miss rates by 5-20%. Simulation also showed that this mechanism can more than double the effective reach of a processor TLB with no modification to the processor MMU.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114646711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic history-length fitting: a third level of adaptivity for branch prediction","authors":"Toni Juan, K. Sanjeevan, J. Navarro","doi":"10.1109/ISCA.1998.694771","DOIUrl":"https://doi.org/10.1109/ISCA.1998.694771","url":null,"abstract":"Accurate branch prediction is essential for obtaining high performance in pipelined superscalar processors that execute instructions speculatively. Some of the best current predictors combine a part of the branch address with a fixed amount of global history of branch outcomes in order to make a prediction. These predictors cannot perform uniformly well across all workloads because the best amount of history to be used depends on the code, the input data and the frequency of context switches. Consequently, all predictors that use a fixed history length are therefore unable to perform up to their maximum potential. We introduce a method-called DHLF-that dynamically determines the optimum history length during execution, adapting to the specific requirements of any code, input data and system workload. Our proposal adds an extra level of adaptivity to two-level adaptive branch predictors. The DHLF method can be applied to any one of the predictors that combine global branch history with the branch address. We apply the DHLF method to gshare (dhlf-gshare) and obtain near-optimal results for all SPECint95 benchmarks, with and without context switches. Some results are also presented for gskewed (dhlf-gskewed), confirming that other predictors can benefit from our proposal.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"105 12S1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122240489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Switcherland: a QoS communication architecture for workstation clusters","authors":"H. Eberle, E. Oertli","doi":"10.1145/279358.279373","DOIUrl":"https://doi.org/10.1145/279358.279373","url":null,"abstract":"Computer systems have become powerful enough to process continuous data streams such as video or animated graphics. While processing power and communication bandwidth of today's systems typically are sufficient, quality of service (QoS) guarantees as required for handling such data types cannot be provided by these systems in adequate ways. We present Switcherland, a scalable communication architecture based on crossbar switches that provides QoS guarantees for workstation clusters in the form of reserved bandwidth and bounded transmission delays. Similar to the ATM technology Switcherland provides QoS guarantees with the help of service classes, that is, data transfers are characterized as variable bit rare traffic or constant bit rate traffic. However, unlike LAN technologies, Switcherland is optimized for cluster computing in that (i) it serves as a backplane interconnection fabric as well as a LAN, (ii) it extends support for service classes by also covering the end nodes of the network, (iii) it provides low latency in the order of one microsecond per switch, and (iv) it uses a communication model based on a global memory to simplify programming.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129809284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting spatial locality in data caches using spatial footprints","authors":"Sanjeev Kumar, C. Wilkerson","doi":"10.1109/ISCA.1998.694794","DOIUrl":"https://doi.org/10.1109/ISCA.1998.694794","url":null,"abstract":"Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, less than half of the data brought into the cache gets used before eviction. The unused portion of the cache line negatively impacts performance by wasting bandwidth and polluting the cache by replacing potentially useful data that would otherwise remain in the cache. This paper describes an alternative approach to exploit spatial locality available in data caches. On a cache miss, our mechanism, called Spatial Footprint Predictor (SFP), predicts which portions of a cache block will get used before getting evicted. The high accuracy of the predictor allows us to exploit spatial locality exhibited in larger blocks of data yielding better miss ratios without significantly impacting the memory access latencies. Our evaluation of this mechanism shows that the miss rate of the cache is improved, on average, by 18% in addition to a significant reduction in the bandwidth requirement.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124305864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory system characterization of commercial workloads","authors":"L. Barroso, K. Gharachorloo, Edouard Bugnion","doi":"10.1109/ISCA.1998.694758","DOIUrl":"https://doi.org/10.1109/ISCA.1998.694758","url":null,"abstract":"This paper presents a detailed performance study of three important classes of commercial workloads: online transaction processing (OLTP), decision support systems (DSS), and Web index search. We use the Oracle commercial database engine for our OLTP and DSS workloads, and the AltaVista search engine for our Web index search workload. This study characterizes the memory system behavior of these workloads through a large number of architectural experiments on Alpha multiprocessors augmented with full system simulations to determine the impact of architectural trends. We also identify a set of simplifications that make these workloads more amenable to monitoring and simulation without affecting representative memory system behavior. We observe that systems optimized for OLTP versus DSS and index search workloads may lead to diverging designs, specifically in the size and speed requirements for off-chip caches.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116686261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}