Sangyeun Cho, Joel R. Martin, Ruibin Xu, Mohammad Hammoud, R. Melhem
{"title":"CA-RAM: A High-Performance Memory Substrate for Search-Intensive Applications","authors":"Sangyeun Cho, Joel R. Martin, Ruibin Xu, Mohammad Hammoud, R. Melhem","doi":"10.1109/ISPASS.2007.363753","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363753","url":null,"abstract":"This paper proposes a specialized memory structure called CA-RAM (content addressable random access memory) to accelerate search operations present in many important real-world applications. Search operations can occupy a significant portion of total execution time and energy consumption, while posing a difficult performance problem to tackle using traditional memory hierarchy concepts. In essence, CA-RAM is a direct hardware implementation of the well-known hashing technique. Searchable records are stored in CA-RAM at a location determined by a hash function, defined on their search key. After a database has been built, looking up a record in CA-RAM typically involves a single memory access followed by a parallel key matching operation. Compared with a conventional CAM (content addressable memory) solution, CA-RAM capitalizes on dense SRAM and DRAM designs, and achieves comparable search performance while occupying much smaller area and consuming significantly less power. This paper presents detailed design aspects of CA-RAM, to be integrated in future general-purpose and application-specific processors and systems. To further motivate and justify our approach, we present two real examples of using CA-RAM to build a high-performance search accelerator targeting: IP address lookup in core routers and trigram lookup in a large speech recognition system","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121295222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Last-Touch Correlated Data Streaming","authors":"M. Ferdman, B. Falsafi","doi":"10.1109/ISPASS.2007.363741","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363741","url":null,"abstract":"Recent research advocates address-correlating predictors to identify cache block addresses for prefetch. Unfortunately, address-correlating predictors require correlation data storage proportional in size to a program's active memory footprint. As a result, current proposals for this class of predictor are either limited in coverage due to constrained on-chip storage requirements or limited in prediction lookahead due to long off-chip correlation data lookup. In this paper, we propose last-touch correlated data streaming (LT-cords), a practical address-correlating predictor. The key idea of LT-cords is to record correlation data off chip in the order they will be used and stream them into a practically-sized on-chip table shortly before they are needed, thereby obviating the need for scalable on-chip tables and enabling low-latency lookup. We use cycle-accurate simulation of an 8-way out-of-order superscalar processor to show that: (1) LT-cords with 214KB of on-chip storage can achieve the same coverage as a last-touch predictor with unlimited storage, without sacrificing predictor lookahead, and (2) LT-cords improves performance by 60% on average and 385% at best in the benchmarks studied","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130188911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seongbeom Kim, Fang Liu, Yan Solihin, R. Iyer, Li Zhao, W. Cohen
{"title":"Accelerating Full-System Simulation through Characterizing and Predicting Operating System Performance","authors":"Seongbeom Kim, Fang Liu, Yan Solihin, R. Iyer, Li Zhao, W. Cohen","doi":"10.1109/ISPASS.2007.363731","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363731","url":null,"abstract":"The ongoing trend of increasing computer hardware and software complexity has resulted in the increase in complexity and overheads of cycle-accurate processor system simulation, especially in full-system simulation which not only simulates user applications, but also the operating system (OS) and system libraries. This paper seeks to address how to accelerate full-system simulation through studying, characterizing, and predicting the performance behavior of OS services. Through studying the performance behavior of OS services, we found that each OS service exhibits multiple but limited behavior points that are repeated frequently. OS services also exhibit application-specific performance behavior and largely irregular patterns of occurrences. We exploit the observation to speed up full system simulation. A simulation run is divided into two non-overlapping periods: a learning period in which performance behavior of instances of an OS service are characterized and recorded, and a prediction period in which detailed simulation is replaced with a much faster emulation mode. During a prediction period, the behavior signature of an instance of an OS service is obtained through emulation while performance of the instance is predicted based on its signature and records of the OS service's past performance behavior. Statistically-rigorous algorithms are used to determine when to switch between learning and prediction periods. We test our simulation acceleration method with a set of OS-intensive applications and a recent version of Linux OS running on top of a detailed processor and memory hierarchy model implemented on Simics, a popular full-system simulator. On average, the method needs the learning periods to cover only 11% of OS service invocations in order to produce highly accurate performance estimates. This leads to an estimated simulation speedup of 4.9times, with an average performance prediction error of only 3.2%, and a worst case error of 4.2%","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"23 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134633909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erez Perelman, Jeremy Lau, H. Patil, A. Jaleel, Greg Hamerly, B. Calder
{"title":"Cross Binary Simulation Points","authors":"Erez Perelman, Jeremy Lau, H. Patil, A. Jaleel, Greg Hamerly, B. Calder","doi":"10.1109/ISPASS.2007.363748","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363748","url":null,"abstract":"Architectures are usually compared by running the same workload on each architecture and comparing performance. When a single compiled binary of a program is executed on many different architectures, techniques like SimPoint can be used to find a small set of samples that represent the majority of the program's execution. Architectures can be compared by simulating their behavior on the code samples selected by SimPoint, to quickly determine which architecture has the best performance. Architectural design space exploration becomes more difficult when different binaries must be used for the same program. These cases arise when evaluating architectures that include ISA extensions, and when evaluating compiler optimizations. This problem domain is the focus of our paper. When multiple binaries are used to evaluate a program, one approach is to create a separate set of simulation points for each binary. This approach works reasonably well for many applications, but breaks down when the simulation points chosen for the different binaries emphasize different parts of the program's execution. This problem can be avoided if simulation points are selected consistently across the different binaries, to ensure that the same parts of program execution are represented in all binaries. In this paper we present an approach that finds a single set of simulation points to be used across all binaries for a single program. This allows for simulation of the same parts of program execution despite changes in the binary due to ISA changes or compiler optimizations","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126224830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}