ASPLOS VI最新文献

筛选
英文 中文
Fine-grain access control for distributed shared memory 分布式共享内存的细粒度访问控制
ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195575
I. Schoinas, B. Falsafi, A. Lebeck, S. Reinhardt, J. Larus, D. Wood
{"title":"Fine-grain access control for distributed shared memory","authors":"I. Schoinas, B. Falsafi, A. Lebeck, S. Reinhardt, J. Larus, D. Wood","doi":"10.1145/195473.195575","DOIUrl":"https://doi.org/10.1145/195473.195575","url":null,"abstract":"This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing.\u0000This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. We incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each shared-memory reference by modifying the program's executable. The second uses the memory's error correcting code (ECC) as cache-block valid bits. The third is a hybrid. The software technique ranged from slightly faster to two times slower than the ECC approach. Blizzard's performance is roughly comparable to a hardware shared-memory machine. These results argue that clusters of workstations or personal computers with networks comparable to the CM-5's will be able to support the same shared-memory interfaces as supercomputers.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"15 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123212892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 292
Surpassing the TLB performance of superpages with less operating system support 在操作系统支持较少的情况下,超越超级页面的TLB性能
ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195531
Madhusudhan Talluri, M. Hill
{"title":"Surpassing the TLB performance of superpages with less operating system support","authors":"Madhusudhan Talluri, M. Hill","doi":"10.1145/195473.195531","DOIUrl":"https://doi.org/10.1145/195473.195531","url":null,"abstract":"Many commercial microprocessor architectures have added translation lookaside buffer (TLB) support for superpages. Superpages differ from segments because their size must be a power of two multiple of the base page size and they must be aligned in both virtual and physical address spaces. Very large superpages (e.g., 1MB) are clearly useful for mapping special structures, such as kernel data or frame buffers. This paper considers the architectural and operating system support required to exploit medium-sized superpages (e.g., 64KB, i.e., sixteen times a 4KB base page size). First, we show that superpages improve TLB performance only after invasive operating system modifications that introduce considerable overhead.\u0000We then propose two subblock TLB designs as alternate ways to improve TLB performance. Analogous to a subblock cache, a complete-subblock TLB associates a tag with a superpage-sized region but has valid bits, physical page number, attributes, etc., for each possible base page mapping. A partial-subblock TLB entry is much smaller than a complete-subblock TLB entry, because it shares physical page number and attribute fields across base page mappings. A drawback of a partial-subblock TLB is that base page mappings can share a TLB entry only if they map to consecutive physical pages and have the same attributes. We propose a physical memory allocation algorithm, page reservation, that makes this sharing more likely. When page reservation is used, experimental results show partial-subblock TLBs perform better than superpage TLBs, while requiring simpler operating system changes. If operating system changes are inappropriate, however, complete-subblock TLBs perform best.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131456257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 251
Contrasting characteristics and cache performance of technical and multi-user commercial workloads 对比技术和多用户商业工作负载的特征和缓存性能
ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195524
A.M.G. Maynard, Colette M. Donnelly, B. Olszewski
{"title":"Contrasting characteristics and cache performance of technical and multi-user commercial workloads","authors":"A.M.G. Maynard, Colette M. Donnelly, B. Olszewski","doi":"10.1145/195473.195524","DOIUrl":"https://doi.org/10.1145/195473.195524","url":null,"abstract":"Experience has shown that many widely used benchmarks are poor predictors of the performance of systems running commercial applications. Research into this anomaly has long been hampered by a lack of address traces from representative multi-user commercial workloads. This paper presents research, using traces of industry-standard commercial benchmarks, which examines the characteristic differences between technical and commercial workloads and illustrates how those differences affect cache performance.\u0000Commercial and technical environments differ in their respective branch behavior, operating system activity, I/O, and dispatching characteristics. A wide range of uniprocessor instruction and data cache geometries were studied. The instruction cache results for commercial workloads demonstrate that instruction cache performance can no longer be neglected because these workloads have much larger code working sets than technical applications. For database workloads, a breakdown of kernel and user behavior reveals that the application component can exhibit behavior similar to the operating system and therefore, can experience miss rates equally high. This paper also indicates that “dispatching” or process switching characteristics must be considered when designing level-two caches. The data presented shows that increasing the associativity of second-level caches can reduce miss rates significantly. Overall, the results of this research should help system designers choose a cache configuration that will perform well in commercial markets.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"5 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124956413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 226
Performance of a hardware-assisted real-time garbage collector 硬件辅助的实时垃圾收集器的性能
ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195504
William J. Schmidt, K. Nilsen
{"title":"Performance of a hardware-assisted real-time garbage collector","authors":"William J. Schmidt, K. Nilsen","doi":"10.1145/195473.195504","DOIUrl":"https://doi.org/10.1145/195473.195504","url":null,"abstract":"Hardware-assisted real-time garbage collection offers high throughput and small worst-case bounds on the times required to allocate dynamic objects and to access the memory contained within previously allocated objects. Whether the proposed technology is cost effective depends on various choices between configuration alternatives. This paper reports the performance of several different configurations of the hardware-assisted real-time garbage collection system subjected to several different workloads. Reported measurements demonstrate that hardware-assisted real-time garbage collection is a viable alternative to traditional explicit memory management techniques, even for low-level languages like C++.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122033751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
Trap-driven simulation with Tapeworm II 绦虫陷阱驱动模拟
ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195521
R. Uhlig, D. Nagle, T. Mudge, S. Sechrest
{"title":"Trap-driven simulation with Tapeworm II","authors":"R. Uhlig, D. Nagle, T. Mudge, S. Sechrest","doi":"10.1145/195473.195521","DOIUrl":"https://doi.org/10.1145/195473.195521","url":null,"abstract":"Tapeworm II is a software-based simulation tool that evaluates the cache and TLB performance of multiple-task and operating system intensive workloads. Tapeworm resides in an OS kernel and causes a host machine's hardware to drive simulations with kernel traps instead of with address traces, as is conventionally done. This allows Tapeworm to quickly and accurately capture complete memory referencing behavior with a limited degradation in overall system performance. This paper compares trap-driven simulation, as implemented in Tapeworm, with the more common technique of trace-driven memory simulation with respect to speed, accuracy, portability and flexibility.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128440921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
The performance impact of flexibility in the Stanford FLASH multiprocessor 斯坦福FLASH多处理器中灵活性对性能的影响
ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195569
M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, Anoop Gupta, M. Rosenblum, J. Hennessy
{"title":"The performance impact of flexibility in the Stanford FLASH multiprocessor","authors":"M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, Anoop Gupta, M. Rosenblum, J. Hennessy","doi":"10.1145/195473.195569","DOIUrl":"https://doi.org/10.1145/195473.195569","url":null,"abstract":"A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121704514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 136
Integration of message passing and shared memory in the Stanford FLASH multiprocessor 斯坦福FLASH多处理器中信息传递和共享内存的集成
ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195494
J. Heinlein, K. Gharachorloo, Scott Dresser, Anoop Gupta
{"title":"Integration of message passing and shared memory in the Stanford FLASH multiprocessor","authors":"J. Heinlein, K. Gharachorloo, Scott Dresser, Anoop Gupta","doi":"10.1145/195473.195494","DOIUrl":"https://doi.org/10.1145/195473.195494","url":null,"abstract":"The advantages of using message passing over shared memory for certain types of communication and synchronization have provided an incentive to integrate both models within a single architecture. A key goal of the FLASH (FLexible Architecture for SHared memory) project at Stanford is to achieve this integration while maintaining a simple and efficient design. This paper presents the hardware and software mechanisms in FLASH to support various message passing protocols. We achieve low overhead message passing by delegating protocol functionality to the programmable node controllers in FLASH and by providing direct user-level access to this messaging subsystem. In contrast to most earlier work, we provide an integrated solution that handles the interaction of the messaging protocols with virtual memory, protected multiprogramming, and cache coherence. Detailed simulation studies indicate that this system can sustain message-transfers rates of several hundred megabytes per second, effectively utilizing projected network bandwidths for next generation multiprocessors.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127068708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 119
Where is time spent in message-passing and shared-memory programs? 在消息传递和共享内存程序上花费的时间在哪里?
ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195501
S. Chandra, J. Larus, Anne Rogers
{"title":"Where is time spent in message-passing and shared-memory programs?","authors":"S. Chandra, J. Larus, Anne Rogers","doi":"10.1145/195473.195501","DOIUrl":"https://doi.org/10.1145/195473.195501","url":null,"abstract":"Message passing and shared memory are two techniques parallel programs use for coordination and communication. This paper studies the strengths and weaknesses of these two mechanisms by comparing equivalent, well-written message-passing and shared-memory programs running on similar hardware. To ensure that our measurements are comparable, we produced two carefully tuned versions of each program and measured them on closely-related simulators of a message-passing and a shared-memory machine, both of which are based on same underlying hardware assumptions.\u0000We examined the behavior and performance of each program carefully. Although the cost of computation in each pair of programs was similar, synchronization and communication differed greatly. We found that message-passing's advantage over shared-memory is not clear-cut. Three of the four shared-memory programs ran at roughly the same speed as their message-passing equivalent, even though their communication patterns were different.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114863454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 94
Improving the accuracy of static branch prediction using branch correlation 利用分支相关性提高静态分支预测的精度
ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195549
C. Young, Michael D. Smith
{"title":"Improving the accuracy of static branch prediction using branch correlation","authors":"C. Young, Michael D. Smith","doi":"10.1145/195473.195549","DOIUrl":"https://doi.org/10.1145/195473.195549","url":null,"abstract":"Recent work in history-based branch prediction uses novel hardware structures to capture branch correlation and increase branch prediction accuracy. We present a profile-based code transformation that exploits branch correlation to improve the accuracy of static branch prediction schemes. Our general method encodes branch history information in the program counter through the duplication and placement of program basic blocks. For correlation histories of eight branches, our experimental results achieve up to a 14.7% improvement in prediction accuracy over conventional profile-based prediction without any increase in the dynamic instruction count of our benchmark applications. In the majority of these applications, code duplication increases code size by less than 30%. For the few applications with code segments that exhibit exponential branching paths and no branch correlation, simple compile-time heuristics can eliminate these branches as code-transformation candidates.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125386559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 132
DCG: an efficient, retargetable dynamic code generation system 一个高效的、可重定向的动态代码生成系统
ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195567
D. Engler, Todd A. Proebsting
{"title":"DCG: an efficient, retargetable dynamic code generation system","authors":"D. Engler, Todd A. Proebsting","doi":"10.1145/195473.195567","DOIUrl":"https://doi.org/10.1145/195473.195567","url":null,"abstract":"Dynamic code generation allows aggressive optimization through the use of runtime information. Previous systems typically relied on ad hoc code generators that were not designed for retargetability, and did not shield the client from machine-specific details. We present a system, dcg, that allows clients to specify dynamically generated code in a machine-independent manner. Our one-pass code generator is easily retargeted and extremely efficient (code generation costs approximately 350 instructions per generated instruction). Experiments show that dynamic code generation increases some application speeds by over an order of magnitude.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126586381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 103
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信