ASPLOS VI最新文献_第2页

Fine-grain access control for distributed shared memory 分布式共享内存的细粒度访问控制

ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195575

I. Schoinas, B. Falsafi, A. Lebeck, S. Reinhardt, J. Larus, D. Wood

{"title":"Fine-grain access control for distributed shared memory","authors":"I. Schoinas, B. Falsafi, A. Lebeck, S. Reinhardt, J. Larus, D. Wood","doi":"10.1145/195473.195575","DOIUrl":"https://doi.org/10.1145/195473.195575","url":null,"abstract":"This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing.\u0000This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. We incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each shared-memory reference by modifying the program's executable. The second uses the memory's error correcting code (ECC) as cache-block valid bits. The third is a hybrid. The software technique ranged from slightly faster to two times slower than the ECC approach. Blizzard's performance is roughly comparable to a hardware shared-memory machine. These results argue that clusters of workstations or personal computers with networks comparable to the CM-5's will be able to support the same shared-memory interfaces as supercomputers.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"15 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123212892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 292

Surpassing the TLB performance of superpages with less operating system support 在操作系统支持较少的情况下，超越超级页面的TLB性能

ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195531

Madhusudhan Talluri, M. Hill

{"title":"Surpassing the TLB performance of superpages with less operating system support","authors":"Madhusudhan Talluri, M. Hill","doi":"10.1145/195473.195531","DOIUrl":"https://doi.org/10.1145/195473.195531","url":null,"abstract":"Many commercial microprocessor architectures have added translation lookaside buffer (TLB) support for superpages. Superpages differ from segments because their size must be a power of two multiple of the base page size and they must be aligned in both virtual and physical address spaces. Very large superpages (e.g., 1MB) are clearly useful for mapping special structures, such as kernel data or frame buffers. This paper considers the architectural and operating system support required to exploit medium-sized superpages (e.g., 64KB, i.e., sixteen times a 4KB base page size). First, we show that superpages improve TLB performance only after invasive operating system modifications that introduce considerable overhead.\u0000We then propose two subblock TLB designs as alternate ways to improve TLB performance. Analogous to a subblock cache, a complete-subblock TLB associates a tag with a superpage-sized region but has valid bits, physical page number, attributes, etc., for each possible base page mapping. A partial-subblock TLB entry is much smaller than a complete-subblock TLB entry, because it shares physical page number and attribute fields across base page mappings. A drawback of a partial-subblock TLB is that base page mappings can share a TLB entry only if they map to consecutive physical pages and have the same attributes. We propose a physical memory allocation algorithm, page reservation, that makes this sharing more likely. When page reservation is used, experimental results show partial-subblock TLBs perform better than superpage TLBs, while requiring simpler operating system changes. If operating system changes are inappropriate, however, complete-subblock TLBs perform best.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131456257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 251

Contrasting characteristics and cache performance of technical and multi-user commercial workloads 对比技术和多用户商业工作负载的特征和缓存性能

ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195524

A.M.G. Maynard, Colette M. Donnelly, B. Olszewski

{"title":"Contrasting characteristics and cache performance of technical and multi-user commercial workloads","authors":"A.M.G. Maynard, Colette M. Donnelly, B. Olszewski","doi":"10.1145/195473.195524","DOIUrl":"https://doi.org/10.1145/195473.195524","url":null,"abstract":"Experience has shown that many widely used benchmarks are poor predictors of the performance of systems running commercial applications. Research into this anomaly has long been hampered by a lack of address traces from representative multi-user commercial workloads. This paper presents research, using traces of industry-standard commercial benchmarks, which examines the characteristic differences between technical and commercial workloads and illustrates how those differences affect cache performance.\u0000Commercial and technical environments differ in their respective branch behavior, operating system activity, I/O, and dispatching characteristics. A wide range of uniprocessor instruction and data cache geometries were studied. The instruction cache results for commercial workloads demonstrate that instruction cache performance can no longer be neglected because these workloads have much larger code working sets than technical applications. For database workloads, a breakdown of kernel and user behavior reveals that the application component can exhibit behavior similar to the operating system and therefore, can experience miss rates equally high. This paper also indicates that “dispatching” or process switching characteristics must be considered when designing level-two caches. The data presented shows that increasing the associativity of second-level caches can reduce miss rates significantly. Overall, the results of this research should help system designers choose a cache configuration that will perform well in commercial markets.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"5 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124956413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 226

Performance of a hardware-assisted real-time garbage collector 硬件辅助的实时垃圾收集器的性能

ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195504

William J. Schmidt, K. Nilsen

引用次数: 60

Trap-driven simulation with Tapeworm II 绦虫陷阱驱动模拟

ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195521

R. Uhlig, D. Nagle, T. Mudge, S. Sechrest

引用次数: 34

The performance impact of flexibility in the Stanford FLASH multiprocessor 斯坦福FLASH多处理器中灵活性对性能的影响

ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195569

M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, Anoop Gupta, M. Rosenblum, J. Hennessy

{"title":"The performance impact of flexibility in the Stanford FLASH multiprocessor","authors":"M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, Anoop Gupta, M. Rosenblum, J. Hennessy","doi":"10.1145/195473.195569","DOIUrl":"https://doi.org/10.1145/195473.195569","url":null,"abstract":"A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121704514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 136

Integration of message passing and shared memory in the Stanford FLASH multiprocessor 斯坦福FLASH多处理器中信息传递和共享内存的集成

ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195494

J. Heinlein, K. Gharachorloo, Scott Dresser, Anoop Gupta

引用次数: 119

Where is time spent in message-passing and shared-memory programs? 在消息传递和共享内存程序上花费的时间在哪里?

ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195501

S. Chandra, J. Larus, Anne Rogers

引用次数: 94

Improving the accuracy of static branch prediction using branch correlation 利用分支相关性提高静态分支预测的精度

ASPLOS VI Pub Date : 1994-11-01 DOI: 10.1145/195473.195549

C. Young, Michael D. Smith