Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

ASPLOS XII Pub Date : 2006-10-23 DOI:10.1145/1168857.1168893

Koushik Chakraborty, Philip M. Wells, G. Sohi

{"title":"Computation spreading: employing hardware migration to specialize CMP cores on-the-fly","authors":"Koushik Chakraborty, Philip M. Wells, G. Sohi","doi":"10.1145/1168857.1168893","DOIUrl":null,"url":null,"abstract":"In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45-65% of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors.We present Computation Spreading (CSP), which employs hardware migration to distribute a thread's dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes.When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27-58%, private L2 load misses by 0-19%, and branch mispredictions by 9-25%.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"117","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ASPLOS XII","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1168857.1168893","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 117

Abstract

In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different times, we observe extensive code reuse among different processors, causing redundancy (e.g., in our server workloads, 45-65% of all instruction blocks are accessed by all processors). Moreover, largely independent fragments of computation compete for the same private resources causing destructive interference. Together, this redundancy and interference lead to poor utilization of private microarchitecture resources such as caches and branch predictors.We present Computation Spreading (CSP), which employs hardware migration to distribute a thread's dissimilar fragments of computation across the multiple processing cores of a chip multiprocessor (CMP), while grouping similar computation fragments from different threads together. This paper focuses on a specific example of CSP for OS intensive server applications: separating application level (user) computation from the OS calls it makes.When performing CSP, each core becomes temporally specialized to execute certain computation fragments, and the same core is repeatedly used for such fragments. We examine two specific thread assignment policies for CSP, and show that these policies, across four server workloads, are able to reduce instruction misses in private L2 caches by 27-58%, private L2 load misses by 0-19%, and branch mispredictions by 9-25%.

查看原文本刊更多论文

计算扩展:利用硬件迁移对动态CMP内核进行专门化

在规范的并行处理中，操作系统(OS)将一个处理核心分配给来自多线程服务器应用程序的单个线程。由于来自同一应用程序的不同线程经常执行类似的计算，尽管在不同的时间，我们观察到不同处理器之间广泛的代码重用，导致冗余(例如，在我们的服务器工作负载中，所有指令块的45-65%被所有处理器访问)。此外，大部分独立的计算片段争夺相同的私有资源，造成破坏性干扰。总之，这种冗余和干扰导致私有微架构资源(如缓存和分支预测器)的利用率低下。本文提出了计算扩展(CSP)，它利用硬件迁移将一个线程的不同计算片段分布到芯片多处理器(CMP)的多个处理内核中，同时将来自不同线程的相似计算片段分组在一起。本文重点讨论了一个用于操作系统密集型服务器应用程序的CSP的具体示例:将应用程序级(用户)计算与操作系统调用分离。在执行CSP时，每个核心暂时专门化以执行某些计算片段，并且重复地为这些片段使用相同的核心。我们研究了CSP的两个特定线程分配策略，并表明这些策略在四种服务器工作负载下能够将私有L2缓存中的指令失误减少27-58%，私有L2负载失误减少0-19%，分支错误预测减少9-25%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ASPLOS XII

自引率

0.00%

发文量