BSSync:用有界过期一致性模型处理机器学习工作负载的近内存

Joo Hwan Lee, Jaewoong Sim, Hyesoon Kim
{"title":"BSSync:用有界过期一致性模型处理机器学习工作负载的近内存","authors":"Joo Hwan Lee, Jaewoong Sim, Hyesoon Kim","doi":"10.1109/PACT.2015.42","DOIUrl":null,"url":null,"abstract":"Parallel machine learning workloads have become prevalent in numerous application domains. Many of these workloads are iterative convergent, allowing different threads to compute in an asynchronous manner, relaxing certain read-after-write data dependencies to use stale values. While considerable effort has been devoted to reducing the communication latency between nodes by utilizing asynchronous parallelism, inefficient utilization of relaxed consistency models within a single node have caused parallel implementations to have low execution efficiency. The long latency and serialization caused by atomic operations have a significant impact on performance. The data communication is not overlapped with the main computation, which reduces execution efficiency. The inefficiency comes from the data movement between where they are stored and where they are processed. In this work, we propose Bounded Staled Sync (BSSync), a hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy. BSSync overlaps the long latency atomic operation with the main computation, targeting iterative convergent machine learning workloads. Compared to previous work that allows staleness for read operations, BSSync utilizes staleness for write operations, allowing stale-writes. We demonstrate the benefit of the proposed scheme for representative machine learning workloads. On average, our approach outperforms the baseline asynchronous parallel implementation by 1.33x times.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"59","resultStr":"{\"title\":\"BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models\",\"authors\":\"Joo Hwan Lee, Jaewoong Sim, Hyesoon Kim\",\"doi\":\"10.1109/PACT.2015.42\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Parallel machine learning workloads have become prevalent in numerous application domains. Many of these workloads are iterative convergent, allowing different threads to compute in an asynchronous manner, relaxing certain read-after-write data dependencies to use stale values. While considerable effort has been devoted to reducing the communication latency between nodes by utilizing asynchronous parallelism, inefficient utilization of relaxed consistency models within a single node have caused parallel implementations to have low execution efficiency. The long latency and serialization caused by atomic operations have a significant impact on performance. The data communication is not overlapped with the main computation, which reduces execution efficiency. The inefficiency comes from the data movement between where they are stored and where they are processed. In this work, we propose Bounded Staled Sync (BSSync), a hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy. BSSync overlaps the long latency atomic operation with the main computation, targeting iterative convergent machine learning workloads. Compared to previous work that allows staleness for read operations, BSSync utilizes staleness for write operations, allowing stale-writes. We demonstrate the benefit of the proposed scheme for representative machine learning workloads. On average, our approach outperforms the baseline asynchronous parallel implementation by 1.33x times.\",\"PeriodicalId\":385398,\"journal\":{\"name\":\"2015 International Conference on Parallel Architecture and Compilation (PACT)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-10-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"59\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on Parallel Architecture and Compilation (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2015.42\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.42","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 59

摘要

并行机器学习工作负载已经在许多应用领域变得普遍。这些工作负载中的许多都是迭代收敛的,允许不同的线程以异步方式进行计算,从而减轻了使用过时值的某些写后读数据依赖关系。虽然通过利用异步并行性来减少节点之间的通信延迟已经付出了相当大的努力,但是在单个节点内对宽松一致性模型的低效利用导致并行实现的执行效率很低。原子操作导致的长延迟和序列化对性能有很大影响。数据通信与主计算不重叠,降低了执行效率。这种低效率来自于数据在存储位置和处理位置之间的移动。在这项工作中,我们提出了有界过时同步(BSSync),这是对有界过时一致性模型的硬件支持,它伴随着内存层次结构中的简单逻辑层。BSSync将长延迟的原子操作与主计算重叠,针对迭代收敛的机器学习工作负载。与之前允许读操作过期的工作相比,BSSync利用写操作过期,允许写过期。我们证明了所提出的方案对代表性机器学习工作负载的好处。平均而言,我们的方法比基线异步并行实现的性能高出1.33倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models
Parallel machine learning workloads have become prevalent in numerous application domains. Many of these workloads are iterative convergent, allowing different threads to compute in an asynchronous manner, relaxing certain read-after-write data dependencies to use stale values. While considerable effort has been devoted to reducing the communication latency between nodes by utilizing asynchronous parallelism, inefficient utilization of relaxed consistency models within a single node have caused parallel implementations to have low execution efficiency. The long latency and serialization caused by atomic operations have a significant impact on performance. The data communication is not overlapped with the main computation, which reduces execution efficiency. The inefficiency comes from the data movement between where they are stored and where they are processed. In this work, we propose Bounded Staled Sync (BSSync), a hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy. BSSync overlaps the long latency atomic operation with the main computation, targeting iterative convergent machine learning workloads. Compared to previous work that allows staleness for read operations, BSSync utilizes staleness for write operations, allowing stale-writes. We demonstrate the benefit of the proposed scheme for representative machine learning workloads. On average, our approach outperforms the baseline asynchronous parallel implementation by 1.33x times.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信