Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
{"title":"Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects","authors":"Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe","doi":"arxiv-2409.08141","DOIUrl":null,"url":null,"abstract":"Conventional wisdom holds that an efficient interface between an OS running\non a CPU and a high-bandwidth I/O device should be based on Direct Memory\nAccess (DMA), descriptor rings, and interrupts: DMA offloads transfers from the\nCPU, descriptor rings provide buffering and queuing, and interrupts facilitate\nasynchronous interaction between cores and device with a lightweight\nnotification mechanism. In this paper we question this wisdom in the light of\nmodern hardware and workloads, particularly in cloud servers. We argue that the\nassumptions that led to this model are obsolete, and in many use-cases use of\nprogrammed I/O, where the CPU explicitly transfers data and control information\nto and from a device via loads and stores, actually results in a more efficient\nsystem. We quantitatively demonstrate these advantages using three use-cases:\nfine-grained RPC-style invocation of functions on an accelerator, offloading of\noperators in a streaming dataflow engine, and a network interface targeting for\nserverless functions. Moreover, we show that while these advantages are\nsignificant over a modern PCIe peripheral bus, a truly cache-coherent\ninterconnect offers significant additional efficiency gains.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Conventional wisdom holds that an efficient interface between an OS running
on a CPU and a high-bandwidth I/O device should be based on Direct Memory
Access (DMA), descriptor rings, and interrupts: DMA offloads transfers from the
CPU, descriptor rings provide buffering and queuing, and interrupts facilitate
asynchronous interaction between cores and device with a lightweight
notification mechanism. In this paper we question this wisdom in the light of
modern hardware and workloads, particularly in cloud servers. We argue that the
assumptions that led to this model are obsolete, and in many use-cases use of
programmed I/O, where the CPU explicitly transfers data and control information
to and from a device via loads and stores, actually results in a more efficient
system. We quantitatively demonstrate these advantages using three use-cases:
fine-grained RPC-style invocation of functions on an accelerator, offloading of
operators in a streaming dataflow engine, and a network interface targeting for
serverless functions. Moreover, we show that while these advantages are
significant over a modern PCIe peripheral bus, a truly cache-coherent
interconnect offers significant additional efficiency gains.