Lazar Cvetković, François Costa, Mihajlo Djokic, Michal Friedman, Ana Klimovic
{"title":"Dirigent: Lightweight Serverless Orchestration","authors":"Lazar Cvetković, François Costa, Mihajlo Djokic, Michal Friedman, Ana Klimovic","doi":"arxiv-2404.16393","DOIUrl":null,"url":null,"abstract":"While Function as a Service (FaaS) platforms can initialize function\nsandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule\nfunctions in real FaaS clusters can be orders of magnitude higher. We find that\nthe current approach of building FaaS cluster managers on top of legacy\norchestration systems like Kubernetes leads to high scheduling delay at high\nsandbox churn, which is typical in FaaS clusters. While generic cluster\nmanagers use hierarchical abstractions and multiple internal components to\nmanage and reconcile state with frequent persistent updates, this becomes a\nbottleneck for FaaS, where cluster state frequently changes as sandboxes are\ncreated on the critical path of requests. Based on our root cause analysis of\nperformance issues in existing FaaS cluster managers, we propose Dirigent, a\nclean-slate system architecture for FaaS orchestration with three key\nprinciples. First, Dirigent optimizes internal cluster manager abstractions to\nsimplify state management. Second, it eliminates persistent state updates on\nthe critical path of function invocations, leveraging the fact that FaaS\nabstracts sandboxes from users to relax exact state reconstruction guarantees.\nFinally, Dirigent runs monolithic control and data planes to minimize internal\ncommunication overheads and maximize throughput. We compare Dirigent to\nstate-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile\nper-function scheduling latency for a production workload by 2.79x compared to\nAWS Lambda and can spin up 2500 sandboxes per second at low latency, which is\n1250x more than with Knative.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"244 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.16393","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
While Function as a Service (FaaS) platforms can initialize function
sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule
functions in real FaaS clusters can be orders of magnitude higher. We find that
the current approach of building FaaS cluster managers on top of legacy
orchestration systems like Kubernetes leads to high scheduling delay at high
sandbox churn, which is typical in FaaS clusters. While generic cluster
managers use hierarchical abstractions and multiple internal components to
manage and reconcile state with frequent persistent updates, this becomes a
bottleneck for FaaS, where cluster state frequently changes as sandboxes are
created on the critical path of requests. Based on our root cause analysis of
performance issues in existing FaaS cluster managers, we propose Dirigent, a
clean-slate system architecture for FaaS orchestration with three key
principles. First, Dirigent optimizes internal cluster manager abstractions to
simplify state management. Second, it eliminates persistent state updates on
the critical path of function invocations, leveraging the fact that FaaS
abstracts sandboxes from users to relax exact state reconstruction guarantees.
Finally, Dirigent runs monolithic control and data planes to minimize internal
communication overheads and maximize throughput. We compare Dirigent to
state-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile
per-function scheduling latency for a production workload by 2.79x compared to
AWS Lambda and can spin up 2500 sandboxes per second at low latency, which is
1250x more than with Knative.