{"title":"NeIL: Intelligent Replica Selection for Distributed Applications","authors":"Faraz Ahmed;Lianjie Cao;Ayush Goel;Puneet Sharma","doi":"10.1109/TMLCN.2024.3479109","DOIUrl":null,"url":null,"abstract":"Distributed applications such as cloud gaming, streaming, etc., are increasingly using edge-to-cloud infrastructure for high availability and performance. While edge infrastructure brings services closer to the end-user, the number of sites on which the services need to be replicated has also increased. This makes replica selection challenging for clients of the replicated services. Traditional replica selection methods including anycast based methods and DNS re-directions are performance agnostic, and clients experience degraded network performance when network performance dynamics are not considered in replica selection. In this work, we present a client-side replica selection framework NeIL, that enables network performance aware replica selection. We propose to use bandits with experts based Multi-Armed Bandit (MAB) algorithms and adapt these algorithms for replica selection at individual clients without centralized coordination. We evaluate our approach using three different setups including a distributed Mininet setup where we use publicly available network performance data from the Measurement Lab (M-Lab) to emulate network conditions, a setup where we deploy replica servers on AWS, and finally we present results from a global enterprise deployment. Our experimental results show that in comparison to greedy selection, NeIL performs better than greedy for 45% of the time and better than or equal to greedy selection for 80% of the time resulting in a net gain in end-to-end network performance. On AWS, we see similar results where NeIL performs better than or equal to greedy for 75% of the time. We have successfully deployed NeIL in a global enterprise remote device management service with over 4000 client devices and our analysis shows that NeIL achieves significantly better tail service quality by cutting the \n<inline-formula> <tex-math>$99th$ </tex-math></inline-formula>\n percentile tail latency from 5.6 seconds to 1.7 seconds.","PeriodicalId":100641,"journal":{"name":"IEEE Transactions on Machine Learning in Communications and Networking","volume":"2 ","pages":"1580-1594"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10714467","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Machine Learning in Communications and Networking","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10714467/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Distributed applications such as cloud gaming, streaming, etc., are increasingly using edge-to-cloud infrastructure for high availability and performance. While edge infrastructure brings services closer to the end-user, the number of sites on which the services need to be replicated has also increased. This makes replica selection challenging for clients of the replicated services. Traditional replica selection methods including anycast based methods and DNS re-directions are performance agnostic, and clients experience degraded network performance when network performance dynamics are not considered in replica selection. In this work, we present a client-side replica selection framework NeIL, that enables network performance aware replica selection. We propose to use bandits with experts based Multi-Armed Bandit (MAB) algorithms and adapt these algorithms for replica selection at individual clients without centralized coordination. We evaluate our approach using three different setups including a distributed Mininet setup where we use publicly available network performance data from the Measurement Lab (M-Lab) to emulate network conditions, a setup where we deploy replica servers on AWS, and finally we present results from a global enterprise deployment. Our experimental results show that in comparison to greedy selection, NeIL performs better than greedy for 45% of the time and better than or equal to greedy selection for 80% of the time resulting in a net gain in end-to-end network performance. On AWS, we see similar results where NeIL performs better than or equal to greedy for 75% of the time. We have successfully deployed NeIL in a global enterprise remote device management service with over 4000 client devices and our analysis shows that NeIL achieves significantly better tail service quality by cutting the
$99th$
percentile tail latency from 5.6 seconds to 1.7 seconds.