{"title":"Caching or re-computing: Online cost optimization for running big data tasks in IaaS clouds","authors":"Xiankun Fu, Li Pan, Shijun Liu","doi":"10.1016/j.jnca.2024.104080","DOIUrl":null,"url":null,"abstract":"High computing power and large storage capacity are necessary for running big data tasks, which leads to high infrastructure costs. Infrastructure-as-a-Service (IaaS) clouds can provide configuration environments and computing resources needed for running big data tasks, while saving users from expensive software and hardware infrastructure investments. Many studies show that the cost of computation can be reduced by caching intermediate results and reusing them instead of repeating computations. However, the storage cost incurred by caching a large number of intermediate results over a long period of time may exceed the cost of computation, ultimately leading to an increase in total cost instead. For making optimal caching decisions, future usage profiles for big data tasks are needed, but it is generally very hard to predict them precisely. In this paper, to address this problem, we propose two practical online algorithms, one deterministic and the other randomized, which can determine whether to cache intermediate results to reduce the total cost of big data tasks without requiring any future information. We prove theoretically that the competitive ratio of the proposed deterministic (randomized) algorithm is <mml:math altimg=\"si1.svg\" display=\"inline\"><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mn>2</mml:mn><mml:mo>−</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>−</mml:mo><mml:mi>η</mml:mi></mml:mrow><mml:mrow><mml:mi>δ</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>−</mml:mo><mml:mfrac><mml:mrow><mml:mi>η</mml:mi></mml:mrow><mml:mrow><mml:mi>β</mml:mi></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math> (resp., <mml:math altimg=\"si2.svg\" display=\"inline\"><mml:mfrac><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:math>). Using real-world Wikipedia data as well as synthetic datasets, we verify the effectiveness of our proposed algorithms through a large number of experiments based on the price of Alibaba’s public IaaS cloud products.","PeriodicalId":54784,"journal":{"name":"Journal of Network and Computer Applications","volume":"30 1","pages":""},"PeriodicalIF":7.7000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Network and Computer Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.jnca.2024.104080","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Caching or re-computing: Online cost optimization for running big data tasks in IaaS clouds
High computing power and large storage capacity are necessary for running big data tasks, which leads to high infrastructure costs. Infrastructure-as-a-Service (IaaS) clouds can provide configuration environments and computing resources needed for running big data tasks, while saving users from expensive software and hardware infrastructure investments. Many studies show that the cost of computation can be reduced by caching intermediate results and reusing them instead of repeating computations. However, the storage cost incurred by caching a large number of intermediate results over a long period of time may exceed the cost of computation, ultimately leading to an increase in total cost instead. For making optimal caching decisions, future usage profiles for big data tasks are needed, but it is generally very hard to predict them precisely. In this paper, to address this problem, we propose two practical online algorithms, one deterministic and the other randomized, which can determine whether to cache intermediate results to reduce the total cost of big data tasks without requiring any future information. We prove theoretically that the competitive ratio of the proposed deterministic (randomized) algorithm is min(2−1−ηδ,2−ηβ) (resp., ee−1). Using real-world Wikipedia data as well as synthetic datasets, we verify the effectiveness of our proposed algorithms through a large number of experiments based on the price of Alibaba’s public IaaS cloud products.
期刊介绍:
The Journal of Network and Computer Applications welcomes research contributions, surveys, and notes in all areas relating to computer networks and applications thereof. Sample topics include new design techniques, interesting or novel applications, components or standards; computer networks with tools such as WWW; emerging standards for internet protocols; Wireless networks; Mobile Computing; emerging computing models such as cloud computing, grid computing; applications of networked systems for remote collaboration and telemedicine, etc. The journal is abstracted and indexed in Scopus, Engineering Index, Web of Science, Science Citation Index Expanded and INSPEC.