容错并行矩阵分解的形式化模型

2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS) Pub Date : 2022-03-01 DOI:10.1109/ICECCS54210.2022.00016

Camille Coti, L. Petrucci, Daniel Alberto Torres González

{"title":"容错并行矩阵分解的形式化模型","authors":"Camille Coti, L. Petrucci, Daniel Alberto Torres González","doi":"10.1109/ICECCS54210.2022.00016","DOIUrl":null,"url":null,"abstract":"As exascale platforms are in sight, high-performance computing needs to take failures into account and provide fault-tolerant applications and environments. Checkpoint-restart approaches do not require modifying the application, but are expensive at large scale. Application-based fault tolerance is more specific to the application and is expected to achieve better performance. In this paper, we address fault-tolerant matrix factorization with algorithms that present good performance, both during failure-free executions and when failures happen. A challenge when designing fault-tolerant algorithms is to make sure they are resilient to any failure scenario. Therefore, we design a model for these algorithms and prove they can tolerate failures at any moment, as long as enough processes are still alive.","PeriodicalId":344493,"journal":{"name":"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Formal Model for Fault Tolerant Parallel Matrix Factorization\",\"authors\":\"Camille Coti, L. Petrucci, Daniel Alberto Torres González\",\"doi\":\"10.1109/ICECCS54210.2022.00016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As exascale platforms are in sight, high-performance computing needs to take failures into account and provide fault-tolerant applications and environments. Checkpoint-restart approaches do not require modifying the application, but are expensive at large scale. Application-based fault tolerance is more specific to the application and is expected to achieve better performance. In this paper, we address fault-tolerant matrix factorization with algorithms that present good performance, both during failure-free executions and when failures happen. A challenge when designing fault-tolerant algorithms is to make sure they are resilient to any failure scenario. Therefore, we design a model for these algorithms and prove they can tolerate failures at any moment, as long as enough processes are still alive.\",\"PeriodicalId\":344493,\"journal\":{\"name\":\"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECCS54210.2022.00016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECCS54210.2022.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着百亿亿级平台的出现，高性能计算需要考虑故障并提供容错应用程序和环境。检查点重新启动方法不需要修改应用程序，但是在大规模情况下成本很高。基于应用程序的容错更具体于应用程序，期望实现更好的性能。在本文中，我们用在无故障执行和发生故障时都表现出良好性能的算法来解决容错矩阵分解问题。设计容错算法时的一个挑战是确保它们对任何故障场景都具有弹性。因此，我们为这些算法设计了一个模型，并证明只要有足够的进程仍然活着，它们就可以容忍任何时刻的故障。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Formal Model for Fault Tolerant Parallel Matrix Factorization

As exascale platforms are in sight, high-performance computing needs to take failures into account and provide fault-tolerant applications and environments. Checkpoint-restart approaches do not require modifying the application, but are expensive at large scale. Application-based fault tolerance is more specific to the application and is expected to achieve better performance. In this paper, we address fault-tolerant matrix factorization with algorithms that present good performance, both during failure-free executions and when failures happen. A challenge when designing fault-tolerant algorithms is to make sure they are resilient to any failure scenario. Therefore, we design a model for these algorithms and prove they can tolerate failures at any moment, as long as enough processes are still alive.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)

自引率

0.00%

发文量