Reproducibility, energy efficiency and performance of pseudorandom number generators in machine learning: a comparative study of python, numpy, tensorflow, and pytorch implementations

arXiv - CS - Mathematical Software Pub Date : 2024-01-30 DOI:arxiv-2401.17345

Benjamin Antunes, David R. C Hill

{"title":"Reproducibility, energy efficiency and performance of pseudorandom number generators in machine learning: a comparative study of python, numpy, tensorflow, and pytorch implementations","authors":"Benjamin Antunes, David R. C Hill","doi":"arxiv-2401.17345","DOIUrl":null,"url":null,"abstract":"Pseudo-Random Number Generators (PRNGs) have become ubiquitous in machine\nlearning technologies because they are interesting for numerous methods. The\nfield of machine learning holds the potential for substantial advancements\nacross various domains, as exemplified by recent breakthroughs in Large\nLanguage Models (LLMs). However, despite the growing interest, persistent\nconcerns include issues related to reproducibility and energy consumption.\nReproducibility is crucial for robust scientific inquiry and explainability,\nwhile energy efficiency underscores the imperative to conserve finite global\nresources. This study delves into the investigation of whether the leading\nPseudo-Random Number Generators (PRNGs) employed in machine learning languages,\nlibraries, and frameworks uphold statistical quality and numerical\nreproducibility when compared to the original C implementation of the\nrespective PRNG algorithms. Additionally, we aim to evaluate the time\nefficiency and energy consumption of various implementations. Our experiments\nencompass Python, NumPy, TensorFlow, and PyTorch, utilizing the Mersenne\nTwister, PCG, and Philox algorithms. Remarkably, we verified that the temporal\nperformance of machine learning technologies closely aligns with that of\nC-based implementations, with instances of achieving even superior\nperformances. On the other hand, it is noteworthy that ML technologies consumed\nonly 10% more energy than their C-implementation counterparts. However, while\nstatistical quality was found to be comparable, achieving numerical\nreproducibility across different platforms for identical seeds and algorithms\nwas not achieved.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.17345","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Pseudo-Random Number Generators (PRNGs) have become ubiquitous in machine learning technologies because they are interesting for numerous methods. The field of machine learning holds the potential for substantial advancements across various domains, as exemplified by recent breakthroughs in Large Language Models (LLMs). However, despite the growing interest, persistent concerns include issues related to reproducibility and energy consumption. Reproducibility is crucial for robust scientific inquiry and explainability, while energy efficiency underscores the imperative to conserve finite global resources. This study delves into the investigation of whether the leading Pseudo-Random Number Generators (PRNGs) employed in machine learning languages, libraries, and frameworks uphold statistical quality and numerical reproducibility when compared to the original C implementation of the respective PRNG algorithms. Additionally, we aim to evaluate the time efficiency and energy consumption of various implementations. Our experiments encompass Python, NumPy, TensorFlow, and PyTorch, utilizing the Mersenne Twister, PCG, and Philox algorithms. Remarkably, we verified that the temporal performance of machine learning technologies closely aligns with that of C-based implementations, with instances of achieving even superior performances. On the other hand, it is noteworthy that ML technologies consumed only 10% more energy than their C-implementation counterparts. However, while statistical quality was found to be comparable, achieving numerical reproducibility across different platforms for identical seeds and algorithms was not achieved.

查看原文本刊更多论文

机器学习中伪随机数生成器的可重复性、能效和性能：python、numpy、tensorflow 和 pytorch 实现的比较研究

伪随机数发生器（PRNG）在机器学习技术中无处不在，因为它们对许多方法都很有趣。机器学习领域有可能在各个领域取得重大进展，最近在大型语言模型（LLMs）方面取得的突破就是例证。可重复性对于科学探索的稳健性和可解释性至关重要，而能效则强调了保护有限的全球资源的必要性。本研究深入探讨了机器学习语言、程序库和框架中使用的主要伪随机数发生器（PRNG）与相关 PRNG 算法的原始 C 语言实现相比，是否能够保证统计质量和数值可重复性。此外，我们还旨在评估各种实现的时间效率和能耗。我们的实验涵盖 Python、NumPy、TensorFlow 和 PyTorch，使用了梅森孪生、PCG 和 Philox 算法。值得注意的是，我们验证了机器学习技术的时间性能与基于C的实现非常接近，甚至有实现更优性能的实例。另一方面，值得注意的是，机器学习技术的能耗仅比基于 C 实现的技术高 10%。不过，虽然统计质量不相上下，但在不同平台上使用相同的种子和算法却无法实现数值上的可重复性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Mathematical Software

自引率

0.00%

发文量