{"title":"机器学习中伪随机数生成器的可重复性、能效和性能:python、numpy、tensorflow 和 pytorch 实现的比较研究","authors":"Benjamin Antunes, David R. C Hill","doi":"arxiv-2401.17345","DOIUrl":null,"url":null,"abstract":"Pseudo-Random Number Generators (PRNGs) have become ubiquitous in machine\nlearning technologies because they are interesting for numerous methods. The\nfield of machine learning holds the potential for substantial advancements\nacross various domains, as exemplified by recent breakthroughs in Large\nLanguage Models (LLMs). However, despite the growing interest, persistent\nconcerns include issues related to reproducibility and energy consumption.\nReproducibility is crucial for robust scientific inquiry and explainability,\nwhile energy efficiency underscores the imperative to conserve finite global\nresources. This study delves into the investigation of whether the leading\nPseudo-Random Number Generators (PRNGs) employed in machine learning languages,\nlibraries, and frameworks uphold statistical quality and numerical\nreproducibility when compared to the original C implementation of the\nrespective PRNG algorithms. Additionally, we aim to evaluate the time\nefficiency and energy consumption of various implementations. Our experiments\nencompass Python, NumPy, TensorFlow, and PyTorch, utilizing the Mersenne\nTwister, PCG, and Philox algorithms. Remarkably, we verified that the temporal\nperformance of machine learning technologies closely aligns with that of\nC-based implementations, with instances of achieving even superior\nperformances. On the other hand, it is noteworthy that ML technologies consumed\nonly 10% more energy than their C-implementation counterparts. However, while\nstatistical quality was found to be comparable, achieving numerical\nreproducibility across different platforms for identical seeds and algorithms\nwas not achieved.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reproducibility, energy efficiency and performance of pseudorandom number generators in machine learning: a comparative study of python, numpy, tensorflow, and pytorch implementations\",\"authors\":\"Benjamin Antunes, David R. C Hill\",\"doi\":\"arxiv-2401.17345\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pseudo-Random Number Generators (PRNGs) have become ubiquitous in machine\\nlearning technologies because they are interesting for numerous methods. The\\nfield of machine learning holds the potential for substantial advancements\\nacross various domains, as exemplified by recent breakthroughs in Large\\nLanguage Models (LLMs). However, despite the growing interest, persistent\\nconcerns include issues related to reproducibility and energy consumption.\\nReproducibility is crucial for robust scientific inquiry and explainability,\\nwhile energy efficiency underscores the imperative to conserve finite global\\nresources. This study delves into the investigation of whether the leading\\nPseudo-Random Number Generators (PRNGs) employed in machine learning languages,\\nlibraries, and frameworks uphold statistical quality and numerical\\nreproducibility when compared to the original C implementation of the\\nrespective PRNG algorithms. Additionally, we aim to evaluate the time\\nefficiency and energy consumption of various implementations. Our experiments\\nencompass Python, NumPy, TensorFlow, and PyTorch, utilizing the Mersenne\\nTwister, PCG, and Philox algorithms. Remarkably, we verified that the temporal\\nperformance of machine learning technologies closely aligns with that of\\nC-based implementations, with instances of achieving even superior\\nperformances. On the other hand, it is noteworthy that ML technologies consumed\\nonly 10% more energy than their C-implementation counterparts. However, while\\nstatistical quality was found to be comparable, achieving numerical\\nreproducibility across different platforms for identical seeds and algorithms\\nwas not achieved.\",\"PeriodicalId\":501256,\"journal\":{\"name\":\"arXiv - CS - Mathematical Software\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Mathematical Software\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2401.17345\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.17345","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
伪随机数发生器(PRNG)在机器学习技术中无处不在,因为它们对许多方法都很有趣。机器学习领域有可能在各个领域取得重大进展,最近在大型语言模型(LLMs)方面取得的突破就是例证。可重复性对于科学探索的稳健性和可解释性至关重要,而能效则强调了保护有限的全球资源的必要性。本研究深入探讨了机器学习语言、程序库和框架中使用的主要伪随机数发生器(PRNG)与相关 PRNG 算法的原始 C 语言实现相比,是否能够保证统计质量和数值可重复性。此外,我们还旨在评估各种实现的时间效率和能耗。我们的实验涵盖 Python、NumPy、TensorFlow 和 PyTorch,使用了梅森孪生、PCG 和 Philox 算法。值得注意的是,我们验证了机器学习技术的时间性能与基于C的实现非常接近,甚至有实现更优性能的实例。另一方面,值得注意的是,机器学习技术的能耗仅比基于 C 实现的技术高 10%。不过,虽然统计质量不相上下,但在不同平台上使用相同的种子和算法却无法实现数值上的可重复性。
Reproducibility, energy efficiency and performance of pseudorandom number generators in machine learning: a comparative study of python, numpy, tensorflow, and pytorch implementations
Pseudo-Random Number Generators (PRNGs) have become ubiquitous in machine
learning technologies because they are interesting for numerous methods. The
field of machine learning holds the potential for substantial advancements
across various domains, as exemplified by recent breakthroughs in Large
Language Models (LLMs). However, despite the growing interest, persistent
concerns include issues related to reproducibility and energy consumption.
Reproducibility is crucial for robust scientific inquiry and explainability,
while energy efficiency underscores the imperative to conserve finite global
resources. This study delves into the investigation of whether the leading
Pseudo-Random Number Generators (PRNGs) employed in machine learning languages,
libraries, and frameworks uphold statistical quality and numerical
reproducibility when compared to the original C implementation of the
respective PRNG algorithms. Additionally, we aim to evaluate the time
efficiency and energy consumption of various implementations. Our experiments
encompass Python, NumPy, TensorFlow, and PyTorch, utilizing the Mersenne
Twister, PCG, and Philox algorithms. Remarkably, we verified that the temporal
performance of machine learning technologies closely aligns with that of
C-based implementations, with instances of achieving even superior
performances. On the other hand, it is noteworthy that ML technologies consumed
only 10% more energy than their C-implementation counterparts. However, while
statistical quality was found to be comparable, achieving numerical
reproducibility across different platforms for identical seeds and algorithms
was not achieved.