Integrated Reproducibility with Self-describing Machine Learning Models

Proceedings of the 2023 ACM Conference on Reproducibility and Replicability Pub Date : 2023-06-27 DOI:10.1145/3589806.3600039

J. Wonsil, J. Sullivan, Margo Seltzer, A. Pocock

{"title":"Integrated Reproducibility with Self-describing Machine Learning Models","authors":"J. Wonsil, J. Sullivan, Margo Seltzer, A. Pocock","doi":"10.1145/3589806.3600039","DOIUrl":null,"url":null,"abstract":"Researchers and data scientists frequently want to collaborate on machine learning models. However, in the presence of sharing and simultaneous experimentation, it is challenging both to determine if two models were trained identically and to reproduce precisely someone else’s training process. We demonstrate how provenance collection that is tightly integrated into a machine learning library facilitates reproducibility. We present MERIT, a reproducibility system that leverages a robust configuration system and extensive provenance collection to exactly reproduce models, given only a model object. We integrate MERIT with Tribuo, an open-source Java-based machine learning library. Key features of this integrated reproducibility framework include controlling for sources of non-determinism in a multi-threaded environment and exposing the training differences between two models in a human-readable form. Our system allows simple reproduction of deployed Tribuo models without any additional information, ensuring data science research is reproducible. Our framework is open-source and available under an Apache 2.0 license.","PeriodicalId":393751,"journal":{"name":"Proceedings of the 2023 ACM Conference on Reproducibility and Replicability","volume":"149 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 ACM Conference on Reproducibility and Replicability","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3589806.3600039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Researchers and data scientists frequently want to collaborate on machine learning models. However, in the presence of sharing and simultaneous experimentation, it is challenging both to determine if two models were trained identically and to reproduce precisely someone else’s training process. We demonstrate how provenance collection that is tightly integrated into a machine learning library facilitates reproducibility. We present MERIT, a reproducibility system that leverages a robust configuration system and extensive provenance collection to exactly reproduce models, given only a model object. We integrate MERIT with Tribuo, an open-source Java-based machine learning library. Key features of this integrated reproducibility framework include controlling for sources of non-determinism in a multi-threaded environment and exposing the training differences between two models in a human-readable form. Our system allows simple reproduction of deployed Tribuo models without any additional information, ensuring data science research is reproducible. Our framework is open-source and available under an Apache 2.0 license.

查看原文本刊更多论文

集成再现性与自描述机器学习模型

研究人员和数据科学家经常希望在机器学习模型上进行合作。然而，在共享和同时实验的情况下，确定两个模型是否训练相同以及精确地复制其他人的训练过程是具有挑战性的。我们演示了紧密集成到机器学习库中的来源集合如何促进再现性。我们提出MERIT，一个可再现系统，它利用一个健壮的配置系统和广泛的来源收集来精确地再现模型，只给出一个模型对象。我们将MERIT与tribo(一个基于java的开源机器学习库)集成在一起。这个集成的再现性框架的关键特性包括在多线程环境中控制不确定性的来源，并以人类可读的形式暴露两个模型之间的训练差异。我们的系统允许在没有任何额外信息的情况下简单地复制部署的Tribuo模型，确保数据科学研究的可重复性。我们的框架是开源的，在Apache 2.0许可下可用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2023 ACM Conference on Reproducibility and Replicability

自引率

0.00%

发文量