DescribeML: a tool for describing machine learning datasets

Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings Pub Date : 2022-10-23 DOI:10.1145/3550356.3559087

Joan Giner-Miguelez, A. Gómez, Jordi Cabot

引用次数: 10

Abstract

Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift, where data issues are given the attention they deserve, for instance, proposing standard descriptions for datasets. In this sense, and inspired by these proposals, we present a model-driven tool to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. Our tool aims to facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The tool is implemented with the Langium workbench as a Visual Studio Code plugin and published as an open-source.

查看原文本刊更多论文

描述机器学习数据集的工具

数据集在机器学习(ML)模型的训练和评估中起着核心作用。但它们也是许多不受欢迎的模型行为的根本原因，比如有偏见的预测。为了克服这种情况，ML社区正在提出以数据为中心的文化转变，其中数据问题得到应有的关注，例如，为数据集提出标准描述。在这个意义上，受这些建议的启发，我们提出了一个模型驱动的工具，以精确地描述机器学习数据集的结构、数据来源和社会关注。我们的工具旨在促进任何ML计划利用和受益于ML中的这种以数据为中心的转变(例如，为新项目选择最合适的数据集或更好地复制其他ML结果)。该工具作为Visual Studio Code插件在Langium工作台上实现，并作为开源发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings

自引率

0.00%

发文量