DescribeML: a tool for describing machine learning datasets

Joan Giner-Miguelez, A. Gómez, Jordi Cabot
{"title":"DescribeML: a tool for describing machine learning datasets","authors":"Joan Giner-Miguelez, A. Gómez, Jordi Cabot","doi":"10.1145/3550356.3559087","DOIUrl":null,"url":null,"abstract":"Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift, where data issues are given the attention they deserve, for instance, proposing standard descriptions for datasets. In this sense, and inspired by these proposals, we present a model-driven tool to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. Our tool aims to facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The tool is implemented with the Langium workbench as a Visual Studio Code plugin and published as an open-source.","PeriodicalId":182662,"journal":{"name":"Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3550356.3559087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift, where data issues are given the attention they deserve, for instance, proposing standard descriptions for datasets. In this sense, and inspired by these proposals, we present a model-driven tool to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. Our tool aims to facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The tool is implemented with the Langium workbench as a Visual Studio Code plugin and published as an open-source.
描述机器学习数据集的工具
数据集在机器学习(ML)模型的训练和评估中起着核心作用。但它们也是许多不受欢迎的模型行为的根本原因,比如有偏见的预测。为了克服这种情况,ML社区正在提出以数据为中心的文化转变,其中数据问题得到应有的关注,例如,为数据集提出标准描述。在这个意义上,受这些建议的启发,我们提出了一个模型驱动的工具,以精确地描述机器学习数据集的结构、数据来源和社会关注。我们的工具旨在促进任何ML计划利用和受益于ML中的这种以数据为中心的转变(例如,为新项目选择最合适的数据集或更好地复制其他ML结果)。该工具作为Visual Studio Code插件在Langium工作台上实现,并作为开源发布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信