{"title":"DescribeML: a tool for describing machine learning datasets","authors":"Joan Giner-Miguelez, A. Gómez, Jordi Cabot","doi":"10.1145/3550356.3559087","DOIUrl":null,"url":null,"abstract":"Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift, where data issues are given the attention they deserve, for instance, proposing standard descriptions for datasets. In this sense, and inspired by these proposals, we present a model-driven tool to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. Our tool aims to facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The tool is implemented with the Langium workbench as a Visual Studio Code plugin and published as an open-source.","PeriodicalId":182662,"journal":{"name":"Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3550356.3559087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift, where data issues are given the attention they deserve, for instance, proposing standard descriptions for datasets. In this sense, and inspired by these proposals, we present a model-driven tool to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. Our tool aims to facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The tool is implemented with the Langium workbench as a Visual Studio Code plugin and published as an open-source.
数据集在机器学习(ML)模型的训练和评估中起着核心作用。但它们也是许多不受欢迎的模型行为的根本原因,比如有偏见的预测。为了克服这种情况,ML社区正在提出以数据为中心的文化转变,其中数据问题得到应有的关注,例如,为数据集提出标准描述。在这个意义上,受这些建议的启发,我们提出了一个模型驱动的工具,以精确地描述机器学习数据集的结构、数据来源和社会关注。我们的工具旨在促进任何ML计划利用和受益于ML中的这种以数据为中心的转变(例如,为新项目选择最合适的数据集或更好地复制其他ML结果)。该工具作为Visual Studio Code插件在Langium工作台上实现,并作为开源发布。