Data Labeling for Machine Learning Engineers: Project-Based Curriculum and Data-Centric Competitions

Anastasia Zhdanovskaya, Daria Baidakova, Dmitry Ustalov
{"title":"Data Labeling for Machine Learning Engineers: Project-Based Curriculum and Data-Centric Competitions","authors":"Anastasia Zhdanovskaya, Daria Baidakova, Dmitry Ustalov","doi":"10.1609/aaai.v37i13.26886","DOIUrl":null,"url":null,"abstract":"The process of training and evaluating machine learning (ML) models relies on high-quality and timely annotated datasets. While a significant portion of academic and industrial research is focused on creating new ML methods, these communities rely on open datasets and benchmarks. However, practitioners often face issues with unlabeled and unavailable data specific to their domain. We believe that building scalable and sustainable processes for collecting data of high quality for ML is a complex skill that needs focused development. To fill the need for this competency, we created a semester course on Data Collection and Labeling for Machine Learning, integrated into a bachelor program that trains data analysts and ML engineers. The course design and delivery illustrate how to overcome the challenge of putting university students with a theoretical background in mathematics, computer science, and physics through a program that is substantially different from their educational habits. Our goal was to motivate students to focus on practicing and mastering a skill that was considered unnecessary to their work. We created a system of inverse ML competitions that showed the students how high-quality and relevant data affect their work with ML models, and their mindset changed completely in the end. Project-based learning with increasing complexity of conditions at each stage helped to raise the satisfaction index of students accustomed to difficult challenges. During the course, our invited industry practitioners drew on their first-hand experience with data, which helped us avoid overtheorizing and made the course highly applicable to the students’ future career paths.","PeriodicalId":74506,"journal":{"name":"Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence","volume":"110 1","pages":"15886-15893"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/aaai.v37i13.26886","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The process of training and evaluating machine learning (ML) models relies on high-quality and timely annotated datasets. While a significant portion of academic and industrial research is focused on creating new ML methods, these communities rely on open datasets and benchmarks. However, practitioners often face issues with unlabeled and unavailable data specific to their domain. We believe that building scalable and sustainable processes for collecting data of high quality for ML is a complex skill that needs focused development. To fill the need for this competency, we created a semester course on Data Collection and Labeling for Machine Learning, integrated into a bachelor program that trains data analysts and ML engineers. The course design and delivery illustrate how to overcome the challenge of putting university students with a theoretical background in mathematics, computer science, and physics through a program that is substantially different from their educational habits. Our goal was to motivate students to focus on practicing and mastering a skill that was considered unnecessary to their work. We created a system of inverse ML competitions that showed the students how high-quality and relevant data affect their work with ML models, and their mindset changed completely in the end. Project-based learning with increasing complexity of conditions at each stage helped to raise the satisfaction index of students accustomed to difficult challenges. During the course, our invited industry practitioners drew on their first-hand experience with data, which helped us avoid overtheorizing and made the course highly applicable to the students’ future career paths.
机器学习工程师的数据标记:基于项目的课程和以数据为中心的竞赛
训练和评估机器学习(ML)模型的过程依赖于高质量和及时注释的数据集。虽然很大一部分学术和工业研究都集中在创建新的机器学习方法上,但这些社区依赖于开放的数据集和基准。然而,从业者经常面临未标记和不可用的特定于他们领域的数据的问题。我们认为,构建可扩展和可持续的流程来收集高质量的ML数据是一项复杂的技能,需要重点开发。为了满足这种能力的需求,我们创建了一个学期的机器学习数据收集和标签课程,并将其整合到培养数据分析师和机器学习工程师的学士课程中。课程设计和交付说明了如何克服将具有数学、计算机科学和物理理论背景的大学生通过一个与他们的教育习惯截然不同的项目的挑战。我们的目标是激励学生专注于练习和掌握一项被认为对他们的工作不必要的技能。我们创建了一个逆向ML竞赛系统,向学生展示了高质量和相关的数据如何影响他们使用ML模型的工作,最终他们的心态完全改变了。基于项目的学习在每个阶段条件的复杂性增加,有助于提高学生的满意度指数习惯于困难的挑战。在课程中,我们邀请了行业从业者利用他们的第一手数据经验,这有助于我们避免过度理论化,使课程高度适用于学生未来的职业道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信