Replacing Mechanical Turkers? Challenges in the Evaluation of Models with Semantic Properties

Fred Morstatter, Huan Liu
{"title":"Replacing Mechanical Turkers? Challenges in the Evaluation of Models with Semantic Properties","authors":"Fred Morstatter, Huan Liu","doi":"10.1145/2935752","DOIUrl":null,"url":null,"abstract":"Some machine-learning algorithms offer more than just predictive power. For example, algorithms provide additional insight into the underlying data. Examples of these algorithms are topic modeling algorithms such as Latent Dirichlet Allocation (LDA) [Blei et al. 2003], whose topics are often inspected as part of the analysis that many researchers perform on their data. Recently, deep learning algorithms such as word embedding algorithms like Word2Vec [Mikolov et al. 2013] have produced models with semantic properties. These algorithms are immensely useful; they tell us something about the environment from which they generate their predictions. One pressing challenge is how to evaluate the quality of the semantic information produced by these algorithms. When we employ algorithms for their semantic properties, it is important that these properties can be understood by a human. Currently, there are no established approaches to carry out this evaluation automatically. This evaluation (if done at all) is usually carried out via user studies. While this type of evaluation is sound, it is expensive from the perspective of both time and cost. It takes a great deal of time to recruit crowdsourced workers to complete tasks on crowdsourcing sites. This adds a huge amount of time to the research process. Furthermore, crowdsourced workers do not work for free. Each individual task costs real currency that could be spent on other parts of the research endeavor. Both the time and financial cost associated with these crowdsourced experiments mean that these types of experiments are difficult to perform. They greatly reduce the probability that future researchers will be able to reproduce these experiments.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"80 1","pages":"1 - 4"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2935752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Some machine-learning algorithms offer more than just predictive power. For example, algorithms provide additional insight into the underlying data. Examples of these algorithms are topic modeling algorithms such as Latent Dirichlet Allocation (LDA) [Blei et al. 2003], whose topics are often inspected as part of the analysis that many researchers perform on their data. Recently, deep learning algorithms such as word embedding algorithms like Word2Vec [Mikolov et al. 2013] have produced models with semantic properties. These algorithms are immensely useful; they tell us something about the environment from which they generate their predictions. One pressing challenge is how to evaluate the quality of the semantic information produced by these algorithms. When we employ algorithms for their semantic properties, it is important that these properties can be understood by a human. Currently, there are no established approaches to carry out this evaluation automatically. This evaluation (if done at all) is usually carried out via user studies. While this type of evaluation is sound, it is expensive from the perspective of both time and cost. It takes a great deal of time to recruit crowdsourced workers to complete tasks on crowdsourcing sites. This adds a huge amount of time to the research process. Furthermore, crowdsourced workers do not work for free. Each individual task costs real currency that could be spent on other parts of the research endeavor. Both the time and financial cost associated with these crowdsourced experiments mean that these types of experiments are difficult to perform. They greatly reduce the probability that future researchers will be able to reproduce these experiments.
取代机械土耳其人?语义属性模型评价中的挑战
一些机器学习算法提供的不仅仅是预测能力。例如,算法提供了对底层数据的额外洞察。这些算法的例子是主题建模算法,如潜狄利克雷分配(Latent Dirichlet Allocation, LDA) [Blei et al. 2003],其主题经常作为许多研究人员对其数据进行分析的一部分进行检查。最近,深度学习算法,如Word2Vec等词嵌入算法[Mikolov et al. 2013]产生了具有语义属性的模型。这些算法非常有用;它们告诉我们一些关于环境的信息,而它们正是根据环境做出预测的。一个紧迫的挑战是如何评估这些算法产生的语义信息的质量。当我们为其语义属性使用算法时,重要的是这些属性可以被人类理解。目前,还没有确定的方法来自动进行这种评估。这种评估(如果有的话)通常是通过用户研究进行的。虽然这种类型的评估是合理的,但从时间和成本的角度来看,它是昂贵的。在众包网站上招募众包工人完成任务需要花费大量的时间。这给研究过程增加了大量的时间。此外,众包工人也不是免费工作的。每一项单独的任务都要花费实际的金钱,而这些钱本可以花在研究工作的其他部分。与这些众包实验相关的时间和财务成本意味着这些类型的实验很难进行。它们大大降低了未来研究人员重现这些实验的可能性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信