{"title":"Replacing Mechanical Turkers? Challenges in the Evaluation of Models with Semantic Properties","authors":"Fred Morstatter, Huan Liu","doi":"10.1145/2935752","DOIUrl":null,"url":null,"abstract":"Some machine-learning algorithms offer more than just predictive power. For example, algorithms provide additional insight into the underlying data. Examples of these algorithms are topic modeling algorithms such as Latent Dirichlet Allocation (LDA) [Blei et al. 2003], whose topics are often inspected as part of the analysis that many researchers perform on their data. Recently, deep learning algorithms such as word embedding algorithms like Word2Vec [Mikolov et al. 2013] have produced models with semantic properties. These algorithms are immensely useful; they tell us something about the environment from which they generate their predictions. One pressing challenge is how to evaluate the quality of the semantic information produced by these algorithms. When we employ algorithms for their semantic properties, it is important that these properties can be understood by a human. Currently, there are no established approaches to carry out this evaluation automatically. This evaluation (if done at all) is usually carried out via user studies. While this type of evaluation is sound, it is expensive from the perspective of both time and cost. It takes a great deal of time to recruit crowdsourced workers to complete tasks on crowdsourcing sites. This adds a huge amount of time to the research process. Furthermore, crowdsourced workers do not work for free. Each individual task costs real currency that could be spent on other parts of the research endeavor. Both the time and financial cost associated with these crowdsourced experiments mean that these types of experiments are difficult to perform. They greatly reduce the probability that future researchers will be able to reproduce these experiments.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"80 1","pages":"1 - 4"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2935752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Some machine-learning algorithms offer more than just predictive power. For example, algorithms provide additional insight into the underlying data. Examples of these algorithms are topic modeling algorithms such as Latent Dirichlet Allocation (LDA) [Blei et al. 2003], whose topics are often inspected as part of the analysis that many researchers perform on their data. Recently, deep learning algorithms such as word embedding algorithms like Word2Vec [Mikolov et al. 2013] have produced models with semantic properties. These algorithms are immensely useful; they tell us something about the environment from which they generate their predictions. One pressing challenge is how to evaluate the quality of the semantic information produced by these algorithms. When we employ algorithms for their semantic properties, it is important that these properties can be understood by a human. Currently, there are no established approaches to carry out this evaluation automatically. This evaluation (if done at all) is usually carried out via user studies. While this type of evaluation is sound, it is expensive from the perspective of both time and cost. It takes a great deal of time to recruit crowdsourced workers to complete tasks on crowdsourcing sites. This adds a huge amount of time to the research process. Furthermore, crowdsourced workers do not work for free. Each individual task costs real currency that could be spent on other parts of the research endeavor. Both the time and financial cost associated with these crowdsourced experiments mean that these types of experiments are difficult to perform. They greatly reduce the probability that future researchers will be able to reproduce these experiments.