设计可扩展的众包平台

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI:10.1145/2213836.2213951

C. V. Pelt, A. Sorokin

{"title":"设计可扩展的众包平台","authors":"C. V. Pelt, A. Sorokin","doi":"10.1145/2213836.2213951","DOIUrl":null,"url":null,"abstract":"Computers are extremely efficient at crawling, storing and processing huge volumes of structured data. They are great at exploiting link structures to generate valuable knowledge. Yet there are plenty of data processing tasks that are difficult today. Labeling sentiment, moderating images, and mining structured content from the web are still too hard for computers. Automated techniques can get us a long way in some of those, but human inteligence is required when an accurate decision is ultimately important. In many cases that decision is easy for people and can be made quickly - in a few seconds to few minutes. By creating millions of simple online tasks we create a distributed computing machine. By shipping the tasks to millions of contributers around the globe, we make this human computer available 24/7 to make important decisions about your data. In this talk, I will describe our approach to designing CrowdFlower - a scalable crowdsourcing platform - as it evolved over the last 4 years. We think about crowdsourcing in terms of Quality, Cost and Speed. They are the ultimate design objectives of a human computer. Unfortunately, we can't have all 3. A general price-constrained task requiring 99.9% accuracy and 10 minute turnaround is not possible today. I will discuss design decisions behind CrowdFlower that allow us to pursue any two of these objectives. I will briefly present examples of common crowdsourced tasks and tools built into the platform to make the design of complex tasks easy, tools such as CrowdFlower Markup Language(CML). Quality control is the single most important challenge in Crowdsourcing. To enable an unidentified crowd of people to produce meaningful work, we must be certain that we can filter out bad contributors and produce high quality output. Initially we only used consensus. As the diversity and size of our crowd grew, so did the number of people attempting fraud. CrowdFlower developed \"Gold standard\" to block attempts of fraud. The use of gold allowed us to train contributors for the details of specific domains. By defining expected responses for a subset of the work and providing explanations of why a given response was expected, we are able distribute tasks to an ever-expanding anonymous workforce without sacrificing quality.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":"{\"title\":\"Designing a scalable crowdsourcing platform\",\"authors\":\"C. V. Pelt, A. Sorokin\",\"doi\":\"10.1145/2213836.2213951\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computers are extremely efficient at crawling, storing and processing huge volumes of structured data. They are great at exploiting link structures to generate valuable knowledge. Yet there are plenty of data processing tasks that are difficult today. Labeling sentiment, moderating images, and mining structured content from the web are still too hard for computers. Automated techniques can get us a long way in some of those, but human inteligence is required when an accurate decision is ultimately important. In many cases that decision is easy for people and can be made quickly - in a few seconds to few minutes. By creating millions of simple online tasks we create a distributed computing machine. By shipping the tasks to millions of contributers around the globe, we make this human computer available 24/7 to make important decisions about your data. In this talk, I will describe our approach to designing CrowdFlower - a scalable crowdsourcing platform - as it evolved over the last 4 years. We think about crowdsourcing in terms of Quality, Cost and Speed. They are the ultimate design objectives of a human computer. Unfortunately, we can't have all 3. A general price-constrained task requiring 99.9% accuracy and 10 minute turnaround is not possible today. I will discuss design decisions behind CrowdFlower that allow us to pursue any two of these objectives. I will briefly present examples of common crowdsourced tasks and tools built into the platform to make the design of complex tasks easy, tools such as CrowdFlower Markup Language(CML). Quality control is the single most important challenge in Crowdsourcing. To enable an unidentified crowd of people to produce meaningful work, we must be certain that we can filter out bad contributors and produce high quality output. Initially we only used consensus. As the diversity and size of our crowd grew, so did the number of people attempting fraud. CrowdFlower developed \\\"Gold standard\\\" to block attempts of fraud. The use of gold allowed us to train contributors for the details of specific domains. By defining expected responses for a subset of the work and providing explanations of why a given response was expected, we are able distribute tasks to an ever-expanding anonymous workforce without sacrificing quality.\",\"PeriodicalId\":212616,\"journal\":{\"name\":\"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"43\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2213836.2213951\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2213836.2213951","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 43

摘要

计算机在抓取、存储和处理大量结构化数据方面效率极高。他们善于利用链接结构来产生有价值的知识。然而，今天仍有许多数据处理任务是困难的。对计算机来说，给情绪贴上标签、对图像进行审核以及从网络中挖掘结构化内容仍然太困难。自动化技术可以让我们在其中的一些方面走得很远，但当一个准确的决定最终很重要时，就需要人类的智慧。在许多情况下，人们很容易做出决定，而且可以很快做出决定——在几秒钟到几分钟内。通过创建数百万个简单的在线任务，我们创建了一个分布式计算机器。通过将任务发送给全球数百万的贡献者，我们使这台人类计算机全天候可用，对您的数据做出重要决定。在这次演讲中，我将描述我们设计CrowdFlower的方法，这是一个可扩展的众包平台，它在过去4年里不断发展。我们从质量、成本和速度的角度考虑众包。它们是人类计算机的终极设计目标。不幸的是，我们不能同时拥有这三个。如今，要求99.9%准确率和10分钟周转的一般价格受限任务是不可能实现的。我将讨论《CrowdFlower》背后的设计决策，使我们能够实现这两个目标。我将简要介绍一些常见的众包任务和内置在平台上的工具的例子，这些工具可以使复杂任务的设计变得容易，比如CrowdFlower标记语言(CML)。质量控制是众包中最重要的挑战。为了使一群身份不明的人能够产生有意义的工作，我们必须确定我们可以过滤掉不良贡献者并产生高质量的输出。最初我们只使用共识。随着我们人群的多样性和规模的增长，试图欺诈的人数也在增加。CrowdFlower开发了“黄金标准”来阻止欺诈企图。使用黄金使我们能够为特定领域的细节培训贡献者。通过定义工作子集的预期响应，并提供为什么预期给定响应的解释，我们能够在不牺牲质量的情况下将任务分配给不断扩展的匿名员工。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Designing a scalable crowdsourcing platform

Computers are extremely efficient at crawling, storing and processing huge volumes of structured data. They are great at exploiting link structures to generate valuable knowledge. Yet there are plenty of data processing tasks that are difficult today. Labeling sentiment, moderating images, and mining structured content from the web are still too hard for computers. Automated techniques can get us a long way in some of those, but human inteligence is required when an accurate decision is ultimately important. In many cases that decision is easy for people and can be made quickly - in a few seconds to few minutes. By creating millions of simple online tasks we create a distributed computing machine. By shipping the tasks to millions of contributers around the globe, we make this human computer available 24/7 to make important decisions about your data. In this talk, I will describe our approach to designing CrowdFlower - a scalable crowdsourcing platform - as it evolved over the last 4 years. We think about crowdsourcing in terms of Quality, Cost and Speed. They are the ultimate design objectives of a human computer. Unfortunately, we can't have all 3. A general price-constrained task requiring 99.9% accuracy and 10 minute turnaround is not possible today. I will discuss design decisions behind CrowdFlower that allow us to pursue any two of these objectives. I will briefly present examples of common crowdsourced tasks and tools built into the platform to make the design of complex tasks easy, tools such as CrowdFlower Markup Language(CML). Quality control is the single most important challenge in Crowdsourcing. To enable an unidentified crowd of people to produce meaningful work, we must be certain that we can filter out bad contributors and produce high quality output. Initially we only used consensus. As the diversity and size of our crowd grew, so did the number of people attempting fraud. CrowdFlower developed "Gold standard" to block attempts of fraud. The use of gold allowed us to train contributors for the details of specific domains. By defining expected responses for a subset of the work and providing explanations of why a given response was expected, we are able distribute tasks to an ever-expanding anonymous workforce without sacrificing quality.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量