扩展大数据挖掘基础设施:twitter体验

SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining Pub Date : 2013-04-30 DOI:10.1145/2481244.2481247

Jimmy J. Lin, D. Ryaboy

{"title":"扩展大数据挖掘基础设施:twitter体验","authors":"Jimmy J. Lin, D. Ryaboy","doi":"10.1145/2481244.2481247","DOIUrl":null,"url":null,"abstract":"The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on \"big data\". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life \"in the trenches\" is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, we discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they're insufficient to provide an overall \"big picture\" of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows---we refer to this as \"plumbing\". This paper has two goals: For practitioners, we hope to share our experiences to flatten bumps in the road for those who come after us. For academic researchers, we hope to provide a broader context for data mining in production environments, pointing out opportunities for future work.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"15 1","pages":"6-19"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"191","resultStr":"{\"title\":\"Scaling big data mining infrastructure: the twitter experience\",\"authors\":\"Jimmy J. Lin, D. Ryaboy\",\"doi\":\"10.1145/2481244.2481247\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on \\\"big data\\\". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life \\\"in the trenches\\\" is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, we discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they're insufficient to provide an overall \\\"big picture\\\" of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows---we refer to this as \\\"plumbing\\\". This paper has two goals: For practitioners, we hope to share our experiences to flatten bumps in the road for those who come after us. For academic researchers, we hope to provide a broader context for data mining in production environments, pointing out opportunities for future work.\",\"PeriodicalId\":90050,\"journal\":{\"name\":\"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining\",\"volume\":\"15 1\",\"pages\":\"6-19\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"191\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2481244.2481247\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2481244.2481247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 191

摘要

在过去的几年里，Twitter的分析平台在规模、复杂性、用户数量和各种用例方面都经历了巨大的增长。在本文中，我们讨论了基础设施的演变和“大数据”数据挖掘能力的发展。一个重要的教训是，在实践中，成功的大数据挖掘不仅仅是大多数学者所认为的数据挖掘:“在战壕中”的生活被大量的准备工作所占据，这些工作是在数据挖掘算法应用之前，然后是将初步模型转化为健壮的解决方案的大量工作。在此上下文中，我们将讨论两个主题:首先，模式在帮助数据科学家理解pb级数据存储方面发挥着重要作用，但它们不足以提供可用数据的整体“大局”，从而产生见解。其次，我们观察到构建数据分析平台的主要挑战源于必须集成到生产工作流中的各种组件的异构性——我们将其称为“管道”。本文有两个目的:对于从业者，我们希望分享我们的经验，为后来者铺平道路。对于学术研究人员，我们希望为生产环境中的数据挖掘提供更广泛的背景，指出未来工作的机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scaling big data mining infrastructure: the twitter experience

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on "big data". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life "in the trenches" is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, we discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they're insufficient to provide an overall "big picture" of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows---we refer to this as "plumbing". This paper has two goals: For practitioners, we hope to share our experiences to flatten bumps in the road for those who come after us. For academic researchers, we hope to provide a broader context for data mining in production environments, pointing out opportunities for future work.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining

自引率

0.00%

发文量