构建基于python的拓扑，用于实时处理社交媒体数据

Proceedings of the 5th Spanish Conference on Information Retrieval Pub Date : 2018-06-26 DOI:10.1145/3230599.3230618

Rodrigo Martínez-Castaño, J. C. Pichel, D. Losada

{"title":"构建基于python的拓扑，用于实时处理社交媒体数据","authors":"Rodrigo Martínez-Castaño, J. C. Pichel, D. Losada","doi":"10.1145/3230599.3230618","DOIUrl":null,"url":null,"abstract":"In this paper we propose a streaming approach for real-time processing of huge amounts of data. CATENAE is a library for easy building and execution of Python topologies (e.g., web crawler, classifier). Topologies are designed for their deployment inside Docker containers and, thus, horizontal scaling, granular resource assignment and isolation can be achieved easily. Furthermore, micromodules can have its own dependencies (including the Python version), allowing the user to limit resources such as CPU or memory by instance. We describe an implementation of a use case composed of two topologies: (1) a crawler for tracking users in social media and (2) an early risk detector of depression. We also explain how CATENAE topologies can be connected to non-Python systems.","PeriodicalId":448209,"journal":{"name":"Proceedings of the 5th Spanish Conference on Information Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Building Python-Based Topologies for Massive Processing of Social Media Data in Real Time\",\"authors\":\"Rodrigo Martínez-Castaño, J. C. Pichel, D. Losada\",\"doi\":\"10.1145/3230599.3230618\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we propose a streaming approach for real-time processing of huge amounts of data. CATENAE is a library for easy building and execution of Python topologies (e.g., web crawler, classifier). Topologies are designed for their deployment inside Docker containers and, thus, horizontal scaling, granular resource assignment and isolation can be achieved easily. Furthermore, micromodules can have its own dependencies (including the Python version), allowing the user to limit resources such as CPU or memory by instance. We describe an implementation of a use case composed of two topologies: (1) a crawler for tracking users in social media and (2) an early risk detector of depression. We also explain how CATENAE topologies can be connected to non-Python systems.\",\"PeriodicalId\":448209,\"journal\":{\"name\":\"Proceedings of the 5th Spanish Conference on Information Retrieval\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 5th Spanish Conference on Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3230599.3230618\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th Spanish Conference on Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3230599.3230618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本文提出了一种实时处理海量数据的流处理方法。CATENAE是一个用于轻松构建和执行Python拓扑(例如，网络爬虫，分类器)的库。拓扑是为部署在Docker容器内而设计的，因此，水平扩展、粒度资源分配和隔离可以很容易地实现。此外，微模块可以有自己的依赖项(包括Python版本)，允许用户按实例限制CPU或内存等资源。我们描述了一个由两个拓扑组成的用例的实现:(1)用于跟踪社交媒体用户的爬虫;(2)抑郁症的早期风险检测器。我们还解释了CATENAE拓扑如何连接到非python系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Building Python-Based Topologies for Massive Processing of Social Media Data in Real Time

In this paper we propose a streaming approach for real-time processing of huge amounts of data. CATENAE is a library for easy building and execution of Python topologies (e.g., web crawler, classifier). Topologies are designed for their deployment inside Docker containers and, thus, horizontal scaling, granular resource assignment and isolation can be achieved easily. Furthermore, micromodules can have its own dependencies (including the Python version), allowing the user to limit resources such as CPU or memory by instance. We describe an implementation of a use case composed of two topologies: (1) a crawler for tracking users in social media and (2) an early risk detector of depression. We also explain how CATENAE topologies can be connected to non-Python systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 5th Spanish Conference on Information Retrieval

自引率

0.00%

发文量