面向在线Web文本数据分析的分布式文本挖掘系统

2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery Pub Date : 2010-10-10 DOI:10.1109/CYBERC.2010.11

Bin Zhou, Yan Jia, Chunyang Liu, Xu Zhang

{"title":"面向在线Web文本数据分析的分布式文本挖掘系统","authors":"Bin Zhou, Yan Jia, Chunyang Liu, Xu Zhang","doi":"10.1109/CYBERC.2010.11","DOIUrl":null,"url":null,"abstract":"Real world Web mining applications usually have different requirements, such as massive data processing, low system latency, and high scalability. In order to meet these different requirements, we proposed a distributed text mining system with a layered architecture that divides the system functions into three layers, namely, the crawling and storage layer, the basic mining layer, and the analysis service layer. Message-oriented middleware are used between these layer components and services to make the communication in a loosely-coupled way. To conquer the data-intensive and storage failure problems, a distributed file system is used to store and manage the raw text data and various indexes. As a case study and example, the design and implementation of an experimental online topic detection application, which can be scaled to handle thousands of Internet news and forum channels and perform online analysis, is also discussed.","PeriodicalId":315132,"journal":{"name":"2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery","volume":"1 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"A Distributed Text Mining System for Online Web Textual Data Analysis\",\"authors\":\"Bin Zhou, Yan Jia, Chunyang Liu, Xu Zhang\",\"doi\":\"10.1109/CYBERC.2010.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Real world Web mining applications usually have different requirements, such as massive data processing, low system latency, and high scalability. In order to meet these different requirements, we proposed a distributed text mining system with a layered architecture that divides the system functions into three layers, namely, the crawling and storage layer, the basic mining layer, and the analysis service layer. Message-oriented middleware are used between these layer components and services to make the communication in a loosely-coupled way. To conquer the data-intensive and storage failure problems, a distributed file system is used to store and manage the raw text data and various indexes. As a case study and example, the design and implementation of an experimental online topic detection application, which can be scaled to handle thousands of Internet news and forum channels and perform online analysis, is also discussed.\",\"PeriodicalId\":315132,\"journal\":{\"name\":\"2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery\",\"volume\":\"1 4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CYBERC.2010.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CYBERC.2010.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

现实世界的Web挖掘应用程序通常有不同的需求，比如海量数据处理、低系统延迟和高可伸缩性。为了满足这些不同的需求，我们提出了一种采用分层架构的分布式文本挖掘系统，将系统功能分为三层，即爬行和存储层、基础挖掘层和分析服务层。在这些层组件和服务之间使用面向消息的中间件，以松散耦合的方式进行通信。为了克服数据密集和存储失败的问题，采用分布式文件系统对原始文本数据和各种索引进行存储和管理。作为案例研究和示例，本文还讨论了一个实验性在线话题检测应用程序的设计和实现，该应用程序可以扩展到处理数千个网络新闻和论坛频道并进行在线分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Distributed Text Mining System for Online Web Textual Data Analysis

Real world Web mining applications usually have different requirements, such as massive data processing, low system latency, and high scalability. In order to meet these different requirements, we proposed a distributed text mining system with a layered architecture that divides the system functions into three layers, namely, the crawling and storage layer, the basic mining layer, and the analysis service layer. Message-oriented middleware are used between these layer components and services to make the communication in a loosely-coupled way. To conquer the data-intensive and storage failure problems, a distributed file system is used to store and manage the raw text data and various indexes. As a case study and example, the design and implementation of an experimental online topic detection application, which can be scaled to handle thousands of Internet news and forum channels and perform online analysis, is also discussed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery

自引率

0.00%

发文量