面向在线Web文本数据分析的分布式文本挖掘系统

Bin Zhou, Yan Jia, Chunyang Liu, Xu Zhang
{"title":"面向在线Web文本数据分析的分布式文本挖掘系统","authors":"Bin Zhou, Yan Jia, Chunyang Liu, Xu Zhang","doi":"10.1109/CYBERC.2010.11","DOIUrl":null,"url":null,"abstract":"Real world Web mining applications usually have different requirements, such as massive data processing, low system latency, and high scalability. In order to meet these different requirements, we proposed a distributed text mining system with a layered architecture that divides the system functions into three layers, namely, the crawling and storage layer, the basic mining layer, and the analysis service layer. Message-oriented middleware are used between these layer components and services to make the communication in a loosely-coupled way. To conquer the data-intensive and storage failure problems, a distributed file system is used to store and manage the raw text data and various indexes. As a case study and example, the design and implementation of an experimental online topic detection application, which can be scaled to handle thousands of Internet news and forum channels and perform online analysis, is also discussed.","PeriodicalId":315132,"journal":{"name":"2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery","volume":"1 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"A Distributed Text Mining System for Online Web Textual Data Analysis\",\"authors\":\"Bin Zhou, Yan Jia, Chunyang Liu, Xu Zhang\",\"doi\":\"10.1109/CYBERC.2010.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Real world Web mining applications usually have different requirements, such as massive data processing, low system latency, and high scalability. In order to meet these different requirements, we proposed a distributed text mining system with a layered architecture that divides the system functions into three layers, namely, the crawling and storage layer, the basic mining layer, and the analysis service layer. Message-oriented middleware are used between these layer components and services to make the communication in a loosely-coupled way. To conquer the data-intensive and storage failure problems, a distributed file system is used to store and manage the raw text data and various indexes. As a case study and example, the design and implementation of an experimental online topic detection application, which can be scaled to handle thousands of Internet news and forum channels and perform online analysis, is also discussed.\",\"PeriodicalId\":315132,\"journal\":{\"name\":\"2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery\",\"volume\":\"1 4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CYBERC.2010.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CYBERC.2010.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

摘要

现实世界的Web挖掘应用程序通常有不同的需求,比如海量数据处理、低系统延迟和高可伸缩性。为了满足这些不同的需求,我们提出了一种采用分层架构的分布式文本挖掘系统,将系统功能分为三层,即爬行和存储层、基础挖掘层和分析服务层。在这些层组件和服务之间使用面向消息的中间件,以松散耦合的方式进行通信。为了克服数据密集和存储失败的问题,采用分布式文件系统对原始文本数据和各种索引进行存储和管理。作为案例研究和示例,本文还讨论了一个实验性在线话题检测应用程序的设计和实现,该应用程序可以扩展到处理数千个网络新闻和论坛频道并进行在线分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Distributed Text Mining System for Online Web Textual Data Analysis
Real world Web mining applications usually have different requirements, such as massive data processing, low system latency, and high scalability. In order to meet these different requirements, we proposed a distributed text mining system with a layered architecture that divides the system functions into three layers, namely, the crawling and storage layer, the basic mining layer, and the analysis service layer. Message-oriented middleware are used between these layer components and services to make the communication in a loosely-coupled way. To conquer the data-intensive and storage failure problems, a distributed file system is used to store and manage the raw text data and various indexes. As a case study and example, the design and implementation of an experimental online topic detection application, which can be scaled to handle thousands of Internet news and forum channels and perform online analysis, is also discussed.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信