使用MapReduce编程模型进行社交媒体数据分析,并使用Apache Mahout训练Tweet分类器

Umit Demirbaga, D. Jha
{"title":"使用MapReduce编程模型进行社交媒体数据分析,并使用Apache Mahout训练Tweet分类器","authors":"Umit Demirbaga, D. Jha","doi":"10.1109/SC2.2018.00024","DOIUrl":null,"url":null,"abstract":"Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, landslip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Naïve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes an algorithm for the pre-processing steps to make the semi-structured Twitter text data ready for classification. The proposed methods are useful for the BGS and other interested people to be able to see the name and number of the countries where the tweets are sent, the number of tweets sent from each country, the dates and time intervals of the tweets, and to classify the tweets whether they are related to landslides.","PeriodicalId":340244,"journal":{"name":"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Social Media Data Analysis Using MapReduce Programming Model and Training a Tweet Classifier Using Apache Mahout\",\"authors\":\"Umit Demirbaga, D. Jha\",\"doi\":\"10.1109/SC2.2018.00024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, landslip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Naïve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes an algorithm for the pre-processing steps to make the semi-structured Twitter text data ready for classification. The proposed methods are useful for the BGS and other interested people to be able to see the name and number of the countries where the tweets are sent, the number of tweets sent from each country, the dates and time intervals of the tweets, and to classify the tweets whether they are related to landslides.\",\"PeriodicalId\":340244,\"journal\":{\"name\":\"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC2.2018.00024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC2.2018.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

微博服务推特每分钟都会产生大量数据,因为它让人们有机会快速清晰地表达自己对任何话题的想法和感受。为了从这些可用的大数据中获取所需的信息,需要高性能的并行计算工具以及机器学习算法的支持。新兴的大数据处理框架(如Hadoop)可以有效地处理此类大数据。本文首先介绍了一种基于推文发布日期和推文发布国家(使用MapReduce算法)自动分类英国地质调查局(BGS) Twitter数据的新方法,这些数据是使用滑坡、滑坡、泥石流、陆地降落、滑坡、土壤滑动等特定关键词收集的。然后,我们在Mahout上使用Naïve-Bayes机器学习算法和n-Grams语言模型提出了一个模型来区分推文是否与山体滑坡相关。本文还描述了一种算法,用于预处理步骤,使半结构化的Twitter文本数据为分类做好准备。所提出的方法有助于BGS和其他感兴趣的人能够看到发送推文的国家名称和数量,每个国家发送的推文数量,推文的日期和时间间隔,以及对推文是否与滑坡有关进行分类。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Social Media Data Analysis Using MapReduce Programming Model and Training a Tweet Classifier Using Apache Mahout
Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, landslip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Naïve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes an algorithm for the pre-processing steps to make the semi-structured Twitter text data ready for classification. The proposed methods are useful for the BGS and other interested people to be able to see the name and number of the countries where the tweets are sent, the number of tweets sent from each country, the dates and time intervals of the tweets, and to classify the tweets whether they are related to landslides.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信