Event Detection in Twitter: A Keyword Volume Approach

2018 IEEE International Conference on Data Mining Workshops (ICDMW) Pub Date : 2018-11-01 DOI:10.1109/ICDMW.2018.00172

A. Hossny, Lewis Mitchell

{"title":"Event Detection in Twitter: A Keyword Volume Approach","authors":"A. Hossny, Lewis Mitchell","doi":"10.1109/ICDMW.2018.00172","DOIUrl":null,"url":null,"abstract":"Event detection using social media streams needs a set of informative features with strong signals that need minimal preprocessing and are highly associated with events of interest. Identifying these informative features as keywords from Twitter is challenging, as people use informal language to express their thoughts and feelings. This informality includes acronyms, misspelled words, synonyms, transliteration and ambiguous terms. In this paper, we propose an efficient method to select the keywords frequently used in Twitter that are mostly associated with events of interest such as protests. The volume of these keywords is tracked in real time to identify the events of interest in a binary classification scheme. We use keywords within word-pairs to capture the context. The proposed method is to binarize vectors of daily counts for each word-pair by applying a spike detection temporal filter, then use the Jaccard metric to measure the similarity of the binary vector for each word-pair with the binary vector describing event occurrence. The top n word-pairs are used as features to classify any day to be an event or non-event day. The selected features are tested using multiple classifiers such as Naive Bayes, SVM, Logistic Regression, KNN and decision trees. They all produced AUC ROC scores up to 0.91 and F1 scores up to 0.79. The experiment is performed using the English language in multiple cities such as Melbourne, Sydney and Brisbane as well as the Indonesian language in Jakarta. The two experiments, comprising different languages and locations, yielded similar results.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2018.00172","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Event detection using social media streams needs a set of informative features with strong signals that need minimal preprocessing and are highly associated with events of interest. Identifying these informative features as keywords from Twitter is challenging, as people use informal language to express their thoughts and feelings. This informality includes acronyms, misspelled words, synonyms, transliteration and ambiguous terms. In this paper, we propose an efficient method to select the keywords frequently used in Twitter that are mostly associated with events of interest such as protests. The volume of these keywords is tracked in real time to identify the events of interest in a binary classification scheme. We use keywords within word-pairs to capture the context. The proposed method is to binarize vectors of daily counts for each word-pair by applying a spike detection temporal filter, then use the Jaccard metric to measure the similarity of the binary vector for each word-pair with the binary vector describing event occurrence. The top n word-pairs are used as features to classify any day to be an event or non-event day. The selected features are tested using multiple classifiers such as Naive Bayes, SVM, Logistic Regression, KNN and decision trees. They all produced AUC ROC scores up to 0.91 and F1 scores up to 0.79. The experiment is performed using the English language in multiple cities such as Melbourne, Sydney and Brisbane as well as the Indonesian language in Jakarta. The two experiments, comprising different languages and locations, yielded similar results.

查看原文本刊更多论文

Twitter中的事件检测:一种关键字量方法

使用社交媒体流进行事件检测需要一组带有强信号的信息特征，这些信号需要最少的预处理，并且与感兴趣的事件高度相关。从Twitter中识别这些信息特征作为关键字是具有挑战性的，因为人们使用非正式的语言来表达他们的想法和感受。这种不正式包括首字母缩略词、拼写错误的单词、同义词、音译和模棱两可的术语。在本文中，我们提出了一种有效的方法来选择Twitter中经常使用的关键字，这些关键字主要与抗议等感兴趣的事件相关。实时跟踪这些关键字的数量，以识别在二元分类方案中感兴趣的事件。我们使用单词对中的关键字来捕捉上下文。提出的方法是通过应用尖峰检测时间滤波器对每个词对的每日计数向量进行二值化，然后使用Jaccard度量来度量每个词对的二进制向量与描述事件发生的二进制向量的相似性。前n个单词对被用作特征，用于将任何一天划分为事件日或非事件日。使用朴素贝叶斯、支持向量机、逻辑回归、KNN和决策树等多个分类器对所选特征进行测试。它们的AUC ROC得分均达0.91,F1得分均达0.79。该实验在墨尔本、悉尼和布里斯班等多个城市使用英语，在雅加达使用印尼语。这两项实验虽然使用不同的语言，地点也不同，但得出了相似的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量