Using logistic regression method to classify tweets into the selected topics

S. Indra, Liza Wikarsa, Mcomp Bcs, Rinaldo Turang
{"title":"Using logistic regression method to classify tweets into the selected topics","authors":"S. Indra, Liza Wikarsa, Mcomp Bcs, Rinaldo Turang","doi":"10.1109/ICACSIS.2016.7872727","DOIUrl":null,"url":null,"abstract":"Topics about health, music, sport, and technology are widely discussed in social network sites, especially in Twitter. Sharing information about those topics can enrich one's knowledge as well as increase the awareness of the current trends pertinent to the area of interests. Hence, this research aims to develop a web-based application that can classify tweets of netizens into these four categories of topics using one of machine learning methods called Logistic Regression. There are four main processes applied in this application that are fetching tweets, preprocessing, text feature extraction and machine learning. There are 1800 labeled tweets for each topic used as training data. Several processes were done in the pre-processing phase, including removal of URLs, punctuation, and stop words, tokenization, and stemming. Later, the application automatically converted the pre-processed tweets into set of features vector using Bag of Words. The set of features vector was applied to the Logistic Regression algorithm for the classification task. The trained classifier was then evaluated using 1800 tweets with 450 for each topic. Using Confusion Matrix, the results showed the accuracy of tweets classification into the selected topics is 92% which is considered very high.","PeriodicalId":267924,"journal":{"name":"2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS.2016.7872727","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 35

Abstract

Topics about health, music, sport, and technology are widely discussed in social network sites, especially in Twitter. Sharing information about those topics can enrich one's knowledge as well as increase the awareness of the current trends pertinent to the area of interests. Hence, this research aims to develop a web-based application that can classify tweets of netizens into these four categories of topics using one of machine learning methods called Logistic Regression. There are four main processes applied in this application that are fetching tweets, preprocessing, text feature extraction and machine learning. There are 1800 labeled tweets for each topic used as training data. Several processes were done in the pre-processing phase, including removal of URLs, punctuation, and stop words, tokenization, and stemming. Later, the application automatically converted the pre-processed tweets into set of features vector using Bag of Words. The set of features vector was applied to the Logistic Regression algorithm for the classification task. The trained classifier was then evaluated using 1800 tweets with 450 for each topic. Using Confusion Matrix, the results showed the accuracy of tweets classification into the selected topics is 92% which is considered very high.
使用逻辑回归方法将推文分类为选定的主题
关于健康、音乐、体育和技术的话题在社交网站上被广泛讨论,尤其是在Twitter上。分享关于这些主题的信息可以丰富一个人的知识,并提高对与感兴趣的领域有关的当前趋势的认识。因此,本研究旨在开发一个基于web的应用程序,该应用程序可以使用一种称为逻辑回归的机器学习方法将网民的推文分类为这四类主题。在这个应用程序中应用了四个主要过程,分别是获取tweet、预处理、文本特征提取和机器学习。每个主题有1800条标记tweet作为训练数据。预处理阶段完成了几个过程,包括删除url、标点符号和停止词、标记化和词干提取。随后,应用程序使用Bag of Words将预处理后的tweets自动转换为一组特征向量。将特征向量集应用到逻辑回归算法中进行分类任务。然后使用1800条推文对训练好的分类器进行评估,每个主题450条。使用混淆矩阵,结果显示推文分类到所选主题的准确率为92%,这被认为是非常高的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信