{"title":"Using logistic regression method to classify tweets into the selected topics","authors":"S. Indra, Liza Wikarsa, Mcomp Bcs, Rinaldo Turang","doi":"10.1109/ICACSIS.2016.7872727","DOIUrl":null,"url":null,"abstract":"Topics about health, music, sport, and technology are widely discussed in social network sites, especially in Twitter. Sharing information about those topics can enrich one's knowledge as well as increase the awareness of the current trends pertinent to the area of interests. Hence, this research aims to develop a web-based application that can classify tweets of netizens into these four categories of topics using one of machine learning methods called Logistic Regression. There are four main processes applied in this application that are fetching tweets, preprocessing, text feature extraction and machine learning. There are 1800 labeled tweets for each topic used as training data. Several processes were done in the pre-processing phase, including removal of URLs, punctuation, and stop words, tokenization, and stemming. Later, the application automatically converted the pre-processed tweets into set of features vector using Bag of Words. The set of features vector was applied to the Logistic Regression algorithm for the classification task. The trained classifier was then evaluated using 1800 tweets with 450 for each topic. Using Confusion Matrix, the results showed the accuracy of tweets classification into the selected topics is 92% which is considered very high.","PeriodicalId":267924,"journal":{"name":"2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS.2016.7872727","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 35
Abstract
Topics about health, music, sport, and technology are widely discussed in social network sites, especially in Twitter. Sharing information about those topics can enrich one's knowledge as well as increase the awareness of the current trends pertinent to the area of interests. Hence, this research aims to develop a web-based application that can classify tweets of netizens into these four categories of topics using one of machine learning methods called Logistic Regression. There are four main processes applied in this application that are fetching tweets, preprocessing, text feature extraction and machine learning. There are 1800 labeled tweets for each topic used as training data. Several processes were done in the pre-processing phase, including removal of URLs, punctuation, and stop words, tokenization, and stemming. Later, the application automatically converted the pre-processed tweets into set of features vector using Bag of Words. The set of features vector was applied to the Logistic Regression algorithm for the classification task. The trained classifier was then evaluated using 1800 tweets with 450 for each topic. Using Confusion Matrix, the results showed the accuracy of tweets classification into the selected topics is 92% which is considered very high.
关于健康、音乐、体育和技术的话题在社交网站上被广泛讨论,尤其是在Twitter上。分享关于这些主题的信息可以丰富一个人的知识,并提高对与感兴趣的领域有关的当前趋势的认识。因此,本研究旨在开发一个基于web的应用程序,该应用程序可以使用一种称为逻辑回归的机器学习方法将网民的推文分类为这四类主题。在这个应用程序中应用了四个主要过程,分别是获取tweet、预处理、文本特征提取和机器学习。每个主题有1800条标记tweet作为训练数据。预处理阶段完成了几个过程,包括删除url、标点符号和停止词、标记化和词干提取。随后,应用程序使用Bag of Words将预处理后的tweets自动转换为一组特征向量。将特征向量集应用到逻辑回归算法中进行分类任务。然后使用1800条推文对训练好的分类器进行评估,每个主题450条。使用混淆矩阵,结果显示推文分类到所选主题的准确率为92%,这被认为是非常高的。