Theme Identification using Machine Learning Techniques

Siti Hajar Jayady, Hasmawati Antong
{"title":"Theme Identification using Machine Learning Techniques","authors":"Siti Hajar Jayady, Hasmawati Antong","doi":"10.51662/jiae.v1i2.24","DOIUrl":null,"url":null,"abstract":"With the abundance of online research platforms, much information presented in PDF files, such as articles and journals, can be obtained easily. In this case, students completing research projects would have many downloaded PDF articles on their laptops. However, identifying the target articles manually within the collection can be tiring as most articles consist of several pages that need to be analyzed. Reading each article to determine if the article relates theme and organizing the articles based on themes is time and energy-consuming. Referring to this problem, a PDF files organizer that implemented a theme identifier is necessary. Thus, work will focus on automatic text classification using the machine learning methods to build a theme identifier employed in the PDF files organizer to classify articles into augmented reality and machine learning. A total of 1000 text documents for both themes were used to build the classification model. Moreover, the pre-preprocessing step for data cleaning and TF-IDF feature extraction for text vectorization and to reduce sparse vectors were performed. 80% of the dataset were used for training, and the remaining were used to validate the trained models. The classification models proposed in this work are Linear SVM and Multinomial Naïve Bayes. The accuracy of the models was evaluated using a confusion matrix. For the Linear SVM model, grid-search optimization was performed to determine the optimal value of the Cost parameter.","PeriodicalId":424190,"journal":{"name":"Journal of Integrated and Advanced Engineering (JIAE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Integrated and Advanced Engineering (JIAE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.51662/jiae.v1i2.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

With the abundance of online research platforms, much information presented in PDF files, such as articles and journals, can be obtained easily. In this case, students completing research projects would have many downloaded PDF articles on their laptops. However, identifying the target articles manually within the collection can be tiring as most articles consist of several pages that need to be analyzed. Reading each article to determine if the article relates theme and organizing the articles based on themes is time and energy-consuming. Referring to this problem, a PDF files organizer that implemented a theme identifier is necessary. Thus, work will focus on automatic text classification using the machine learning methods to build a theme identifier employed in the PDF files organizer to classify articles into augmented reality and machine learning. A total of 1000 text documents for both themes were used to build the classification model. Moreover, the pre-preprocessing step for data cleaning and TF-IDF feature extraction for text vectorization and to reduce sparse vectors were performed. 80% of the dataset were used for training, and the remaining were used to validate the trained models. The classification models proposed in this work are Linear SVM and Multinomial Naïve Bayes. The accuracy of the models was evaluated using a confusion matrix. For the Linear SVM model, grid-search optimization was performed to determine the optimal value of the Cost parameter.
主题识别使用机器学习技术
随着在线研究平台的丰富,许多以PDF文件形式呈现的信息,如文章和期刊,可以很容易地获得。在这种情况下,完成研究项目的学生将在他们的笔记本电脑上下载许多PDF文章。但是,在集合中手动识别目标文章可能会很累人,因为大多数文章都由几个需要分析的页面组成。阅读每篇文章来确定文章是否与主题相关,并根据主题组织文章是费时费力的。提到这个问题,实现主题标识符的PDF文件管理器是必要的。因此,工作将集中在使用机器学习方法的自动文本分类上,以构建PDF文件管理器中使用的主题标识符,将文章分类为增强现实和机器学习。这两个主题总共使用了1000个文本文档来构建分类模型。此外,还进行了数据清洗预处理和文本矢量化和稀疏向量减少的TF-IDF特征提取。80%的数据集用于训练,其余的用于验证训练后的模型。本文提出的分类模型是线性支持向量机和多项式Naïve贝叶斯。使用混淆矩阵评估模型的准确性。对于线性支持向量机模型,进行网格搜索优化,确定Cost参数的最优值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信