Theme Identification using Machine Learning Techniques

Journal of Integrated and Advanced Engineering (JIAE) Pub Date : 2021-11-30 DOI:10.51662/jiae.v1i2.24

Siti Hajar Jayady, Hasmawati Antong

{"title":"Theme Identification using Machine Learning Techniques","authors":"Siti Hajar Jayady, Hasmawati Antong","doi":"10.51662/jiae.v1i2.24","DOIUrl":null,"url":null,"abstract":"With the abundance of online research platforms, much information presented in PDF files, such as articles and journals, can be obtained easily. In this case, students completing research projects would have many downloaded PDF articles on their laptops. However, identifying the target articles manually within the collection can be tiring as most articles consist of several pages that need to be analyzed. Reading each article to determine if the article relates theme and organizing the articles based on themes is time and energy-consuming. Referring to this problem, a PDF files organizer that implemented a theme identifier is necessary. Thus, work will focus on automatic text classification using the machine learning methods to build a theme identifier employed in the PDF files organizer to classify articles into augmented reality and machine learning. A total of 1000 text documents for both themes were used to build the classification model. Moreover, the pre-preprocessing step for data cleaning and TF-IDF feature extraction for text vectorization and to reduce sparse vectors were performed. 80% of the dataset were used for training, and the remaining were used to validate the trained models. The classification models proposed in this work are Linear SVM and Multinomial Naïve Bayes. The accuracy of the models was evaluated using a confusion matrix. For the Linear SVM model, grid-search optimization was performed to determine the optimal value of the Cost parameter.","PeriodicalId":424190,"journal":{"name":"Journal of Integrated and Advanced Engineering (JIAE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Integrated and Advanced Engineering (JIAE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.51662/jiae.v1i2.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

With the abundance of online research platforms, much information presented in PDF files, such as articles and journals, can be obtained easily. In this case, students completing research projects would have many downloaded PDF articles on their laptops. However, identifying the target articles manually within the collection can be tiring as most articles consist of several pages that need to be analyzed. Reading each article to determine if the article relates theme and organizing the articles based on themes is time and energy-consuming. Referring to this problem, a PDF files organizer that implemented a theme identifier is necessary. Thus, work will focus on automatic text classification using the machine learning methods to build a theme identifier employed in the PDF files organizer to classify articles into augmented reality and machine learning. A total of 1000 text documents for both themes were used to build the classification model. Moreover, the pre-preprocessing step for data cleaning and TF-IDF feature extraction for text vectorization and to reduce sparse vectors were performed. 80% of the dataset were used for training, and the remaining were used to validate the trained models. The classification models proposed in this work are Linear SVM and Multinomial Naïve Bayes. The accuracy of the models was evaluated using a confusion matrix. For the Linear SVM model, grid-search optimization was performed to determine the optimal value of the Cost parameter.

查看原文本刊更多论文

主题识别使用机器学习技术

随着在线研究平台的丰富，许多以PDF文件形式呈现的信息，如文章和期刊，可以很容易地获得。在这种情况下，完成研究项目的学生将在他们的笔记本电脑上下载许多PDF文章。但是，在集合中手动识别目标文章可能会很累人，因为大多数文章都由几个需要分析的页面组成。阅读每篇文章来确定文章是否与主题相关，并根据主题组织文章是费时费力的。提到这个问题，实现主题标识符的PDF文件管理器是必要的。因此，工作将集中在使用机器学习方法的自动文本分类上，以构建PDF文件管理器中使用的主题标识符，将文章分类为增强现实和机器学习。这两个主题总共使用了1000个文本文档来构建分类模型。此外，还进行了数据清洗预处理和文本矢量化和稀疏向量减少的TF-IDF特征提取。80%的数据集用于训练，其余的用于验证训练后的模型。本文提出的分类模型是线性支持向量机和多项式Naïve贝叶斯。使用混淆矩阵评估模型的准确性。对于线性支持向量机模型，进行网格搜索优化，确定Cost参数的最优值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Integrated and Advanced Engineering (JIAE)

自引率

0.00%

发文量