k -均值聚类方法在论文主题趋势分析中的实现(以浙江大学计算机科学学院为例)

BERKALA SAINSTEK Pub Date : 2022-12-10 DOI:10.19184/bst.v10i4.29524

M. Irianto, Achmad Maududie, Fajrin Nurman Arifin

{"title":"k -均值聚类方法在论文主题趋势分析中的实现(以浙江大学计算机科学学院为例)","authors":"M. Irianto, Achmad Maududie, Fajrin Nurman Arifin","doi":"10.19184/bst.v10i4.29524","DOIUrl":null,"url":null,"abstract":"The development of information technology causes a large number of digital documents, especially thesis documents, so that it can create opportunities for students to take the same and not varied topics. Thesis documents can be grouped by topic by identifying the abstract section. The results of the grouping can be seen with the trend with data visualization so that it can be analyzed to find out the trend of each topic. Retrieval of data in the repository of the University of Jember through a web scraping process as many as 490 thesis documents for students of the Faculty of Computer Science, University of Jember. The preprocessing stage is carried out by text mining methods which include cleaning, filtering, stemming, and tokenizing. Then calculate the weight of each word with the Term Frequency - Inverse Document Frequency algorithm, followed by the dimension reduction process using the Principal Component Analysis algorithm, which is normalized by Z-Score first. The outliers removal process is carried out before classifying documents. Furthermore, document grouping uses the K-Means Clustering method with Cosine Similarity as the distance calculation and the Silhouette Coefficient algorithm as a test. The test results were carried out with various k values and the optimal value was obtained at k = 2 with a Silhouette value of 0.80. Then the topic detection uses the Latent Dirichlet Allocation algorithm for each cluster that has been formed. Each cluster is visualized with a line chart and Trend Linear algorithm and analyzed to find out the trend. From the results of the analysis, it can be concluded that the topic of Decision Support System Development is trending down, and the topic of IT Performance Measurement and Forecasting is trending up. It can be concluded that the topic of Decision Support System Development needs to be reduced so that other topics can emerge.","PeriodicalId":353803,"journal":{"name":"BERKALA SAINSTEK","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Implementation of K-Means Clustering Method for Trend Analysis of Thesis Topics (Case Study: Faculty of Computer Science, University of Jember)\",\"authors\":\"M. Irianto, Achmad Maududie, Fajrin Nurman Arifin\",\"doi\":\"10.19184/bst.v10i4.29524\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The development of information technology causes a large number of digital documents, especially thesis documents, so that it can create opportunities for students to take the same and not varied topics. Thesis documents can be grouped by topic by identifying the abstract section. The results of the grouping can be seen with the trend with data visualization so that it can be analyzed to find out the trend of each topic. Retrieval of data in the repository of the University of Jember through a web scraping process as many as 490 thesis documents for students of the Faculty of Computer Science, University of Jember. The preprocessing stage is carried out by text mining methods which include cleaning, filtering, stemming, and tokenizing. Then calculate the weight of each word with the Term Frequency - Inverse Document Frequency algorithm, followed by the dimension reduction process using the Principal Component Analysis algorithm, which is normalized by Z-Score first. The outliers removal process is carried out before classifying documents. Furthermore, document grouping uses the K-Means Clustering method with Cosine Similarity as the distance calculation and the Silhouette Coefficient algorithm as a test. The test results were carried out with various k values and the optimal value was obtained at k = 2 with a Silhouette value of 0.80. Then the topic detection uses the Latent Dirichlet Allocation algorithm for each cluster that has been formed. Each cluster is visualized with a line chart and Trend Linear algorithm and analyzed to find out the trend. From the results of the analysis, it can be concluded that the topic of Decision Support System Development is trending down, and the topic of IT Performance Measurement and Forecasting is trending up. It can be concluded that the topic of Decision Support System Development needs to be reduced so that other topics can emerge.\",\"PeriodicalId\":353803,\"journal\":{\"name\":\"BERKALA SAINSTEK\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BERKALA SAINSTEK\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.19184/bst.v10i4.29524\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BERKALA SAINSTEK","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.19184/bst.v10i4.29524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

信息技术的发展导致了大量的数字文档，尤其是论文文档，这为学生提供了相同而不不同的主题的机会。论文文件可以通过识别摘要部分按主题分组。通过数据可视化，可以看到分组结果的趋势，从而可以对分组结果进行分析，找出各个主题的趋势。通过网络抓取过程在Jember大学存储库中检索数据，为Jember大学计算机科学学院的学生检索多达490篇论文。预处理阶段通过文本挖掘方法进行，包括清理、过滤、词干提取和标记化。然后用Term Frequency - Inverse Document Frequency算法计算每个单词的权重，然后用主成分分析算法进行降维处理，先用Z-Score归一化。在文档分类之前进行异常值去除过程。此外，文档分组使用余弦相似度作为距离计算的K-Means聚类方法和轮廓系数算法作为测试。试验结果在不同的k值下进行，在k = 2时得到最优值，廓形值为0.80。然后对已形成的每个聚类使用潜狄利克雷分配算法进行主题检测。用折线图和趋势线性算法对每个聚类进行可视化，并对其进行分析以找出趋势。从分析结果可以看出，“决策支持系统开发”的选题呈下降趋势，而“it绩效测量与预测”的选题呈上升趋势。可以得出结论，决策支持系统开发的主题需要减少，以便其他主题可以出现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Implementation of K-Means Clustering Method for Trend Analysis of Thesis Topics (Case Study: Faculty of Computer Science, University of Jember)

The development of information technology causes a large number of digital documents, especially thesis documents, so that it can create opportunities for students to take the same and not varied topics. Thesis documents can be grouped by topic by identifying the abstract section. The results of the grouping can be seen with the trend with data visualization so that it can be analyzed to find out the trend of each topic. Retrieval of data in the repository of the University of Jember through a web scraping process as many as 490 thesis documents for students of the Faculty of Computer Science, University of Jember. The preprocessing stage is carried out by text mining methods which include cleaning, filtering, stemming, and tokenizing. Then calculate the weight of each word with the Term Frequency - Inverse Document Frequency algorithm, followed by the dimension reduction process using the Principal Component Analysis algorithm, which is normalized by Z-Score first. The outliers removal process is carried out before classifying documents. Furthermore, document grouping uses the K-Means Clustering method with Cosine Similarity as the distance calculation and the Silhouette Coefficient algorithm as a test. The test results were carried out with various k values and the optimal value was obtained at k = 2 with a Silhouette value of 0.80. Then the topic detection uses the Latent Dirichlet Allocation algorithm for each cluster that has been formed. Each cluster is visualized with a line chart and Trend Linear algorithm and analyzed to find out the trend. From the results of the analysis, it can be concluded that the topic of Decision Support System Development is trending down, and the topic of IT Performance Measurement and Forecasting is trending up. It can be concluded that the topic of Decision Support System Development needs to be reduced so that other topics can emerge.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BERKALA SAINSTEK

自引率

0.00%

发文量