用简单聚类方法形成本科毕业论文数据集

International Journal of Innovation in Enterprise System Pub Date : 2023-01-31 DOI:10.25124/ijies.v7i01.187

Tio Dharmawan, Chinta 'Aliyyah Candramaya, Vandha Pradwiyasma Widharta

{"title":"用简单聚类方法形成本科毕业论文数据集","authors":"Tio Dharmawan, Chinta 'Aliyyah Candramaya, Vandha Pradwiyasma Widharta","doi":"10.25124/ijies.v7i01.187","DOIUrl":null,"url":null,"abstract":"Each university collects many undergraduate theses data but has yet to process it to make it easier for students to find references as desired. This study aims to classify and compare the grouping of documents using expert and simple clustering methods. Experts have done ground truth using OR Boolean Retrieval and keyword generation. The best cluster was discovered by the experiments using the K-Means, K-Medoids, and DBSCAN clustering methods and using Euclidean, Manhattan, City Block, and Cosine Similarity metrics. The cluster with the best Silhouette Score compared to the accuracy of the categorization of each document. The K-Means clustering method and the Cosine Similarity metric gave the best results with a Silhouette Score value of 0.105534. The comparison between ground truth and the best cluster results shows an accuracy of 33.42%. The result shows that the simple clustering method cannot handle data with Negative Skewness and Leptokurtic Kurtosis.","PeriodicalId":217640,"journal":{"name":"International Journal of Innovation in Enterprise System","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Forming Dataset of The Undergraduate Thesis using Simple Clustering Methods\",\"authors\":\"Tio Dharmawan, Chinta 'Aliyyah Candramaya, Vandha Pradwiyasma Widharta\",\"doi\":\"10.25124/ijies.v7i01.187\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Each university collects many undergraduate theses data but has yet to process it to make it easier for students to find references as desired. This study aims to classify and compare the grouping of documents using expert and simple clustering methods. Experts have done ground truth using OR Boolean Retrieval and keyword generation. The best cluster was discovered by the experiments using the K-Means, K-Medoids, and DBSCAN clustering methods and using Euclidean, Manhattan, City Block, and Cosine Similarity metrics. The cluster with the best Silhouette Score compared to the accuracy of the categorization of each document. The K-Means clustering method and the Cosine Similarity metric gave the best results with a Silhouette Score value of 0.105534. The comparison between ground truth and the best cluster results shows an accuracy of 33.42%. The result shows that the simple clustering method cannot handle data with Negative Skewness and Leptokurtic Kurtosis.\",\"PeriodicalId\":217640,\"journal\":{\"name\":\"International Journal of Innovation in Enterprise System\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Innovation in Enterprise System\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.25124/ijies.v7i01.187\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Innovation in Enterprise System","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25124/ijies.v7i01.187","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

每所大学都收集了许多本科论文数据，但尚未对其进行处理，以便学生更容易找到所需的参考文献。本研究旨在使用专家聚类和简单聚类方法对文档分组进行分类和比较。专家们已经使用OR布尔检索和关键字生成实现了地面真相。通过K-Means、K-Medoids和DBSCAN聚类方法以及欧几里得、曼哈顿、城市街区和余弦相似度指标的实验，发现了最佳聚类。与最佳剪影分数的聚类比较每个文档的分类准确性。K-Means聚类方法和余弦相似度度量给出了最好的结果，剪影得分值为0.105534。地面真值与最佳聚类结果比较，准确率为33.42%。结果表明，简单的聚类方法不能处理具有负偏度和细峰度的数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Forming Dataset of The Undergraduate Thesis using Simple Clustering Methods

Each university collects many undergraduate theses data but has yet to process it to make it easier for students to find references as desired. This study aims to classify and compare the grouping of documents using expert and simple clustering methods. Experts have done ground truth using OR Boolean Retrieval and keyword generation. The best cluster was discovered by the experiments using the K-Means, K-Medoids, and DBSCAN clustering methods and using Euclidean, Manhattan, City Block, and Cosine Similarity metrics. The cluster with the best Silhouette Score compared to the accuracy of the categorization of each document. The K-Means clustering method and the Cosine Similarity metric gave the best results with a Silhouette Score value of 0.105534. The comparison between ground truth and the best cluster results shows an accuracy of 33.42%. The result shows that the simple clustering method cannot handle data with Negative Skewness and Leptokurtic Kurtosis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Innovation in Enterprise System

自引率

0.00%

发文量