用简单聚类方法形成本科毕业论文数据集

Tio Dharmawan, Chinta 'Aliyyah Candramaya, Vandha Pradwiyasma Widharta
{"title":"用简单聚类方法形成本科毕业论文数据集","authors":"Tio Dharmawan, Chinta 'Aliyyah Candramaya, Vandha Pradwiyasma Widharta","doi":"10.25124/ijies.v7i01.187","DOIUrl":null,"url":null,"abstract":"Each university collects many undergraduate theses data but has yet to process it to make it easier for students to find references as desired. This study aims to classify and compare the grouping of documents using expert and simple clustering methods. Experts have done ground truth using OR Boolean Retrieval and keyword generation. The best cluster was discovered by the experiments using the K-Means, K-Medoids, and DBSCAN clustering methods and using Euclidean, Manhattan, City Block, and Cosine Similarity metrics. The cluster with the best Silhouette Score compared to the accuracy of the categorization of each document. The K-Means clustering method and the Cosine Similarity metric gave the best results with a Silhouette Score value of 0.105534. The comparison between ground truth and the best cluster results shows an accuracy of 33.42%. The result shows that the simple clustering method cannot handle data with Negative Skewness and Leptokurtic Kurtosis.","PeriodicalId":217640,"journal":{"name":"International Journal of Innovation in Enterprise System","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Forming Dataset of The Undergraduate Thesis using Simple Clustering Methods\",\"authors\":\"Tio Dharmawan, Chinta 'Aliyyah Candramaya, Vandha Pradwiyasma Widharta\",\"doi\":\"10.25124/ijies.v7i01.187\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Each university collects many undergraduate theses data but has yet to process it to make it easier for students to find references as desired. This study aims to classify and compare the grouping of documents using expert and simple clustering methods. Experts have done ground truth using OR Boolean Retrieval and keyword generation. The best cluster was discovered by the experiments using the K-Means, K-Medoids, and DBSCAN clustering methods and using Euclidean, Manhattan, City Block, and Cosine Similarity metrics. The cluster with the best Silhouette Score compared to the accuracy of the categorization of each document. The K-Means clustering method and the Cosine Similarity metric gave the best results with a Silhouette Score value of 0.105534. The comparison between ground truth and the best cluster results shows an accuracy of 33.42%. The result shows that the simple clustering method cannot handle data with Negative Skewness and Leptokurtic Kurtosis.\",\"PeriodicalId\":217640,\"journal\":{\"name\":\"International Journal of Innovation in Enterprise System\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Innovation in Enterprise System\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.25124/ijies.v7i01.187\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Innovation in Enterprise System","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25124/ijies.v7i01.187","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

每所大学都收集了许多本科论文数据,但尚未对其进行处理,以便学生更容易找到所需的参考文献。本研究旨在使用专家聚类和简单聚类方法对文档分组进行分类和比较。专家们已经使用OR布尔检索和关键字生成实现了地面真相。通过K-Means、K-Medoids和DBSCAN聚类方法以及欧几里得、曼哈顿、城市街区和余弦相似度指标的实验,发现了最佳聚类。与最佳剪影分数的聚类比较每个文档的分类准确性。K-Means聚类方法和余弦相似度度量给出了最好的结果,剪影得分值为0.105534。地面真值与最佳聚类结果比较,准确率为33.42%。结果表明,简单的聚类方法不能处理具有负偏度和细峰度的数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Forming Dataset of The Undergraduate Thesis using Simple Clustering Methods
Each university collects many undergraduate theses data but has yet to process it to make it easier for students to find references as desired. This study aims to classify and compare the grouping of documents using expert and simple clustering methods. Experts have done ground truth using OR Boolean Retrieval and keyword generation. The best cluster was discovered by the experiments using the K-Means, K-Medoids, and DBSCAN clustering methods and using Euclidean, Manhattan, City Block, and Cosine Similarity metrics. The cluster with the best Silhouette Score compared to the accuracy of the categorization of each document. The K-Means clustering method and the Cosine Similarity metric gave the best results with a Silhouette Score value of 0.105534. The comparison between ground truth and the best cluster results shows an accuracy of 33.42%. The result shows that the simple clustering method cannot handle data with Negative Skewness and Leptokurtic Kurtosis.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信