{"title":"The relationship of text categorization using Dewey Decimal Classification techniques","authors":"J. Watthananon","doi":"10.1109/ICTKE.2014.7001538","DOIUrl":null,"url":null,"abstract":"Now a day, the massive amount of data and information (recently termed as “Big Data”) causes accessibility and retrieval problems if poorly managed. This is due to their relational structure which is more complicate, unexplainable, and unanalyzable with simple or traditional methods. The uniform display of these data and information is also difficult due to their diversified formats. Bag of Words (BOW), the mostly used data sorting method, is although simple but the significance of synonymity is overlooked. The objective of this research study is to propose method in determining massively scattered data (as electronic documents). The linking of related data is also supported by the application of Dewey Decimal Classification (DDC) technique. DDC was employed in data processing, analyzing, and displaying with appropriate method in form of Mind Map. The accuracy test was performed on the data from the “Wikipedia Selection for schools”, a sub version of Wikipedia, to determine the efficiency among four models: DDC: Dewey decimal classification, SVM: Support Vector Machine, K-Mean Clustering and Hierarchical Clustering. The results indicated that DDC yielded the most accuracy (75.02%), followed by the Hierarchical models (74.66%), while both K-Mean and SVM yielded the similar accuracy (72.66%). And the time in process is K-Mean Clustering was best time more than other models (16.09 second).","PeriodicalId":120743,"journal":{"name":"2014 Twelfth International Conference on ICT and Knowledge Engineering","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 Twelfth International Conference on ICT and Knowledge Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTKE.2014.7001538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Now a day, the massive amount of data and information (recently termed as “Big Data”) causes accessibility and retrieval problems if poorly managed. This is due to their relational structure which is more complicate, unexplainable, and unanalyzable with simple or traditional methods. The uniform display of these data and information is also difficult due to their diversified formats. Bag of Words (BOW), the mostly used data sorting method, is although simple but the significance of synonymity is overlooked. The objective of this research study is to propose method in determining massively scattered data (as electronic documents). The linking of related data is also supported by the application of Dewey Decimal Classification (DDC) technique. DDC was employed in data processing, analyzing, and displaying with appropriate method in form of Mind Map. The accuracy test was performed on the data from the “Wikipedia Selection for schools”, a sub version of Wikipedia, to determine the efficiency among four models: DDC: Dewey decimal classification, SVM: Support Vector Machine, K-Mean Clustering and Hierarchical Clustering. The results indicated that DDC yielded the most accuracy (75.02%), followed by the Hierarchical models (74.66%), while both K-Mean and SVM yielded the similar accuracy (72.66%). And the time in process is K-Mean Clustering was best time more than other models (16.09 second).
如今,如果管理不善,大量的数据和信息(最近被称为“大数据”)会导致可访问性和检索问题。这是由于它们的关系结构更加复杂,无法解释,无法用简单或传统的方法分析。这些数据和信息由于格式多样,难以统一显示。word Bag (BOW)是目前最常用的数据排序方法,虽然简单,但忽略了同义性的重要性。本研究的目的是提出确定大量分散数据(如电子文档)的方法。杜威十进分类法(Dewey Decimal Classification, DDC)的应用也支持了相关数据的链接。DDC以思维导图的形式对数据进行处理、分析和显示。对维基百科的子版本“Wikipedia Selection for schools”的数据进行准确性测试,以确定DDC: Dewey十进分类、SVM:支持向量机、K-Mean聚类和分层聚类四种模型的效率。结果表明,DDC模型的准确率最高(75.02%),其次是分层模型(74.66%),K-Mean和SVM的准确率相近(72.66%)。K-Mean聚类在处理时间上优于其他模型(16.09秒)。