Document Classification through Building Specified N-Gram

Byeongkyu Ko, Dongjin Choi, Chang Choi, Junho Choi, Pankoo Kim
{"title":"Document Classification through Building Specified N-Gram","authors":"Byeongkyu Ko, Dongjin Choi, Chang Choi, Junho Choi, Pankoo Kim","doi":"10.1109/IMIS.2012.142","DOIUrl":null,"url":null,"abstract":"This paper proposed a method to classify textural documents using specified n-gram data set. Human lives in the world where web documents have a great potential and the amount of valuable information has been consistently growing over the year. There is a problem that finding relevant web documents corresponding to what users want is more difficult due to the huge amount of web size. For this reason, many approaches have been suggested to overcome this obstacle. The most important task is classifying textural documents into predefined categories. Over the years, many statistical approaches were introduced though, no one can find perfect solution yet. In this paper, we suggest a method for textural document classification using n-gram model. The n-gram data frequency has a great potential to find similarities between documents. For this reason, we construct our own n-gram data sets from research papers. If an unknown document comes to the system, the system will extract n-grams from the given unknown documents. After this step, n-grams from unknown document and n-grams in previous data sets will be compared by proposed similarity measurement. The precision rate of this method comes to 86%.","PeriodicalId":290976,"journal":{"name":"2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IMIS.2012.142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

This paper proposed a method to classify textural documents using specified n-gram data set. Human lives in the world where web documents have a great potential and the amount of valuable information has been consistently growing over the year. There is a problem that finding relevant web documents corresponding to what users want is more difficult due to the huge amount of web size. For this reason, many approaches have been suggested to overcome this obstacle. The most important task is classifying textural documents into predefined categories. Over the years, many statistical approaches were introduced though, no one can find perfect solution yet. In this paper, we suggest a method for textural document classification using n-gram model. The n-gram data frequency has a great potential to find similarities between documents. For this reason, we construct our own n-gram data sets from research papers. If an unknown document comes to the system, the system will extract n-grams from the given unknown documents. After this step, n-grams from unknown document and n-grams in previous data sets will be compared by proposed similarity measurement. The precision rate of this method comes to 86%.
通过构建指定N-Gram进行文档分类
提出了一种使用指定n-gram数据集对文本文档进行分类的方法。在人类生活的世界里,网络文档具有巨大的潜力,而且有价值的信息的数量在过去的一年里一直在持续增长。有一个问题是,由于庞大的网络规模,找到与用户想要的相对应的相关网络文档变得更加困难。出于这个原因,已经提出了许多方法来克服这一障碍。最重要的任务是将纹理文档分类到预定义的类别中。多年来,许多统计方法被引入,但没有人能找到完美的解决方案。本文提出了一种基于n-gram模型的文本文档分类方法。n-gram数据频率在查找文档之间的相似性方面具有很大的潜力。出于这个原因,我们从研究论文中构建了自己的n-gram数据集。如果一个未知文档进入系统,系统将从给定的未知文档中提取n个图。在此步骤之后,将未知文档中的n个图与之前数据集中的n个图进行相似性度量比较。该方法的准确率可达86%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信