PKIP: feature selection in text categorization for item banks

Atorn Nuntiyagul, K. Naruedomkul, N. Cercone, Damras Wongsawang
{"title":"PKIP: feature selection in text categorization for item banks","authors":"Atorn Nuntiyagul, K. Naruedomkul, N. Cercone, Damras Wongsawang","doi":"10.1109/ICTAI.2005.95","DOIUrl":null,"url":null,"abstract":"We propose an alternative approach to text categorization for item banks. An item bank is a collection of textual data in which each item consists of short sentences and has only a few relevant words for categorization; some items could be categorized into many categories. The traditional categorization techniques cannot provide sufficiently accurate results because of a \"lack of words\" problem. From this observation, items in the same category always have the same group of terms (or keywords) and the similar locations of these terms in phrases suggest that the items have a high probability to be in the same category. Our new methodology PKIP, patterned keywords in phrase, is proposed to improve categorization accuracy and recover from the \"lack of words \"problem. The k-highest weight order words are selected as the keywords from each category and their patterns are mapped for feature selection. The value of k affects the classification result. The item bank categorization process is based on a supervised machine learning technique. The sample of the item bank that is used in this research is the collection of Thai primary mathematics problems item bank and we use SVM in the Weka machine learning software package as our classifier. The result of the classification shows that our approach produces acceptable classification results and the highest classification result is given when k = 12","PeriodicalId":294694,"journal":{"name":"17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05)","volume":"56 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2005.95","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

We propose an alternative approach to text categorization for item banks. An item bank is a collection of textual data in which each item consists of short sentences and has only a few relevant words for categorization; some items could be categorized into many categories. The traditional categorization techniques cannot provide sufficiently accurate results because of a "lack of words" problem. From this observation, items in the same category always have the same group of terms (or keywords) and the similar locations of these terms in phrases suggest that the items have a high probability to be in the same category. Our new methodology PKIP, patterned keywords in phrase, is proposed to improve categorization accuracy and recover from the "lack of words "problem. The k-highest weight order words are selected as the keywords from each category and their patterns are mapped for feature selection. The value of k affects the classification result. The item bank categorization process is based on a supervised machine learning technique. The sample of the item bank that is used in this research is the collection of Thai primary mathematics problems item bank and we use SVM in the Weka machine learning software package as our classifier. The result of the classification shows that our approach produces acceptable classification results and the highest classification result is given when k = 12
PKIP:特征选择在文本分类的项目银行
我们提出了一种可供选择的条目库文本分类方法。题库是文本数据的集合,其中每个项目由短句组成,只有几个相关的词用于分类;有些项目可以分为许多类别。由于“缺乏词”的问题,传统的分类技术不能提供足够准确的结果。从这个观察结果来看,同一类别中的项目总是有相同的一组术语(或关键字),这些术语在短语中的相似位置表明这些项目有很高的概率属于同一类别。我们提出了一种新的方法PKIP,即短语模式关键词,以提高分类精度并从“缺词”问题中恢复过来。从每个分类中选取权重最高的k个词作为关键词,并映射其模式进行特征选择。k的值影响分类结果。题库分类过程是基于监督机器学习技术。本研究使用的题库样本是泰国小学数学问题题库的集合,我们使用Weka机器学习软件包中的SVM作为分类器。分类结果表明,我们的方法产生了可以接受的分类结果,当k = 12时给出了最高的分类结果
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信