PKIP: feature selection in text categorization for item banks

17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05) Pub Date : 2005-11-14 DOI:10.1109/ICTAI.2005.95

Atorn Nuntiyagul, K. Naruedomkul, N. Cercone, Damras Wongsawang

{"title":"PKIP: feature selection in text categorization for item banks","authors":"Atorn Nuntiyagul, K. Naruedomkul, N. Cercone, Damras Wongsawang","doi":"10.1109/ICTAI.2005.95","DOIUrl":null,"url":null,"abstract":"We propose an alternative approach to text categorization for item banks. An item bank is a collection of textual data in which each item consists of short sentences and has only a few relevant words for categorization; some items could be categorized into many categories. The traditional categorization techniques cannot provide sufficiently accurate results because of a \"lack of words\" problem. From this observation, items in the same category always have the same group of terms (or keywords) and the similar locations of these terms in phrases suggest that the items have a high probability to be in the same category. Our new methodology PKIP, patterned keywords in phrase, is proposed to improve categorization accuracy and recover from the \"lack of words \"problem. The k-highest weight order words are selected as the keywords from each category and their patterns are mapped for feature selection. The value of k affects the classification result. The item bank categorization process is based on a supervised machine learning technique. The sample of the item bank that is used in this research is the collection of Thai primary mathematics problems item bank and we use SVM in the Weka machine learning software package as our classifier. The result of the classification shows that our approach produces acceptable classification results and the highest classification result is given when k = 12","PeriodicalId":294694,"journal":{"name":"17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05)","volume":"56 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2005.95","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

We propose an alternative approach to text categorization for item banks. An item bank is a collection of textual data in which each item consists of short sentences and has only a few relevant words for categorization; some items could be categorized into many categories. The traditional categorization techniques cannot provide sufficiently accurate results because of a "lack of words" problem. From this observation, items in the same category always have the same group of terms (or keywords) and the similar locations of these terms in phrases suggest that the items have a high probability to be in the same category. Our new methodology PKIP, patterned keywords in phrase, is proposed to improve categorization accuracy and recover from the "lack of words "problem. The k-highest weight order words are selected as the keywords from each category and their patterns are mapped for feature selection. The value of k affects the classification result. The item bank categorization process is based on a supervised machine learning technique. The sample of the item bank that is used in this research is the collection of Thai primary mathematics problems item bank and we use SVM in the Weka machine learning software package as our classifier. The result of the classification shows that our approach produces acceptable classification results and the highest classification result is given when k = 12

查看原文本刊更多论文

PKIP:特征选择在文本分类的项目银行

我们提出了一种可供选择的条目库文本分类方法。题库是文本数据的集合，其中每个项目由短句组成，只有几个相关的词用于分类;有些项目可以分为许多类别。由于“缺乏词”的问题，传统的分类技术不能提供足够准确的结果。从这个观察结果来看，同一类别中的项目总是有相同的一组术语(或关键字)，这些术语在短语中的相似位置表明这些项目有很高的概率属于同一类别。我们提出了一种新的方法PKIP，即短语模式关键词，以提高分类精度并从“缺词”问题中恢复过来。从每个分类中选取权重最高的k个词作为关键词，并映射其模式进行特征选择。k的值影响分类结果。题库分类过程是基于监督机器学习技术。本研究使用的题库样本是泰国小学数学问题题库的集合，我们使用Weka机器学习软件包中的SVM作为分类器。分类结果表明，我们的方法产生了可以接受的分类结果，当k = 12时给出了最高的分类结果

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05)

自引率

0.00%

发文量