非标准词作为文本分类的特征

2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) Pub Date : 2014-08-28 DOI:10.1109/MIPRO.2014.6859744

Slobodan Beliga, Sanda Martinčić-Ipšić

{"title":"非标准词作为文本分类的特征","authors":"Slobodan Beliga, Sanda Martinčić-Ipšić","doi":"10.1109/MIPRO.2014.6859744","DOIUrl":null,"url":null,"abstract":"This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.","PeriodicalId":299409,"journal":{"name":"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)","volume":"28 5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Non-standard words as features for text categorization\",\"authors\":\"Slobodan Beliga, Sanda Martinčić-Ipšić\",\"doi\":\"10.1109/MIPRO.2014.6859744\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.\",\"PeriodicalId\":299409,\"journal\":{\"name\":\"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)\",\"volume\":\"28 5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MIPRO.2014.6859744\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MIPRO.2014.6859744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

本文提出了以非标准词(NSW)为特征的克罗地亚语文本分类。非标准词汇包括:数字、日期、首字母缩写、缩写、货币等。克罗地亚语中的新南威尔士州是根据克罗地亚的新南威尔士州分类法确定的。本研究收集了390份文本文件，形成了SKIPEZ系列，分为官方、文学、信息、大众、教育和科学6个类别。对SKIPEZ集合的三种不同表示进行了文本分类实验:在第一种表示中，使用NSWs的频率作为特征;在第二种表示中，使用NSWs的统计度量(方差、变异系数、标准差等)作为特征;而第三种表示结合了前两个特性集。文本分类实验采用朴素贝叶斯、CN2、C4.5、kNN、分类树和随机森林算法。使用第一个特征集(NSW频率)获得最佳分类结果，分类准确率为87%。这表明nsw应该被认为是高度屈折的语言的特征，比如克罗地亚语。基于NSW的特征在没有标准词形化过程的情况下降低了特征空间的维数，因此应该考虑将袋子-NSW用于进一步的克罗地亚文本分类实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Non-standard words as features for text categorization

This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)

自引率

0.00%

发文量