非标准词作为文本分类的特征

Slobodan Beliga, Sanda Martinčić-Ipšić
{"title":"非标准词作为文本分类的特征","authors":"Slobodan Beliga, Sanda Martinčić-Ipšić","doi":"10.1109/MIPRO.2014.6859744","DOIUrl":null,"url":null,"abstract":"This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.","PeriodicalId":299409,"journal":{"name":"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)","volume":"28 5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Non-standard words as features for text categorization\",\"authors\":\"Slobodan Beliga, Sanda Martinčić-Ipšić\",\"doi\":\"10.1109/MIPRO.2014.6859744\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.\",\"PeriodicalId\":299409,\"journal\":{\"name\":\"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)\",\"volume\":\"28 5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MIPRO.2014.6859744\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MIPRO.2014.6859744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

本文提出了以非标准词(NSW)为特征的克罗地亚语文本分类。非标准词汇包括:数字、日期、首字母缩写、缩写、货币等。克罗地亚语中的新南威尔士州是根据克罗地亚的新南威尔士州分类法确定的。本研究收集了390份文本文件,形成了SKIPEZ系列,分为官方、文学、信息、大众、教育和科学6个类别。对SKIPEZ集合的三种不同表示进行了文本分类实验:在第一种表示中,使用NSWs的频率作为特征;在第二种表示中,使用NSWs的统计度量(方差、变异系数、标准差等)作为特征;而第三种表示结合了前两个特性集。文本分类实验采用朴素贝叶斯、CN2、C4.5、kNN、分类树和随机森林算法。使用第一个特征集(NSW频率)获得最佳分类结果,分类准确率为87%。这表明nsw应该被认为是高度屈折的语言的特征,比如克罗地亚语。基于NSW的特征在没有标准词形化过程的情况下降低了特征空间的维数,因此应该考虑将袋子-NSW用于进一步的克罗地亚文本分类实验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Non-standard words as features for text categorization
This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信