基于同构分类器的词嵌入文本分类分析

2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA) Pub Date : 2019-07-03 DOI:10.1109/INISTA.2019.8778329

Z. H. Kilimci, S. Akyokuş

{"title":"基于同构分类器的词嵌入文本分类分析","authors":"Z. H. Kilimci, S. Akyokuş","doi":"10.1109/INISTA.2019.8778329","DOIUrl":null,"url":null,"abstract":"Text data mining is the process of extracting and analyzing valuable information from text. A text data mining process generally consists of lexical and syntax analysis of input text data, the removal of non-informative linguistic features and the representation of text data in appropriate formats, and eventually analysis and interpretation of the output. Text categorization, text clustering, sentiment analysis, and document summarization are some of the important applications of text mining. In this study, we analyze and compare the performance of text categorization by using different single classifiers, an ensemble of classifiers, a neural probabilistic representation model called word2vec on English texts. The neural probabilistic based model namely, word2vec, enables the representation of terms of a text in a new and smaller space with word embedding vectors instead of using original terms. After the representation of text data in new feature space, the training procedure is carried out with the well-known classification algorithms, namely multivariate Bernoulli naïve Bayes, support vector machines and decision trees and an ensemble algorithm such as bagging, random subspace and random forest. A wide range of comparative experiments are conducted on English texts to analyze the effectiveness of word embeddings on text classification. The evaluation of experimental results demonstrates that an ensemble of algorithms models with word embeddings performs better than other classification algorithms that uses traditional methods on English texts.","PeriodicalId":262143,"journal":{"name":"2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"The Analysis of Text Categorization Represented With Word Embeddings Using Homogeneous Classifiers\",\"authors\":\"Z. H. Kilimci, S. Akyokuş\",\"doi\":\"10.1109/INISTA.2019.8778329\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text data mining is the process of extracting and analyzing valuable information from text. A text data mining process generally consists of lexical and syntax analysis of input text data, the removal of non-informative linguistic features and the representation of text data in appropriate formats, and eventually analysis and interpretation of the output. Text categorization, text clustering, sentiment analysis, and document summarization are some of the important applications of text mining. In this study, we analyze and compare the performance of text categorization by using different single classifiers, an ensemble of classifiers, a neural probabilistic representation model called word2vec on English texts. The neural probabilistic based model namely, word2vec, enables the representation of terms of a text in a new and smaller space with word embedding vectors instead of using original terms. After the representation of text data in new feature space, the training procedure is carried out with the well-known classification algorithms, namely multivariate Bernoulli naïve Bayes, support vector machines and decision trees and an ensemble algorithm such as bagging, random subspace and random forest. A wide range of comparative experiments are conducted on English texts to analyze the effectiveness of word embeddings on text classification. The evaluation of experimental results demonstrates that an ensemble of algorithms models with word embeddings performs better than other classification algorithms that uses traditional methods on English texts.\",\"PeriodicalId\":262143,\"journal\":{\"name\":\"2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INISTA.2019.8778329\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INISTA.2019.8778329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

文本数据挖掘是从文本中提取和分析有价值信息的过程。文本数据挖掘过程通常包括输入文本数据的词法和语法分析，去除非信息性语言特征和以适当格式表示文本数据，以及最终对输出进行分析和解释。文本分类、文本聚类、情感分析和文档摘要是文本挖掘的一些重要应用。在这项研究中，我们分析和比较了使用不同的单一分类器、分类器集成、一种称为word2vec的神经概率表示模型对英语文本的分类性能。基于神经概率的模型，即word2vec，能够在一个新的更小的空间中使用词嵌入向量来表示文本的术语，而不是使用原始术语。在新的特征空间中表示文本数据后，使用众所周知的分类算法，即多元伯努利naïve贝叶斯、支持向量机和决策树以及bagging、随机子空间和随机森林等集成算法进行训练过程。本文对英语文本进行了广泛的对比实验，以分析词嵌入对文本分类的有效性。实验结果表明，基于词嵌入的综合算法模型在英语文本分类上的表现优于其他传统方法的分类算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Analysis of Text Categorization Represented With Word Embeddings Using Homogeneous Classifiers

Text data mining is the process of extracting and analyzing valuable information from text. A text data mining process generally consists of lexical and syntax analysis of input text data, the removal of non-informative linguistic features and the representation of text data in appropriate formats, and eventually analysis and interpretation of the output. Text categorization, text clustering, sentiment analysis, and document summarization are some of the important applications of text mining. In this study, we analyze and compare the performance of text categorization by using different single classifiers, an ensemble of classifiers, a neural probabilistic representation model called word2vec on English texts. The neural probabilistic based model namely, word2vec, enables the representation of terms of a text in a new and smaller space with word embedding vectors instead of using original terms. After the representation of text data in new feature space, the training procedure is carried out with the well-known classification algorithms, namely multivariate Bernoulli naïve Bayes, support vector machines and decision trees and an ensemble algorithm such as bagging, random subspace and random forest. A wide range of comparative experiments are conducted on English texts to analyze the effectiveness of word embeddings on text classification. The evaluation of experimental results demonstrates that an ensemble of algorithms models with word embeddings performs better than other classification algorithms that uses traditional methods on English texts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)

自引率

0.00%

发文量