Integrating rich document representations for text classification

2016 IEEE Systems and Information Engineering Design Symposium (SIEDS) Pub Date : 2016-04-29 DOI:10.1109/SIEDS.2016.7489319

Suqi Jiang, Jason Lewris, Michael Voltmer, Hongning Wang

{"title":"Integrating rich document representations for text classification","authors":"Suqi Jiang, Jason Lewris, Michael Voltmer, Hongning Wang","doi":"10.1109/SIEDS.2016.7489319","DOIUrl":null,"url":null,"abstract":"This paper involves deriving high quality information from unstructured text data through the integration of rich document representations to improve machine learning text classification problems. Previous research has applied Neural Network Language Models (NNLMs) to document classification performance, and word vector representations have been used to measure semantics among text. Never have they been combined together and shown to have improved text classification performance. Our belief is that the inference and clustering abilities of word vectors coupled with the power of a neural network can create more accurate classification predictions. The first phase our work focused on word vector representations for classification purposes. This approach included analyzing two distinct text sources with pre-marked binary outcomes for classification, creating a benchmark metric, and comparing against word vector representations within the feature space as a classifier. The results showed promise, obtaining an area under the curve of 0.95 utilizing word vectors, relative to the benchmark case of 0.93. The second phase of the project focused on utilizing an extension of the neural network model used in phase one to represent a document in its entirety as opposed to being represented word by word. Preliminary results indicated a slight improvement over the baseline model of approximately 2-3 percent.","PeriodicalId":426864,"journal":{"name":"2016 IEEE Systems and Information Engineering Design Symposium (SIEDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS.2016.7489319","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

This paper involves deriving high quality information from unstructured text data through the integration of rich document representations to improve machine learning text classification problems. Previous research has applied Neural Network Language Models (NNLMs) to document classification performance, and word vector representations have been used to measure semantics among text. Never have they been combined together and shown to have improved text classification performance. Our belief is that the inference and clustering abilities of word vectors coupled with the power of a neural network can create more accurate classification predictions. The first phase our work focused on word vector representations for classification purposes. This approach included analyzing two distinct text sources with pre-marked binary outcomes for classification, creating a benchmark metric, and comparing against word vector representations within the feature space as a classifier. The results showed promise, obtaining an area under the curve of 0.95 utilizing word vectors, relative to the benchmark case of 0.93. The second phase of the project focused on utilizing an extension of the neural network model used in phase one to represent a document in its entirety as opposed to being represented word by word. Preliminary results indicated a slight improvement over the baseline model of approximately 2-3 percent.

查看原文本刊更多论文

集成用于文本分类的丰富文档表示

本文涉及通过集成丰富的文档表示从非结构化文本数据中获取高质量的信息，以改进机器学习文本分类问题。已有研究将神经网络语言模型(NNLMs)应用于文档分类性能，并使用词向量表示来度量文本之间的语义。从来没有将它们组合在一起并显示出提高了文本分类性能。我们的信念是，词向量的推理和聚类能力加上神经网络的力量可以创造更准确的分类预测。第一阶段，我们的工作重点是用于分类目的的词向量表示。该方法包括分析两个不同的文本源，并使用预先标记的二进制结果进行分类，创建基准度量，并将特征空间中的词向量表示作为分类器进行比较。结果显示出了希望，利用词向量获得的曲线下面积为0.95，而基准情况为0.93。该项目的第二阶段侧重于利用第一阶段中使用的神经网络模型的扩展来表示整个文档，而不是一个词一个词地表示。初步结果表明，与基线模型相比，大约有2- 3%的轻微改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE Systems and Information Engineering Design Symposium (SIEDS)

自引率

0.00%

发文量