用于短文本分类的词的分布表示

VS@HLT-NAACL Pub Date : 2015-06-01 DOI:10.3115/v1/W15-1505
Chenglong Ma, Weiqun Xu, Peijia Li, Yonghong Yan
{"title":"用于短文本分类的词的分布表示","authors":"Chenglong Ma, Weiqun Xu, Peijia Li, Yonghong Yan","doi":"10.3115/v1/W15-1505","DOIUrl":null,"url":null,"abstract":"Traditional supervised learning approaches to common NLP tasks depend heavily on manual annotation, which is labor intensive and time consuming, and often suffer from data sparseness. In this paper we show how to mitigate the problems in short text classification (STC) through word embeddings ‐ distributional representations of words learned from large unlabeled data. The word embeddings are trained from the entire English Wikipedia text. We assume that a short text document is a specific sample of one distribution in a Bayesian framework. A Gaussian process approach is used to model the distribution of words. The task of classification becomes a simple problem of selecting the most probable Gaussian distribution. This approach is compared with those based on the classical maximum entropy (MaxEnt) model and the Latent Dirichlet Allocation (LDA) approach. Our approach achieved better performance and also showed advantages in dealing with unseen words.","PeriodicalId":299646,"journal":{"name":"VS@HLT-NAACL","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":"{\"title\":\"Distributional Representations of Words for Short Text Classification\",\"authors\":\"Chenglong Ma, Weiqun Xu, Peijia Li, Yonghong Yan\",\"doi\":\"10.3115/v1/W15-1505\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Traditional supervised learning approaches to common NLP tasks depend heavily on manual annotation, which is labor intensive and time consuming, and often suffer from data sparseness. In this paper we show how to mitigate the problems in short text classification (STC) through word embeddings ‐ distributional representations of words learned from large unlabeled data. The word embeddings are trained from the entire English Wikipedia text. We assume that a short text document is a specific sample of one distribution in a Bayesian framework. A Gaussian process approach is used to model the distribution of words. The task of classification becomes a simple problem of selecting the most probable Gaussian distribution. This approach is compared with those based on the classical maximum entropy (MaxEnt) model and the Latent Dirichlet Allocation (LDA) approach. Our approach achieved better performance and also showed advantages in dealing with unseen words.\",\"PeriodicalId\":299646,\"journal\":{\"name\":\"VS@HLT-NAACL\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"33\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"VS@HLT-NAACL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3115/v1/W15-1505\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"VS@HLT-NAACL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3115/v1/W15-1505","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 33

摘要

传统的监督学习方法在很大程度上依赖于人工标注,这是劳动密集型和耗时的,并且经常受到数据稀疏性的影响。在本文中,我们展示了如何通过词嵌入-从大量未标记数据中学习的词的分布表示来缓解短文本分类(STC)中的问题。单词嵌入是从整个英文维基百科文本中训练的。我们假设一个短文本文档是贝叶斯框架中一个分布的特定样本。使用高斯过程方法对单词的分布进行建模。分类任务变成了选择最可能的高斯分布的简单问题。将该方法与基于经典最大熵(MaxEnt)模型和潜在狄利克雷分配(LDA)方法进行了比较。我们的方法取得了更好的性能,并且在处理未见词方面也显示出优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Distributional Representations of Words for Short Text Classification
Traditional supervised learning approaches to common NLP tasks depend heavily on manual annotation, which is labor intensive and time consuming, and often suffer from data sparseness. In this paper we show how to mitigate the problems in short text classification (STC) through word embeddings ‐ distributional representations of words learned from large unlabeled data. The word embeddings are trained from the entire English Wikipedia text. We assume that a short text document is a specific sample of one distribution in a Bayesian framework. A Gaussian process approach is used to model the distribution of words. The task of classification becomes a simple problem of selecting the most probable Gaussian distribution. This approach is compared with those based on the classical maximum entropy (MaxEnt) model and the Latent Dirichlet Allocation (LDA) approach. Our approach achieved better performance and also showed advantages in dealing with unseen words.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信