基于无监督主题模型的文本网络构建及其词嵌入学习

2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) Pub Date : 2019-12-01 DOI:10.1109/ICMLA.2019.00032

S. Chung, Michael D'Arcy

{"title":"基于无监督主题模型的文本网络构建及其词嵌入学习","authors":"S. Chung, Michael D'Arcy","doi":"10.1109/ICMLA.2019.00032","DOIUrl":null,"url":null,"abstract":"Distributed word embeddings have proven remarkably effective at capturing word level semantic and syntactic regularities in language for many natural language processing tasks. One recently proposed semi-supervised representation learning method called Predictive Text Embedding (PTE) utilizes both semantically labeled and unlabeled data in information networks to learn the embedding of text that produces state of-the-art performance when compared to other embedding methods. However, PTE uses supervised label information to construct one of the networks and many other possible ways of constructing such information networks are left untested. We present two unsupervised methods that can be used in constructing a large scale semantic information network from documents by combining topic models that have emerged as a powerful technique of finding useful structure in an unstructured text collection as it learns distributions over words. The first method uses Latent Dirichlet Allocation (LDA) to build a topic model over text, and constructs a word topic network with edge weights proportional to the word-topic probability distributions. The second method trains an unsupervised neural network to learn the word-document distribution, with a single hidden layer representing a topic distribution. The two weight matrices of the neural net are directly reinterpreted as the edge weights of heterogeneous text networks that can be used to train word embeddings to build an effective low dimensional representation that preserves the semantic closeness of words and documents for NLP tasks. We conduct extensive experiments to evaluate the effectiveness of our methods.","PeriodicalId":436714,"journal":{"name":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unsupervised Topic Model Based Text Network Construction for Learning Word Embeddings\",\"authors\":\"S. Chung, Michael D'Arcy\",\"doi\":\"10.1109/ICMLA.2019.00032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Distributed word embeddings have proven remarkably effective at capturing word level semantic and syntactic regularities in language for many natural language processing tasks. One recently proposed semi-supervised representation learning method called Predictive Text Embedding (PTE) utilizes both semantically labeled and unlabeled data in information networks to learn the embedding of text that produces state of-the-art performance when compared to other embedding methods. However, PTE uses supervised label information to construct one of the networks and many other possible ways of constructing such information networks are left untested. We present two unsupervised methods that can be used in constructing a large scale semantic information network from documents by combining topic models that have emerged as a powerful technique of finding useful structure in an unstructured text collection as it learns distributions over words. The first method uses Latent Dirichlet Allocation (LDA) to build a topic model over text, and constructs a word topic network with edge weights proportional to the word-topic probability distributions. The second method trains an unsupervised neural network to learn the word-document distribution, with a single hidden layer representing a topic distribution. The two weight matrices of the neural net are directly reinterpreted as the edge weights of heterogeneous text networks that can be used to train word embeddings to build an effective low dimensional representation that preserves the semantic closeness of words and documents for NLP tasks. We conduct extensive experiments to evaluate the effectiveness of our methods.\",\"PeriodicalId\":436714,\"journal\":{\"name\":\"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2019.00032\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2019.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

分布式词嵌入在许多自然语言处理任务中捕获词级语义和句法规则方面被证明是非常有效的。最近提出的一种称为预测文本嵌入(PTE)的半监督表示学习方法利用信息网络中语义标记和未标记的数据来学习文本嵌入，与其他嵌入方法相比，这种方法产生了最先进的性能。然而，PTE使用监督标签信息来构建其中一种网络，许多其他可能的构建这种信息网络的方法尚未经过测试。我们提出了两种无监督的方法，通过结合主题模型，可以用于从文档构建大规模的语义信息网络，主题模型已经成为一种强大的技术，可以在非结构化文本集合中发现有用的结构，因为它学习单词的分布。第一种方法利用潜狄利克雷分配(Latent Dirichlet Allocation, LDA)在文本上建立主题模型，构建一个边缘权重与词-主题概率分布成正比的词-主题网络。第二种方法是训练一个无监督神经网络来学习单词-文档的分布，其中一个隐藏层代表一个主题的分布。神经网络的两个权重矩阵被直接重新解释为异构文本网络的边缘权重，可用于训练词嵌入，以建立有效的低维表示，为NLP任务保留词和文档的语义接近度。我们进行了大量的实验来评估我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Unsupervised Topic Model Based Text Network Construction for Learning Word Embeddings

Distributed word embeddings have proven remarkably effective at capturing word level semantic and syntactic regularities in language for many natural language processing tasks. One recently proposed semi-supervised representation learning method called Predictive Text Embedding (PTE) utilizes both semantically labeled and unlabeled data in information networks to learn the embedding of text that produces state of-the-art performance when compared to other embedding methods. However, PTE uses supervised label information to construct one of the networks and many other possible ways of constructing such information networks are left untested. We present two unsupervised methods that can be used in constructing a large scale semantic information network from documents by combining topic models that have emerged as a powerful technique of finding useful structure in an unstructured text collection as it learns distributions over words. The first method uses Latent Dirichlet Allocation (LDA) to build a topic model over text, and constructs a word topic network with edge weights proportional to the word-topic probability distributions. The second method trains an unsupervised neural network to learn the word-document distribution, with a single hidden layer representing a topic distribution. The two weight matrices of the neural net are directly reinterpreted as the edge weights of heterogeneous text networks that can be used to train word embeddings to build an effective low dimensional representation that preserves the semantic closeness of words and documents for NLP tasks. We conduct extensive experiments to evaluate the effectiveness of our methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)

自引率

0.00%

发文量