实证化对段落向量模型的影响

2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA) Pub Date : 2019-07-03 DOI:10.1109/INISTA.2019.8778304

Aydın Gerek, Mehmet Can Yüney, Erencan Erkaya, M. Ganiz

{"title":"实证化对段落向量模型的影响","authors":"Aydın Gerek, Mehmet Can Yüney, Erencan Erkaya, M. Ganiz","doi":"10.1109/INISTA.2019.8778304","DOIUrl":null,"url":null,"abstract":"Natural language processing (NLP) is an important field of Artificial Intelligence. One of the fundamental problems in NLP is to create vector (distributed) representations of words so that vectors of words that have similar meaning lie closer in space. One of the most popular algorithms for creating these representations are word embedding models such as word2vec and fastText. Similarly the paragraph vector model (doc2vec) is used to create distributed representations of documents while simultaneously creating distributed representations for the words in these documents. These models create a dense, and low dimensional (usually in the low hundreds) vector representations which may include negative values. In this study we focus on these negative values and introduce a family of regularization methods in which document, word and/or context vectors of the paragraph vector model are forced to have only positive components. We measure its effects on several tasks; text classification, semantic similarity, and analogy tasks. Although positivization greatly increases the sparsity of the word embeddings, and should be expected to result in a loss of information, our results show that there is almost no reduction in the performance of the regularized embeddings in these tasks. We also observe an increase in the classification accuracy in one case. We foresee that these approaches can be beneficial in machine learning systems which require non-negative vectors.","PeriodicalId":262143,"journal":{"name":"2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Effects of Positivization on the Paragraph Vector Model\",\"authors\":\"Aydın Gerek, Mehmet Can Yüney, Erencan Erkaya, M. Ganiz\",\"doi\":\"10.1109/INISTA.2019.8778304\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Natural language processing (NLP) is an important field of Artificial Intelligence. One of the fundamental problems in NLP is to create vector (distributed) representations of words so that vectors of words that have similar meaning lie closer in space. One of the most popular algorithms for creating these representations are word embedding models such as word2vec and fastText. Similarly the paragraph vector model (doc2vec) is used to create distributed representations of documents while simultaneously creating distributed representations for the words in these documents. These models create a dense, and low dimensional (usually in the low hundreds) vector representations which may include negative values. In this study we focus on these negative values and introduce a family of regularization methods in which document, word and/or context vectors of the paragraph vector model are forced to have only positive components. We measure its effects on several tasks; text classification, semantic similarity, and analogy tasks. Although positivization greatly increases the sparsity of the word embeddings, and should be expected to result in a loss of information, our results show that there is almost no reduction in the performance of the regularized embeddings in these tasks. We also observe an increase in the classification accuracy in one case. We foresee that these approaches can be beneficial in machine learning systems which require non-negative vectors.\",\"PeriodicalId\":262143,\"journal\":{\"name\":\"2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INISTA.2019.8778304\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INISTA.2019.8778304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

自然语言处理(NLP)是人工智能的一个重要领域。NLP的一个基本问题是创建词的向量(分布式)表示，以便具有相似含义的词的向量在空间上更接近。用于创建这些表示的最流行的算法之一是单词嵌入模型，如word2vec和fastText。类似地，段落向量模型(doc2vec)用于创建文档的分布式表示，同时为这些文档中的单词创建分布式表示。这些模型创建一个密集的、低维的(通常在低几百)向量表示，其中可能包括负值。在这项研究中，我们专注于这些负值，并引入了一系列正则化方法，其中段落向量模型的文档、单词和/或上下文向量被迫只有正分量。我们测量了它对几个任务的影响;文本分类、语义相似性和类比任务。虽然正规化极大地增加了词嵌入的稀疏性，并且应该预期会导致信息丢失，但我们的结果表明，在这些任务中，正则化嵌入的性能几乎没有降低。我们还观察到在一种情况下，分类精度有所提高。我们预见这些方法在需要非负向量的机器学习系统中是有益的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Effects of Positivization on the Paragraph Vector Model

Natural language processing (NLP) is an important field of Artificial Intelligence. One of the fundamental problems in NLP is to create vector (distributed) representations of words so that vectors of words that have similar meaning lie closer in space. One of the most popular algorithms for creating these representations are word embedding models such as word2vec and fastText. Similarly the paragraph vector model (doc2vec) is used to create distributed representations of documents while simultaneously creating distributed representations for the words in these documents. These models create a dense, and low dimensional (usually in the low hundreds) vector representations which may include negative values. In this study we focus on these negative values and introduce a family of regularization methods in which document, word and/or context vectors of the paragraph vector model are forced to have only positive components. We measure its effects on several tasks; text classification, semantic similarity, and analogy tasks. Although positivization greatly increases the sparsity of the word embeddings, and should be expected to result in a loss of information, our results show that there is almost no reduction in the performance of the regularized embeddings in these tasks. We also observe an increase in the classification accuracy in one case. We foresee that these approaches can be beneficial in machine learning systems which require non-negative vectors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)

自引率

0.00%

发文量