基于并行语料库的波斯语短文本语义相似度深度神经网络模型设计

2021 7th International Conference on Web Research (ICWR) Pub Date : 2021-05-19 DOI:10.1109/ICWR51868.2021.9443108

Zahra Sadat Hosseini Moghadam Emami, Shohreh Tabatabayiseifi, M. Izadi, Mohammad Tavakoli

{"title":"基于并行语料库的波斯语短文本语义相似度深度神经网络模型设计","authors":"Zahra Sadat Hosseini Moghadam Emami, Shohreh Tabatabayiseifi, M. Izadi, Mohammad Tavakoli","doi":"10.1109/ICWR51868.2021.9443108","DOIUrl":null,"url":null,"abstract":"Text processing, as one of the main issues in the field of artificial intelligence, has received a lot of attention in recent decades. Numerous methods and algorithms are proposed to address the task of semantic textual similarity which is one of the sub-branches of text processing. Due to the special features of the Persian language and its non-standard writing system, finding semantic similarity is an even more challenging task in Persian. On the other hand, producing a proper corpus that can be used for training a model for finding semantic similarities, is of great importance. In this study, the main purpose is to propose a method for measuring the semantic similarity between short Persian texts. To do so, first, we try to build an appropriate corpus, and then propose an efficient approach based on neural networks. The proposed method involves three steps. The first step is data collection and building a parallel corpus. In the next step, namely the pre-processing step, the data is normalized. Finally, Semantic similarity recognition is done by the neural network using vector representations of the words. The suggested model is built upon the produced corpus made of movie and tv show subtitles containing 35266 sentence pairs. The F-measure of the proposed approach on PAN2016 is 75.98% with 4 tags and 98.87% with 2 tags. We also achieved an F-measure of 98.86% for our model tested on the parallel corpus with 2 tags.","PeriodicalId":377597,"journal":{"name":"2021 7th International Conference on Web Research (ICWR)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Designing a Deep Neural Network Model for Finding Semantic Similarity Between Short Persian Texts Using a Parallel Corpus\",\"authors\":\"Zahra Sadat Hosseini Moghadam Emami, Shohreh Tabatabayiseifi, M. Izadi, Mohammad Tavakoli\",\"doi\":\"10.1109/ICWR51868.2021.9443108\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text processing, as one of the main issues in the field of artificial intelligence, has received a lot of attention in recent decades. Numerous methods and algorithms are proposed to address the task of semantic textual similarity which is one of the sub-branches of text processing. Due to the special features of the Persian language and its non-standard writing system, finding semantic similarity is an even more challenging task in Persian. On the other hand, producing a proper corpus that can be used for training a model for finding semantic similarities, is of great importance. In this study, the main purpose is to propose a method for measuring the semantic similarity between short Persian texts. To do so, first, we try to build an appropriate corpus, and then propose an efficient approach based on neural networks. The proposed method involves three steps. The first step is data collection and building a parallel corpus. In the next step, namely the pre-processing step, the data is normalized. Finally, Semantic similarity recognition is done by the neural network using vector representations of the words. The suggested model is built upon the produced corpus made of movie and tv show subtitles containing 35266 sentence pairs. The F-measure of the proposed approach on PAN2016 is 75.98% with 4 tags and 98.87% with 2 tags. We also achieved an F-measure of 98.86% for our model tested on the parallel corpus with 2 tags.\",\"PeriodicalId\":377597,\"journal\":{\"name\":\"2021 7th International Conference on Web Research (ICWR)\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 7th International Conference on Web Research (ICWR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICWR51868.2021.9443108\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 7th International Conference on Web Research (ICWR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICWR51868.2021.9443108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

文本处理作为人工智能领域的主要问题之一，近几十年来受到了广泛的关注。语义文本相似度是文本处理的一个分支，人们提出了许多方法和算法来解决语义文本相似度的问题。由于波斯语的特殊特点及其非标准的书写系统，在波斯语中寻找语义相似性是一项更具挑战性的任务。另一方面，产生一个合适的语料库，可以用来训练一个寻找语义相似度的模型，是非常重要的。在本研究中，主要目的是提出一种测量波斯语短文本之间语义相似度的方法。为此，我们首先尝试构建一个合适的语料库，然后提出一种基于神经网络的高效方法。所提出的方法包括三个步骤。第一步是数据收集和构建并行语料库。在下一步即预处理步骤中，对数据进行归一化处理。最后，使用向量表示的神经网络进行语义相似度识别。建议的模型建立在包含35266句对的电影和电视节目字幕的生成语料库上。该方法在PAN2016上4个标签的f值为75.98%，2个标签的f值为98.87%。我们的模型在2个标签的平行语料库上测试的f值也达到了98.86%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Designing a Deep Neural Network Model for Finding Semantic Similarity Between Short Persian Texts Using a Parallel Corpus

Text processing, as one of the main issues in the field of artificial intelligence, has received a lot of attention in recent decades. Numerous methods and algorithms are proposed to address the task of semantic textual similarity which is one of the sub-branches of text processing. Due to the special features of the Persian language and its non-standard writing system, finding semantic similarity is an even more challenging task in Persian. On the other hand, producing a proper corpus that can be used for training a model for finding semantic similarities, is of great importance. In this study, the main purpose is to propose a method for measuring the semantic similarity between short Persian texts. To do so, first, we try to build an appropriate corpus, and then propose an efficient approach based on neural networks. The proposed method involves three steps. The first step is data collection and building a parallel corpus. In the next step, namely the pre-processing step, the data is normalized. Finally, Semantic similarity recognition is done by the neural network using vector representations of the words. The suggested model is built upon the produced corpus made of movie and tv show subtitles containing 35266 sentence pairs. The F-measure of the proposed approach on PAN2016 is 75.98% with 4 tags and 98.87% with 2 tags. We also achieved an F-measure of 98.86% for our model tested on the parallel corpus with 2 tags.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 7th International Conference on Web Research (ICWR)

自引率

0.00%

发文量