Zahra Sadat Hosseini Moghadam Emami, Shohreh Tabatabayiseifi, M. Izadi, Mohammad Tavakoli
{"title":"基于并行语料库的波斯语短文本语义相似度深度神经网络模型设计","authors":"Zahra Sadat Hosseini Moghadam Emami, Shohreh Tabatabayiseifi, M. Izadi, Mohammad Tavakoli","doi":"10.1109/ICWR51868.2021.9443108","DOIUrl":null,"url":null,"abstract":"Text processing, as one of the main issues in the field of artificial intelligence, has received a lot of attention in recent decades. Numerous methods and algorithms are proposed to address the task of semantic textual similarity which is one of the sub-branches of text processing. Due to the special features of the Persian language and its non-standard writing system, finding semantic similarity is an even more challenging task in Persian. On the other hand, producing a proper corpus that can be used for training a model for finding semantic similarities, is of great importance. In this study, the main purpose is to propose a method for measuring the semantic similarity between short Persian texts. To do so, first, we try to build an appropriate corpus, and then propose an efficient approach based on neural networks. The proposed method involves three steps. The first step is data collection and building a parallel corpus. In the next step, namely the pre-processing step, the data is normalized. Finally, Semantic similarity recognition is done by the neural network using vector representations of the words. The suggested model is built upon the produced corpus made of movie and tv show subtitles containing 35266 sentence pairs. The F-measure of the proposed approach on PAN2016 is 75.98% with 4 tags and 98.87% with 2 tags. We also achieved an F-measure of 98.86% for our model tested on the parallel corpus with 2 tags.","PeriodicalId":377597,"journal":{"name":"2021 7th International Conference on Web Research (ICWR)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Designing a Deep Neural Network Model for Finding Semantic Similarity Between Short Persian Texts Using a Parallel Corpus\",\"authors\":\"Zahra Sadat Hosseini Moghadam Emami, Shohreh Tabatabayiseifi, M. Izadi, Mohammad Tavakoli\",\"doi\":\"10.1109/ICWR51868.2021.9443108\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text processing, as one of the main issues in the field of artificial intelligence, has received a lot of attention in recent decades. Numerous methods and algorithms are proposed to address the task of semantic textual similarity which is one of the sub-branches of text processing. Due to the special features of the Persian language and its non-standard writing system, finding semantic similarity is an even more challenging task in Persian. On the other hand, producing a proper corpus that can be used for training a model for finding semantic similarities, is of great importance. In this study, the main purpose is to propose a method for measuring the semantic similarity between short Persian texts. To do so, first, we try to build an appropriate corpus, and then propose an efficient approach based on neural networks. The proposed method involves three steps. The first step is data collection and building a parallel corpus. In the next step, namely the pre-processing step, the data is normalized. Finally, Semantic similarity recognition is done by the neural network using vector representations of the words. The suggested model is built upon the produced corpus made of movie and tv show subtitles containing 35266 sentence pairs. The F-measure of the proposed approach on PAN2016 is 75.98% with 4 tags and 98.87% with 2 tags. We also achieved an F-measure of 98.86% for our model tested on the parallel corpus with 2 tags.\",\"PeriodicalId\":377597,\"journal\":{\"name\":\"2021 7th International Conference on Web Research (ICWR)\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 7th International Conference on Web Research (ICWR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICWR51868.2021.9443108\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 7th International Conference on Web Research (ICWR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICWR51868.2021.9443108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Designing a Deep Neural Network Model for Finding Semantic Similarity Between Short Persian Texts Using a Parallel Corpus
Text processing, as one of the main issues in the field of artificial intelligence, has received a lot of attention in recent decades. Numerous methods and algorithms are proposed to address the task of semantic textual similarity which is one of the sub-branches of text processing. Due to the special features of the Persian language and its non-standard writing system, finding semantic similarity is an even more challenging task in Persian. On the other hand, producing a proper corpus that can be used for training a model for finding semantic similarities, is of great importance. In this study, the main purpose is to propose a method for measuring the semantic similarity between short Persian texts. To do so, first, we try to build an appropriate corpus, and then propose an efficient approach based on neural networks. The proposed method involves three steps. The first step is data collection and building a parallel corpus. In the next step, namely the pre-processing step, the data is normalized. Finally, Semantic similarity recognition is done by the neural network using vector representations of the words. The suggested model is built upon the produced corpus made of movie and tv show subtitles containing 35266 sentence pairs. The F-measure of the proposed approach on PAN2016 is 75.98% with 4 tags and 98.87% with 2 tags. We also achieved an F-measure of 98.86% for our model tested on the parallel corpus with 2 tags.