文本图中可扩展相似度评价框架

2021 7th International Conference on Web Research (ICWR) Pub Date : 2021-05-19 DOI:10.1109/ICWR51868.2021.9443144

Mahdi Samani, Nasser Ghadiri

{"title":"文本图中可扩展相似度评价框架","authors":"Mahdi Samani, Nasser Ghadiri","doi":"10.1109/ICWR51868.2021.9443144","DOIUrl":null,"url":null,"abstract":"Graphs and graph databases are applicable over a wide range of domains, including text mining and web mining. Using graphs to represent relationships between entities provides enriched models for emerging tasks of web search and information retrieval. Natural language processing algorithms use graphs to model structural relationships of texts efficiently, resulting in improved performance. However, the need to increase the accuracy of graph construction and weight allocation remains a fundamental challenge. Existing methods for these tasks provide limited efficiency and lack scalability for large graphs. In this study, we propose a novel graph-based method for text modeling and running a query to evaluate the similarity of text segments. In this method, the graph corresponding to the text is first created by modeling words and named entities by the state-of-the-art pre-trained BERT model. Graph nodes are then weighted in two stages. In the first stage, the nodes with more generalization obtain higher weights. The second weighting stage is done by the graph obtained from the query text. In this weighting step, nodes are considered important if they are specifically related to the query text. After determining the important nodes in the graph, the semantic similarity between the query text and the texts in the database is measured. The whole process of this framework uses a natural language processing pipeline in Apache Spark scalable platform. The efficiency of the model was evaluated for both distributed and non-distributed configuration and its scalability on a Spark cluster. Evaluation of the accuracy using the Pearson correlation coefficient shows that the proposed method performs higher performance than its competitors.","PeriodicalId":377597,"journal":{"name":"2021 7th International Conference on Web Research (ICWR)","volume":"53 1-2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Framework For Scalable Similarity Evaluation in Text Graphs\",\"authors\":\"Mahdi Samani, Nasser Ghadiri\",\"doi\":\"10.1109/ICWR51868.2021.9443144\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphs and graph databases are applicable over a wide range of domains, including text mining and web mining. Using graphs to represent relationships between entities provides enriched models for emerging tasks of web search and information retrieval. Natural language processing algorithms use graphs to model structural relationships of texts efficiently, resulting in improved performance. However, the need to increase the accuracy of graph construction and weight allocation remains a fundamental challenge. Existing methods for these tasks provide limited efficiency and lack scalability for large graphs. In this study, we propose a novel graph-based method for text modeling and running a query to evaluate the similarity of text segments. In this method, the graph corresponding to the text is first created by modeling words and named entities by the state-of-the-art pre-trained BERT model. Graph nodes are then weighted in two stages. In the first stage, the nodes with more generalization obtain higher weights. The second weighting stage is done by the graph obtained from the query text. In this weighting step, nodes are considered important if they are specifically related to the query text. After determining the important nodes in the graph, the semantic similarity between the query text and the texts in the database is measured. The whole process of this framework uses a natural language processing pipeline in Apache Spark scalable platform. The efficiency of the model was evaluated for both distributed and non-distributed configuration and its scalability on a Spark cluster. Evaluation of the accuracy using the Pearson correlation coefficient shows that the proposed method performs higher performance than its competitors.\",\"PeriodicalId\":377597,\"journal\":{\"name\":\"2021 7th International Conference on Web Research (ICWR)\",\"volume\":\"53 1-2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 7th International Conference on Web Research (ICWR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICWR51868.2021.9443144\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 7th International Conference on Web Research (ICWR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICWR51868.2021.9443144","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

图和图数据库适用于广泛的领域，包括文本挖掘和web挖掘。使用图来表示实体之间的关系为web搜索和信息检索的新任务提供了丰富的模型。自然语言处理算法利用图形有效地对文本的结构关系进行建模，从而提高了性能。然而，提高图构建和权重分配的准确性仍然是一个根本性的挑战。用于这些任务的现有方法提供有限的效率，并且缺乏大型图的可扩展性。在这项研究中，我们提出了一种新的基于图的文本建模方法，并运行查询来评估文本片段的相似性。在这种方法中，首先通过最先进的预训练BERT模型建模单词和命名实体来创建与文本对应的图。然后分两个阶段对图节点进行加权。第一阶段，泛化程度越高的节点权重越高。第二个加权阶段由从查询文本中获得的图完成。在这个加权步骤中，如果节点与查询文本特别相关，则认为节点很重要。在确定图中的重要节点后，测量查询文本与数据库文本之间的语义相似度。该框架的整个过程采用了Apache Spark可伸缩平台上的自然语言处理流水线。对该模型在分布式和非分布式配置下的效率以及在Spark集群上的可扩展性进行了评估。使用Pearson相关系数对准确率进行评估，结果表明该方法比竞争对手具有更高的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Framework For Scalable Similarity Evaluation in Text Graphs

Graphs and graph databases are applicable over a wide range of domains, including text mining and web mining. Using graphs to represent relationships between entities provides enriched models for emerging tasks of web search and information retrieval. Natural language processing algorithms use graphs to model structural relationships of texts efficiently, resulting in improved performance. However, the need to increase the accuracy of graph construction and weight allocation remains a fundamental challenge. Existing methods for these tasks provide limited efficiency and lack scalability for large graphs. In this study, we propose a novel graph-based method for text modeling and running a query to evaluate the similarity of text segments. In this method, the graph corresponding to the text is first created by modeling words and named entities by the state-of-the-art pre-trained BERT model. Graph nodes are then weighted in two stages. In the first stage, the nodes with more generalization obtain higher weights. The second weighting stage is done by the graph obtained from the query text. In this weighting step, nodes are considered important if they are specifically related to the query text. After determining the important nodes in the graph, the semantic similarity between the query text and the texts in the database is measured. The whole process of this framework uses a natural language processing pipeline in Apache Spark scalable platform. The efficiency of the model was evaluated for both distributed and non-distributed configuration and its scalability on a Spark cluster. Evaluation of the accuracy using the Pearson correlation coefficient shows that the proposed method performs higher performance than its competitors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 7th International Conference on Web Research (ICWR)

自引率

0.00%

发文量