Parallel hypernym relation extraction based on partition index dividing

2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE) Pub Date : 2023-01-06 DOI:10.1109/ICCECE58074.2023.10135486

Juxin Yin, Lei Pan

引用次数: 0

Abstract

Aiming at the shortcoming of existing pattern methods in handling massive data, a parallel hypernym relation extraction method on Spark is proposed. To improve the extraction accuracy, Combined with Spark's RDD programming model, an improved credibility algorithm(ppmit) is designed to identify the inverse hypernym relation; to address the data skew problem when calculating the credibility value of the hypernym relation, a data partitioning strategy - PID(Partition Index Dividing) algorithm is proposed to calculate the partition balance, add partition index to the overflow data, redivide the data, and ensure the partition data balance, thereby reducing the calculation time. Experiments conducted on the Chinese Wikipedia dataset show that the proposed method can guarantee the extraction accuracy and effectively improve the operational efficiency of the pattern extraction method.

查看原文本刊更多论文

基于分区索引划分的并列词关系提取

针对现有模式方法在处理海量数据时存在的不足，提出了一种基于Spark的并行超词关系提取方法。为了提高提取精度，结合Spark的RDD编程模型，设计了一种改进的可信度算法(ppmit)来识别倒接词关系;为了解决计算上词关系可信度值时的数据倾斜问题，提出了一种数据分区策略——PID(Partition Index divide)算法，计算分区平衡，对溢出数据添加分区索引，重新划分数据，保证分区数据平衡，从而减少了计算时间。在中文维基百科数据集上进行的实验表明，该方法能够保证模式提取的准确性，有效提高模式提取方法的运行效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)

自引率

0.00%

发文量