{"title":"Parallel hypernym relation extraction based on partition index dividing","authors":"Juxin Yin, Lei Pan","doi":"10.1109/ICCECE58074.2023.10135486","DOIUrl":null,"url":null,"abstract":"Aiming at the shortcoming of existing pattern methods in handling massive data, a parallel hypernym relation extraction method on Spark is proposed. To improve the extraction accuracy, Combined with Spark's RDD programming model, an improved credibility algorithm(ppmit) is designed to identify the inverse hypernym relation; to address the data skew problem when calculating the credibility value of the hypernym relation, a data partitioning strategy - PID(Partition Index Dividing) algorithm is proposed to calculate the partition balance, add partition index to the overflow data, redivide the data, and ensure the partition data balance, thereby reducing the calculation time. Experiments conducted on the Chinese Wikipedia dataset show that the proposed method can guarantee the extraction accuracy and effectively improve the operational efficiency of the pattern extraction method.","PeriodicalId":120030,"journal":{"name":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCECE58074.2023.10135486","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Aiming at the shortcoming of existing pattern methods in handling massive data, a parallel hypernym relation extraction method on Spark is proposed. To improve the extraction accuracy, Combined with Spark's RDD programming model, an improved credibility algorithm(ppmit) is designed to identify the inverse hypernym relation; to address the data skew problem when calculating the credibility value of the hypernym relation, a data partitioning strategy - PID(Partition Index Dividing) algorithm is proposed to calculate the partition balance, add partition index to the overflow data, redivide the data, and ensure the partition data balance, thereby reducing the calculation time. Experiments conducted on the Chinese Wikipedia dataset show that the proposed method can guarantee the extraction accuracy and effectively improve the operational efficiency of the pattern extraction method.
针对现有模式方法在处理海量数据时存在的不足,提出了一种基于Spark的并行超词关系提取方法。为了提高提取精度,结合Spark的RDD编程模型,设计了一种改进的可信度算法(ppmit)来识别倒接词关系;为了解决计算上词关系可信度值时的数据倾斜问题,提出了一种数据分区策略——PID(Partition Index divide)算法,计算分区平衡,对溢出数据添加分区索引,重新划分数据,保证分区数据平衡,从而减少了计算时间。在中文维基百科数据集上进行的实验表明,该方法能够保证模式提取的准确性,有效提高模式提取方法的运行效率。