A new unsupervised Algorithm for extracting relationship words between two entities

2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC) Pub Date : 2021-04-01 DOI:10.1109/CTISC52352.2021.00037

Fan Wu, Taihao Zheng, L. Yao, Honghai Feng

{"title":"A new unsupervised Algorithm for extracting relationship words between two entities","authors":"Fan Wu, Taihao Zheng, L. Yao, Honghai Feng","doi":"10.1109/CTISC52352.2021.00037","DOIUrl":null,"url":null,"abstract":"Purpose: In order to use a popular supervised learning algorithm such as BERT to extract the relationships of concepts (triple relationship extraction), it is necessary to label the relationship types manually. If some relation words are not been labeled in the training stag, they cannot be recognized probably in the test stage and the corresponding entities cannot been recognized accordingly. This paper proposes a new unsupervised algorithm to extract as many relation words as possible of two entities, especially those that are easily overlooked. Methods: The disease-cause relationship was taken as an example, and 10204 effective sentences of disease and corresponding causes were extracted by web crawler. According to the constraints of syntactic, semantic and lexical features, the relationship words were extracted with an unsupervised manner, and the automatic extracted results were summarized. Results: Some specific relation words that are ignored in manual labeling stage are found; the conjoining relation words often appeared together in the texts are recognized; some types and features of relation words are obtained. These types and features can be used to help the relation labeling in the supervised learning stage, and to help expanding the relevant knowledge graphs and improving the accuracy of information retrieval.","PeriodicalId":268378,"journal":{"name":"2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CTISC52352.2021.00037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: In order to use a popular supervised learning algorithm such as BERT to extract the relationships of concepts (triple relationship extraction), it is necessary to label the relationship types manually. If some relation words are not been labeled in the training stag, they cannot be recognized probably in the test stage and the corresponding entities cannot been recognized accordingly. This paper proposes a new unsupervised algorithm to extract as many relation words as possible of two entities, especially those that are easily overlooked. Methods: The disease-cause relationship was taken as an example, and 10204 effective sentences of disease and corresponding causes were extracted by web crawler. According to the constraints of syntactic, semantic and lexical features, the relationship words were extracted with an unsupervised manner, and the automatic extracted results were summarized. Results: Some specific relation words that are ignored in manual labeling stage are found; the conjoining relation words often appeared together in the texts are recognized; some types and features of relation words are obtained. These types and features can be used to help the relation labeling in the supervised learning stage, and to help expanding the relevant knowledge graphs and improving the accuracy of information retrieval.

查看原文本刊更多论文

一种新的实体间关系词的无监督提取算法

目的:为了使用BERT等流行的监督学习算法来提取概念之间的关系(三重关系提取)，有必要手动标记关系类型。如果在训练阶段没有对某些关系词进行标注，则可能在测试阶段无法识别这些关系词，从而无法识别相应的实体。本文提出了一种新的无监督算法，以尽可能多地提取两个实体之间的关系词，特别是那些容易被忽略的关系词。方法:以病因关系为例，通过网络爬虫提取10204个有效的疾病句子及其原因。根据句法、语义和词法特征的约束，采用无监督方式提取关系词，并对自动提取结果进行汇总。结果:发现了一些在手工标注阶段被忽略的特定关系词;识别出文本中经常同时出现的连词;得到了关系词的一些类型和特征。这些类型和特征可以用来帮助监督学习阶段的关系标注，并有助于扩展相关知识图，提高信息检索的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 3rd International Conference on Advances in Computer Technology, Information Science and Communication (CTISC)

自引率

0.00%

发文量