HDSKG:从网页内容中获取特定领域的知识图谱

2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2017-02-01 DOI:10.1109/SANER.2017.7884609

Xuejiao Zhao, Zhenchang Xing, M. A. Kabir, Naoya Sawada, J. Li, Shang-Wei Lin

{"title":"HDSKG:从网页内容中获取特定领域的知识图谱","authors":"Xuejiao Zhao, Zhenchang Xing, M. A. Kabir, Naoya Sawada, J. Li, Shang-Wei Lin","doi":"10.1109/SANER.2017.7884609","DOIUrl":null,"url":null,"abstract":"Knowledge graph is useful for many different domains like search result ranking, recommendation, exploratory search, etc. It integrates structural information of concepts across multiple information sources, and links these concepts together. The extraction of domain specific relation triples (subject, verb phrase, object) is one of the important techniques for domain specific knowledge graph construction. In this research, an automatic method named HDSKG is proposed to discover domain specific concepts and their relation triples from the content of webpages. We incorporate the dependency parser with rule-based method to chunk the relations triple candidates, then we extract advanced features of these candidate relation triples to estimate the domain relevance by a machine learning algorithm. For the evaluation of our method, we apply HDSKG to Stack Overflow (a Q&A website about computer programming). As a result, we construct a knowledge graph of software engineering domain with 35279 relation triples, 44800 concepts, and 9660 unique verb phrases. The experimental results show that both the precision and recall of HDSKG (0.78 and 0.7 respectively) is much higher than the openIE (0.11 and 0.6 respectively). The performance is particularly efficient in the case of complex sentences. Further more, with the self-training technique we used in the classifier, HDSKG can be applied to other domain easily with less training data.","PeriodicalId":6541,"journal":{"name":"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"164 1","pages":"56-67"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":"{\"title\":\"HDSKG: Harvesting domain specific knowledge graph from content of webpages\",\"authors\":\"Xuejiao Zhao, Zhenchang Xing, M. A. Kabir, Naoya Sawada, J. Li, Shang-Wei Lin\",\"doi\":\"10.1109/SANER.2017.7884609\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Knowledge graph is useful for many different domains like search result ranking, recommendation, exploratory search, etc. It integrates structural information of concepts across multiple information sources, and links these concepts together. The extraction of domain specific relation triples (subject, verb phrase, object) is one of the important techniques for domain specific knowledge graph construction. In this research, an automatic method named HDSKG is proposed to discover domain specific concepts and their relation triples from the content of webpages. We incorporate the dependency parser with rule-based method to chunk the relations triple candidates, then we extract advanced features of these candidate relation triples to estimate the domain relevance by a machine learning algorithm. For the evaluation of our method, we apply HDSKG to Stack Overflow (a Q&A website about computer programming). As a result, we construct a knowledge graph of software engineering domain with 35279 relation triples, 44800 concepts, and 9660 unique verb phrases. The experimental results show that both the precision and recall of HDSKG (0.78 and 0.7 respectively) is much higher than the openIE (0.11 and 0.6 respectively). The performance is particularly efficient in the case of complex sentences. Further more, with the self-training technique we used in the classifier, HDSKG can be applied to other domain easily with less training data.\",\"PeriodicalId\":6541,\"journal\":{\"name\":\"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)\",\"volume\":\"164 1\",\"pages\":\"56-67\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"49\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SANER.2017.7884609\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SANER.2017.7884609","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 49

摘要

知识图谱在许多不同的领域都很有用，比如搜索结果排名、推荐、探索性搜索等。它集成了跨多个信息源的概念的结构信息，并将这些概念链接在一起。领域特定关系三元组(主语、动词短语、宾语)的提取是构建领域特定知识图的重要技术之一。本文提出了一种从网页内容中自动发现领域特定概念及其关系三元组的方法——HDSKG。我们将依赖解析器与基于规则的方法相结合，对候选关系三元组进行分块，然后提取候选关系三元组的高级特征，通过机器学习算法估计领域相关性。为了评估我们的方法，我们将HDSKG应用于Stack Overflow(一个关于计算机编程的问答网站)。因此，我们构建了一个包含35279个关系三元组、44800个概念和9660个唯一动词短语的软件工程领域知识图谱。实验结果表明，HDSKG的查准率和查全率分别为0.78和0.7，远高于openIE(分别为0.11和0.6)。在复杂句的情况下，这种表现尤其有效。此外，通过我们在分类器中使用的自训练技术，HDSKG可以在训练数据较少的情况下轻松地应用于其他领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HDSKG: Harvesting domain specific knowledge graph from content of webpages

Knowledge graph is useful for many different domains like search result ranking, recommendation, exploratory search, etc. It integrates structural information of concepts across multiple information sources, and links these concepts together. The extraction of domain specific relation triples (subject, verb phrase, object) is one of the important techniques for domain specific knowledge graph construction. In this research, an automatic method named HDSKG is proposed to discover domain specific concepts and their relation triples from the content of webpages. We incorporate the dependency parser with rule-based method to chunk the relations triple candidates, then we extract advanced features of these candidate relation triples to estimate the domain relevance by a machine learning algorithm. For the evaluation of our method, we apply HDSKG to Stack Overflow (a Q&A website about computer programming). As a result, we construct a knowledge graph of software engineering domain with 35279 relation triples, 44800 concepts, and 9660 unique verb phrases. The experimental results show that both the precision and recall of HDSKG (0.78 and 0.7 respectively) is much higher than the openIE (0.11 and 0.6 respectively). The performance is particularly efficient in the case of complex sentences. Further more, with the self-training technique we used in the classifier, HDSKG can be applied to other domain easily with less training data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量