Yang Yang;Wei Shen;Junfeng Shu;Yinan Liu;Edward Curry;Guoliang Li
{"title":"CMVC+:基于对比学习的开放知识库规范化多视图聚类框架","authors":"Yang Yang;Wei Shen;Junfeng Shu;Yinan Liu;Edward Curry;Guoliang Li","doi":"10.1109/TKDE.2025.3543423","DOIUrl":null,"url":null,"abstract":"Open information extraction (OIE) methods extract plenty of OIE triples <italic><inline-formula><tex-math>$< $</tex-math><alternatives><mml:math><mml:mo><</mml:mo></mml:math><inline-graphic></alternatives></inline-formula>noun phrase, relation phrase, noun phrase<inline-formula><tex-math>$> $</tex-math><alternatives><mml:math><mml:mo>></mml:mo></mml:math><inline-graphic></alternatives></inline-formula></i> from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. In order to leverage these two views of knowledge jointly, we propose CMVC+, a novel unsupervised framework for canonicalizing OKBs without the need for manually annotated labels. Specifically, we propose a multi-view CHF K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering the clustering quality in a fine-grained manner. Furthermore, we propose a novel contrastive learning module to refine the learned view-specific embeddings and further enhance the canonicalization performance. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2296-2310"},"PeriodicalIF":8.9000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CMVC+: A Multi-View Clustering Framework for Open Knowledge Base Canonicalization Via Contrastive Learning\",\"authors\":\"Yang Yang;Wei Shen;Junfeng Shu;Yinan Liu;Edward Curry;Guoliang Li\",\"doi\":\"10.1109/TKDE.2025.3543423\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Open information extraction (OIE) methods extract plenty of OIE triples <italic><inline-formula><tex-math>$< $</tex-math><alternatives><mml:math><mml:mo><</mml:mo></mml:math><inline-graphic></alternatives></inline-formula>noun phrase, relation phrase, noun phrase<inline-formula><tex-math>$> $</tex-math><alternatives><mml:math><mml:mo>></mml:mo></mml:math><inline-graphic></alternatives></inline-formula></i> from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. In order to leverage these two views of knowledge jointly, we propose CMVC+, a novel unsupervised framework for canonicalizing OKBs without the need for manually annotated labels. Specifically, we propose a multi-view CHF K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering the clustering quality in a fine-grained manner. Furthermore, we propose a novel contrastive learning module to refine the learned view-specific embeddings and further enhance the canonicalization performance. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods.\",\"PeriodicalId\":13496,\"journal\":{\"name\":\"IEEE Transactions on Knowledge and Data Engineering\",\"volume\":\"37 5\",\"pages\":\"2296-2310\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-02-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Knowledge and Data Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10891880/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891880/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
CMVC+: A Multi-View Clustering Framework for Open Knowledge Base Canonicalization Via Contrastive Learning
Open information extraction (OIE) methods extract plenty of OIE triples $< $<noun phrase, relation phrase, noun phrase$> $> from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. In order to leverage these two views of knowledge jointly, we propose CMVC+, a novel unsupervised framework for canonicalizing OKBs without the need for manually annotated labels. Specifically, we propose a multi-view CHF K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering the clustering quality in a fine-grained manner. Furthermore, we propose a novel contrastive learning module to refine the learned view-specific embeddings and further enhance the canonicalization performance. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods.
期刊介绍:
The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.