无监督实例匹配的简单高效方法及其在发电厂关联数据中的应用

IF 2.1 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Web Semantics Pub Date : 2024-02-20 DOI:10.1016/j.websem.2024.100815

Andreas Eibeck , Shaocong Zhang , Mei Qi Lim , Markus Kraft

{"title":"无监督实例匹配的简单高效方法及其在发电厂关联数据中的应用","authors":"Andreas Eibeck , Shaocong Zhang , Mei Qi Lim , Markus Kraft","doi":"10.1016/j.websem.2024.100815","DOIUrl":null,"url":null,"abstract":"<div><p>Knowledge graphs store and link semantically annotated data about real-world entities from a variety of domains and on a large scale. The World Avatar is based on a dynamic decentralised knowledge graph and on semantic technologies to realise complex cross-domain scenarios. Accurate computational results for such scenarios require the availability of complete, high-quality data. This work focuses on instance matching — one of the subtasks of automatically populating the knowledge graph with data from a wide spectrum of external sources. Instance matching compares two data sets and seeks to identify instances (data, records) referring to the same real-world entity. We introduce AutoCal, a new instance matcher which does not require labelled data and runs out of the box for a wide range of domains without tuning method-specific parameters. AutoCal achieves results competitive to recently proposed unsupervised matchers from the field of Machine Learning. We also select an unsupervised state-of-the-art matcher from the field of Deep Learning for a thorough comparison. Our results show that neither AutoCal nor the state-of-the-art matcher is superior regarding matching quality while AutoCal has only moderate hardware requirements and runs 2.7 to 60 times faster. In summary, AutoCal is specifically well-suited to be used in an automated environment. We present its prototypical integration into the World Avatar and apply AutoCal to the domain of power plants which is relevant for practical environmental scenarios of the World Avatar.</p></div>","PeriodicalId":49951,"journal":{"name":"Journal of Web Semantics","volume":"80 ","pages":"Article 100815"},"PeriodicalIF":2.1000,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1570826824000015/pdfft?md5=3ea0d1c12ee82e1292dd9975673bdbcc&pid=1-s2.0-S1570826824000015-main.pdf","citationCount":"0","resultStr":"{\"title\":\"A simple and efficient approach to unsupervised instance matching and its application to linked data of power plants\",\"authors\":\"Andreas Eibeck , Shaocong Zhang , Mei Qi Lim , Markus Kraft\",\"doi\":\"10.1016/j.websem.2024.100815\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Knowledge graphs store and link semantically annotated data about real-world entities from a variety of domains and on a large scale. The World Avatar is based on a dynamic decentralised knowledge graph and on semantic technologies to realise complex cross-domain scenarios. Accurate computational results for such scenarios require the availability of complete, high-quality data. This work focuses on instance matching — one of the subtasks of automatically populating the knowledge graph with data from a wide spectrum of external sources. Instance matching compares two data sets and seeks to identify instances (data, records) referring to the same real-world entity. We introduce AutoCal, a new instance matcher which does not require labelled data and runs out of the box for a wide range of domains without tuning method-specific parameters. AutoCal achieves results competitive to recently proposed unsupervised matchers from the field of Machine Learning. We also select an unsupervised state-of-the-art matcher from the field of Deep Learning for a thorough comparison. Our results show that neither AutoCal nor the state-of-the-art matcher is superior regarding matching quality while AutoCal has only moderate hardware requirements and runs 2.7 to 60 times faster. In summary, AutoCal is specifically well-suited to be used in an automated environment. We present its prototypical integration into the World Avatar and apply AutoCal to the domain of power plants which is relevant for practical environmental scenarios of the World Avatar.</p></div>\",\"PeriodicalId\":49951,\"journal\":{\"name\":\"Journal of Web Semantics\",\"volume\":\"80 \",\"pages\":\"Article 100815\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1570826824000015/pdfft?md5=3ea0d1c12ee82e1292dd9975673bdbcc&pid=1-s2.0-S1570826824000015-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Web Semantics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1570826824000015\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Web Semantics","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1570826824000015","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

知识图谱可以存储和链接来自不同领域、大规模的真实世界实体的语义注释数据。世界阿凡达 "基于动态分散的知识图谱和语义技术来实现复杂的跨领域场景。要为这些场景提供准确的计算结果，就必须提供完整、高质量的数据。这项工作的重点是实例匹配，这是用各种外部来源的数据自动填充知识图谱的子任务之一。实例匹配是对两个数据集进行比较，并设法识别指向同一现实世界实体的实例（数据、记录）。我们介绍的 AutoCal 是一种新的实例匹配器，它不需要标注数据，无需调整特定方法参数即可在各种领域运行。与机器学习领域最近提出的无监督匹配器相比，AutoCal 的结果具有竞争力。我们还选择了深度学习领域最先进的无监督匹配器进行全面比较。我们的结果表明，AutoCal 和最先进的匹配器在匹配质量方面都不占优势，而 AutoCal 对硬件的要求不高，运行速度却快 2.7 到 60 倍。总之，AutoCal 特别适合在自动化环境中使用。我们介绍了将其集成到世界阿凡达中的原型，并将 AutoCal 应用于发电厂领域，这与世界阿凡达的实际环境场景息息相关。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A simple and efficient approach to unsupervised instance matching and its application to linked data of power plants

Knowledge graphs store and link semantically annotated data about real-world entities from a variety of domains and on a large scale. The World Avatar is based on a dynamic decentralised knowledge graph and on semantic technologies to realise complex cross-domain scenarios. Accurate computational results for such scenarios require the availability of complete, high-quality data. This work focuses on instance matching — one of the subtasks of automatically populating the knowledge graph with data from a wide spectrum of external sources. Instance matching compares two data sets and seeks to identify instances (data, records) referring to the same real-world entity. We introduce AutoCal, a new instance matcher which does not require labelled data and runs out of the box for a wide range of domains without tuning method-specific parameters. AutoCal achieves results competitive to recently proposed unsupervised matchers from the field of Machine Learning. We also select an unsupervised state-of-the-art matcher from the field of Deep Learning for a thorough comparison. Our results show that neither AutoCal nor the state-of-the-art matcher is superior regarding matching quality while AutoCal has only moderate hardware requirements and runs 2.7 to 60 times faster. In summary, AutoCal is specifically well-suited to be used in an automated environment. We present its prototypical integration into the World Avatar and apply AutoCal to the domain of power plants which is relevant for practical environmental scenarios of the World Avatar.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Web Semantics 工程技术-计算机：人工智能

CiteScore

6.20

自引率

12.00%

发文量

审稿时长

14.6 weeks

期刊介绍： The Journal of Web Semantics is an interdisciplinary journal based on research and applications of various subject areas that contribute to the development of a knowledge-intensive and intelligent service Web. These areas include: knowledge technologies, ontology, agents, databases and the semantic grid, obviously disciplines like information retrieval, language technology, human-computer interaction and knowledge discovery are of major relevance as well. All aspects of the Semantic Web development are covered. The publication of large-scale experiments and their analysis is also encouraged to clearly illustrate scenarios and methods that introduce semantics into existing Web interfaces, contents and services. The journal emphasizes the publication of papers that combine theories, methods and experiments from different subject areas in order to deliver innovative semantic methods and applications.