Generating Schema Labels through Dataset Content Analysis

Companion Proceedings of the The Web Conference 2018 Pub Date : 2018-04-23 DOI:10.1145/3184558.3191601

Zhiyu Chen, Haiyan Jia, J. Heflin, Brian D. Davison

{"title":"Generating Schema Labels through Dataset Content Analysis","authors":"Zhiyu Chen, Haiyan Jia, J. Heflin, Brian D. Davison","doi":"10.1145/3184558.3191601","DOIUrl":null,"url":null,"abstract":"Impoverished descriptions and convoluted schema labels are common challenges in data-centric tasks such as schema matching and data linking, especially when datasets can span domains. To address these issues, we consider the task of schema label generation. Typically, schema labels are created by dataset providers and are useful for users to understand a dataset. The motivation behind the task is that a lot of data linking systems require overlapping information between two datasets and rely on unique identifiers of schema labels. Moreover, it is common for schema labels in different datasets to have different identifiers even when they refer to the same concept. With no naming standard for schema labels, unintelligible labels are widely found in real-world datasets. For example, many schema labels contain abbreviations and compound nouns that hinder automated matching of attributes in corresponding datasets. Through schema label generation, more common (and thus understandable) schema labels can be provided to allow for broader schema matches in contexts such as dataset search and data linking. We develop a variety of features based on analysis of dataset content to enable machine learning methods to recommend useful labels. We test our approach on two real-world data collections and demonstrate that our method is able to outperform the alternative approach.","PeriodicalId":235572,"journal":{"name":"Companion Proceedings of the The Web Conference 2018","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the The Web Conference 2018","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3184558.3191601","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

Impoverished descriptions and convoluted schema labels are common challenges in data-centric tasks such as schema matching and data linking, especially when datasets can span domains. To address these issues, we consider the task of schema label generation. Typically, schema labels are created by dataset providers and are useful for users to understand a dataset. The motivation behind the task is that a lot of data linking systems require overlapping information between two datasets and rely on unique identifiers of schema labels. Moreover, it is common for schema labels in different datasets to have different identifiers even when they refer to the same concept. With no naming standard for schema labels, unintelligible labels are widely found in real-world datasets. For example, many schema labels contain abbreviations and compound nouns that hinder automated matching of attributes in corresponding datasets. Through schema label generation, more common (and thus understandable) schema labels can be provided to allow for broader schema matches in contexts such as dataset search and data linking. We develop a variety of features based on analysis of dataset content to enable machine learning methods to recommend useful labels. We test our approach on two real-world data collections and demonstrate that our method is able to outperform the alternative approach.

查看原文本刊更多论文

通过数据集内容分析生成模式标签

在以数据为中心的任务(如模式匹配和数据链接)中，缺乏描述和复杂的模式标签是常见的挑战，特别是当数据集可以跨域时。为了解决这些问题，我们考虑生成模式标签的任务。通常，模式标签是由数据集提供者创建的，对于用户理解数据集很有用。该任务背后的动机是，许多数据链接系统需要两个数据集之间的重叠信息，并依赖于模式标签的唯一标识符。此外，不同数据集中的模式标签通常具有不同的标识符，即使它们引用相同的概念。由于模式标签没有命名标准，难以理解的标签在现实数据集中广泛存在。例如，许多模式标签包含缩写和复合名词，这会妨碍相应数据集中属性的自动匹配。通过生成模式标签，可以提供更常见(因此也更容易理解)的模式标签，从而允许在数据集搜索和数据链接等上下文中进行更广泛的模式匹配。我们基于对数据集内容的分析开发了各种特征，使机器学习方法能够推荐有用的标签。我们在两个真实世界的数据集上测试了我们的方法，并证明我们的方法能够优于替代方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Companion Proceedings of the The Web Conference 2018

自引率

0.00%

发文量