Modular framework for similarity-based dataset discovery using external knowledge

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications Pub Date : 2022-02-15 DOI:10.1108/dta-09-2021-0261

M. Nečaský, P. Škoda, D. Bernhauer, Jakub Klímek, T. Skopal

{"title":"Modular framework for similarity-based dataset discovery using external knowledge","authors":"M. Nečaský, P. Škoda, D. Bernhauer, Jakub Klímek, T. Skopal","doi":"10.1108/dta-09-2021-0261","DOIUrl":null,"url":null,"abstract":"PurposeSemantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.Design/methodology/approachIn this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.FindingsThe study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.Originality/valueTo the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"38 1","pages":"506-535"},"PeriodicalIF":1.5000,"publicationDate":"2022-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Technologies and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1108/dta-09-2021-0261","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 1

Abstract

PurposeSemantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.Design/methodology/approachIn this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.FindingsThe study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.Originality/valueTo the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.

查看原文本刊更多论文

使用外部知识进行基于相似性的数据集发现的模块化框架

作为开放数据发布的数据集的语义检索和发现仍然是一项具有挑战性的任务。数据集本质上起源于全球分布的网络丛林，缺乏集中的数据库管理、数据库方案、共享属性、词汇表、结构和语义。现有的数据集目录提供了基本的搜索功能，依赖于附加在数据集上的简短的、不完整的或误导性的文本元数据的关键字搜索。因此，搜索结果往往是不充分的。然而，通过使用基于内容的检索、机器学习工具、第三方(外部)知识库、无数特征提取方法和描述模型等，存在许多改进数据集发现的方法。设计/方法/方法在本文中，作者提出了一个模块化框架，用于基于相似性的数据集发现方法的快速实验。该框架由可扩展的组件目录组成，这些组件准备形成用于数据集表示和发现的自定义管道。该研究提出了几个概念验证管道，包括实验评估，展示了该框架的使用。原创性/价值据作者所知，在数据集发现的背景下，没有类似的正式框架来实验各种相似方法。该框架的目标是为数据集发现领域的可重复性和可比性研究建立一个平台。该框架的原型实现可以在GitHub上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Technologies and Applications Social Sciences-Library and Information Sciences

CiteScore

3.80

自引率

6.20%

发文量

期刊介绍： Previously published as: Program Online from: 2018 Subject Area: Information & Knowledge Management, Library Studies