Discovering Missing Values in Semi-Structured Databases

RIAO Conference Pub Date : 2007-05-30 DOI:10.5555/1931390.1931456

Xing Yi, James Allan, V. Lavrenko

引用次数: 5

Abstract

We explore the problem of discovering multiple missing values in a semi-structured database. For this task, we formally develop Structured Relevance Model (SRM) built on one hypothetical generative model for semi-structured records. SRM is based on the idea that plausible values for a given field could be inferred from the context provided by the other fields in the record. Small-scale experiments on IMDb (Internet Movie Database) show that SRM matched three state-of-the-art relational learning approaches on the movie label prediction tasks. Large-scale experiments on a snapshot of the National Science Digital Library (NSDL) repository show that SRM is highly effective at discovering possible values for free-text fields even with quite modest amounts of training data, compared with state-of-the-art machine learning approaches.

查看原文本刊更多论文

发现半结构化数据库中的缺失值

我们探讨了在半结构化数据库中发现多个缺失值的问题。为了完成这项任务，我们正式开发了结构化关联模型(SRM)，该模型建立在一个半结构化记录的假设生成模型之上。SRM基于这样一种思想，即可以从记录中其他字段提供的上下文中推断出给定字段的合理值。在IMDb(互联网电影数据库)上的小规模实验表明，SRM在电影标签预测任务上匹配了三种最先进的关系学习方法。在国家科学数字图书馆(NSDL)存储库的快照上进行的大规模实验表明，与最先进的机器学习方法相比，SRM在发现自由文本字段的可能值方面非常有效，即使使用相当少量的训练数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

RIAO Conference

自引率

0.00%

发文量