Integrating spreadsheet data via accurate and low-effort extraction

Zhe Chen, Michael J. Cafarella
{"title":"Integrating spreadsheet data via accurate and low-effort extraction","authors":"Zhe Chen, Michael J. Cafarella","doi":"10.1145/2623330.2623617","DOIUrl":null,"url":null,"abstract":"Spreadsheets contain valuable data on many topics. However, spreadsheets are difficult to integrate with other data sources. Converting spreadsheet data to the relational model would allow data analysts to use relational integration tools. We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. Based on an undirected graphical model, our system enables downstream spreadsheet integration applications. First, the automatic extractor uses hints from spreadsheets' graphical style and recovered metadata to extract the spreadsheet data as accurately as possible. Second, the interactive repair identifies similar regions in distinct spreadsheets scattered across large spreadsheet corpora, allowing a user's single manual repair to be amortized over many possible extraction errors. Our experiments show that a human can obtain the accurate extraction with just 31% of the manual operations required by a standard classification based technique on two real-world datasets.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"64 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"68","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2623330.2623617","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 68

Abstract

Spreadsheets contain valuable data on many topics. However, spreadsheets are difficult to integrate with other data sources. Converting spreadsheet data to the relational model would allow data analysts to use relational integration tools. We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. Based on an undirected graphical model, our system enables downstream spreadsheet integration applications. First, the automatic extractor uses hints from spreadsheets' graphical style and recovered metadata to extract the spreadsheet data as accurately as possible. Second, the interactive repair identifies similar regions in distinct spreadsheets scattered across large spreadsheet corpora, allowing a user's single manual repair to be amortized over many possible extraction errors. Our experiments show that a human can obtain the accurate extraction with just 31% of the manual operations required by a standard classification based technique on two real-world datasets.
通过精确和低工作量的提取集成电子表格数据
电子表格包含许多主题的有价值的数据。然而,电子表格很难与其他数据源集成。将电子表格数据转换为关系模型将允许数据分析师使用关系集成工具。我们提出了一个两阶段的半自动系统,它可以提取准确的关系元数据,同时最大限度地减少用户的工作量。基于无向图形模型,我们的系统支持下游电子表格集成应用程序。首先,自动提取器使用电子表格图形样式的提示和恢复的元数据尽可能准确地提取电子表格数据。其次,交互式修复识别分散在大型电子表格语料库中的不同电子表格中的相似区域,允许用户的单个手动修复分摊到许多可能的提取错误上。我们的实验表明,在两个真实世界的数据集上,一个人只需要31%的人工操作,就可以获得基于标准分类技术的准确提取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信