{"title":"Making Sense of Entities and Quantities in Web Tables","authors":"Yusra Ibrahim, Mirek Riedewald, G. Weikum","doi":"10.1145/2983323.2983772","DOIUrl":null,"url":null,"abstract":"HTML tables and spreadsheets on the Internet or in enterprise intranets often contain valuable information, but are created ad-hoc. As a result, they usually lack systematic names for column headers and clear vocabulary for cell values. This limits the re-use of such tables and creates a huge heterogeneity problem when comparing or aggregating multiple tables. This paper aims to overcome this problem by automatically canonicalizing header names and cell values onto concepts, classes, entities and uniquely represented quantities registered in a knowledge base. To this end, we devise a probabilistic graphical model that captures coherence dependencies between cells in tables and candidate items in the space of concepts, entities and quantities. We give specific consideration to quantities which are mapped into a \"measure, value, unit\" triple over a taxonomy of physical (e.g. power consumption), monetary (e.g. revenue), temporal (e.g. date) and dimensionless (e.g. counts) measures. Our experiments with Web tables from diverse domains demonstrate the viability of our method and its benefits over baselines.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2983323.2983772","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 39
Abstract
HTML tables and spreadsheets on the Internet or in enterprise intranets often contain valuable information, but are created ad-hoc. As a result, they usually lack systematic names for column headers and clear vocabulary for cell values. This limits the re-use of such tables and creates a huge heterogeneity problem when comparing or aggregating multiple tables. This paper aims to overcome this problem by automatically canonicalizing header names and cell values onto concepts, classes, entities and uniquely represented quantities registered in a knowledge base. To this end, we devise a probabilistic graphical model that captures coherence dependencies between cells in tables and candidate items in the space of concepts, entities and quantities. We give specific consideration to quantities which are mapped into a "measure, value, unit" triple over a taxonomy of physical (e.g. power consumption), monetary (e.g. revenue), temporal (e.g. date) and dimensionless (e.g. counts) measures. Our experiments with Web tables from diverse domains demonstrate the viability of our method and its benefits over baselines.