Extracting attribute-value pairs from product specifications on the web

P. Petrovski, Christian Bizer
{"title":"Extracting attribute-value pairs from product specifications on the web","authors":"P. Petrovski, Christian Bizer","doi":"10.1145/3106426.3106449","DOIUrl":null,"url":null,"abstract":"Comparison shopping portals integrate product offers from large numbers of e-shops in order to support consumers in their buying decisions. Product offers often consist of a title and a free-text product description, both describing product attributes that are considered relevant by the specific vendor. In addition, product offers might contain structured or semi-structured product specifications in the form of HTML tables and HTML lists. As product specifications often cover more product attributes than free-text descriptions, being able to extract attribute-value pairs from these specifications is a critical prerequisite for achieving good results in tasks such as product matching, product categorisation, faceted product search, and product recommendation. In this paper, we present an approach for extracting attribute-value pairs from product specifications on the Web. We use supervised learning to classify the HTML tables and HTML lists within a web page as product specification or not. In order to extract attribute-value pairs from the HTML fragments identified by the specification detector, we again use supervised learning to classify columns as attribute column or value column. Compared to DEXTER, the current state-of-the-art approach for extracting attribute-value pairs from product specifications, we introduce several new features for specification detection and support the extraction of attribute-value pairs from specifications having more than two columns. This allows us to improve the F-score up to 10% for extracting attribute-value pairs from tables and up to 3% for lists. In addition, we report the results of using duplicate-based schema matching to align the product attribute schemata of 32 different e-shops. This experiment confirms the suitability of duplicate-based schema matching for product data integration.","PeriodicalId":20685,"journal":{"name":"Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3106426.3106449","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22

Abstract

Comparison shopping portals integrate product offers from large numbers of e-shops in order to support consumers in their buying decisions. Product offers often consist of a title and a free-text product description, both describing product attributes that are considered relevant by the specific vendor. In addition, product offers might contain structured or semi-structured product specifications in the form of HTML tables and HTML lists. As product specifications often cover more product attributes than free-text descriptions, being able to extract attribute-value pairs from these specifications is a critical prerequisite for achieving good results in tasks such as product matching, product categorisation, faceted product search, and product recommendation. In this paper, we present an approach for extracting attribute-value pairs from product specifications on the Web. We use supervised learning to classify the HTML tables and HTML lists within a web page as product specification or not. In order to extract attribute-value pairs from the HTML fragments identified by the specification detector, we again use supervised learning to classify columns as attribute column or value column. Compared to DEXTER, the current state-of-the-art approach for extracting attribute-value pairs from product specifications, we introduce several new features for specification detection and support the extraction of attribute-value pairs from specifications having more than two columns. This allows us to improve the F-score up to 10% for extracting attribute-value pairs from tables and up to 3% for lists. In addition, we report the results of using duplicate-based schema matching to align the product attribute schemata of 32 different e-shops. This experiment confirms the suitability of duplicate-based schema matching for product data integration.
从web上的产品规格中提取属性值对
比较购物门户网站整合了大量电子商店提供的产品,以支持消费者的购买决策。产品报价通常由标题和自由文本产品描述组成,两者都描述了特定供应商认为相关的产品属性。此外,产品报价可能包含HTML表格和HTML列表形式的结构化或半结构化产品规范。由于产品规格说明通常比自由文本描述涵盖更多的产品属性,因此能够从这些规格说明中提取属性值对是在产品匹配、产品分类、分面产品搜索和产品推荐等任务中获得良好结果的关键先决条件。本文提出了一种从Web上的产品规格中提取属性值对的方法。我们使用监督学习将网页中的HTML表格和HTML列表分类为产品规范或非产品规范。为了从规范检测器识别的HTML片段中提取属性-值对,我们再次使用监督学习将列分类为属性列或值列。与目前用于从产品规格中提取属性值对的最先进的方法DEXTER相比,我们引入了几个用于规格检测的新特性,并支持从包含两列以上的规格中提取属性值对。这允许我们将从表中提取属性值对的F-score提高10%,从列表中提取属性值对的F-score提高3%。此外,我们报告了使用基于重复的模式匹配来对齐32个不同电子商店的产品属性模式的结果。实验验证了基于副本的模式匹配在产品数据集成中的适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信