Xuefeng Xian, Pengpeng Zhao, Wei Fang, Jie Xin, Zhiming Cui
{"title":"Data Source Selection for Large-Scale Deep Web Data Integration","authors":"Xuefeng Xian, Pengpeng Zhao, Wei Fang, Jie Xin, Zhiming Cui","doi":"10.1109/WMWA.2009.25","DOIUrl":null,"url":null,"abstract":"Deep web has been an important resource on the web due to its rich and high quality information, leading to emerging a new application area in data mining and integrates. There may be hundreds or thousands of data sources providing data of relevance to a particular domain on the web, So a primary challenge to large-scale deep web data integration is to determine in what order to user integrate candidate data sources. In this paper, we develop a most-benefit approach (MBA) for ordering candidate data sources for user integration. At the core of this approach is a utility function that quantifies the utility of a given the state of integration system; thus, we devise a utility function for integration system based on query result number. We show in practice how to efficiently apply MBA in concert with this utility function to order data sources. A detailed experimental evaluation on real datasets shows that the ordering of data sources produced by this MBA-based yields a integration system with a significantly higher utility than a wide range of other ordering strategies.","PeriodicalId":375180,"journal":{"name":"2009 Second Pacific-Asia Conference on Web Mining and Web-based Application","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Second Pacific-Asia Conference on Web Mining and Web-based Application","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WMWA.2009.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Deep web has been an important resource on the web due to its rich and high quality information, leading to emerging a new application area in data mining and integrates. There may be hundreds or thousands of data sources providing data of relevance to a particular domain on the web, So a primary challenge to large-scale deep web data integration is to determine in what order to user integrate candidate data sources. In this paper, we develop a most-benefit approach (MBA) for ordering candidate data sources for user integration. At the core of this approach is a utility function that quantifies the utility of a given the state of integration system; thus, we devise a utility function for integration system based on query result number. We show in practice how to efficiently apply MBA in concert with this utility function to order data sources. A detailed experimental evaluation on real datasets shows that the ordering of data sources produced by this MBA-based yields a integration system with a significantly higher utility than a wide range of other ordering strategies.