{"title":"一种使用监督学习识别Web查询接口的策略","authors":"H. Marin-Castro, V. Sosa-Sosa, I. Lopez-Arevalo","doi":"10.1109/NWESP.2011.6088183","DOIUrl":null,"url":null,"abstract":"The Deep Web is an enormous source of information constantly growing. It comprises a large amount of databases on the Web that are accessed through Web query interfaces related to different domains. The content of the Deep Web can not be reachable by traditional search engines, what makes almost impossible for common users to get access to this useful information. There are several problems related to the search for content in the Deep Web. One of them is the automatic identification of Web query interfaces, being this a mean to access the information in the Deep Web. The task of classify HTML forms contained inside Web page as Web query interface is challenging due to their enormous heterogeneity. This paper introduce a strategy that automatically identifies Web query interfaces independent of their domain. We make an adequate selection of HTML elements and use them appropriately to build characteristic vectors that are used as input of a supervised classifier to determine if a Web page contains or not a Web query interface. The experimental results show that the proposed strategy is efficient and accurate, achieving better classification results than works previously reported.","PeriodicalId":271670,"journal":{"name":"2011 7th International Conference on Next Generation Web Services Practices","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A strategy for identification of Web query interfaces using supervised learning\",\"authors\":\"H. Marin-Castro, V. Sosa-Sosa, I. Lopez-Arevalo\",\"doi\":\"10.1109/NWESP.2011.6088183\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Deep Web is an enormous source of information constantly growing. It comprises a large amount of databases on the Web that are accessed through Web query interfaces related to different domains. The content of the Deep Web can not be reachable by traditional search engines, what makes almost impossible for common users to get access to this useful information. There are several problems related to the search for content in the Deep Web. One of them is the automatic identification of Web query interfaces, being this a mean to access the information in the Deep Web. The task of classify HTML forms contained inside Web page as Web query interface is challenging due to their enormous heterogeneity. This paper introduce a strategy that automatically identifies Web query interfaces independent of their domain. We make an adequate selection of HTML elements and use them appropriately to build characteristic vectors that are used as input of a supervised classifier to determine if a Web page contains or not a Web query interface. The experimental results show that the proposed strategy is efficient and accurate, achieving better classification results than works previously reported.\",\"PeriodicalId\":271670,\"journal\":{\"name\":\"2011 7th International Conference on Next Generation Web Services Practices\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 7th International Conference on Next Generation Web Services Practices\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NWESP.2011.6088183\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 7th International Conference on Next Generation Web Services Practices","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NWESP.2011.6088183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A strategy for identification of Web query interfaces using supervised learning
The Deep Web is an enormous source of information constantly growing. It comprises a large amount of databases on the Web that are accessed through Web query interfaces related to different domains. The content of the Deep Web can not be reachable by traditional search engines, what makes almost impossible for common users to get access to this useful information. There are several problems related to the search for content in the Deep Web. One of them is the automatic identification of Web query interfaces, being this a mean to access the information in the Deep Web. The task of classify HTML forms contained inside Web page as Web query interface is challenging due to their enormous heterogeneity. This paper introduce a strategy that automatically identifies Web query interfaces independent of their domain. We make an adequate selection of HTML elements and use them appropriately to build characteristic vectors that are used as input of a supervised classifier to determine if a Web page contains or not a Web query interface. The experimental results show that the proposed strategy is efficient and accurate, achieving better classification results than works previously reported.