Tong Liu, Derong Shen, Jing Shan, Tiezheng Nie, Yue Kou
{"title":"基于实体识别与集成的自底向上Web数据提取方法","authors":"Tong Liu, Derong Shen, Jing Shan, Tiezheng Nie, Yue Kou","doi":"10.1109/WISA.2011.37","DOIUrl":null,"url":null,"abstract":"Nowadays, most popular methods for web data extraction (WDE) are top-down ones depending on structure. However, these techniques are not scalable enough when coming to complex pages. Consequently, we put forward a bottom-up approach for WDE based on entity recognition and integration to avoid over dependency to structure of web pages. The approach proposed focuses on primary text sequences labeling first and also gives consideration to repetitive patterns of them as well. We propose a Two-Level extraction model for entity recognition and repetitive pattern extraction algorithm for entity integration. Our approach can effectively reduce the attribute labeling mistakes. Also, we demonstrate our approach by scientifically experimental results. The conclusion is that our approach perform better than the traditional extraction techniques, especially on complex Web pages. Keywords-web data extraction; entity recognition; entity integration; bottom-up","PeriodicalId":242633,"journal":{"name":"2011 Eighth Web Information Systems and Applications Conference","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Bottom-up Approach of Web Data Extraction based on Entity Recognition and Integration\",\"authors\":\"Tong Liu, Derong Shen, Jing Shan, Tiezheng Nie, Yue Kou\",\"doi\":\"10.1109/WISA.2011.37\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, most popular methods for web data extraction (WDE) are top-down ones depending on structure. However, these techniques are not scalable enough when coming to complex pages. Consequently, we put forward a bottom-up approach for WDE based on entity recognition and integration to avoid over dependency to structure of web pages. The approach proposed focuses on primary text sequences labeling first and also gives consideration to repetitive patterns of them as well. We propose a Two-Level extraction model for entity recognition and repetitive pattern extraction algorithm for entity integration. Our approach can effectively reduce the attribute labeling mistakes. Also, we demonstrate our approach by scientifically experimental results. The conclusion is that our approach perform better than the traditional extraction techniques, especially on complex Web pages. Keywords-web data extraction; entity recognition; entity integration; bottom-up\",\"PeriodicalId\":242633,\"journal\":{\"name\":\"2011 Eighth Web Information Systems and Applications Conference\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 Eighth Web Information Systems and Applications Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WISA.2011.37\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 Eighth Web Information Systems and Applications Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISA.2011.37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Bottom-up Approach of Web Data Extraction based on Entity Recognition and Integration
Nowadays, most popular methods for web data extraction (WDE) are top-down ones depending on structure. However, these techniques are not scalable enough when coming to complex pages. Consequently, we put forward a bottom-up approach for WDE based on entity recognition and integration to avoid over dependency to structure of web pages. The approach proposed focuses on primary text sequences labeling first and also gives consideration to repetitive patterns of them as well. We propose a Two-Level extraction model for entity recognition and repetitive pattern extraction algorithm for entity integration. Our approach can effectively reduce the attribute labeling mistakes. Also, we demonstrate our approach by scientifically experimental results. The conclusion is that our approach perform better than the traditional extraction techniques, especially on complex Web pages. Keywords-web data extraction; entity recognition; entity integration; bottom-up