{"title":"HTML块相似度估计","authors":"Kiril Griazev, Simona Ramanauskait","doi":"10.1109/AIEEE.2018.8592241","DOIUrl":null,"url":null,"abstract":"Automatic data extraction is an important task but websites contain a lot of secondary information that has little value, because of this it is important to correctly identify information blocks. This can be done using various techniques one of which is HTML block comparison. It can be used to identify blocks by estimating their similarity score. This paper proposes an algorithm for HTML block similarity estimation using multiple methods: structure, structure and tag similarity, structure, tag and content similarity. Additionally, proposed algorithm is tested against other open source algorithms by analyzing the same data.","PeriodicalId":198244,"journal":{"name":"2018 IEEE 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"HTML Block Similarity Estimation\",\"authors\":\"Kiril Griazev, Simona Ramanauskait\",\"doi\":\"10.1109/AIEEE.2018.8592241\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic data extraction is an important task but websites contain a lot of secondary information that has little value, because of this it is important to correctly identify information blocks. This can be done using various techniques one of which is HTML block comparison. It can be used to identify blocks by estimating their similarity score. This paper proposes an algorithm for HTML block similarity estimation using multiple methods: structure, structure and tag similarity, structure, tag and content similarity. Additionally, proposed algorithm is tested against other open source algorithms by analyzing the same data.\",\"PeriodicalId\":198244,\"journal\":{\"name\":\"2018 IEEE 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AIEEE.2018.8592241\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIEEE.2018.8592241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Automatic data extraction is an important task but websites contain a lot of secondary information that has little value, because of this it is important to correctly identify information blocks. This can be done using various techniques one of which is HTML block comparison. It can be used to identify blocks by estimating their similarity score. This paper proposes an algorithm for HTML block similarity estimation using multiple methods: structure, structure and tag similarity, structure, tag and content similarity. Additionally, proposed algorithm is tested against other open source algorithms by analyzing the same data.