HTML Block Similarity Estimation

2018 IEEE 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE) Pub Date : 2018-11-01 DOI:10.1109/AIEEE.2018.8592241

Kiril Griazev, Simona Ramanauskait

引用次数: 1

Abstract

Automatic data extraction is an important task but websites contain a lot of secondary information that has little value, because of this it is important to correctly identify information blocks. This can be done using various techniques one of which is HTML block comparison. It can be used to identify blocks by estimating their similarity score. This paper proposes an algorithm for HTML block similarity estimation using multiple methods: structure, structure and tag similarity, structure, tag and content similarity. Additionally, proposed algorithm is tested against other open source algorithms by analyzing the same data.

查看原文本刊更多论文

HTML块相似度估计

自动数据提取是一项重要的任务，但网站包含大量的次要信息，这些信息的价值不大，因此正确识别信息块非常重要。这可以使用各种技术来完成，其中之一就是HTML块比较。它可以通过估计它们的相似度得分来识别块。本文提出了一种基于结构、结构与标签相似度、结构与标签与内容相似度的HTML块相似度估计算法。此外，通过分析相同的数据，将所提出的算法与其他开源算法进行了对比测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE)

自引率

0.00%

发文量