{"title":"A Deep Web Data Extraction Framework Enhancement Method","authors":"Salar Faisal Noori, Bazeer Ahamed B","doi":"10.53409/mnaa/jcsit/e202203013342","DOIUrl":null,"url":null,"abstract":"The solutions for the data extraction problem are based on an analysis of the HTML DOM trees and the response page tags. These techniques rely highly on HTML specifications, even though they can produce good results. To effectively disclose in-depth online data, this research provides a methodology with two stages to address the problem. To find the user’s text query, the suggested system first performs “normal crawling.” A method is suggested based on the crawler’s received moved forward weighting work (ITF-IDF) to choose important websites. “data region extraction” is carried out in the second stage to gather data records. The suggested data extractor extracts visual blocks using the blocks’ visual characteristics. According to the suggested technique, the visual blocks should be grouped into similar formats based on format trees and appearance similarity. The visual blocks that will be extracted as information records from the cluster with the highest weight are those that are selected. The test reveals that the system’s suggested outline is superior to earlier information extraction efforts.","PeriodicalId":125707,"journal":{"name":"Journal of Computational Science and Intelligent Technologies","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science and Intelligent Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.53409/mnaa/jcsit/e202203013342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The solutions for the data extraction problem are based on an analysis of the HTML DOM trees and the response page tags. These techniques rely highly on HTML specifications, even though they can produce good results. To effectively disclose in-depth online data, this research provides a methodology with two stages to address the problem. To find the user’s text query, the suggested system first performs “normal crawling.” A method is suggested based on the crawler’s received moved forward weighting work (ITF-IDF) to choose important websites. “data region extraction” is carried out in the second stage to gather data records. The suggested data extractor extracts visual blocks using the blocks’ visual characteristics. According to the suggested technique, the visual blocks should be grouped into similar formats based on format trees and appearance similarity. The visual blocks that will be extracted as information records from the cluster with the highest weight are those that are selected. The test reveals that the system’s suggested outline is superior to earlier information extraction efforts.