{"title":"Web Crawling Algorithm Fusing TF-IDF and Word2Vec Feature Extraction","authors":"Xinyue Feng","doi":"10.13052/jwe1540-9589.2452","DOIUrl":null,"url":null,"abstract":"Current research focuses on how to efficiently extract and crawl network information because, with the growth of the Internet, network information is becoming more and more diverse. To address the problem of incorrect data extraction and topic judgment of web crawlers, this study proposes a novel approach based on a file inverse frequency algorithm and Word2Vec feature extraction. The new method improves the retrieval capability of web crawlers by using the file inverse frequency algorithm and uses Word2Vec to extract data features, which improves the data extraction capability of current crawlers. The results showed that the F1 values of the research use model were 25.8% and 26.2% higher than those of the digital filtering algorithm, respectively. The total number of localization resources for the research use strategy was 2800 and the network coverage was 81%, which was 12% higher than the optimal strategy. The research use strategy had a shorter retrieval time and the model could recognize the vocabulary of the keywords. Finally, the model used by the research also had a good model processing capability when compared to other models. In summary, the new model built by the research can improve the data retrieval ability and data extraction ability of the web crawler, which provides new research ideas for future web information extraction.","PeriodicalId":49952,"journal":{"name":"Journal of Web Engineering","volume":"24 5","pages":"713-738"},"PeriodicalIF":1.0000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11135464","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Web Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11135464/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Current research focuses on how to efficiently extract and crawl network information because, with the growth of the Internet, network information is becoming more and more diverse. To address the problem of incorrect data extraction and topic judgment of web crawlers, this study proposes a novel approach based on a file inverse frequency algorithm and Word2Vec feature extraction. The new method improves the retrieval capability of web crawlers by using the file inverse frequency algorithm and uses Word2Vec to extract data features, which improves the data extraction capability of current crawlers. The results showed that the F1 values of the research use model were 25.8% and 26.2% higher than those of the digital filtering algorithm, respectively. The total number of localization resources for the research use strategy was 2800 and the network coverage was 81%, which was 12% higher than the optimal strategy. The research use strategy had a shorter retrieval time and the model could recognize the vocabulary of the keywords. Finally, the model used by the research also had a good model processing capability when compared to other models. In summary, the new model built by the research can improve the data retrieval ability and data extraction ability of the web crawler, which provides new research ideas for future web information extraction.
期刊介绍:
The World Wide Web and its associated technologies have become a major implementation and delivery platform for a large variety of applications, ranging from simple institutional information Web sites to sophisticated supply-chain management systems, financial applications, e-government, distance learning, and entertainment, among others. Such applications, in addition to their intrinsic functionality, also exhibit the more complex behavior of distributed applications.