Web Crawling Algorithm Fusing TF-IDF and Word2Vec Feature Extraction

IF 1 4区计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Web Engineering Pub Date : 2025-07-01 DOI:10.13052/jwe1540-9589.2452

Xinyue Feng

{"title":"Web Crawling Algorithm Fusing TF-IDF and Word2Vec Feature Extraction","authors":"Xinyue Feng","doi":"10.13052/jwe1540-9589.2452","DOIUrl":null,"url":null,"abstract":"Current research focuses on how to efficiently extract and crawl network information because, with the growth of the Internet, network information is becoming more and more diverse. To address the problem of incorrect data extraction and topic judgment of web crawlers, this study proposes a novel approach based on a file inverse frequency algorithm and Word2Vec feature extraction. The new method improves the retrieval capability of web crawlers by using the file inverse frequency algorithm and uses Word2Vec to extract data features, which improves the data extraction capability of current crawlers. The results showed that the F1 values of the research use model were 25.8% and 26.2% higher than those of the digital filtering algorithm, respectively. The total number of localization resources for the research use strategy was 2800 and the network coverage was 81%, which was 12% higher than the optimal strategy. The research use strategy had a shorter retrieval time and the model could recognize the vocabulary of the keywords. Finally, the model used by the research also had a good model processing capability when compared to other models. In summary, the new model built by the research can improve the data retrieval ability and data extraction ability of the web crawler, which provides new research ideas for future web information extraction.","PeriodicalId":49952,"journal":{"name":"Journal of Web Engineering","volume":"24 5","pages":"713-738"},"PeriodicalIF":1.0000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11135464","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Web Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11135464/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Current research focuses on how to efficiently extract and crawl network information because, with the growth of the Internet, network information is becoming more and more diverse. To address the problem of incorrect data extraction and topic judgment of web crawlers, this study proposes a novel approach based on a file inverse frequency algorithm and Word2Vec feature extraction. The new method improves the retrieval capability of web crawlers by using the file inverse frequency algorithm and uses Word2Vec to extract data features, which improves the data extraction capability of current crawlers. The results showed that the F1 values of the research use model were 25.8% and 26.2% higher than those of the digital filtering algorithm, respectively. The total number of localization resources for the research use strategy was 2800 and the network coverage was 81%, which was 12% higher than the optimal strategy. The research use strategy had a shorter retrieval time and the model could recognize the vocabulary of the keywords. Finally, the model used by the research also had a good model processing capability when compared to other models. In summary, the new model built by the research can improve the data retrieval ability and data extraction ability of the web crawler, which provides new research ideas for future web information extraction.

查看原文本刊更多论文

融合TF-IDF和Word2Vec特征提取的网络爬行算法

随着互联网的发展，网络信息变得越来越多样化，如何高效地提取和抓取网络信息成为当前研究的重点。针对网络爬虫的数据提取和主题判断错误问题，本研究提出了一种基于文件逆频率算法和Word2Vec特征提取的新方法。该方法利用文件逆频率算法提高了网络爬虫的检索能力，并利用Word2Vec提取数据特征，提高了现有爬虫的数据提取能力。结果表明，研究使用模型的F1值分别比数字滤波算法高25.8%和26.2%。研究使用策略的本地化资源总数为2800个，网络覆盖率为81%，比最优策略提高了12%。研究使用策略的检索时间较短，模型能够识别关键词的词汇。最后，与其他模型相比，本研究使用的模型也具有良好的模型处理能力。综上所述，研究建立的新模型可以提高网络爬虫的数据检索能力和数据提取能力，为未来的网络信息提取提供了新的研究思路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Web Engineering 工程技术-计算机：理论方法

CiteScore

1.80

自引率

12.50%

发文量

审稿时长

9 months

期刊介绍： The World Wide Web and its associated technologies have become a major implementation and delivery platform for a large variety of applications, ranging from simple institutional information Web sites to sophisticated supply-chain management systems, financial applications, e-government, distance learning, and entertainment, among others. Such applications, in addition to their intrinsic functionality, also exhibit the more complex behavior of distributed applications.