Web Crawling Algorithm Fusing TF-IDF and Word2Vec Feature Extraction

IF 1 4区 计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Xinyue Feng
{"title":"Web Crawling Algorithm Fusing TF-IDF and Word2Vec Feature Extraction","authors":"Xinyue Feng","doi":"10.13052/jwe1540-9589.2452","DOIUrl":null,"url":null,"abstract":"Current research focuses on how to efficiently extract and crawl network information because, with the growth of the Internet, network information is becoming more and more diverse. To address the problem of incorrect data extraction and topic judgment of web crawlers, this study proposes a novel approach based on a file inverse frequency algorithm and Word2Vec feature extraction. The new method improves the retrieval capability of web crawlers by using the file inverse frequency algorithm and uses Word2Vec to extract data features, which improves the data extraction capability of current crawlers. The results showed that the F1 values of the research use model were 25.8% and 26.2% higher than those of the digital filtering algorithm, respectively. The total number of localization resources for the research use strategy was 2800 and the network coverage was 81%, which was 12% higher than the optimal strategy. The research use strategy had a shorter retrieval time and the model could recognize the vocabulary of the keywords. Finally, the model used by the research also had a good model processing capability when compared to other models. In summary, the new model built by the research can improve the data retrieval ability and data extraction ability of the web crawler, which provides new research ideas for future web information extraction.","PeriodicalId":49952,"journal":{"name":"Journal of Web Engineering","volume":"24 5","pages":"713-738"},"PeriodicalIF":1.0000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11135464","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Web Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11135464/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Current research focuses on how to efficiently extract and crawl network information because, with the growth of the Internet, network information is becoming more and more diverse. To address the problem of incorrect data extraction and topic judgment of web crawlers, this study proposes a novel approach based on a file inverse frequency algorithm and Word2Vec feature extraction. The new method improves the retrieval capability of web crawlers by using the file inverse frequency algorithm and uses Word2Vec to extract data features, which improves the data extraction capability of current crawlers. The results showed that the F1 values of the research use model were 25.8% and 26.2% higher than those of the digital filtering algorithm, respectively. The total number of localization resources for the research use strategy was 2800 and the network coverage was 81%, which was 12% higher than the optimal strategy. The research use strategy had a shorter retrieval time and the model could recognize the vocabulary of the keywords. Finally, the model used by the research also had a good model processing capability when compared to other models. In summary, the new model built by the research can improve the data retrieval ability and data extraction ability of the web crawler, which provides new research ideas for future web information extraction.
融合TF-IDF和Word2Vec特征提取的网络爬行算法
随着互联网的发展,网络信息变得越来越多样化,如何高效地提取和抓取网络信息成为当前研究的重点。针对网络爬虫的数据提取和主题判断错误问题,本研究提出了一种基于文件逆频率算法和Word2Vec特征提取的新方法。该方法利用文件逆频率算法提高了网络爬虫的检索能力,并利用Word2Vec提取数据特征,提高了现有爬虫的数据提取能力。结果表明,研究使用模型的F1值分别比数字滤波算法高25.8%和26.2%。研究使用策略的本地化资源总数为2800个,网络覆盖率为81%,比最优策略提高了12%。研究使用策略的检索时间较短,模型能够识别关键词的词汇。最后,与其他模型相比,本研究使用的模型也具有良好的模型处理能力。综上所述,研究建立的新模型可以提高网络爬虫的数据检索能力和数据提取能力,为未来的网络信息提取提供了新的研究思路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Web Engineering
Journal of Web Engineering 工程技术-计算机:理论方法
CiteScore
1.80
自引率
12.50%
发文量
62
审稿时长
9 months
期刊介绍: The World Wide Web and its associated technologies have become a major implementation and delivery platform for a large variety of applications, ranging from simple institutional information Web sites to sophisticated supply-chain management systems, financial applications, e-government, distance learning, and entertainment, among others. Such applications, in addition to their intrinsic functionality, also exhibit the more complex behavior of distributed applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信