A topic-specific web crawler using deep convolutional networks

Saed Alqaraleh, Hatice Meltem Nergiz Sirin
{"title":"A topic-specific web crawler using deep convolutional networks","authors":"Saed Alqaraleh, Hatice Meltem Nergiz Sirin","doi":"10.34028/iajit/20/3/3","DOIUrl":null,"url":null,"abstract":"This paper presented a new focused crawler that efficiently supports the Turkish language. The developed architecture was divided into multiple units: a control unit, crawler unit, link extractor unit, link sorter unit, and natural language processing unit. The crawler's units can work in parallel to process the massive amount of published websites. Also, the proposed Convolutional Neural Network (CNN) based natural language processing unit can professionally classifying Turkish text and web pages. Extensive experiments using three datasets have been performed to illustrate the performance of the developed approach. The first dataset contains 50,000 Turkish web pages downloaded by the developed crawler, while the other two are publicly available and consist of “28,567” and “22,431” Turkish web pages, respectively. In addition, the Vector Space Model (VSM) in general and word embedding state-of-the-art techniques, in particular, were investigated to find the most suitable one for the Turkish language. Overall, results indicated that the developed approach had achieved good performance, robustness, and stability when processing the Turkish language. Also, Bidirectional Encoder Representations from Transformer (BERT) was found to be the most appropriate embedding for building an efficient Turkish language classification system. Finally, our experiments showed superior performance of the developed natural language processing unit against seven state-of-the-art CNN classification systems. Where accuracy improvement compared to the second-best is 10% and 47% compared to the lowest performance.","PeriodicalId":13624,"journal":{"name":"Int. Arab J. Inf. Technol.","volume":"8 1","pages":"310-318"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. Arab J. Inf. Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34028/iajit/20/3/3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presented a new focused crawler that efficiently supports the Turkish language. The developed architecture was divided into multiple units: a control unit, crawler unit, link extractor unit, link sorter unit, and natural language processing unit. The crawler's units can work in parallel to process the massive amount of published websites. Also, the proposed Convolutional Neural Network (CNN) based natural language processing unit can professionally classifying Turkish text and web pages. Extensive experiments using three datasets have been performed to illustrate the performance of the developed approach. The first dataset contains 50,000 Turkish web pages downloaded by the developed crawler, while the other two are publicly available and consist of “28,567” and “22,431” Turkish web pages, respectively. In addition, the Vector Space Model (VSM) in general and word embedding state-of-the-art techniques, in particular, were investigated to find the most suitable one for the Turkish language. Overall, results indicated that the developed approach had achieved good performance, robustness, and stability when processing the Turkish language. Also, Bidirectional Encoder Representations from Transformer (BERT) was found to be the most appropriate embedding for building an efficient Turkish language classification system. Finally, our experiments showed superior performance of the developed natural language processing unit against seven state-of-the-art CNN classification systems. Where accuracy improvement compared to the second-best is 10% and 47% compared to the lowest performance.
使用深度卷积网络的特定主题网络爬虫
本文提出了一种高效支持土耳其语的聚焦爬虫。所开发的体系结构分为多个单元:控制单元、爬虫单元、链接提取单元、链接分类单元和自然语言处理单元。爬虫的单元可以并行地处理大量已发布的网站。此外,本文提出的基于卷积神经网络(CNN)的自然语言处理单元可以对土耳其文本和网页进行专业分类。使用三个数据集进行了广泛的实验,以说明所开发方法的性能。第一个数据集包含开发的爬虫下载的50,000个土耳其网页,而另外两个数据集是公开的,分别由“28,567”和“22,431”土耳其网页组成。此外,研究了向量空间模型(VSM)和最先进的词嵌入技术,以找到最适合土耳其语的模型。总体而言,结果表明开发的方法在处理土耳其语时取得了良好的性能,鲁棒性和稳定性。此外,双向编码器表示从变压器(BERT)被发现是最合适的嵌入,以建立一个有效的土耳其语分类系统。最后,我们的实验表明,开发的自然语言处理单元在7个最先进的CNN分类系统中表现优异。与第二好的相比,准确度提高了10%,与最差的相比,提高了47%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信