Semantic crawling: An approach based on Named Entity Recognition

Giulia Di Pietro, C. Aliprandi, Antonio Ercole De Luca, Matteo Raffaelli, Tiziana Soru
{"title":"Semantic crawling: An approach based on Named Entity Recognition","authors":"Giulia Di Pietro, C. Aliprandi, Antonio Ercole De Luca, Matteo Raffaelli, Tiziana Soru","doi":"10.1109/ASONAM.2014.6921661","DOIUrl":null,"url":null,"abstract":"Law Enforcement Agencies (LEAs) are increasingly more reliant on information and communication technologies and affected by a society shaped by the Internet. The richness and quantity of information available from open sources, if properly gathered and processed, can provide valuable intelligence and help in drawing inferences from existing closed source intelligence. Today the intelligence cycle is characterized by manual collection and integration of data. Named Entity Recognition (NER) plays a fundamental role in Open Source Intelligence (OSINT) solutions when fighting crime. This paper describes the implementation of a NER-based focused web crawler under the EU FP7 Security Research Project CAPER (Collaborative information, Acquisition, Processing, Exploitation and Reporting for the prevention of organized crime). The crawler allows 1. to look for documents starting from a URL until a parametric depth of levels - also specifying a keyword that has to be contained in the page and in the related links - and 2. to look for a parametric number of documents starting from a keyword (entrusting the keyword search to one of the principal search engines, thus behaving as a meta-search engine). In addition, the crawler is able to retrieve only those documents that contain the information semantically relevant to the query (in other words: the required keyword with the required sense). This is achieved through the use of NER technologies. In this paper we present the CAPER NER-based Semantic Crawler, which has been proven to be a suitable tool for focused crawling, allowing LEAs to drastically reduce data collection and integration efforts.","PeriodicalId":143584,"journal":{"name":"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)","volume":"218 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASONAM.2014.6921661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Law Enforcement Agencies (LEAs) are increasingly more reliant on information and communication technologies and affected by a society shaped by the Internet. The richness and quantity of information available from open sources, if properly gathered and processed, can provide valuable intelligence and help in drawing inferences from existing closed source intelligence. Today the intelligence cycle is characterized by manual collection and integration of data. Named Entity Recognition (NER) plays a fundamental role in Open Source Intelligence (OSINT) solutions when fighting crime. This paper describes the implementation of a NER-based focused web crawler under the EU FP7 Security Research Project CAPER (Collaborative information, Acquisition, Processing, Exploitation and Reporting for the prevention of organized crime). The crawler allows 1. to look for documents starting from a URL until a parametric depth of levels - also specifying a keyword that has to be contained in the page and in the related links - and 2. to look for a parametric number of documents starting from a keyword (entrusting the keyword search to one of the principal search engines, thus behaving as a meta-search engine). In addition, the crawler is able to retrieve only those documents that contain the information semantically relevant to the query (in other words: the required keyword with the required sense). This is achieved through the use of NER technologies. In this paper we present the CAPER NER-based Semantic Crawler, which has been proven to be a suitable tool for focused crawling, allowing LEAs to drastically reduce data collection and integration efforts.
语义爬行:一种基于命名实体识别的方法
执法机构越来越依赖信息和通信技术,并受到互联网社会的影响。如果正确地收集和处理来自开放源的丰富和数量的信息,可以提供有价值的情报,并有助于从现有的封闭源情报中得出推论。今天,情报周期的特点是人工收集和整合数据。命名实体识别(NER)在打击犯罪的开源情报(OSINT)解决方案中起着重要作用。本文描述了欧盟FP7安全研究项目CAPER(预防有组织犯罪的协同信息、获取、处理、利用和报告)下基于ner的重点网络爬虫的实现。爬虫允许1。查找从URL开始的文档,直到参数深度的级别—还指定了必须包含在页面和相关链接中的关键字—以及2。从关键字开始查找文档的参数数量(将关键字搜索委托给主要搜索引擎之一,从而充当元搜索引擎)。此外,爬虫只能够检索那些包含与查询语义相关的信息的文档(换句话说:具有所需意义的所需关键字)。这是通过使用NER技术实现的。在本文中,我们介绍了基于CAPER er的语义爬行器,它已被证明是集中爬行的合适工具,允许LEAs大大减少数据收集和集成工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信