{"title":"Designing a Modular and Distributed Web Crawler Focused on Unstructured Cybersecurity Intelligence","authors":"Don Jenkins, L. Liebrock, V. Urias","doi":"10.1109/ICCST49569.2021.9717379","DOIUrl":null,"url":null,"abstract":"There are many use cases for cybersecurity related information available on the Internet. Tasks relating to natural language processing and machine learning require large amounts of structured and labeled data. However, the availability of recent data is limited due to the difficulty in its sanitization, retrieval, and labeling. Data on the Internet is generally diverse and unstructured, and storing this information in a manner that is easily usable for research and development purposes is not an intuitive task. We propose architectural considerations when developing a distributed system consisting of web crawlers, web scrapers, and various post-processing components, as well as possible implementations of these considerations. Our team developed such a system that is capable of applying structure and storing open source intelligence data from the Internet in an easily-searchable software platform called Splunk.","PeriodicalId":101539,"journal":{"name":"2021 International Carnahan Conference on Security Technology (ICCST)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Carnahan Conference on Security Technology (ICCST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCST49569.2021.9717379","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
There are many use cases for cybersecurity related information available on the Internet. Tasks relating to natural language processing and machine learning require large amounts of structured and labeled data. However, the availability of recent data is limited due to the difficulty in its sanitization, retrieval, and labeling. Data on the Internet is generally diverse and unstructured, and storing this information in a manner that is easily usable for research and development purposes is not an intuitive task. We propose architectural considerations when developing a distributed system consisting of web crawlers, web scrapers, and various post-processing components, as well as possible implementations of these considerations. Our team developed such a system that is capable of applying structure and storing open source intelligence data from the Internet in an easily-searchable software platform called Splunk.