Jack B. Harrison, Joseph R. Harrison, Madison G. Boswell, Alan J. Michaels
{"title":"A Hierarchical Database of One Million Websites","authors":"Jack B. Harrison, Joseph R. Harrison, Madison G. Boswell, Alan J. Michaels","doi":"10.1109/SecDev53368.2022.00025","DOIUrl":null,"url":null,"abstract":"As part of a broader cyber-policy experiment on the Use and Abuse of Personal Information, we are seeking efficient methods to generate a hierarchical and malleable database of over one million websites for use in a future large-scale, semi-automated establishment of fake online accounts. Available directories of reputable Internet sites are often incomplete, outdated, or not well categorized. This paper describes the design and challenges associated with a custom web scraper to refine Curlie [1], an online repository of websites, into a concise, readable format. The scraper recursively and distributively crawls Curlie for unique URLs and plain-text names and parses them into our database. We will use the hierarchy functionality of this new database to answer future research questions focused on website stewardship of personal information (PI). This data normalization challenge is one of many we have encountered in the larger open-source intelligence (OSINT) Use and Abuse (U&A) collection framework.","PeriodicalId":407946,"journal":{"name":"2022 IEEE Secure Development Conference (SecDev)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Secure Development Conference (SecDev)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SecDev53368.2022.00025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
As part of a broader cyber-policy experiment on the Use and Abuse of Personal Information, we are seeking efficient methods to generate a hierarchical and malleable database of over one million websites for use in a future large-scale, semi-automated establishment of fake online accounts. Available directories of reputable Internet sites are often incomplete, outdated, or not well categorized. This paper describes the design and challenges associated with a custom web scraper to refine Curlie [1], an online repository of websites, into a concise, readable format. The scraper recursively and distributively crawls Curlie for unique URLs and plain-text names and parses them into our database. We will use the hierarchy functionality of this new database to answer future research questions focused on website stewardship of personal information (PI). This data normalization challenge is one of many we have encountered in the larger open-source intelligence (OSINT) Use and Abuse (U&A) collection framework.