What's in a Domain? Anaylsis of URL Features

John Hawkins
{"title":"What's in a Domain? Anaylsis of URL Features","authors":"John Hawkins","doi":"10.5121/csit.2023.131409","DOIUrl":null,"url":null,"abstract":"Many data science problems require processing log data derived from web pages, apis or other internet traffic sources. URLs are one of the few ubiquitous data fields that describe internet activity, hence they require effective processing for a wide variety of machine learning applications. While URLs are structurally rich, the structure can be both domain specific and subject to change over time, making feature engineering for URLs an ongoing challenge. In this research we outline the key structural components of URLs and discuss the information available within each. We describe methods for generating features on these URL components and share an open source implementation of these ideas. In addition, we describe a method for exploring URL feature importance that allows for comparison and analysis of the information available inside URLs. We experiment with a collection of URL classification datasets and demonstrate the utility of these tools. Package and source code is open on https://pypi.org/project/url2features.","PeriodicalId":430291,"journal":{"name":"Artificial Intelligence, NLP , Data Science and Cloud Computing Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence, NLP , Data Science and Cloud Computing Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5121/csit.2023.131409","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Many data science problems require processing log data derived from web pages, apis or other internet traffic sources. URLs are one of the few ubiquitous data fields that describe internet activity, hence they require effective processing for a wide variety of machine learning applications. While URLs are structurally rich, the structure can be both domain specific and subject to change over time, making feature engineering for URLs an ongoing challenge. In this research we outline the key structural components of URLs and discuss the information available within each. We describe methods for generating features on these URL components and share an open source implementation of these ideas. In addition, we describe a method for exploring URL feature importance that allows for comparison and analysis of the information available inside URLs. We experiment with a collection of URL classification datasets and demonstrate the utility of these tools. Package and source code is open on https://pypi.org/project/url2features.
域中有什么?URL特性分析
许多数据科学问题需要处理来自网页、api或其他互联网流量源的日志数据。url是描述互联网活动的少数普遍存在的数据字段之一,因此它们需要为各种各样的机器学习应用程序进行有效处理。虽然url的结构丰富,但结构可能是特定于域的,并且会随着时间的推移而变化,这使得url的特征工程成为一个持续的挑战。在本研究中,我们概述了url的关键结构组件,并讨论了每个组件中可用的信息。我们描述了在这些URL组件上生成特性的方法,并分享了这些思想的开源实现。此外,我们描述一个方法探索URL功能重要性,允许信息的比较和分析内部URL。我们对一组URL分类数据集进行了实验,并演示了这些工具的实用性。包和源代码在https://pypi.org/project/url2features上打开。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信