聚类未知——Youtube案例

2019 International Conference on Computing, Networking and Communications (ICNC) Pub Date : 2019-02-18 DOI:10.1109/ICCNC.2019.8685364

A. Dvir, Angelos K. Marnerides, Ran Dubin, Nehor Golan

{"title":"聚类未知——Youtube案例","authors":"A. Dvir, Angelos K. Marnerides, Ran Dubin, Nehor Golan","doi":"10.1109/ICCNC.2019.8685364","DOIUrl":null,"url":null,"abstract":"Recent stringent end-user security and privacy requirements caused the dramatic rise of encrypted video streams in which YouTube encrypted traffic is one of the most prevalent. Regardless of their encrypted nature, metadata derived from such traffic flows can be utilized to identify the title of a video, thus enabling the classification of video streams into a single video title using a given video title set. Nonetheless, scenarios where no video title set is present and a supervised approach is not feasible, are both frequent and challenging. In this paper we go beyond previous studies and demonstrate the feasibility of clustering unknown video streams into subgroups although no information is available about the title name. We address this problem by exploring Natural Language Processing (NLP) formulations and Word2vec techniques to compose a novel statistical feature in order to further cluster unknown video streams. Through our experimental results over real datasets we demonstrate that our methodology is capable to cluster 72 video titles out of 100 video titles from a dataset of 10,000 video streams. Thus, we argue that the proposed methodology could sufficiently contribute to the newly rising and demanding domain of encrypted Internet traffic classification.","PeriodicalId":161815,"journal":{"name":"2019 International Conference on Computing, Networking and Communications (ICNC)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Clustering the Unknown - The Youtube Case\",\"authors\":\"A. Dvir, Angelos K. Marnerides, Ran Dubin, Nehor Golan\",\"doi\":\"10.1109/ICCNC.2019.8685364\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent stringent end-user security and privacy requirements caused the dramatic rise of encrypted video streams in which YouTube encrypted traffic is one of the most prevalent. Regardless of their encrypted nature, metadata derived from such traffic flows can be utilized to identify the title of a video, thus enabling the classification of video streams into a single video title using a given video title set. Nonetheless, scenarios where no video title set is present and a supervised approach is not feasible, are both frequent and challenging. In this paper we go beyond previous studies and demonstrate the feasibility of clustering unknown video streams into subgroups although no information is available about the title name. We address this problem by exploring Natural Language Processing (NLP) formulations and Word2vec techniques to compose a novel statistical feature in order to further cluster unknown video streams. Through our experimental results over real datasets we demonstrate that our methodology is capable to cluster 72 video titles out of 100 video titles from a dataset of 10,000 video streams. Thus, we argue that the proposed methodology could sufficiently contribute to the newly rising and demanding domain of encrypted Internet traffic classification.\",\"PeriodicalId\":161815,\"journal\":{\"name\":\"2019 International Conference on Computing, Networking and Communications (ICNC)\",\"volume\":\"75 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-02-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Computing, Networking and Communications (ICNC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCNC.2019.8685364\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Computing, Networking and Communications (ICNC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCNC.2019.8685364","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

最近严格的终端用户安全和隐私要求导致加密视频流的急剧上升，其中YouTube加密流量是最普遍的。无论其加密性质如何，可以利用源自此类流量的元数据来识别视频的标题，从而能够使用给定的视频标题集将视频流分类为单个视频标题。尽管如此，在没有视频标题集的情况下，有监督的方法是不可行的，这既常见又具有挑战性。在本文中，我们超越了以往的研究，并证明了将未知视频流聚类到子组的可行性，尽管没有关于标题名称的可用信息。我们通过探索自然语言处理(NLP)公式和Word2vec技术来解决这个问题，以组成一个新的统计特征，以便进一步聚类未知视频流。通过我们在真实数据集上的实验结果，我们证明了我们的方法能够从10,000个视频流数据集中的100个视频标题中聚类72个视频标题。因此，我们认为所提出的方法可以充分地为新兴的和要求很高的加密互联网流量分类领域做出贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Clustering the Unknown - The Youtube Case

Recent stringent end-user security and privacy requirements caused the dramatic rise of encrypted video streams in which YouTube encrypted traffic is one of the most prevalent. Regardless of their encrypted nature, metadata derived from such traffic flows can be utilized to identify the title of a video, thus enabling the classification of video streams into a single video title using a given video title set. Nonetheless, scenarios where no video title set is present and a supervised approach is not feasible, are both frequent and challenging. In this paper we go beyond previous studies and demonstrate the feasibility of clustering unknown video streams into subgroups although no information is available about the title name. We address this problem by exploring Natural Language Processing (NLP) formulations and Word2vec techniques to compose a novel statistical feature in order to further cluster unknown video streams. Through our experimental results over real datasets we demonstrate that our methodology is capable to cluster 72 video titles out of 100 video titles from a dataset of 10,000 video streams. Thus, we argue that the proposed methodology could sufficiently contribute to the newly rising and demanding domain of encrypted Internet traffic classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 International Conference on Computing, Networking and Communications (ICNC)

自引率

0.00%

发文量