实用OSINT调查-相似度计算使用Reddit用户资料数据

IS&T International Symposium on Electronic Imaging Pub Date : 2023-01-16 DOI:10.2352/ei.2023.35.3.mobmu-356

Valeria Vishnevskaya, Klaus Schwarz, Reiner Creutzburg

{"title":"实用OSINT调查-相似度计算使用Reddit用户资料数据","authors":"Valeria Vishnevskaya, Klaus Schwarz, Reiner Creutzburg","doi":"10.2352/ei.2023.35.3.mobmu-356","DOIUrl":null,"url":null,"abstract":"This paper presents a practical Open Source Intelligence (OSINT) use case for user similarity measurements with the use of open profile data from the Reddit social network. This PoC work combines the open data from Reddit and the part of the state-of-the-art BERT model. Using the PRAW Python library, the project fetches comments and posts of users. Then these texts are converted into a feature vector - representation of all user posts and comments. The main idea here is to create a comparable user's pair similarity score based on their comments and posts. For example, if we fix one user and calculate scores of all mutual pairs with other users, we will produce a total order on the set of all mutual pairs with that user. This total order can be described as a degree of written similarity with this chosen user. A set of \"similar\" users for one particular user can be used to recommend to the user interesting for him people. The similarity score also has a \"transitive property\": if $user_1$ is \"similar\" to $user_2$ and $user_2$ is similar to $user_3$ then inner properties of our model guarantees that $user_1$ and $user_3$ are pretty \"similar\" too. In this way, this score can be used to cluster a set of users into sets of \"similar\" users. It could be used in some recommendation algorithms or tune already existing algorithms to consider a cluster's peculiarities. Also, we can extend our model and calculate feature vectors for subreddits. In that way, we can find similar to the user's subreddits and recommend them to him.","PeriodicalId":73514,"journal":{"name":"IS&T International Symposium on Electronic Imaging","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Practical OSINT investigation - Similarity calculation using Reddit user profile data\",\"authors\":\"Valeria Vishnevskaya, Klaus Schwarz, Reiner Creutzburg\",\"doi\":\"10.2352/ei.2023.35.3.mobmu-356\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a practical Open Source Intelligence (OSINT) use case for user similarity measurements with the use of open profile data from the Reddit social network. This PoC work combines the open data from Reddit and the part of the state-of-the-art BERT model. Using the PRAW Python library, the project fetches comments and posts of users. Then these texts are converted into a feature vector - representation of all user posts and comments. The main idea here is to create a comparable user's pair similarity score based on their comments and posts. For example, if we fix one user and calculate scores of all mutual pairs with other users, we will produce a total order on the set of all mutual pairs with that user. This total order can be described as a degree of written similarity with this chosen user. A set of \\\"similar\\\" users for one particular user can be used to recommend to the user interesting for him people. The similarity score also has a \\\"transitive property\\\": if $user_1$ is \\\"similar\\\" to $user_2$ and $user_2$ is similar to $user_3$ then inner properties of our model guarantees that $user_1$ and $user_3$ are pretty \\\"similar\\\" too. In this way, this score can be used to cluster a set of users into sets of \\\"similar\\\" users. It could be used in some recommendation algorithms or tune already existing algorithms to consider a cluster's peculiarities. Also, we can extend our model and calculate feature vectors for subreddits. In that way, we can find similar to the user's subreddits and recommend them to him.\",\"PeriodicalId\":73514,\"journal\":{\"name\":\"IS&T International Symposium on Electronic Imaging\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IS&T International Symposium on Electronic Imaging\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2352/ei.2023.35.3.mobmu-356\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IS&T International Symposium on Electronic Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2352/ei.2023.35.3.mobmu-356","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一个实用的开源智能(OSINT)用例，用于使用来自Reddit社交网络的开放个人资料数据进行用户相似性测量。这项PoC工作结合了Reddit的开放数据和最先进的BERT模型的一部分。该项目使用PRAW Python库获取用户的评论和帖子。然后将这些文本转换成一个特征向量-所有用户帖子和评论的表示。这里的主要思想是根据用户的评论和帖子创建一个可比较的用户配对相似度评分。例如，如果我们固定一个用户并计算与其他用户的所有互对的分数，我们将在与该用户的所有互对的集合上生成一个总排序。这个总顺序可以用与所选用户的书写相似度来描述。一个特定用户的一组“相似”用户可以用来向用户推荐他感兴趣的人。相似度得分也有一个“传递属性”:如果$user_1$与$user_2$“相似”，$user_2$与$user_3$相似，那么我们模型的内部属性保证$user_1$和$user_3$也非常“相似”。这样，这个分数就可以用来将一组用户聚类为“相似”用户集。它可以用在一些推荐算法中，或者调优已经存在的算法来考虑集群的特性。此外，我们可以扩展我们的模型并计算子reddit的特征向量。这样，我们就可以找到与用户相似的subreddits并推荐给他。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Practical OSINT investigation - Similarity calculation using Reddit user profile data

This paper presents a practical Open Source Intelligence (OSINT) use case for user similarity measurements with the use of open profile data from the Reddit social network. This PoC work combines the open data from Reddit and the part of the state-of-the-art BERT model. Using the PRAW Python library, the project fetches comments and posts of users. Then these texts are converted into a feature vector - representation of all user posts and comments. The main idea here is to create a comparable user's pair similarity score based on their comments and posts. For example, if we fix one user and calculate scores of all mutual pairs with other users, we will produce a total order on the set of all mutual pairs with that user. This total order can be described as a degree of written similarity with this chosen user. A set of "similar" users for one particular user can be used to recommend to the user interesting for him people. The similarity score also has a "transitive property": if $user_1$ is "similar" to $user_2$ and $user_2$ is similar to $user_3$ then inner properties of our model guarantees that $user_1$ and $user_3$ are pretty "similar" too. In this way, this score can be used to cluster a set of users into sets of "similar" users. It could be used in some recommendation algorithms or tune already existing algorithms to consider a cluster's peculiarities. Also, we can extend our model and calculate feature vectors for subreddits. In that way, we can find similar to the user's subreddits and recommend them to him.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IS&T International Symposium on Electronic Imaging

自引率

0.00%

发文量