Practical OSINT investigation - Similarity calculation using Reddit user profile data

Valeria Vishnevskaya, Klaus Schwarz, Reiner Creutzburg
{"title":"Practical OSINT investigation - Similarity calculation using Reddit user profile data","authors":"Valeria Vishnevskaya, Klaus Schwarz, Reiner Creutzburg","doi":"10.2352/ei.2023.35.3.mobmu-356","DOIUrl":null,"url":null,"abstract":"This paper presents a practical Open Source Intelligence (OSINT) use case for user similarity measurements with the use of open profile data from the Reddit social network. This PoC work combines the open data from Reddit and the part of the state-of-the-art BERT model. Using the PRAW Python library, the project fetches comments and posts of users. Then these texts are converted into a feature vector - representation of all user posts and comments. The main idea here is to create a comparable user's pair similarity score based on their comments and posts. For example, if we fix one user and calculate scores of all mutual pairs with other users, we will produce a total order on the set of all mutual pairs with that user. This total order can be described as a degree of written similarity with this chosen user. A set of \"similar\" users for one particular user can be used to recommend to the user interesting for him people. The similarity score also has a \"transitive property\": if $user_1$ is \"similar\" to $user_2$ and $user_2$ is similar to $user_3$ then inner properties of our model guarantees that $user_1$ and $user_3$ are pretty \"similar\" too. In this way, this score can be used to cluster a set of users into sets of \"similar\" users. It could be used in some recommendation algorithms or tune already existing algorithms to consider a cluster's peculiarities. Also, we can extend our model and calculate feature vectors for subreddits. In that way, we can find similar to the user's subreddits and recommend them to him.","PeriodicalId":73514,"journal":{"name":"IS&T International Symposium on Electronic Imaging","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IS&T International Symposium on Electronic Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2352/ei.2023.35.3.mobmu-356","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents a practical Open Source Intelligence (OSINT) use case for user similarity measurements with the use of open profile data from the Reddit social network. This PoC work combines the open data from Reddit and the part of the state-of-the-art BERT model. Using the PRAW Python library, the project fetches comments and posts of users. Then these texts are converted into a feature vector - representation of all user posts and comments. The main idea here is to create a comparable user's pair similarity score based on their comments and posts. For example, if we fix one user and calculate scores of all mutual pairs with other users, we will produce a total order on the set of all mutual pairs with that user. This total order can be described as a degree of written similarity with this chosen user. A set of "similar" users for one particular user can be used to recommend to the user interesting for him people. The similarity score also has a "transitive property": if $user_1$ is "similar" to $user_2$ and $user_2$ is similar to $user_3$ then inner properties of our model guarantees that $user_1$ and $user_3$ are pretty "similar" too. In this way, this score can be used to cluster a set of users into sets of "similar" users. It could be used in some recommendation algorithms or tune already existing algorithms to consider a cluster's peculiarities. Also, we can extend our model and calculate feature vectors for subreddits. In that way, we can find similar to the user's subreddits and recommend them to him.
实用OSINT调查-相似度计算使用Reddit用户资料数据
本文提出了一个实用的开源智能(OSINT)用例,用于使用来自Reddit社交网络的开放个人资料数据进行用户相似性测量。这项PoC工作结合了Reddit的开放数据和最先进的BERT模型的一部分。该项目使用PRAW Python库获取用户的评论和帖子。然后将这些文本转换成一个特征向量-所有用户帖子和评论的表示。这里的主要思想是根据用户的评论和帖子创建一个可比较的用户配对相似度评分。例如,如果我们固定一个用户并计算与其他用户的所有互对的分数,我们将在与该用户的所有互对的集合上生成一个总排序。这个总顺序可以用与所选用户的书写相似度来描述。一个特定用户的一组“相似”用户可以用来向用户推荐他感兴趣的人。相似度得分也有一个“传递属性”:如果$user_1$与$user_2$“相似”,$user_2$与$user_3$相似,那么我们模型的内部属性保证$user_1$和$user_3$也非常“相似”。这样,这个分数就可以用来将一组用户聚类为“相似”用户集。它可以用在一些推荐算法中,或者调优已经存在的算法来考虑集群的特性。此外,我们可以扩展我们的模型并计算子reddit的特征向量。这样,我们就可以找到与用户相似的subreddits并推荐给他。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信