Using naive bayes to detect spammy names in social networks

Proceedings of the 2013 ACM workshop on Artificial intelligence and security Pub Date : 2013-11-04 DOI:10.1145/2517312.2517314

D. Freeman

{"title":"Using naive bayes to detect spammy names in social networks","authors":"D. Freeman","doi":"10.1145/2517312.2517314","DOIUrl":null,"url":null,"abstract":"Many social networks are predicated on the assumption that a member's online information reflects his or her real identity. In such networks, members who fill their name fields with fictitious identities, company names, phone numbers, or just gibberish are violating the terms of service, polluting search results, and degrading the value of the site to real members. Finding and removing these accounts on the basis of their spammy names can both improve the site experience for real members and prevent further abusive activity. In this paper we describe a set of features that can be used by a Naive Bayes classifier to find accounts whose names do not represent real people. The model can detect both automated and human abusers and can be used at registration time, before other signals such as social graph or clickstream history are present. We use member data from LinkedIn to train and validate our model and to choose parameters. Our best-scoring model achieves AUC 0.85 on a sequestered test set. We ran the algorithm on live LinkedIn data for one month in parallel with our previous name scoring algorithm based on regular expressions. The false positive rate of our new algorithm (3.3%) was less than half that of the previous algorithm (7.0%). When the algorithm is run on email usernames as well as user-entered first and last names, it provides an effective way to catch not only bad human actors but also bots that have poor name and email generation algorithms.","PeriodicalId":422398,"journal":{"name":"Proceedings of the 2013 ACM workshop on Artificial intelligence and security","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2013 ACM workshop on Artificial intelligence and security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2517312.2517314","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 53

Abstract

Many social networks are predicated on the assumption that a member's online information reflects his or her real identity. In such networks, members who fill their name fields with fictitious identities, company names, phone numbers, or just gibberish are violating the terms of service, polluting search results, and degrading the value of the site to real members. Finding and removing these accounts on the basis of their spammy names can both improve the site experience for real members and prevent further abusive activity. In this paper we describe a set of features that can be used by a Naive Bayes classifier to find accounts whose names do not represent real people. The model can detect both automated and human abusers and can be used at registration time, before other signals such as social graph or clickstream history are present. We use member data from LinkedIn to train and validate our model and to choose parameters. Our best-scoring model achieves AUC 0.85 on a sequestered test set. We ran the algorithm on live LinkedIn data for one month in parallel with our previous name scoring algorithm based on regular expressions. The false positive rate of our new algorithm (3.3%) was less than half that of the previous algorithm (7.0%). When the algorithm is run on email usernames as well as user-entered first and last names, it provides an effective way to catch not only bad human actors but also bots that have poor name and email generation algorithms.

查看原文本刊更多论文

使用朴素贝叶斯检测社交网络中的垃圾邮件名称

许多社交网络建立在一个假设之上，即会员的在线信息反映了他或她的真实身份。在这样的网络中，会员在自己的名字栏中填写虚假的身份、公司名称、电话号码或只是胡言乱语，这不仅违反了服务条款，而且污染了搜索结果，降低了网站对真实会员的价值。查找和删除这些帐户的基础上，他们的垃圾名称，既可以改善网站的经验，为真正的成员和防止进一步滥用活动。在本文中，我们描述了一组可以被朴素贝叶斯分类器用来查找名字不代表真实人物的帐户的特征。该模型可以检测到自动和人为滥用者，可以在注册时使用，在社交图谱或点击流历史等其他信号出现之前。我们使用LinkedIn的会员数据来训练和验证我们的模型，并选择参数。我们的最佳评分模型在隔离的测试集上达到了0.85的AUC。我们在LinkedIn的实时数据上运行了一个月的算法，与之前基于正则表达式的名字评分算法并行。新算法的误报率(3.3%)不到前算法(7.0%)的一半。当算法在电子邮件用户名以及用户输入的名字和姓氏上运行时，它提供了一种有效的方法，不仅可以捕获不良的人类参与者，还可以捕获具有不良名称和电子邮件生成算法的机器人。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2013 ACM workshop on Artificial intelligence and security

自引率

0.00%

发文量