Weizhong Zhao, M. VenkataSwamy, Gang Chen, Xiaowei Xu
{"title":"Fast Information Retrieval and Social Network Mining via Cosine Similarity Upper Bound","authors":"Weizhong Zhao, M. VenkataSwamy, Gang Chen, Xiaowei Xu","doi":"10.1109/SocialCom.2013.147","DOIUrl":null,"url":null,"abstract":"Similarity search is a key function for many applications including databases, pattern recognition and recommendation systems to name a few. In this paper, we first propose ε-query, a similarity search based on the popular cosine similarity for information retrieval and social network analysis. In contrast to traditional similarity search ε-query returns results whose cosine similarities with the query are larger than a threshold ε. The major contribution of this paper is an efficient ε-query processing algorithm by using an upper bound for binary data. Our evaluation using two of the largest publicly available real datasets, ClueWeb09 and Twitter, demonstrated that the proposed method could achieve several orders of magnitude speedup in comparison with the traditional approach. Last but not least, we applied the proposed method for information retrieval from ClueWeb and finding community structures from Twitter. The outcome further proved the effectiveness of the proposed method.","PeriodicalId":129308,"journal":{"name":"2013 International Conference on Social Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Social Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SocialCom.2013.147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Similarity search is a key function for many applications including databases, pattern recognition and recommendation systems to name a few. In this paper, we first propose ε-query, a similarity search based on the popular cosine similarity for information retrieval and social network analysis. In contrast to traditional similarity search ε-query returns results whose cosine similarities with the query are larger than a threshold ε. The major contribution of this paper is an efficient ε-query processing algorithm by using an upper bound for binary data. Our evaluation using two of the largest publicly available real datasets, ClueWeb09 and Twitter, demonstrated that the proposed method could achieve several orders of magnitude speedup in comparison with the traditional approach. Last but not least, we applied the proposed method for information retrieval from ClueWeb and finding community structures from Twitter. The outcome further proved the effectiveness of the proposed method.