{"title":"PRED","authors":"Q. Yuan, Wei Zhang, Chao Zhang, Xinhe Geng, Gao Cong, Jiawei Han","doi":"10.1145/3018661.3018680","DOIUrl":"https://doi.org/10.1145/3018661.3018680","url":null,"abstract":"The availability of massive geo-annotated social media data sheds light on studying human mobility patterns. Among them, periodic pattern, ie an individual visiting a geographical region with some specific time interval, has been recognized as one of the most important. Mining periodic patterns has a variety of applications, such as location prediction, anomaly detection, and location- and time-aware recommendation. However, it is a challenging task: the regions of a person and the periods of each region are both unknown. The interdependency between them makes the task even harder. Hence, existing methods are far from satisfactory for detecting periodic patterns from the low-sampling and noisy social media data. We propose a Bayesian non-parametric model, named textbf{P}eriodic textbf{RE}gion textbf{D}etection (PRED), to discover periodic mobility patterns by jointly modeling the geographical and temporal information. Our method differs from previous studies in that it is non-parametric and thus does not require priori knowledge about an individual's mobility (eg number of regions, period length, region size). Meanwhile, it models the time gap between two consecutive records rather than the exact visit time, making it less sensitive to data noise. Extensive experimental results on both synthetic and real-world datasets show that PRED outperforms the state-of-the-art methods significantly in four tasks: periodic region discovery, outlier movement finding, period detection, and location prediction.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117110768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheng Li, Michael Bendersky, Vijay Garg, Sujith Ravi
{"title":"Related Event Discovery","authors":"Cheng Li, Michael Bendersky, Vijay Garg, Sujith Ravi","doi":"10.1145/3018661.3018713","DOIUrl":"https://doi.org/10.1145/3018661.3018713","url":null,"abstract":"We consider the problem of discovering local events on the web, where events are entities extracted from webpages. Examples of such local events include small venue concerts, farmers markets, sports activities, etc. Given an event entity, we propose a graph-based framework for retrieving a ranked list of related events that a user is likely to be interested in attending. Due to the difficulty of obtaining ground-truth labels for event entities, which are temporal and are constrained by location, our retrieval framework is unsupervised, and its graph-based formulation addresses (a) the challenge of feature sparseness and noisiness, and (b) the semantic mismatch problem in a self-contained and principled manner. To validate our methods, we collect human annotations and conduct a comprehensive empirical study, analyzing the performance of our methods with regard to relevance, recall, and diversity. This study shows that our graph-based framework is significantly better than any individual feature source, and can be further improved with minimal supervision.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127167747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synthesis of Forgiving Data Extractors","authors":"Adi Omari, Sharon Shoham, Eran Yahav","doi":"10.1145/3018661.3018740","DOIUrl":"https://doi.org/10.1145/3018661.3018740","url":null,"abstract":"We address the problem of synthesizing a robust data-extractor from a family of websites that contain the same kind of information. This problem is common when trying to aggregate information from many web sites, for example, when extracting information for a price-comparison site. Given a set of example annotated web pages from multiple sites in a family, our goal is to synthesize a robust data extractor that performs well on all sites in the family (not only on the provided example pages). The main challenge is the need to trade off precision for generality and robustness. Our key contribution is the introduction of forgiving extractors that dynamically adjust their precision to handle structural changes, without sacrificing precision on the training set. Our approach uses decision tree learning to create a generalized extractor and converts it into a forgiving extractor, inthe form of an XPath query. The forgiving extractor captures a series of pruned decision trees with monotonically decreasing precision, and monotonically increasing recall, and dynamically adjusts precision to guarantee sufficient recall. We have implemented our approach in a tool called TREEX and applied it to synthesize extractors for real-world large scale web sites. We evaluate the robustness and generality of the forgiving extractors by evaluating their precision and recall on: (i) different pages from sites in the training set (ii) pages from different versions of sites in the training set (iii) pages from different (unseen) sites. We compare the results of our synthesized extractor to those of classifier-based extractors, and pattern-based extractors, and show that TREEX significantly improves extraction accuracy.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"694 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122024924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigation of User Search Behavior While Facing Heterogeneous Search Services","authors":"Xin Li, Yiqun Liu, Rongjie Cai, Shaoping Ma","doi":"10.1145/3018661.3018673","DOIUrl":"https://doi.org/10.1145/3018661.3018673","url":null,"abstract":"With Web users' search tasks becoming increasingly complex, a single information source cannot necessarily satisfy their information needs. Searchers may rely on heterogeneous sources to complete their tasks, such as search engines, Community Question Answering (CQA), encyclopedia sites, and crowdsourcing platforms. Previous works focus on interaction behaviors with federated search results, including how to compose a federated Web search result page and what factors affect users' interaction behavior on aggregated search interfaces. However, little is known about which factors are crucial in determining users' search outcomes while facing multiple heterogeneous search services. In this paper, we design a lab-based user study to analyze what explicit and implicit factors affect search outcomes (information gain and user satisfaction) when users have access to heterogeneous information sources. In the study, each participant can access three different kinds of search services: a general search engine (Bing), a general CQA portal (Baidu Knows), and a high-quality CQA portal (Zhihu). Using questionnaires and interaction log data, we extract explicit and implicit signals to analyze how users' search outcomes are correlated with their behaviors on different information sources. Experimental results indicate that users' search experiences on CQA portals (such as users' perceived usefulness and number of result clicks) positively affect search outcome (information gain), while search satisfaction is significantly correlated with some other factors such as users' familiarity, interest and difficulty of the task. Besides, users' search satisfaction can be more accurately predicted by the implicit factors than search outcomes.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122072916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Matrix Completion for Signed Link Prediction","authors":"Jing Wang, Jie Shen, Ping Li, Huan Xu","doi":"10.1145/3018661.3018681","DOIUrl":"https://doi.org/10.1145/3018661.3018681","url":null,"abstract":"This work studies the binary matrix completion problem underlying a large body of real-world applications such as signed link prediction and information propagation. That is, each entry of the matrix indicates a binary preference such as \"like\" or \"dislike\", \"trust\" or \"distrust\". However, the performance of existing matrix completion methods may be hindered owing to three practical challenges: 1) the observed data are with binary label (i.e., not real value); 2) the data are typically sampled non-uniformly (i.e., positive links dominate the negative ones) and 3) a network may have a huge volume of data (i.e., memory and computational issue). In order to remedy these problems, we propose a novel framework which {i} maximizes the resemblance between predicted and observed matrices as well as penalizing the logistic loss to fit the binary data to produce binary estimates; {ii} constrains the matrix max-norm and maximizes the F-score to handle non-uniformness and {iii} presents online optimization technique, hence mitigating the memory cost. Extensive experiments performed on four large-scale datasets with up to hundreds of thousands of users demonstrate the superiority of our framework over the state-of-the-art matrix completion based methods and popular link prediction approaches.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126787015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting and Characterizing Eating-Disorder Communities on Social Media","authors":"Tao Wang, M. Brede, Antonella Ianni, E. Mentzakis","doi":"10.1145/3018661.3018706","DOIUrl":"https://doi.org/10.1145/3018661.3018706","url":null,"abstract":"Eating disorders are complex mental disorders and responsible for the highest mortality rate among mental illnesses. Recent studies reveal that user-generated content on social media provides useful information in understanding these disorders. Most previous studies focus on studying communities of people who discuss eating disorders on social media, while few studies have explored community structures and interactions among individuals who suffer from this disease over social media. In this paper, we first develop a snowball sampling method to automatically gather individuals who self-identify as eating disordered in their profile descriptions, as well as their social network connections with one another on Twitter. Then, we verify the effectiveness of our sampling method by: 1. quantifying differences between the sampled eating disordered users and two sets of reference data collected for non-disordered users in social status, behavioral patterns and psychometric properties; 2. building predictive models to classify eating disordered and non-disordered users. Finally, leveraging the data of social connections between eating disordered individuals on Twitter, we present the first homophily study among eating-disorder communities on social media. Our findings shed new light on how an eating-disorder community develops on social media.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129075232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ios Kotsogiannis, E. Zheleva, Ashwin Machanavajjhala
{"title":"Directed Edge Recommender System","authors":"Ios Kotsogiannis, E. Zheleva, Ashwin Machanavajjhala","doi":"10.1145/3018661.3018729","DOIUrl":"https://doi.org/10.1145/3018661.3018729","url":null,"abstract":"Recommender systems have become ubiquitous in online applications where companies personalize the user experience based on explicit or inferred user preferences. Most modern recommender systems concentrate on finding relevant items for each individual user. In this paper, we describe the problem of directed edge recommendations where the system recommends the best item that a user can gift, share or recommend to another user that he/she is connected to. We propose algorithms that utilize the preferences of both the sender and the recipient by integrating individual user preference models (e.g., based on items each user purchased for themselves) with models of sharing preferences (e.g., gift purchases for others) into the recommendation process. We compare our work to group recommender systems and social network edge labeling, showing that incorporating the task context leads to more accurate recommendations.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129184796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural Survival Recommender","authors":"How Jing, Alex Smola","doi":"10.1145/3018661.3018719","DOIUrl":"https://doi.org/10.1145/3018661.3018719","url":null,"abstract":"The ability to predict future user activity is invaluable when it comes to content recommendation and personalization. For instance, knowing when users will return to an online music service and what they will listen to increases user satisfaction and therefore user retention. We present a model based on Long-Short Term Memory to estimate when a user will return to a site and what their future listening behavior will be. In doing so, we aim to solve the problem of Just-In-Time recommendation, that is, to recommend the right items at the right time. We use tools from survival analysis for return time prediction and exponential families for future activity analysis. We show that the resulting multitask problem can be solved accurately, when applied to two real-world datasets.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127908544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Air Travel Choice Behavior with Mixed Kernel Density Estimations","authors":"Zhenni Feng, Yanmin Zhu, Jian Cao","doi":"10.1145/3018661.3018671","DOIUrl":"https://doi.org/10.1145/3018661.3018671","url":null,"abstract":"Understanding air travel choice behavior of air passengers is of great significance for various purposes such as travel demand prediction and trip recommendation. Existing approaches based on surveys can only provide aggregate level air travel choice behavior of passengers and they fail to provide comprehensive information for personalized services. In this paper we focus on modeling individual level air travel choice behavior of passengers, which is valuable for recommendations and personalized services. We employ a probabilistic model to represent individual level air travel choice behavior based on a large dataset of historical booking records, leveraging several key factors, such as takeoff time, arrival time, elapsed time between reservation and takeoff, price, and seat class. However, each passenger has only a limited number of historical booking records, causing a serious data sparsity problem. To this end, we propose a mixed kernel density estimation (mix-KDE) approach for each passenger with a mixture model that combines probabilistic estimation of both regularity of the individual himself and social conformity of similar passengers. The proposed model is trained and evaluated via the expectation-maximization (EM) algorithm with a huge dataset of booking records of over 10 million air passengers from a popular online travel agency in China. Experimental results demonstrate that our mix-KDE approach outperforms the Gaussian mixture model (GMM) and the simple kernel density estimation in the presence of the sparsity issue.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129973455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linchuan Xu, Xiaokai Wei, Jiannong Cao, Philip S. Yu
{"title":"Embedding of Embedding (EOE): Joint Embedding for Coupled Heterogeneous Networks","authors":"Linchuan Xu, Xiaokai Wei, Jiannong Cao, Philip S. Yu","doi":"10.1145/3018661.3018723","DOIUrl":"https://doi.org/10.1145/3018661.3018723","url":null,"abstract":"Network embedding is increasingly employed to assist network analysis as it is effective to learn latent features that encode linkage information. Various network embedding methods have been proposed, but they are only designed for a single network scenario. In the era of big data, different types of related information can be fused together to form a coupled heterogeneous network, which consists of two different but related sub-networks connected by inter-network edges. In this scenario, the inter-network edges can act as comple- mentary information in the presence of intra-network ones. This complementary information is important because it can make latent features more comprehensive and accurate. And it is more important when the intra-network edges are ab- sent, which can be referred to as the cold-start problem. In this paper, we thus propose a method named embedding of embedding (EOE) for coupled heterogeneous networks. In the EOE, latent features encode not only intra-network edges, but also inter-network ones. To tackle the challenge of heterogeneities of two networks, the EOE incorporates a harmonious embedding matrix to further embed the em- beddings that only encode intra-network edges. Empirical experiments on a variety of real-world datasets demonstrate the EOE outperforms consistently single network embedding methods in applications including visualization, link prediction multi-class classification, and multi-label classification.","PeriodicalId":344017,"journal":{"name":"Proceedings of the Tenth ACM International Conference on Web Search and Data Mining","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129311664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}