{"title":"Project Success Prediction in Crowdfunding Environments","authors":"Yan Li, Vineeth Rakesh, C. Reddy","doi":"10.1145/2835776.2835791","DOIUrl":"https://doi.org/10.1145/2835776.2835791","url":null,"abstract":"Crowdfunding has gained widespread attention in recent years. Despite the huge success of crowdfunding platforms, the percentage of projects that succeed in achieving their desired goal amount is only around 40%. Moreover, many of these crowdfunding platforms follow \"all-or-nothing\" policy which means the pledged amount is collected only if the goal is reached within a certain predefined time duration. Hence, estimating the probability of success for a project is one of the most important research challenges in the crowdfunding domain. To predict the project success, there is a need for new prediction models that can potentially combine the power of both classification (which incorporate both successful and failed projects) and regression (for estimating the time for success). In this paper, we formulate the project success prediction as a survival analysis problem and apply the censored regression approach where one can perform regression in the presence of partial information. We rigorously study the project success time distribution of crowdfunding data and show that the logistic and log-logistic distributions are a natural choice for learning from such data. We investigate various censored regression models using comprehensive data of 18K Kickstarter (a popular crowdfunding platform) projects and 116K corresponding tweets collected from Twitter. We show that the models that take complete advantage of both the successful and failed projects during the training phase will perform significantly better at predicting the success of future projects compared to the ones that only use the successful projects. We provide a rigorous evaluation on many sets of relevant features and show that adding few temporal features that are obtained at the project's early stages can dramatically improve the performance.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76904034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Predictive Power of Massive Data about our Fine-Grained Behavior","authors":"F. Provost","doi":"10.1145/2835776.2835846","DOIUrl":"https://doi.org/10.1145/2835776.2835846","url":null,"abstract":"What really is it about \"big data\" that makes it different from traditional data? In this talk I illustrate one important aspect: massive ultra-fine-grained data on individuals' behaviors holds remarkable predictive power. I examine several applications to marketing-related tasks, showing how machine learning methods can extract the predictive power and how the value of the data \"asset\" seems different from the value of traditional data used for predictive modeling. I then dig deeper into explaining the predictions made from massive numbers of fine-grained behaviors by applying a counter-factual framework for explaining model behavior based on treating the individual behaviors as evidence that is combined by the model. This analysis shows that the fine-grained behavior data incorporate various sorts of information that we traditionally have sought to capture by other means. For example, for marketing modeling the behavior data effectively incorporate demographics, psychographics, category interest, and purchase intent. Finally, I discuss the flip side of the coin: the remarkable predictive power based on fine-grained information on individuals raises new privacy concerns. In particular, I discuss privacy concerns based on inferences drawn about us (in contrast to privacy concerns stemming from violations to data confidentiality). The evidence counterfactual approach used to explain the predictions also can be used to provide online consumers with transparency into the reasons why inferences are drawn about them. In addition, it offers the possibility to design novel solutions such as a privacy-friendly \"cloaking device\" to inhibit inferences from being drawn based on particular behaviors.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79623044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding Offline Political Systems by Mining Online Political Data","authors":"D. Lazer, Oren Tsur, Tina Eliassi-Rad","doi":"10.1145/2835776.2855112","DOIUrl":"https://doi.org/10.1145/2835776.2855112","url":null,"abstract":"\"Man is by nature a political animal\", as asserted by Aristotle. This political nature manifests itself in the data we produce and the traces we leave online. In this tutorial, we address a number of fundamental issues regarding mining of political data: What types of data could be considered political? What can we learn from such data? Can we use the data for prediction of political changes, etc? How can these prediction tasks be done efficiently? Can we use online socio-political data in order to get a better understanding of our political systems and of recent political changes? What are the pitfalls and inherent shortcomings of using online data for political analysis? In recent years, with the abundance of data, these questions, among others, have gained importance, especially in light of the global political turmoil and the upcoming 2016 US presidential election. We introduce relevant political science theory, describe the challenges within the framework of computational social science and present state of the art approaches bridging social network analysis, graph mining, and natural language processing.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79076232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal Formation and Evolution of Online Communities","authors":"Hossein Fani","doi":"10.1145/2835776.2855089","DOIUrl":"https://doi.org/10.1145/2835776.2855089","url":null,"abstract":"Researchers have already studied the identification of online communities and the possible impact or influence relationships from several perspectives. For instance, communities of users that are formed based on shared relationships and topological similarities, or communities that consist of users that share similar content. However, little work has been done on detection of communities that simultaneously share topical and temporal similarities. Furthermore, these studies have not explored the causation relationship between the communities. Causation provides systematic explanation as to why communities are formed and helps to predict future communities. This proposal will address two main research questions: i) how can communities that share topical and temporal similarities be identified, and ii) how can causation relation between different online communities be detected and modelled. We model users' behaviour towards topics of interest through multivariate time series to identify like-minded communities. Further, we employ Granger's concept of causality to infer causation between detected communities from corresponding users' time series. Granger causality is the prominent approach in time series modelling and rests on a firm statistical foundation. We assess the proposed community detection methods through comparison with the state of the art and verify the causal model through its prediction accuracy.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"107 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81501166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref
{"title":"Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop","authors":"Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref","doi":"10.1145/2835776.2835841","DOIUrl":"https://doi.org/10.1145/2835776.2835841","url":null,"abstract":"Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86230403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Querying and Tracking Influencers in Social Streams","authors":"Karthik Subbian, C. Aggarwal, J. Srivastava","doi":"10.1145/2835776.2835788","DOIUrl":"https://doi.org/10.1145/2835776.2835788","url":null,"abstract":"Influence analysis is an important problem in social network analysis due to its impact on viral marketing and targeted advertisements. Most of the existing influence analysis methods determine the influencers in a static network with an influence propagation model based on pre-defined edge propagation probabilities. However, none of these models can be queried to find influencers in both context and time-sensitive fashion from a streaming social data. In this paper, we propose an approach to maintain real-time influence scores of users in a social stream using a topic and time-sensitive approach, while the network and topic is constantly evolving over time. We show that our approach is efficient in terms of online maintenance and effective in terms various types of real-time context- and time-sensitive queries. We evaluate our results on both social and collaborative network data sets.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"1101 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76744016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collaborative Denoising Auto-Encoders for Top-N Recommender Systems","authors":"Yao Wu, Christopher DuBois, A. Zheng, M. Ester","doi":"10.1145/2835776.2835837","DOIUrl":"https://doi.org/10.1145/2835776.2835837","url":null,"abstract":"Most real-world recommender services measure their performance based on the top-N results shown to the end users. Thus, advances in top-N recommendation have far-ranging consequences in practical applications. In this paper, we present a novel method, called Collaborative Denoising Auto-Encoder (CDAE), for top-N recommendation that utilizes the idea of Denoising Auto-Encoders. We demonstrate that the proposed model is a generalization of several well-known collaborative filtering models but with more flexible components. Thorough experiments are conducted to understand the performance of CDAE under various component settings. Furthermore, experimental results on several public datasets demonstrate that CDAE consistently outperforms state-of-the-art top-N recommendation methods on a variety of common evaluation metrics.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73517419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Past and Future of Systems for Current Events","authors":"Mor Naaman","doi":"10.1145/2835776.2835850","DOIUrl":"https://doi.org/10.1145/2835776.2835850","url":null,"abstract":"People share in social media an overwhelming amount of content from real-world events. These events range from major global events like an uprising or an earthquake, to local events and emergencies such as a fire or a parade; from media events like the Oscar's, to events that enjoy little media coverage such as a conference or a music concert. This shared media represents an important part of our society, culture and history. At the same time, this social media content is still fragmented across services, hard to find, and difficult to consume and understand.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"118 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74597107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote Speaker Bio","authors":"Yiling Chen","doi":"10.1145/2835776.2835845","DOIUrl":"https://doi.org/10.1145/2835776.2835845","url":null,"abstract":"Chen has served as the Tutorial Chair of the ACM Conference on Electronic Commerce (EC), 2008, and on the Program Committee for the International World Wide Web Conference (WWW), 2008, and the International Workshop on Internet and Network Economics (WINE), 2008. She has co-organized the 2nd and the 3rd Workshops on Prediction Markets, 2007-2008. She has also been a reviewer for Management Science, Information Systems Research, Decision Support Systems, Information Systems and e-Business Management, and various conferences. Chen’s awards include Outstanding Paper Award, ACM Conference on Electronic Commerce (EC), 2008; Honorable Mention, Decision Science Institute Doctoral Dissertation Competition, 2006; and eBRC Doctoral Support Award, eBusiness Research Center at Penn State University, 2004.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74501704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qi Li, Fenglong Ma, Jing Gao, Lu Su, Christopher J. Quinn
{"title":"Crowdsourcing High Quality Labels with a Tight Budget","authors":"Qi Li, Fenglong Ma, Jing Gao, Lu Su, Christopher J. Quinn","doi":"10.1145/2835776.2835797","DOIUrl":"https://doi.org/10.1145/2835776.2835797","url":null,"abstract":"In the past decade, commercial crowdsourcing platforms have revolutionized the ways of classifying and annotating data, especially for large datasets. Obtaining labels for a single instance can be inexpensive, but for large datasets, it is important to allocate budgets wisely. With limited budgets, requesters must trade-off between the quantity of labeled instances and the quality of the final results. Existing budget allocation methods can achieve good quantity but cannot guarantee high quality of individual instances under a tight budget. However, in some scenarios, requesters may be willing to label fewer instances but of higher quality. Moreover, they may have different requirements on quality for different tasks. To address these challenges, we propose a flexible budget allocation framework called Requallo. Requallo allows requesters to set their specific requirements on the labeling quality and maximizes the number of labeled instances that achieve the quality requirement under a tight budget. The budget allocation problem is modeled as a Markov decision process and a sequential labeling policy is produced. The proposed policy greedily searches for the instance to query next as the one that can provide the maximum reward for the goal. The Requallo framework is further extended to consider worker reliability so that the budget can be better allocated. Experiments on two real-world crowdsourcing tasks as well as a simulated task demonstrate that when the budget is tight, the proposed Requallo framework outperforms existing state-of-the-art budget allocation methods from both quantity and quality aspects.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84581041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}