Weinan Zhang, Amr Ahmed, Jie Yang, V. Josifovski, Alex Smola
{"title":"Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails","authors":"Weinan Zhang, Amr Ahmed, Jie Yang, V. Josifovski, Alex Smola","doi":"10.1145/2783258.2788580","DOIUrl":"https://doi.org/10.1145/2783258.2788580","url":null,"abstract":"Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g.purchase, event) into templates. Extracting structured data from B2C emails allows users to track important information on various devices. However, it also poses several challenges, due to the requirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information. In this paper we first introduce a system which can extract structured information automatically without requiring human review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Neither general learning methods, such as binary classifiers, nor more specific structure learning methods, suchas Conditional Random Field (CRF), can solve this problem well. To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels predicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to remove the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131573872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"E-commerce in Your Inbox: Product Recommendations at Scale","authors":"Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan L. Bhamidipati, Jaikit Savla, Varun Bhagwan, Doug Sharp","doi":"10.1145/2783258.2788627","DOIUrl":"https://doi.org/10.1145/2783258.2788627","url":null,"abstract":"In recent years online advertising has become increasingly ubiquitous and effective. Advertisements shown to visitors fund sites and apps that publish digital content, manage social networks, and operate e-mail services. Given such large variety of internet resources, determining an appropriate type of advertising for a given platform has become critical to financial success. Native advertisements, namely ads that are similar in look and feel to content, have had great success in news and social feeds. However, to date there has not been a winning formula for ads in e-mail clients. In this paper we describe a system that leverages user purchase history determined from e-mail receipts to deliver highly personalized product ads to Yahoo Mail users. We propose to use a novel neural language-based algorithm specifically tailored for delivering effective product recommendations, which was evaluated against baselines that included showing popular products and products predicted based on co-occurrence. We conducted rigorous offline testing using a large-scale product purchase data set, covering purchases of more than 29 million users from 172 e-commerce websites. Ads in the form of product recommendations were successfully tested on online traffic, where we observed a steady 9% lift in click-through rates over other ad formats in mail, as well as comparable lift in conversion rates. Following successful tests, the system was launched into production during the holiday season of 2014.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132638428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maximum Likelihood Postprocessing for Differential Privacy under Consistency Constraints","authors":"Jaewoo Lee, Yue Wang, Daniel Kifer","doi":"10.1145/2783258.2783366","DOIUrl":"https://doi.org/10.1145/2783258.2783366","url":null,"abstract":"When analyzing data that has been perturbed for privacy reasons, one is often concerned about its usefulness. Recent research on differential privacy has shown that the accuracy of many data queries can be improved by post-processing the perturbed data to ensure consistency constraints that are known to hold for the original data. Most prior work converted this post-processing step into a least squares minimization problem with customized efficient solutions. While improving accuracy, this approach ignored the noise distribution in the perturbed data. In this paper, to further improve accuracy, we formulate this post-processing step as a constrained maximum likelihood estimation problem, which is equivalent to constrained L1 minimization. Instead of relying on slow linear program solvers, we present a faster generic recipe (based on ADMM) that is suitable for a wide variety of applications including differentially private contingency tables, histograms, and the matrix mechanism (linear queries). An added benefit of our formulation is that it can often take direct advantage of algorithmic tricks used by the prior work on least-squares post-processing. An extensive set of experiments on various datasets demonstrates that this approach significantly improve accuracy over prior work.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132775395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Lalmas, Janette Lehmann, G. Shaked, F. Silvestri, Gabriele Tolomei
{"title":"Promoting Positive Post-Click Experience for In-Stream Yahoo Gemini Users","authors":"M. Lalmas, Janette Lehmann, G. Shaked, F. Silvestri, Gabriele Tolomei","doi":"10.1145/2783258.2788581","DOIUrl":"https://doi.org/10.1145/2783258.2788581","url":null,"abstract":"Click-through rate (CTR) is the most common metric used to assess the performance of an online advert; another performance of an online advert is the user post-click experience. In this paper, we describe the method we have implemented in Yahoo Gemini to measure the post-click experience on Yahoo mobile news streams via an automatic analysis of advert landing pages. We measure the post-click experience by means of two well-known metrics, dwell time and bounce rate. We show that these metrics can be used as proxy of an advert post-click experience, and that a negative post-click experience has a negative effect on user engagement and future ad clicks. We then put forward an approach that analyses advert landing pages, and show how these can affect dwell time and bounce rate. Finally, we develop a prediction model for advert quality based on dwell time, which was deployed on Yahoo mobile news stream app running on iOS. The results show that, using dwell time as a proxy of post-click experience, we can prioritise higher quality ads. We demonstrate the impact of this on users via A/B testing.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133159560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Émilien Antoine, A. Jatowt, Shoko Wakamiya, Yukiko Kawai, Toyokazu Akiyama
{"title":"Portraying Collective Spatial Attention in Twitter","authors":"Émilien Antoine, A. Jatowt, Shoko Wakamiya, Yukiko Kawai, Toyokazu Akiyama","doi":"10.1145/2783258.2783418","DOIUrl":"https://doi.org/10.1145/2783258.2783418","url":null,"abstract":"Microblogging platforms such as Twitter have been recently frequently used for detecting real-time events. The spatial component, as reflected by user location, usually plays a key role in such systems. However, an often neglected source of spatial information are location mentions expressed in tweet contents. In this paper we demonstrate a novel visualization system for analyzing how Twitter users collectively talk about space and for uncovering correlations between geographical locations of Twitter users and the locations they tweet about. Our exploratory analysis is based on the development of a model of spatial information extraction and representation that allows building effective visual analytics framework for large scale datasets. We show visualization results based on half a year long dataset of Japanese tweets and a four months long collection of tweets from USA. The proposed system allows observing many space related aspects of tweet messages including the average scope of spatial attention of social media users and variances in spatial interest over time. The analytical framework we provide and the findings we outline can be valuable for scientists from diverse research areas and for any users interested in geographical and social aspects of shared online data.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127064807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved Bounds on the Dot Product under Random Projection and Random Sign Projection","authors":"A. Kabán","doi":"10.1145/2783258.2783364","DOIUrl":"https://doi.org/10.1145/2783258.2783364","url":null,"abstract":"Dot product is a key building block in a number of data mining algorithms from classification, regression, correlation clustering, to information retrieval and many others. When data is high dimensional, the use of random projections may serve as a universal dimensionality reduction method that provides both low distortion guarantees and computational savings. Yet, contrary to the optimal guarantees that are known on the preservation of the Euclidean distance cf. the Johnson-Lindenstrauss lemma, the existing guarantees on the dot product under random projection are loose and incomplete in the current data mining and machine learning literature. Some recent literature even suggested that the dot product may not be preserved when the angle between the original vectors is obtuse. In this paper we provide improved bounds on the dot product under random projection that matches the optimal bounds on the Euclidean distance. As a corollary, we elucidate the impact of the angle between the original vectors on the relative distortion of the dot product under random projection, and we show that the obtuse vs. acute angles behave symmetrically in the same way. In a further corollary we make a link to sign random projection, where we generalise earlier results. Numerical simulations confirm our theoretical results. Finally we give an application of our results to bounding the generalisation error of compressive linear classifiers under the margin loss.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124444182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Qahtan, Basma Alharbi, Suojin Wang, Xiangliang Zhang
{"title":"A PCA-Based Change Detection Framework for Multidimensional Data Streams: Change Detection in Multidimensional Data Streams","authors":"A. Qahtan, Basma Alharbi, Suojin Wang, Xiangliang Zhang","doi":"10.1145/2783258.2783359","DOIUrl":"https://doi.org/10.1145/2783258.2783359","url":null,"abstract":"Detecting changes in multidimensional data streams is an important and challenging task. In unsupervised change detection, changes are usually detected by comparing the distribution in a current (test) window with a reference window. It is thus essential to design divergence metrics and density estimators for comparing the data distributions, which are mostly done for univariate data. Detecting changes in multidimensional data streams brings difficulties to the density estimation and comparisons. In this paper, we propose a framework for detecting changes in multidimensional data streams based on principal component analysis, which is used for projecting data into a lower dimensional space, thus facilitating density estimation and change-score calculations. The proposed framework also has advantages over existing approaches by reducing computational costs with an efficient density estimator, promoting the change-score calculation by introducing effective divergence metrics, and by minimizing the efforts required from users on the threshold parameter setting by using the Page-Hinkley test. The evaluation results on synthetic and real data show that our framework outperforms two baseline methods in terms of both detection accuracy and computational costs.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124549692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shayok Chakraborty, V. Balasubramanian, Adepu Ravi Sankar, S. Panchanathan, Jieping Ye
{"title":"BatchRank: A Novel Batch Mode Active Learning Framework for Hierarchical Classification","authors":"Shayok Chakraborty, V. Balasubramanian, Adepu Ravi Sankar, S. Panchanathan, Jieping Ye","doi":"10.1145/2783258.2783298","DOIUrl":"https://doi.org/10.1145/2783258.2783298","url":null,"abstract":"Active learning algorithms automatically identify the salient and exemplar instances from large amounts of unlabeled data and thus reduce human annotation effort in inducing a classification model. More recently, Batch Mode Active Learning (BMAL) techniques have been proposed, where a batch of data samples is selected simultaneously from an unlabeled set. Most active learning algorithms assume a flat label space, that is, they consider the class labels to be independent. However, in many applications, the set of class labels are organized in a hierarchical tree structure, with the leaf nodes as outputs and the internal nodes as clusters of outputs at multiple levels of granularity. In this paper, we propose a novel BMAL algorithm (BatchRank) for hierarchical classification. The sample selection is posed as an NP-hard integer quadratic programming problem and a convex relaxation (based on linear programming) is derived, whose solution is further improved by an iterative truncated power method. Finally, a deterministic bound is established on the quality of the solution. Our empirical results on several challenging, real-world datasets from multiple domains, corroborate the potential of the proposed framework for real-world hierarchical classification applications.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123924667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transitive Transfer Learning","authors":"Ben Tan, Yangqiu Song, Erheng Zhong, Qiang Yang","doi":"10.1145/2783258.2783295","DOIUrl":"https://doi.org/10.1145/2783258.2783295","url":null,"abstract":"Transfer learning, which leverages knowledge from source domains to enhance learning ability in a target domain, has been proven effective in various applications. One major limitation of transfer learning is that the source and target domains should be directly related. If there is little overlap between the two domains, performing knowledge transfer between these domains will not be effective. Inspired by human transitive inference and learning ability, whereby two seemingly unrelated concepts can be connected by a string of intermediate bridges using auxiliary concepts, in this paper we study a novel learning problem: Transitive Transfer Learning (abbreviated to TTL). TTL is aimed at breaking the large domain distances and transfer knowledge even when the source and target domains share few factors directly. For example, when the source and target domains are documents and images respectively, TTL could use some annotated images as the intermediate domain to bridge them. To solve the TTL problem, we propose a learning framework to mimic the human learning process. The framework is composed of an intermediate domain selection component and a knowledge transfer component. Extensive empirical evidence shows that the framework yields state-of-the-art classification accuracies on several classification data sets.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"55 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129532365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amit Dhurandhar, B. Graves, Rajesh Kumar Ravi, Gopikrishnan Maniachari, M. Ettl
{"title":"Big Data System for Analyzing Risky Procurement Entities","authors":"Amit Dhurandhar, B. Graves, Rajesh Kumar Ravi, Gopikrishnan Maniachari, M. Ettl","doi":"10.1145/2783258.2788563","DOIUrl":"https://doi.org/10.1145/2783258.2788563","url":null,"abstract":"An accredited biennial 2014 study by the Association of Certified Fraud Examiners claims that on average 5% of a company's revenue is lost because of unchecked fraud every year. The reason for such heavy losses are that it takes around 18 months for a fraud to be caught and audits catch only 3% of the actual fraud. This begs the need for better tools and processes to be able to quickly and cheaply identify potential malefactors. In this paper, we describe a robust tool to identify procurement related fraud/risk, though the general design and the analytical components could be adapted to detecting fraud in other domains. Besides analyzing standard transactional data, our solution analyzes multiple public and private data sources leading to wider coverage of fraud types than what generally exists in the marketplace. Moreover, our approach is more principled in the sense that the learning component, which is based on investigation feedback has formal guarantees. Though such a tool is ever evolving, a deployment of this tool over the past 12 months has found many interesting cases from compliance risk and fraud point of view across more than 150 countries and 65000+ vendors, increasing the number of true positives found by over 80% compared with other state-of-the-art tools that the domain experts were previously using.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128937699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}