{"title":"Learning better transliterations","authors":"Jeff Pasternack, D. Roth","doi":"10.1145/1645953.1645978","DOIUrl":"https://doi.org/10.1145/1645953.1645978","url":null,"abstract":"We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of \"productions\", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2(|w|-1) possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m^6 * n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m^4 * w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2 * k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115264479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clustering web queries","authors":"John S. Whissell, C. Clarke, Azin Ashkan","doi":"10.1145/1645953.1646069","DOIUrl":"https://doi.org/10.1145/1645953.1646069","url":null,"abstract":"Despite the wide applicability of clustering methods, their evaluation remains a problem. In this paper, we present a metric for the evaluation of clustering methods. The data set to be clustered is viewed as a sample from a larger population, with clustering quality measured in terms of our predicted ability to discriminate between members of this population. We measure this property by training a classifier to recognize each cluster and measuring the accuracy of this classifier, normalized by a notion of expected accuracy. To demonstrate the applicability of this metric we apply it to Web queries. We investigated a commercially oriented data set of 1700 queries and a general data set of 4000 queries. Both sets are taken from the logs of a commercial Web search engine. Clustering is based on the contents of search engine result pages generated by executing the queries on the search engine from which they were taken. Multiple clustering algorithms are crossed with various weighting schemes to produce multiple clusterings of each query set. Our metric is used evaluate these clusterings. The results on the commercially oriented data set are compared to two pre-existing manual labelings, and are also used in an ad clickthrough experiment.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115268505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting KNN text classification accuracy by using supervised term weighting schemes","authors":"Iyad Batal, M. Hauskrecht","doi":"10.1145/1645953.1646296","DOIUrl":"https://doi.org/10.1145/1645953.1646296","url":null,"abstract":"The increasing availability of digital documents in the last decade has prompted the development of machine learning techniques to automatically classify and organize text documents. The majority of text classification systems rely on the vector space model, which represents the documents as vectors in the term space. Each vector component is assigned a weight that reflects the importance of the term in the document. Typically, these weights are assigned using an information retrieval (IR) approach, such as the famous tf-idf function. In this work, we study two weighting schemes based on information gain and chi-square statistics. These schemes take advantage of the category label information to weight the terms according to their distributions across the different categories. We show that using these supervised weights instead of conventional unsupervised weights can greatly improve the performance of the k-nearest neighbor (KNN) classifier. Experimental evaluations, carried out on multiple text classification tasks, demonstrate the benefits of this approach in creating accurate text classifiers.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"202 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121026817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Building domain-oriented sentiment lexicon by improved information bottleneck","authors":"Weifu Du, Songbo Tan","doi":"10.1145/1645953.1646221","DOIUrl":"https://doi.org/10.1145/1645953.1646221","url":null,"abstract":"This paper describes an adapted information bottleneck approach for construction of domain-oriented sentiment lexicon. The basic idea is to use three kinds of relationships (WWinter, WDinter and WDintra,) to infer the semantic orientation of the out-of-domain words. The experimental results demonstrate that proposed method could dramatically improve the accuracy of the baseline approach on the construction of out-of-domain sentiment lexicon.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121068754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large margin transductive transfer learning","authors":"Brian Quanz, Jun Huan","doi":"10.1145/1645953.1646121","DOIUrl":"https://doi.org/10.1145/1645953.1646121","url":null,"abstract":"Recently there has been increasing interest in the problem of transfer learning, in which the typical assumption that training and testing data are drawn from identical distributions is relaxed. We specifically address the problem of transductive transfer learning in which we have access to labeled training data and unlabeled testing data potentially drawn from different, yet related distributions, and the goal is to leverage the labeled training data to learn a classifier to correctly predict data from the testing distribution. We have derived efficient algorithms for transductive transfer learning based on a novel viewpoint and the Support Vector Machine (SVM) paradigm, of a large margin hyperplane classifier in a feature space. We show that our method can out-perform some recent state-of-the-art approaches for transfer learning on several data sets, with the added benefits of model and data separation and the potential to leverage existing work on support vector machines.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121141989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"L2 norm regularized feature kernel regression for graph data","authors":"Hongliang Fei, Jun Huan","doi":"10.1145/1645953.1646029","DOIUrl":"https://doi.org/10.1145/1645953.1646029","url":null,"abstract":"Features in many real world applications such as Cheminformatics, Bioinformatics and Information Retrieval have complex internal structure. For example, frequent patterns mined from graph data are graphs. Such graph features have different number of nodes and edges and usually overlap with each other. In conventional data mining and machine learning applications, the internal structure of features are usually ignored. In this paper we consider a supervised learning problem where the features of the data set have intrinsic complexity, and we further assume that the feature intrinsic complexity may be measured by a kernel function. We hypothesize that by regularizing model parameters using the information of feature complexity, we can construct simple yet high quality model that captures the intrinsic structure of the data. Towards the end of testing this hypothesis, we focus on a regression task and have designed an algorithm that incorporate the feature complexity in the learning process, using a kernel matrix weighted L2 norm for regularization, to obtain improved regression performance over conventional learning methods that does not consider the additional information of the feature. We have tested our algorithm using 5 different real-world data sets and have demonstrate the effectiveness of our method.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123567954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interactive, topic-based visual text summarization and analysis","authors":"Shixia Liu, Michelle X. Zhou, Shimei Pan, Weihong Qian, Weijia Cai, X. Lian","doi":"10.1145/1645953.1646023","DOIUrl":"https://doi.org/10.1145/1645953.1646023","url":null,"abstract":"We are building an interactive, visual text analysis tool that aids users in analyzing a large collection of text. Unlike existing work in text analysis, which focuses either on developing sophisticated text analytic techniques or inventing novel visualization metaphors, ours is tightly integrating state-of-the-art text analytics with interactive visualization to maximize the value of both. In this paper, we focus on describing our work from two aspects. First, we present the design and development of a time-based, visual text summary that effectively conveys complex text summarization results produced by the Latent Dirichlet Allocation (LDA) model. Second, we describe a set of rich interaction tools that allow users to work with a created visual text summary to further interpret the summarization results in context and examine the text collection from multiple perspectives. As a result, our work offers two unique contributions. First, we provide an effective visual metaphor that transforms complex and even imperfect text summarization results into a comprehensible visual summary of texts. Second, we offer users a set of flexible visual interaction tools as the alternatives to compensate for the deficiencies of current text summarization techniques. We have applied our work to a number of text corpora and our evaluation shows the promise of the work, especially in support of complex text analyses.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123715563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rongwei Cen, Yiqun Liu, Min Zhang, Bo Zhou, Liyun Ru, Shaoping Ma
{"title":"Exploring relevance for clicks","authors":"Rongwei Cen, Yiqun Liu, Min Zhang, Bo Zhou, Liyun Ru, Shaoping Ma","doi":"10.1145/1645953.1646246","DOIUrl":"https://doi.org/10.1145/1645953.1646246","url":null,"abstract":"Mining feedback information from user click-through data is an important issue for modern Web retrieval systems in terms of architecture analysis, performance evaluation and algorithm optimization. For commercial search engines, user click-through data contains useful information as well as large amount of inevitable noises. This paper proposes an approach to recognize reliable and meaningful user clicks (referred to as Relevant Clicks, RCs) in click-through data. By modeling user click-through behavior on search result lists, we propose several features to separate RCs from click noises. A learning algorithm is presented to estimate the quality of user clicks. Experimental results on large scale dataset show that: 1) our model effectively identifies RCs in noisy click-through data; 2) Different from previous click-through analysis efforts, our approach works well for both hot queries and long-tail queries.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"229 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116133555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: KM graph mining","authors":"Michael L. Lyu","doi":"10.1145/3261228","DOIUrl":"https://doi.org/10.1145/3261228","url":null,"abstract":"","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122986297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An improved feedback approach using relevant local posts for blog feed retrieval","authors":"Yeha Lee, Seung-Hoon Na, Jong-Hyeok Lee","doi":"10.1145/1645953.1646278","DOIUrl":"https://doi.org/10.1145/1645953.1646278","url":null,"abstract":"Blog feed search aims to identify a blog feed with a recurring interest in a given topic. In this paper, we investigate the \"pseudo-relevance feedback\" for blog feed search task, where its unit of relevance judgment is not based on a blog post but a blog feed (the collection of all its constituent posts). This paper focuses on two characteristics of feed search task, blog feed's topical diversity and multifaceted property of query. We propose a novel feed-level selection of local posts which uses only highly relevant local posts in each top-ranked feed, in order to capture the correct and diverse relevant information to a given topic. Experimental results show that the proposed approach outperforms traditional feedback approaches. Especially, the proposed approach gives 2% further increase of nDCG over the best performing result of TREC '08 Blog Distillation Task.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122097325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}