Proceedings of the 18th ACM conference on Information and knowledge management最新文献

Learning better transliterations 学习更好的音译

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI: 10.1145/1645953.1645978

Jeff Pasternack, D. Roth

{"title":"Learning better transliterations","authors":"Jeff Pasternack, D. Roth","doi":"10.1145/1645953.1645978","DOIUrl":"https://doi.org/10.1145/1645953.1645978","url":null,"abstract":"We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of \"productions\", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2(|w|-1) possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m^6 * n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m^4 * w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2 * k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115264479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Clustering web queries 聚类web查询

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI: 10.1145/1645953.1646069

John S. Whissell, C. Clarke, Azin Ashkan

{"title":"Clustering web queries","authors":"John S. Whissell, C. Clarke, Azin Ashkan","doi":"10.1145/1645953.1646069","DOIUrl":"https://doi.org/10.1145/1645953.1646069","url":null,"abstract":"Despite the wide applicability of clustering methods, their evaluation remains a problem. In this paper, we present a metric for the evaluation of clustering methods. The data set to be clustered is viewed as a sample from a larger population, with clustering quality measured in terms of our predicted ability to discriminate between members of this population. We measure this property by training a classifier to recognize each cluster and measuring the accuracy of this classifier, normalized by a notion of expected accuracy. To demonstrate the applicability of this metric we apply it to Web queries. We investigated a commercially oriented data set of 1700 queries and a general data set of 4000 queries. Both sets are taken from the logs of a commercial Web search engine. Clustering is based on the contents of search engine result pages generated by executing the queries on the search engine from which they were taken. Multiple clustering algorithms are crossed with various weighting schemes to produce multiple clusterings of each query set. Our metric is used evaluate these clusterings. The results on the commercially oriented data set are compared to two pre-existing manual labelings, and are also used in an ad clickthrough experiment.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115268505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Boosting KNN text classification accuracy by using supervised term weighting schemes 利用监督词权方案提高KNN文本分类准确率

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI: 10.1145/1645953.1646296

Iyad Batal, M. Hauskrecht

引用次数: 30

Building domain-oriented sentiment lexicon by improved information bottleneck 通过改进信息瓶颈构建面向领域的情感词典

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI: 10.1145/1645953.1646221

Weifu Du, Songbo Tan

引用次数: 34

Large margin transductive transfer learning 大余量转换迁移学习

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI: 10.1145/1645953.1646121

Brian Quanz, Jun Huan

引用次数: 116

L2 norm regularized feature kernel regression for graph data 图数据的L2范数正则化特征核回归

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI: 10.1145/1645953.1646029

Hongliang Fei, Jun Huan

{"title":"L2 norm regularized feature kernel regression for graph data","authors":"Hongliang Fei, Jun Huan","doi":"10.1145/1645953.1646029","DOIUrl":"https://doi.org/10.1145/1645953.1646029","url":null,"abstract":"Features in many real world applications such as Cheminformatics, Bioinformatics and Information Retrieval have complex internal structure. For example, frequent patterns mined from graph data are graphs. Such graph features have different number of nodes and edges and usually overlap with each other. In conventional data mining and machine learning applications, the internal structure of features are usually ignored. In this paper we consider a supervised learning problem where the features of the data set have intrinsic complexity, and we further assume that the feature intrinsic complexity may be measured by a kernel function. We hypothesize that by regularizing model parameters using the information of feature complexity, we can construct simple yet high quality model that captures the intrinsic structure of the data. Towards the end of testing this hypothesis, we focus on a regression task and have designed an algorithm that incorporate the feature complexity in the learning process, using a kernel matrix weighted L2 norm for regularization, to obtain improved regression performance over conventional learning methods that does not consider the additional information of the feature. We have tested our algorithm using 5 different real-world data sets and have demonstrate the effectiveness of our method.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123567954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Interactive, topic-based visual text summarization and analysis 交互式，基于主题的可视化文本摘要和分析

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI: 10.1145/1645953.1646023

Shixia Liu, Michelle X. Zhou, Shimei Pan, Weihong Qian, Weijia Cai, X. Lian

{"title":"Interactive, topic-based visual text summarization and analysis","authors":"Shixia Liu, Michelle X. Zhou, Shimei Pan, Weihong Qian, Weijia Cai, X. Lian","doi":"10.1145/1645953.1646023","DOIUrl":"https://doi.org/10.1145/1645953.1646023","url":null,"abstract":"We are building an interactive, visual text analysis tool that aids users in analyzing a large collection of text. Unlike existing work in text analysis, which focuses either on developing sophisticated text analytic techniques or inventing novel visualization metaphors, ours is tightly integrating state-of-the-art text analytics with interactive visualization to maximize the value of both. In this paper, we focus on describing our work from two aspects. First, we present the design and development of a time-based, visual text summary that effectively conveys complex text summarization results produced by the Latent Dirichlet Allocation (LDA) model. Second, we describe a set of rich interaction tools that allow users to work with a created visual text summary to further interpret the summarization results in context and examine the text collection from multiple perspectives. As a result, our work offers two unique contributions. First, we provide an effective visual metaphor that transforms complex and even imperfect text summarization results into a comprehensible visual summary of texts. Second, we offer users a set of flexible visual interaction tools as the alternatives to compensate for the deficiencies of current text summarization techniques. We have applied our work to a number of text corpora and our evaluation shows the promise of the work, especially in support of complex text analyses.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123715563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 111

Exploring relevance for clicks 探索点击的相关性

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI: 10.1145/1645953.1646246

Rongwei Cen, Yiqun Liu, Min Zhang, Bo Zhou, Liyun Ru, Shaoping Ma

引用次数: 7

Session details: KM graph mining 会话细节:KM图挖掘

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI: 10.1145/3261228

Michael L. Lyu

引用次数: 0

An improved feedback approach using relevant local posts for blog feed retrieval 一种改进的反馈方法，使用相关的本地帖子进行博客提要检索

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI: 10.1145/1645953.1646278

Yeha Lee, Seung-Hoon Na, Jong-Hyeok Lee

引用次数: 7