Snigdha Chaturvedi, T. Faruquie, L. V. Subramaniam, M. Mohania
{"title":"Estimating accuracy for text classification tasks on large unlabeled data","authors":"Snigdha Chaturvedi, T. Faruquie, L. V. Subramaniam, M. Mohania","doi":"10.1145/1871437.1871551","DOIUrl":"https://doi.org/10.1145/1871437.1871551","url":null,"abstract":"Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133598655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Web search solved?: all result rankings the same?","authors":"H. Zaragoza, B. B. Cambazoglu, R. Baeza-Yates","doi":"10.1145/1871437.1871507","DOIUrl":"https://doi.org/10.1145/1871437.1871507","url":null,"abstract":"The objective of this work is to derive quantitative statements about what fraction of web search queries issued to the state-of-the-art commercial search engines lead to excellent results or, on the contrary, poor results. To be able to make such statements in an automated way, we propose a new measure that is based on lower and upper bound analysis over the standard relevance measures. Moreover, we extend this measure to carry out comparisons between competing search engines by introducing the concept of disruptive sets, which we use to estimate the degree to which a search engine solves queries that are not solved by its competitors. We report empirical results on a large editorial evaluation of the three largest search engines in the US market.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133432761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiong Cheng, M. Ogihara, Jinpeng Wei, A. Zelikovsky
{"title":"WS-GraphMatching: a web service tool for graph matching","authors":"Qiong Cheng, M. Ogihara, Jinpeng Wei, A. Zelikovsky","doi":"10.1145/1871437.1871779","DOIUrl":"https://doi.org/10.1145/1871437.1871779","url":null,"abstract":"Some emerging applications deal with graph data and relie on graph matching and mining. The service-oriented graph matching and mining tool has been required. In this demo we present the web service tool WS-GraphMatching which supports the efficient and visualized matching of polytrees, series-parallel graphs, and arbitrary graphs with bounded feedback vertex set. Its embedded matching algorithms take in account the similarity of vertex-to-vertex and graph structures, allowing path contraction, vertex deletion, and vertex insertions. It provides one-to-one matching queries as well as queries in batch modes including one-to-many matching mode and many-to-many matching mode. It can be used for predicting unknown structured information, comparing and finding conserved patterns, and resolving ambiguous identification of vertices. WS-GraphAligner is available as web-server at: http://odysseus.cs.fiu.edu:8080/MinePW/pages/gmapping/GMMain.html.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131871288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Searching consumer image collections using web-based concept expansion","authors":"Mark D. Wood, A. Loui, S. Hibino","doi":"10.1145/1871437.1871525","DOIUrl":"https://doi.org/10.1145/1871437.1871525","url":null,"abstract":"As consumers accumulate more and more personal imagery, searching for specific images has become increasingly difficult. Consumers typically provide little or no annotations, and automated classifiers and concept tagging tools are limited in their scope and vocabulary. This work addresses this sparsity of semantic information by leveraging domain-specific information provided by online photo-sharing communities. Such information enables improved search by allowing user-provided search terms to be expanded into a set of semantically related concepts, using relevant semantic relationships provided by millions of users. Our system first extracts metadata using a modest number of image and event-based semantic classifiers, as well as any meaningful file or folder names. When users pose text-based queries, our system retrieves images from their personal image collections by leveraging Flickr's tag dataset for concept expansion. This approach enables users to search their collections without having to manually annotate their pictures. We compare the retrieval performance of using a Flickr-based concept expander with the performance obtained without concept expansion and with using a WordNet-based concept expander. The results demonstrate that common sense knowledge gleaned from online photo sharing communities can enable meaningful image search on consumer image collections, searches that would be impossible using only the available image metadata.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115711046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Duo Zhang, Jimeng Sun, ChengXiang Zhai, A. Bose, Nikos Anerousis
{"title":"PTM: probabilistic topic mapping model for mining parallel document collections","authors":"Duo Zhang, Jimeng Sun, ChengXiang Zhai, A. Bose, Nikos Anerousis","doi":"10.1145/1871437.1871696","DOIUrl":"https://doi.org/10.1145/1871437.1871696","url":null,"abstract":"Many applications generate a large volume of parallel document collections. A parallel document collection consists of two sets of documents where the documents in each set correspond to each other and form semantic pairs (e.g., pairs of problem and solution descriptions in a help-desk setting). Although much work has been done on text mining, little previous work has attempted to mine such a novel kind of text data. In this paper, we propose a new probabilistic topic model, called Probabilistic Topic Mapping (PTM) model, to mine parallel document collections to simultaneously discover latent topics in both sets of documents as well as the mapping of topics in one set to those in the other. We evaluate the PTM model on one real parallel document collection in IT service domain. We show that PTM can effectively discover meaningful topics, as well as their mappings, and it's also useful for improving text matching and retrieval when there's a vocabulary gap.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"162 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124550551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ontology emergence from folksonomies","authors":"Kaipeng Liu, Binxing Fang, Weizhe Zhang","doi":"10.1145/1871437.1871578","DOIUrl":"https://doi.org/10.1145/1871437.1871578","url":null,"abstract":"The folksonomies built from the large-scale social annotations made by collaborating users are perfect data sources for bootstrapping Semantic Web applications. In this paper, we develop an ontology induction approach to harvest the emergent semantics from the folksonomies. We propose a latent subsumption hierarchy model to uncover the implicit structure of tag space and develop our ontology induction approach on basis of this model. We identify tag subsumptions with a set-theoretical approach and model the tag space as a tag subsumption graph. While turning this graph into a concept hierarchy, we address the problem of inconsistent subsumptions and propose a random walk based tag generality ranking procedure to settle it. We propose an agglomerative hierarchical clustering algorithm utilizing the result of tag generality ranking to generate the concept hierarchy. We conduct experiments on the Delicious dataset. The results of both qualitative and quantitative evaluation demonstrate the effectiveness of the proposed approach.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114424511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Web page classification on child suitability","authors":"Carsten Eickhoff, P. Serdyukov, A. D. Vries","doi":"10.1145/1871437.1871638","DOIUrl":"https://doi.org/10.1145/1871437.1871638","url":null,"abstract":"Children spend significant amounts of time on the Internet. Recent studies showed, that during these periods they are often not under adult supervision. This work presents an automatic approach to identifying suitable web pages for children based on topical and non-topical web page aspects. We discuss the characteristics of children's web sites with respect to recent findings in children's psychology and cognitive sciences. We finally evaluate our approach in a large-scale user study, finding, that it compares favourably to state of the art methods while approximating human performance.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114441413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Constructing classification features using minimal predictive patterns","authors":"Iyad Batal, M. Hauskrecht","doi":"10.1145/1871437.1871549","DOIUrl":"https://doi.org/10.1145/1871437.1871549","url":null,"abstract":"Choosing good features to represent objects can be crucial to the success of supervised machine learning methods. Recently, there has been a great interest in applying data mining techniques to construct new classification features. The rationale behind this approach is that patterns (feature-value combinations) could capture more underlying semantics than single features. Hence the inclusion of some patterns can improve the classification performance. Currently, most methods adopt a two-phases approach by generating all frequent patterns in the first phase and selecting the discriminative patterns in the second phase. However, this approach has limited success because it is usually very difficult to correctly identify important predictive patterns in a large set of highly correlated frequent patterns. In this paper, we introduce the minimal predictive patterns framework to directly mine a compact set of highly predictive patterns. The idea is to integrate pattern mining and feature selection in order to filter out non-informative and redundant patterns while being generated. We propose some pruning techniques to speed up the mining process. Our extensive experimental evaluation on many datasets demonstrates the advantage of our method by outperforming many well known classifiers.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114891892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical auto-tagging: organizing Q&A knowledge for everyone","authors":"Kyosuke Nishida, Ko Fujimura","doi":"10.1145/1871437.1871697","DOIUrl":"https://doi.org/10.1145/1871437.1871697","url":null,"abstract":"We propose a hierarchical auto-tagging system, TagHats, to improve users' knowledge sharing. Our system assigns three different levels of tags to Q&A documents: category, theme, and keyword. Multiple category tags can organize a document according to multiple viewpoints, and multiple theme and keyword tags can identify what the document is about clearly. Moreover, these hierarchical tags will be helpful in organizing documents to support everyone because different users have different demands in terms of tag specificity. Our system consists of a hierarchical classification method for assigning category and theme tags, a new keyword extraction method that considers the structure of Q&A documents, and a new method for selecting theme tag candidates from each category. Experiments with the documents of Oshiete! goo demonstrate that our system is able to assign hierarchical tags to the documents appropriately and is capable of outperforming baseline methods significantly.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123590468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SEQUEL: query completion via pattern mining on multi-column structural data","authors":"Chuancong Gao, Qingyan Yang, Jianyong Wang","doi":"10.1145/1871437.1871782","DOIUrl":"https://doi.org/10.1145/1871437.1871782","url":null,"abstract":"In this demonstration, we propose an interactive query completion system on structural data like DBLP, called SEQUEL. It is novel in several aspects: with patterns mined on the structural data using newly devised algorithm, SEQUEL offers high-utility completions composed with not only words but also phrases, and requires no explicit indications of corresponding columns. Instead of using query logs exploited previously for unstructured data, more effective completions are provided based on patterns mined directly from the records. Moreover, an effective index structure helps SEQUEL respond fast at millisecond level for each keystroke.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123640896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}