Zheng Chen, Shengping Liu, Wenyin Liu, G. Pu, Wei-Ying Ma
{"title":"Building a web thesaurus from web link structure","authors":"Zheng Chen, Shengping Liu, Wenyin Liu, G. Pu, Wei-Ying Ma","doi":"10.1145/860435.860447","DOIUrl":"https://doi.org/10.1145/860435.860447","url":null,"abstract":"Thesaurus has been widely used in many applications, including information retrieval, natural language processing, and question answering. In this paper, we propose a novel approach to automatically constructing a domain-specific thesaurus from the Web using link structure information. The proposed approach is able to identify new terms and reflect the latest relationship between terms as the Web evolves. First, a set of high quality and representative websites of a specific domain is selected. After filtering out navigational links, link analysis is applied to each website to obtain its content structure. Finally, the thesaurus is constructed by merging the content structures of the selected websites. The experimental results on automatic query expansion based on our constructed thesaurus show 20% improvement in search precision compared to the baseline.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116396997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm","authors":"J. Aslam, Virgil Pavlu, R. Savell","doi":"10.1145/860435.860517","DOIUrl":"https://doi.org/10.1145/860435.860517","url":null,"abstract":"We present a unified framework for simultaneously solving both the pooling problem (the construction of efficient document pools for the evaluation of retrieval systems) and metasearch (the fusion of ranked lists returned by retrieval systems in order to increase performance). The implementation is based on the Hedge algorithm for online learning, which has the advantage of convergence to bounded error rates approaching the performance of the best linear combination of the underlying systems. The choice of a loss function closely related to the average precision measure of system performance ensures that the judged document set performs well, both in constructing a metasearch list and as a pool for the accurate evaluation of retrieval systems. Our experimental results on TREC data demonstrate excellent performance in all measures---evaluation of systems, retrieval of relevant documents, and generation of metasearch lists.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122659390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A light weight PDA-friendly collection fusion technique","authors":"J. Antoniuk, M. Nascimento","doi":"10.1145/860435.860540","DOIUrl":"https://doi.org/10.1145/860435.860540","url":null,"abstract":"This short paper presents a light weight technique to merge results lists obtained from querying different databases. The motivation for such a technique is a general purpose search engine for Palm-OS based PDAs.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127671547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Document clustering based on non-negative matrix factorization","authors":"W. Xu, Xin Liu, Yihong Gong","doi":"10.1145/860435.860485","DOIUrl":"https://doi.org/10.1145/860435.860485","url":null,"abstract":"In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. The cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. Our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125605634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering and structuring information flow among bioinformatics resources","authors":"Joan C. Bartlett, Elaine Toms","doi":"10.1145/860435.860526","DOIUrl":"https://doi.org/10.1145/860435.860526","url":null,"abstract":"In this poster, we present a model of the flow of information among bioinformatics resources in the context of a specific scientific problem. Combining task analysis with traditional, qualitative research, we determined the extent to which the bioinformatics analysis process could be automated. The model represents a semi-automated process, involving fourteen distinct data processing steps, and forms the framework for an interface to bioinformatics information.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125620275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Assessing the effectiveness of pen-based input queries","authors":"Stephen Levin, Paul D. Clough, M. Sanderson","doi":"10.1145/860435.860539","DOIUrl":"https://doi.org/10.1145/860435.860539","url":null,"abstract":"In this poster, we describe an experiment exploring the effectiveness of a pen based text input device for use in query construction. Standard TREC queries were written, recognised, and subsequently retrieved upon. Comparisons between retrieval effectiveness based on the recognised writing and a typed text baseline were made. On average, effectiveness was 75% of the baseline. Other statistics on the quality and nature of recognition are also reported. .","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"789 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130834756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using asymmetric distributions to improve text classifier probability estimates","authors":"Paul N. Bennett","doi":"10.1145/860435.860457","DOIUrl":"https://doi.org/10.1145/860435.860457","url":null,"abstract":"Text classifiers that give probability estimates are more readily applicable in a variety of scenarios. For example, rather than choosing one set decision threshold, they can be used in a Bayesian risk model to issue a run-time decision which minimizes a user-specified cost function dynamically chosen at prediction time. However, the quality of the probability estimates is crucial. We review a variety of standard approaches to converting scores (and poor probability estimates) from text classifiers to high quality estimates and introduce new models motivated by the intuition that the empirical score distribution for the \"extremely irrelevant\", \"hard to discriminate\", and \"obviously relevant\" items are often significantly different. Finally, we analyze the experimental performance of these models over the outputs of two text classifiers. The analysis demonstrates that one of these models is theoretically attractive (introducing few new parameters while increasing flexibility), computationally efficient, and empirically preferable.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133346588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An architecture for peer-to-peer information retrieval","authors":"I. Klampanos, J. Jose","doi":"10.1145/860435.860521","DOIUrl":"https://doi.org/10.1145/860435.860521","url":null,"abstract":"P2P networking is one of the most rapidly developing areas of modern computing. By the utilisation of the exponentially increasing Internet nodes (users) as well as the ever powerful home computer systems and mobile devices, the P2P paradigm attempts to create open and collaborative networks of the most diverse functionality nature. In this study we propose an architecture for IR over large semi-collaborating P2P networks based on clustering. By the term “semi-collaborating” we mean networks where, although peers have to collaborate in order to achieve overall effectiveness, they do not have to share any proprietary information with the rest of the network, nor do they have to be consistent with respect to the IR systems they use. Also, we reason toward the usefulness of clustering in open P2P networks by relying on two basic assumptions (introduced in Section 3.1).","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115015328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic ranking of retrieval systems in imperfect environments","authors":"Rabia Nuray-Turan, F. Can","doi":"10.1145/860435.860510","DOIUrl":"https://doi.org/10.1145/860435.860510","url":null,"abstract":"The empirical investigation of the effectiveness of information retrieval (IR) systems requires a test collection, a set of query topics, and a set of relevance judgments made by human assessors for each query. Previous experiments show that differences in human relevance assessments do not affect the relative performance of retrieval systems. Based on this observation, we propose and evaluate a new approach to replace the human relevance judgments by an automatic method. Ranking of retrieval systems with our methodology correlates positively and significantly with that of human-based evaluations. In the experiments, we assume a Web-like imperfect environment: the indexing information for all documents is available for ranking, but some documents may not be available for retrieval. Such conditions can be due to document deletions or network problems. Our method of simulating imperfect environments can be used for Web search engine assessment and in estimating the effects of network conditions (e.g., network unreliability) on IR system performance.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114643989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian extension to the language model for ad hoc information retrieval","authors":"H. Zaragoza, D. Hiemstra, Michael E. Tipping","doi":"10.1145/860435.860439","DOIUrl":"https://doi.org/10.1145/860435.860439","url":null,"abstract":"We propose a Bayesian extension to the ad-hoc Language Model. Many smoothed estimators used for the multinomial query model in ad-hoc Language Models (including Laplace and Bayes-smoothing) are approximations to the Bayesian predictive distribution. In this paper we derive the full predictive distribution in a form amenable to implementation by classical IR models, and then compare it to other currently used estimators. In our experiments the proposed model outperforms Bayes-smoothing, and its combination with linear interpolation smoothing outperforms all other estimators.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129044252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}