{"title":"Combining document representations for known-item search","authors":"Paul Ogilvie, Jamie Callan","doi":"10.1145/860435.860463","DOIUrl":"https://doi.org/10.1145/860435.860463","url":null,"abstract":"This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search. As this task is very similar to work in meta-search and data fusion, we adapt several hypotheses from those research areas and investigate them in this context. To investigate these hypotheses, we present a mixture-based language model and also examine many of the current meta-search algorithms. We find that compatible output from systems is important for successful combination of document representations. We also demonstrate that combining low performing document representations can improve performance, but not consistently. We find that the techniques best suited for this task are robust to the inclusion of poorly performing document representations. We also explore the role of variance of results across systems and its impact on the performance of fusion, with the surprising result that the correct documents have higher variance across document representations than highly ranking incorrect documents.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114927389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yoshihiko Hayashi, K. Ohtsuki, K. Bessho, Osamu Mizuno, Y. Matsuo, S. Matsunaga, Minoru Hayashi, T. Hasegawa, Naruhiro Ikeda
{"title":"Speech-based and video-supported indexing of multimedia broadcast news","authors":"Yoshihiko Hayashi, K. Ohtsuki, K. Bessho, Osamu Mizuno, Y. Matsuo, S. Matsunaga, Minoru Hayashi, T. Hasegawa, Naruhiro Ikeda","doi":"10.1145/860435.860541","DOIUrl":"https://doi.org/10.1145/860435.860541","url":null,"abstract":"This paper describes an automatic content indexing system for news programs, with a special emphasis on its segmentation process. The process can successfully segment an entire news program into topic-centered news stories; the primary tool is a linguistic topic segmentation algorithm. Experiments show that the resulting speech-based segments are fairly accurate, and scene change points supplied by an external video processor can be of help in improving segmentation effectiveness.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128662220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A maximal figure-of-merit learning approach to text categorization","authors":"Sheng Gao, Wen-Chin Wu, Chin-Hui Lee, Tat-Seng Chua","doi":"10.1145/860435.860469","DOIUrl":"https://doi.org/10.1145/860435.860469","url":null,"abstract":"A novel maximal figure-of-merit (MFoM) learning approach to text categorization is proposed. Different from the conventional techniques, the proposed MFoM method attempts to integrate any performance metric of interest (e.g. accuracy, recall, precision, or F1 measure) into the design of any classifier. The corresponding classifier parameters are learned by optimizing an overall objective function of interest. To solve this highly nonlinear optimization problem, we use a generalized probabilistic descent algorithm. The MFoM learning framework is evaluated on the Reuters-21578 task with LSI-based feature extraction and a binary tree classifier. Experimental results indicate that the MFoM classifier gives improved F1 and enhanced robustness over the conventional one. It also outperforms the popular SVM method in micro-averaging F1. Other extensions to design discriminative multiple-category MFoM classifiers for application scenarios with new performance metrics could be envisioned too.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115243271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote Address - exploring, modeling, and using the web graph","authors":"A. Broder","doi":"10.1145/860435.860436","DOIUrl":"https://doi.org/10.1145/860435.860436","url":null,"abstract":"The Web graph, meaning the graph induced by Web pages as nodes and their hyperlinks as directed edges, has become a fascinating object of study for many people: physicists, sociologists, mathematicians, computer scientists, and information retrieval specialists.Recent results range from theoretical (e.g.: models for the graph, semi-external algorithms), to experimental (e.g.: new insights regarding the rate of change of pages, new data on the distribution of degrees), to practical (e.g.: improvements in crawling technology).Recent results range from theoretical (e.g.: models for the graph, semi-external algorithms), to experimental (e.g.: new insights regarding the rate of change of pages, new data on the distribution of degrees), to practical (e.g.: improvements in crawling technology).The goal of this talk is to convey an introduction to the state of the art in this area and to sketch the current issues in collecting, representing, analyzing, and modeling this graph. Although graph analytic methods are essential tools in the Web IR arsenal, they are well known to the SIGIR community and will not be discussed here in any detail; instead, we will explore some challenges and opportunities for using IR methods and techniques in the exploration of the Web graph, in particular in dealing with legitimate and \"spam\" perturbations of the \"natural\" process of birth and death of nodes and links, and conversely, the challenges and opportunities of using graph methods in support of IR on the Web and in the enterprise.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124084481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robustness of regularized linear classification methods in text categorization","authors":"Jian Zhang, Yiming Yang","doi":"10.1145/860435.860471","DOIUrl":"https://doi.org/10.1145/860435.860471","url":null,"abstract":"Real-world applications often require the classification of documents under situations of small number of features, mis-labeled documents and rare positive examples. This paper investigates the robustness of three regularized linear classification methods (SVM, ridge regression and logistic regression) under above situations. We compare these methods in terms of their loss functions and score distributions, and establish the connection between their optimization problems and generalization error bounds. Several sets of controlled experiments on the Reuters-21578 corpus are conducted to investigate the robustness of these methods. Our results show that ridge regression seems to be the most promising candidate for rare class problems.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128468546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image classification using hybrid neural networks","authors":"Chih-Fong Tsai, K. McGarry, J. Tait","doi":"10.1145/860435.860536","DOIUrl":"https://doi.org/10.1145/860435.860536","url":null,"abstract":"Use of semantic content is one of the major issues which needs to be addressed for improving image retrieval effectiveness. We present a new approach to classify images based on the combination of image processing techniques and hybrid neural networks. Multiple keywords are assigned to an image to represent its main contents, i.e. semantic content. Images are divided into a number of regions and colour and texture features are extracted. The first classifier, a self-organising map (SOM) clusters similar images based on the extracted features. Then, regions of the representative images of these clusters were labeled and used to train the second classifier, composed of several support vector machines (SVMs). Initial experiments on the accuracy of keyword assignment for a small vocabulary are reported.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130441366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Makoto Iwayama, Atsushi Fujii, N. Kando, Yuzo Marukawa
{"title":"An empirical study on retrieval models for different document genres: patents and newspaper articles","authors":"Makoto Iwayama, Atsushi Fujii, N. Kando, Yuzo Marukawa","doi":"10.1145/860435.860482","DOIUrl":"https://doi.org/10.1145/860435.860482","url":null,"abstract":"Reflecting the rapid growth in the utilization of large test collections for information retrieval since the 1990s, extensive comparative experiments have been performed to explore the effectiveness of various retrieval models. However, most collections were intended for retrieving newspaper articles and technical abstracts. In this paper, we describe the process of producing a test collection for patent retrieval, the NTCIR-3 Patent Retrieval Collection, which includes two years of Japanese patent applications and 31 topics produced by professional patent searchers. We also report experimental results obtained by using this collection to re-examine the effectiveness of existing retrieval models in the context of patent retrieval. The relative superiority among existing retrieval models did not significantly differ depending on the document genre, that is, patents and newspaper articles. Issues related to patent retrieval are also discussed.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125743978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zonghuan Wu, Vijay V. Raghavan, Chun Du, C. KomanduruSai, W. Meng, Hai He, Clement T. Yu
{"title":"SE-LEGO: creating metasearch engines on demand","authors":"Zonghuan Wu, Vijay V. Raghavan, Chun Du, C. KomanduruSai, W. Meng, Hai He, Clement T. Yu","doi":"10.1145/860435.860555","DOIUrl":"https://doi.org/10.1145/860435.860555","url":null,"abstract":"Extended Abstract As a system that provides unified access to multiple existing search systems, a metasearch engine can alleviate ordinary users from the formidable task of identifying useful sources and searching them individually. At present, the largest metasearch engines such as ProFusion (www.profusion.com) and SavvySearch (www.search.com) can connect to about 1,000 search engines. This means that only a small fraction of the information sources on the Web, including both the Surface Web and the Deep Web, are connected, as the number of such sources is estimated to be in the order of hundreds of thousands [1]. Most of these Websites have their own search capabilities and provide search interfaces. Many of these Websites provide high quality information that has been frequently queried by specialists and researchers in particular fields. Present major metasearch engines usually do not connect to these specialized Websites. Currently, building a metasearch engine is an expensive and labor-intensive job that needs diverse expertise. As a result, it is difficult for an ordinary Web user to create a metasearch engine based on the search engines of the user’s choice. Some metasearch engine companies (e.g., ProFusion) allow user to build customized metasearch engines, but only search engines in a pre-compiled list can be used because the capability to connect to these search engines need to be established in advance.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"97 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134086030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Beyond independent relevance: methods and evaluation metrics for subtopic retrieval","authors":"ChengXiang Zhai, William W. Cohen, J. Lafferty","doi":"10.1145/860435.860440","DOIUrl":"https://doi.org/10.1145/860435.860440","url":null,"abstract":"We present a non-traditional retrieval problem we call subtopic retrieval. The subtopic retrieval problem is concerned with finding documents that cover many different subtopics of a query topic. In such a problem, the utility of a document in a ranking is dependent on other documents in the ranking, violating the assumption of independent relevance which is assumed in most traditional retrieval methods. Subtopic retrieval poses challenges for evaluating performance, as well as for developing effective algorithms. We propose a framework for evaluating subtopic retrieval which generalizes the traditional precision and recall metrics by accounting for intrinsic topic difficulty as well as redundancy in documents. We propose and systematically evaluate several methods for performing subtopic retrieval using statistical language models and a maximal marginal relevance (MMR) ranking strategy. A mixture model combined with query likelihood relevance ranking is shown to modestly outperform a baseline relevance ranking on a data set used in the TREC interactive track.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132670085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Re-examining the potential effectiveness of interactive query expansion","authors":"I. Ruthven","doi":"10.1145/860435.860475","DOIUrl":"https://doi.org/10.1145/860435.860475","url":null,"abstract":"Much attention has been paid to the relative effectiveness of interactive query expansion versus automatic query expansion. Although interactive query expansion has the potential to be an effective means of improving a search, in this paper we show that, on average, human searchers are less likely than systems to make good expansion decisions. To enable good expansion decisions, searchers must have adequate instructions on how to use interactive query expansion functionalities. We show that simple instructions on using interactive query expansion do not necessarily help searchers make good expansion decisions and discuss difficulties found in making query expansion decisions.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114820682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}