{"title":"Question classification using support vector machines","authors":"Dell Zhang, Wee Sun Lee","doi":"10.1145/860435.860443","DOIUrl":"https://doi.org/10.1145/860435.860443","url":null,"abstract":"Question classification is very important for question answering. This paper presents our research work on automatic question classification through machine learning approaches. We have experimented with five machine learning algorithms: Nearest Neighbors (NN), Naive Bayes (NB), Decision Tree (DT), Sparse Network of Winnows (SNoW), and Support Vector Machines (SVM) using two kinds of features: bag-of-words and bag-of-ngrams. The experiment results show that with only surface text features the SVM outperforms the other four methods for this task. Further, we propose to use a special kernel function called the tree kernel to enable the SVM to take advantage of the syntactic structures of questions. We describe how the tree kernel can be computed efficiently by dynamic programming. The performance of our approach is promising, when tested on the questions from the TREC QA track.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129393616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing term vectors for efficient and robust filtering","authors":"David A. Evans, Jeffrey Bennett, David A. Hull","doi":"10.1145/860435.860546","DOIUrl":"https://doi.org/10.1145/860435.860546","url":null,"abstract":"We describe an efficient, robust method for selecting and optimizing terms for a classification or filtering task. Terms are extracted from positive examples in training data based on several alternative term-selection algorithms, then combined additively after a simple term-score normalization step to produce a merged and ranked master term vector. The score threshold for the master vector is set via beta-gamma regulation over all the available training data. The process avoids para-meter calibrations and protracted training. It also results in compact profiles for run-time evaluation of test (new) documents. Results on TREC-2002 filtering-task datasets demonstrate substantial improvements over TREC-median results and rival both idealized IR-based results and optimized (and expensive) SVM-based classifiers in general effectiveness.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124634033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comparative study on content-based music genre classification","authors":"Tao Li, M. Ogihara, Qi Li","doi":"10.1145/860435.860487","DOIUrl":"https://doi.org/10.1145/860435.860487","url":null,"abstract":"Content-based music genre classification is a fundamental component of music information retrieval systems and has been gaining importance and enjoying a growing amount of attention with the emergence of digital music on the Internet. Currently little work has been done on automatic music genre classification, and in addition, the reported classification accuracies are relatively low. This paper proposes a new feature extraction method for music genre classification, DWCHs. DWCHs stands for Daubechies Wavelet Coefficient Histograms. DWCHs capture the local and global information of music signals simultaneously by computing histograms on their Daubechies wavelet coefficients. Effectiveness of this new feature and of previously studied features are compared using various machine learning classification algorithms, including Support Vector Machines and Linear Discriminant Analysis. It is demonstrated that the use of DWCHs significantly improves the accuracy of music genre classification.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116795151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Searchers' criteria For assessing web pages","authors":"A. Tombros, I. Ruthven, J. Jose","doi":"10.1145/860435.860513","DOIUrl":"https://doi.org/10.1145/860435.860513","url":null,"abstract":"We investigate the criteria used by online searchers when assessing the relevance of web pages to information-seeking tasks. Twenty four searchers were given three tasks each, and indicated the features of web pages which they employed when deciding about the usefulness of the pages. These tasks were presented within the context of a simulated work-task situation. The results of this study provide a set of criteria used by searchers to decide about the utility of web pages. Such criteria have implications for the design of systems that use or recommend web pages, as well as to authors of web pages.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131230067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Search strategies in content-based image retrieval","authors":"Sharon McDonald, J. Tait","doi":"10.1145/860435.860452","DOIUrl":"https://doi.org/10.1145/860435.860452","url":null,"abstract":"This paper describes two studies that looked at users' ability to formulate visual queries with a Content-Based Image Retrieval system that uses dominant image colour as the primary indexing key. The first experiment examined users' performance with two visual search tools, a sketch tool and a structured browsing tool, with different types of image query. The results showed that while users were able to successfully search on the basis of colour, and were able to formulate visual queries, their ability to do so was affected by search task type. Search task type was also shown to be related to search tool choice. However, the results of study two showed that while users were able to complete all of the tasks, there was evidence to suggest that a degree of compromise was present in the users' choice of image that was largely due to problems relating to query formulation.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134109418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Richard J. Edens, Helen Gaylard, G. Jones, Adenike M. Lam-Adesina
{"title":"An investigation of broad coverage automatic pronoun resolution for information retrieval","authors":"Richard J. Edens, Helen Gaylard, G. Jones, Adenike M. Lam-Adesina","doi":"10.1145/860435.860511","DOIUrl":"https://doi.org/10.1145/860435.860511","url":null,"abstract":"Term weighting methods have been shown to give significant increases in information retrieval performance. The presence of pronomial references in documents reduces the term frequencies of associated words with a consequent effect on term weights and information retrieval behaviour. This investigation explores the impact on information retrieval performance of broad coverage automatic pronoun resolution. Results indicate that this approach has potential to improve both precision at fixed cutoff levels and average precision.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134488329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Salton Award Lecture - Information retrieval and computer science: an evolving relationship","authors":"W. Bruce Croft","doi":"10.1145/860435.860437","DOIUrl":"https://doi.org/10.1145/860435.860437","url":null,"abstract":"Following the tradition of these acceptance talks, I will be giving my thoughts on where our field is going. Any discussion of the future of information retrieval (IR) research, however, needs to be placed in the context of its history and relationship to other fields. Although IR has had a very strong relationship with library and information science, its relationship to computer science (CS) and its relative standing as a sub-discipline of CS has been more dynamic. IR is quite an old field, and when a number of CS departments were forming in the 60s, it was not uncommon for a faculty member to be pursuing research related to IR. Early ACM curriculum recommendations for CS contained courses on information retrieval, and encyclopedias described IR and database systems as different aspects of the same field. By the 70s, there were only a few IR researchers in CS departments in the U.S., database systems was a separate (and thriving) field, and many felt that IR had stagnated and was largely irrelevant. The truth, in fact, was far from that. The IR research community was a small, but dedicated, group of researchers in the U.S. and Europe who were motivated by a desire to understand the process of information retrieval and to build systems that would help people find the right information in text databases. This was (and is) a hard goal and led to different evaluation metrics and methodologies than the database community. Progress in the field was hampered by a lack of large-scale testbeds and tests were limited to databases containing at most a few hundred document abstracts. In the 80s AI boom, IR was still not a mainstream area, despite its focus on a human task involving natural language. IR focused on a statistical approach to language rather than the much more popular knowledge-based approach. The fact that IR conferences mix papers on effectiveness as measured by human judgments with papers measuring performance of file organizations for large-scale systems has meant that IR has always been difficult to classify into simple categories such as \"systems\" or \"AI\" that are often used in CS departments. Since the early 90s, just about everything has changed. Large, full-text databases were finally made available for experimentation through DARPA funding and TREC. This has had an enormous positive impact on the quantity and quality of IR research. The advent of the Web search engine has validated the longstanding claims made by IR researchers that simple queries and ranking were the right techniques for information access in a largely unstructured information world. What has not changed is that there are still relatively few IR researchers in CS departments. There are, however, many more people in CS departments doing IR-related research, which is just about the same thing. Conferences in databases, machine learning, computational linguistics, and data mining publish a number of IR papers done by people who would not primarily consider themselves a","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128746201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A frequency-based and a poisson-based definition of the probability of being informative","authors":"T. Roelleke","doi":"10.1145/860435.860478","DOIUrl":"https://doi.org/10.1145/860435.860478","url":null,"abstract":"This paper reports on theoretical investigations about the assumptions underlying the inverse document frequency (idf). We show that an intuitive idf-based probability function for the probability of a term being informative assumes disjoint document events. By assuming documents to be independent rather than disjoint, we arrive at a Poisson-based probability of being informative. The framework is useful for understanding and deciding the parameter estimation and combination in probabilistic retrieval models.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129248414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of anchor text for web search","authors":"Nadav Eiron, K. McCurley","doi":"10.1145/860435.860550","DOIUrl":"https://doi.org/10.1145/860435.860550","url":null,"abstract":"It has been observed that anchor text in web documents is very useful in improving the quality of web text search for some classes of queries. By examining properties of anchor text in a large intranet, we hope to shed light on why this is the case. Our main premise is that anchor text behaves very much like real user queries and consensus titles. Thus an understanding of how anchor text is related to a document will likely lead to better understanding of how to translate a user’s query into high quality search results. Our approach is experimental, based on a study of a large corporate intranet, including the content as well as a large stream of queries against that content. We conduct experiments to investigate several aspects of anchor text, including their relationship to titles, the frequency of queries that can be satisfied by anchortext alone, and the homogeneity of results fetched by anchor text.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114844352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daqing He, Jianqiang Wang, Douglas W. Oard, Michael Nossal
{"title":"User-assisted query translation for interactive CLIR","authors":"Daqing He, Jianqiang Wang, Douglas W. Oard, Michael Nossal","doi":"10.1145/860435.860552","DOIUrl":"https://doi.org/10.1145/860435.860552","url":null,"abstract":"We view interactive Cross-Language Information Retrieval (CLIR) as an iterative process in which the searcher and the retrieval system collaborate to find documents that satisfy the searcher’s needs, regardless of the language in which those documents are written. Our motivation is that humans and machines can bring complementary strengths to this process. Machines are excellent at repetitive tasks that are well specified; humans bring creativity and exceptional pattern recognition capabilities. Properly coupling these capabilities can result in a synergy that greatly exceeds the ability of either human or machine alone. Designers of CLIR systems can select from a variety of fully automatic techniques to overcome problems with unknown terms and translation ambiguity, but automatic processing of this sort risks reducing the searcher’s understanding of system operation. This, in turn, tends to work against the synergy that we seek to accomplish. We are therefore exploring more transparent approaches to support interactive CLIR.","PeriodicalId":209809,"journal":{"name":"Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval","volume":"20 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126200000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}