A. D. Silva, Eveline Veloso, P. B. Golgher, B. Ribeiro-Neto, Alberto H. F. Laender, N. Ziviani
{"title":"CoBWeb-a crawler for the Brazilian Web","authors":"A. D. Silva, Eveline Veloso, P. B. Golgher, B. Ribeiro-Neto, Alberto H. F. Laender, N. Ziviani","doi":"10.1109/SPIRE.1999.796594","DOIUrl":"https://doi.org/10.1109/SPIRE.1999.796594","url":null,"abstract":"One of the key components of current Web search engines is the document collector. The paper describes CoBWeb, an automatic document collector whose architecture is distributed and highly scalable. CoBWeb aims at collecting large amounts of documents per time period while observing operational and ethical limits in the crawling process. CoBWeb is part of the SIAM (Information Systems in Mobile Computing Environments) search engine which is being implemented to support the Brazilian Web. Thus, several results related to the Brazilian Web are presented.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124668121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Kida, Yusuke Shibata, M. Takeda, A. Shinohara, S. Arikawa
{"title":"A unifying framework for compressed pattern matching","authors":"T. Kida, Yusuke Shibata, M. Takeda, A. Shinohara, S. Arikawa","doi":"10.1109/SPIRE.1999.796582","DOIUrl":"https://doi.org/10.1109/SPIRE.1999.796582","url":null,"abstract":"We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions, and propose a compressed pattern matching algorithm for the framework. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW) (J. Ziv and A. Lempel, 1978), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extends that for LZW compressed text presented by A. Amir et al. (1996).","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133626698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fast algorithm on average for all-against-all sequence matching","authors":"Ricardo Baeza-Yates, G. Gonnet","doi":"10.1109/SPIRE.1999.796573","DOIUrl":"https://doi.org/10.1109/SPIRE.1999.796573","url":null,"abstract":"We present an algorithm which attempts to align pairs of subsequences from a database of genetic sequences. The algorithm simulates the classical dynamic programming alignment algorithm over a suffix array of the database. We provide a detailed average case analysis which shows that the running time of the algorithm is subquadratic with respect to the database size. A similar algorithm solves the approximate string matching problem in sublinear average time.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131358470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The EC query language applied to old manuscripts","authors":"J. Vegas, P. Fuente, Ricardo Baeza-Yates","doi":"10.1109/SPIRE.1999.796597","DOIUrl":"https://doi.org/10.1109/SPIRE.1999.796597","url":null,"abstract":"We show the possibilities of the EC query language in a very structured environment as a catalog of old manuscripts. The EC language can deal with simple queries and with more complex ones, as approximate searches. We have done two classes of experiments. The first one shows that the structure does not change the statistical behaviour of the system with regard to the frequency of the words. The second kind of experiments tends to show the statistical behaviour of the database when we use different structural elements in the queries.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116826472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effects of term segmentation on Chinese/English cross-language information retrieval","authors":"Douglas W. Oard, Jianqiang Wang","doi":"10.1109/SPIRE.1999.796590","DOIUrl":"https://doi.org/10.1109/SPIRE.1999.796590","url":null,"abstract":"The majority of recent Cross-Language Information Retrieval (CLIR) research has focused on European languages. CLIR problems that involve East Asian languages such as Chinese introduce additional challenges, because written Chinese texts lack boundaries between terms. The paper examines three Chinese segmentation techniques in combination with two variants of dictionary-based Chinese to English query translation. The results indicate that failure to segment terms, particularly technical terms and names, can have a cascading effect that reduces retrieval effectiveness. Task-tuned segmentation algorithms and alternative term weighting strategies are suggested as productive directions for future work.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130270014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bounds for parametric sequence comparison","authors":"David Fernández-Baca, T. Seppäläinen, G. Slutzki","doi":"10.1109/SPIRE.1999.796578","DOIUrl":"https://doi.org/10.1109/SPIRE.1999.796578","url":null,"abstract":"We consider the problem of computing a global alignment between two or more sequences subject to varying mismatch and indel penalties. We prove a tight 3(n/2/spl pi/)/sup 2/3/+O(n/sup 1/3/logn) bound on the worst-case number of distinct optimum alignments for two sequences of length n as the parameters are varied. This refines a O(n/sup 2/3/) upper bound by D. Gusfield et al. (1994). Our lower bound requires an unbounded alphabet. For strings over a binary alphabet, we prove a /spl Omega/(n/sup 1/2/) lower bound. For the parametric global alignment of k/spl ges/2 sequences under sum-of-pairs scoring, we prove a 3((k/2)n/2/spl pi/)/sup 2/3/+O(k/sup 2/3/n/sup 1/3/logn) upper bound on the number of distinct optimality regions and a /spl Omega/(n/sup 2/3/) lower bound. Based on experimental evidence, we conjecture that for two random sequences, the number of optimality regions is approximately /spl radic/n with high probability.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129917907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Ribeiro-Neto, Alberto H. F. Laender, A. D. Silva
{"title":"Top-down extraction of semi-structured data","authors":"B. Ribeiro-Neto, Alberto H. F. Laender, A. D. Silva","doi":"10.1109/SPIRE.1999.796593","DOIUrl":"https://doi.org/10.1109/SPIRE.1999.796593","url":null,"abstract":"We propose an innovative approach to extracting semi-structured data from Web sources. The idea is to collect a couple of example objects from the user and to use this information to extract new objects from new pages or texts. We propose a top-down strategy that extracts complex objects, decomposing them in objects less complex, until atomic objects have been extracted. Through experimentation, we demonstrate that with a small number of given examples, our strategy is able to extract most of the objects present in a Web source given as input.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122135369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Emotional awareness in collaborative systems","authors":"O. García, J. Favela, R. Machorro","doi":"10.1109/SPIRE.1999.796607","DOIUrl":"https://doi.org/10.1109/SPIRE.1999.796607","url":null,"abstract":"Emotions play an important role in human interaction. Both, our own emotional state and our perception of that of others with which we collaborate influence the outcome of cooperative work. With the growing interest in providing computational support for the recognition and representation of emotions, there is a clear interest in adding such facilities to groupware systems and to evaluate the positive and negative effects of using this additional channel of communication. We discuss the issues involved in supporting a new type of collaborative awareness in groupware, namely, emotional awareness. We also present two emotion-based sample applications, and discussion to further motivate work in this area within the collaborative community.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132655508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient method for in memory construction of suffix arrays","authors":"Hideo Itoh, Hozumi Tanaka","doi":"10.1109/SPIRE.1999.796581","DOIUrl":"https://doi.org/10.1109/SPIRE.1999.796581","url":null,"abstract":"The suffix array is a string-indexing structure and a memory efficient alternative to the suffix tree. It has many advantages for text processing. We propose an efficient algorithm for sorting suffixes. We call this algorithm the two-stage suffix sort. One of our ideas is to exploit the specific relationships between adjacent suffixes. Our algorithm makes it possible to use the suffix array for much larger texts and suggests new areas of application. Our experiments on several text data sets (including 514-MB Japanese newspapers) demonstrate that our algorithm is 4.5 to 6.9 times faster than Quicksort, and 2.5 to 3.6 times faster than K. Sadakane's (1998) algorithm, which is considered to be the fastest algorithm in previous work.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125570310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"String-oriented databases","authors":"A. Rajasekar","doi":"10.1109/SPIRE.1999.796591","DOIUrl":"https://doi.org/10.1109/SPIRE.1999.796591","url":null,"abstract":"Relational databases and Datalog view each attribute as indivisible. This view, though useful in several applications, does not provide a suitable database paradigm for use in genetic, multimedia or scientific databases. Data in these applications are unstructured; querying on sub-strings of attribute values is often necessary. Moreover due to imprecision and incompleteness in the data, approximate reasoning also becomes indispensable. Our aim is to view strings as database objects that can be compared, divided, subsumed, interpreted and approximated. Allowing such operations on strings enriches the semantics and increases the expressive power of database languages. We develop an extension to the relational algebra, augmenting it with the concept of a string expression with a rich structure of string variables, mapping functions, interpreted string operations and approximate evaluations. We study properties of such expressions and show that many of the well-known properties of relational algebra hold in the extension. We also discuss an extension to Datalog(String) and an implementation of a prototype system called S-log. S-log integrates pattern matching in Datalog framework. We contend that string oriented database systems would be useful in applications that require efficient sub-structure analysis, such as aligning DNA strings using motifs, retrieving and synthesizing iconic images based on content.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129265384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}