{"title":"Genome rearrangements distance by fusion, fission, and transposition is easy","authors":"Zanoni Dias, J. Meidanis","doi":"10.1109/SPIRE.2001.989776","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989776","url":null,"abstract":"Given two genomes represented as circularly ordered sequences of genes, we show a polynomial time algorithm for the minimum weight series of fusion, jissions, and transpositions (with transpositions weighing twice as much as fusions and$ssions) that transforms one genome into the other. The algorithm is based on classical results ofpermutation group theory and is the jirst polynomial result for a genome rearrangement problem involving transpositions. It has been observed in real biological instances that transpositions occur with about ha&- the frequency of reversals. Although we are not using reversals in this study, this observation motivated the double weight assigned to transpositions.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122214562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A stemming algorithm for the portuguese language","authors":"Viviane Moreira Orengo, C. Huyck","doi":"10.1109/SPIRE.2001.989755","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989755","url":null,"abstract":"Stemming algorithms are traditionally used in Information Retrieval with the goal of enhancing recall, as they conflate the variant forms of a word into a common representation. This paper describes the development of a simple and eflective su&?x-stripping algorithm for Portuguese. The stemmer is evaluated using a method proposed by Paice f9/. The results show that it performs significantly better than the Portuguese version of the Porter algorithm.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126870550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Badue, Ricardo Baeza-Yates, B. Ribeiro-Neto, N. Ziviani
{"title":"Distributed query processing using partitioned inverted files","authors":"C. Badue, Ricardo Baeza-Yates, B. Ribeiro-Neto, N. Ziviani","doi":"10.1109/SPIRE.2001.989733","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989733","url":null,"abstract":"In this paper; we study query processing in a distributed text database. The novelty is a real distributed architecture implementation that oflers concurrent query service. The distributed system adopts a network of workstations model and the client-server paradigm. The document collection is indexed with an imerted$le. We adopt two distinct strategies of index partitioning in the distributed system, namely local index partitioning and global indexpartitioning. In both strategies, documents are ranked using the vector space model along with a documentfiltering technique for fast ranking. We evaluate and compare the impact of the two index partitioning strategies on query processing per$ormance. Experimental results on retrieval eficiency show that, within our framework, the global index partitioning outpe~orms the local index partitioning.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131440674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adding security to compressed information retrieval systems","authors":"R. Milidiú, C. G. Mello, José Rodrigues Fernandes","doi":"10.1109/SPIRE.2001.989778","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989778","url":null,"abstract":"Word-based Huffman coding has widespread use in information retrieval systems. Besides its compressing power, it also enables the implementation of both indexing and searching schema in the compressed file. In this work, an algorithm that adds securiry to compressed data is proposed. It shows a small loss in coding, decoding and compression performances. The algorithm uses homophonic substitution, canonical Huffman codes and a secret key for enciphering.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130273732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. V. Magalhães, Alberto H. F. Laender, A. D. Silva
{"title":"Storing semistructured data in relational databases","authors":"K. V. Magalhães, Alberto H. F. Laender, A. D. Silva","doi":"10.1109/SPIRE.2001.989749","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989749","url":null,"abstract":"This paper presents an approach to storing semistructured data in relational databases. We focus on semistructured data as extracted from Web pages by a tool called DEBYE (Data Extraction By Example), and organized according to its data model, the DEByE Object Model (DEByEOM). The approach presented here consists in representing the structure of the objects extracted by DEByE by a relational schema and populating the corresponding database accordingly. We also show how to retrieve such objects by automatically transforming high-level query specifications (query patterns) into SQL queries that are executed over the relational database. Experiments results carried out to evaluate our approach are also described.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132298584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speed-up of Aho-Corasick pattern matching machines by rearranging states","authors":"T. Nishimura, S. Fukamachi, T. Shinohara","doi":"10.1109/SPIRE.2001.989753","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989753","url":null,"abstract":"This article describes speed-up of string pattern matching by rearranging states in Aho-Corasick pattern matching machine, which is a kind of afinite automaton. We realized speed-up of string pattern matching using data compression. Although we obtain higher compression ratio using a finite state model, it doesn't lead speed-up of string pattern matching. Because the pattern matching machine becomes very large, when compression codes are complex. Random Access Memory (RAM) are scattered with states used frequently Such states are close to the initial state of pattern matching machine. We rearrange states so as to collecting states used frequently for CPU cache eficiency. We renumber states in breadth-first order. In experiments, the elapsed time is reduced to about 55% in case of a compressed English text.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122165342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On using two-phase filtering in indexed approximate string matching with application to searching unique oligonucleotides","authors":"H. Hyyro","doi":"10.1109/spire.2001.989742","DOIUrl":"https://doi.org/10.1109/spire.2001.989742","url":null,"abstract":"We discuss using an indexing scheme to accelerate approximate search over a static text in the case of using unit cost edit distance as the measure of similarity between strings. First we generally consider the filtering criteria that can be used as a basis for the index, and then propose using filtering twice before the final checking phase. The last part consists of presenting an indexed approximate string matching application in bioinformatics, which is the search of unique oligonucleotides. We present practical comparisons and results for using different filtering schemes in this application. Our tests have involved a total of 15 different genomes, from which we present some results involving the largest two of these: The genome of Saccharomyces cerevisiae (baker's yeast) and a recent draft of the human genome, the latter being also the main target of the application.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121070692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comparative study of topic identification on newspaper and e-mail","authors":"B. Bigi, A. Brun, J. Haton, K. Smaïli, I. Zitouni","doi":"10.1109/SPIRE.2001.989770","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.989770","url":null,"abstract":"This work presents several statistical methods for topic identification on two kinds of textual data: newspaper articles and e-mails. Five methods are tested on these two corpora: topic unigrams, cache model, TFIDF classijier, topic peqdexity, and weighted model. Our work aims to study these methods by confronting them to very diferent data. This study is very fruitful for our research. Statistical topic identiJication methods depend not only on a corpus, but also on its type. One of the methods achieves a topic identiJcation of 80% on a general newspaper corpus but does not exceed 30% on e-mail corpus. Another method gives the best result on e-mails, but has not the same behavior on a newspaper corpus. We also show in this paper that almost all our methods achieve good results in retrieving the first two manually annotated labels.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133594057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exact distribution of deletion sizes for unavoidable strings","authors":"Christine E. Heitsch","doi":"10.1109/SPIRE.2001.10014","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.10014","url":null,"abstract":"We constructively prove the exact distribution of deletion sizes for unavoidable strings, under the reductive decidability method of Zimin and Bean et al. Bounds such as these on the unique initial reductions of unavoidable strings were instrumental in proving the computational intractability of the reduction algorithm. We also provide the necessa y supporting results, including some useful approximations on the deletion sizes of individual strings. This work improves upon previous results that, although suficient to establish the desired exponential lower bound, were far from optimal.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132638381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic thesaurus for automatic expanded query in information retrieval","authors":"Marco González, Vera Lúcia Strube de Lima","doi":"10.1109/SPIRE.2001.10023","DOIUrl":"https://doi.org/10.1109/SPIRE.2001.10023","url":null,"abstract":"This article proposes (a) a semantic structuring for thesauri and (b) a procedure that handles it, for automatic query expansion in information retrieval. The thesaurus for this experiment was built manually, based on a traditional dictionary, adopting aspects from the Generative Lexicon Theory by James Pustejovsky as well as concepts from object oriented software modeling. We show how to select new terms for query expansion and to calculate their weights. This last task is performed according to intersections of the derived lexical sets and to the depth level for descriptors search with respect to each considered term. Also, an evaluation of the use of these resources is presented.","PeriodicalId":107511,"journal":{"name":"Proceedings Eighth Symposium on String Processing and Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115286041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}