{"title":"Contextual Information for Named Entity Recognition in Biomedical Texts","authors":"R. Goulart, Vera Lúcia Strube de Lima","doi":"10.1109/STIL.2009.28","DOIUrl":"https://doi.org/10.1109/STIL.2009.28","url":null,"abstract":"This article presents a study on Named Entities (NE) recognition using contextual information present on a Biomedical corpus. Related work indicates that the use of context (words surrounding a word) can assist the NE recognition. This work presents experimental results to evaluate the impact of different context settings, using machine learning, for the NE recognition.","PeriodicalId":265848,"journal":{"name":"2009 Seventh Brazilian Symposium in Information and Human Language Technology","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124207013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical Machine Translation: Little Changes Big Impacts","authors":"Helena de Medeiros Caseli, Israel Aono Nunes","doi":"10.1109/STIL.2009.24","DOIUrl":"https://doi.org/10.1109/STIL.2009.24","url":null,"abstract":"In this paper we describe some experiments carried out to test the impact of automatic casing and punctuation changes when training and testing statistical translation models. The experiments described here concern the translation from/to English and Brazilian Portuguese texts but since the superficial changes investigated are language independent, we believe that the conclusions can be applied to many other pairs of languages. These experiments weredesigned aiming at setting a baseline scenario for future training and testing of more complex statistical translation models such as the factored ones. From the experiments presented here it is possible to see that case and punctuation changes have a significant impact on automatic translation results.","PeriodicalId":265848,"journal":{"name":"2009 Seventh Brazilian Symposium in Information and Human Language Technology","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123574320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of Stopwords Removal on the Statistical Approach for Automatic Term Extraction","authors":"Í. Braga","doi":"10.1109/STIL.2009.8","DOIUrl":"https://doi.org/10.1109/STIL.2009.8","url":null,"abstract":"The construction of terminological products is important to the organization and spreading of knowledge. This task can be leveraged by the automatic extraction of terms, which has been considered a Natural Language Processing problem. In this paper, the interaction between the statistical approach to term extraction and the process of stopword removal is investigated. Experiments conducted on two corpora show that stopword removal improves performance when extracting bigram terms, no matter if the removal is done before or after the application of a statistical metric. As a result of this investigation, it is possible to recommend more appropriate statistical metrics for the case where it is possible to remove stopwords and for the case that this removal cannot be done.","PeriodicalId":265848,"journal":{"name":"2009 Seventh Brazilian Symposium in Information and Human Language Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128512644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimedia Collections of Indigenous Languages: An Organization Proposal","authors":"Ellison Cleyton Barbosa dos Santos","doi":"10.1109/STIL.2009.7","DOIUrl":"https://doi.org/10.1109/STIL.2009.7","url":null,"abstract":"This paper describes the Sistema de Informação do Acervo deLínguas Indígenas (SIALI), a database designed to organize the storage of linguistic and ethnographic media data. The database was implemented in MS Access and offers a personalized mechanism for controlling the organization and storage of data, based on library techniques. The physical design presented here identifies the core configuration required in a database, including organization, storage, management, retrieval of data, and other features that are important for a database on storage media.","PeriodicalId":265848,"journal":{"name":"2009 Seventh Brazilian Symposium in Information and Human Language Technology","volume":"819 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127298326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Token Classification Approach to Dependency Parsing","authors":"R. Milidiú, C. M. P. Crestana, C. D. Santos","doi":"10.1109/STIL.2009.29","DOIUrl":"https://doi.org/10.1109/STIL.2009.29","url":null,"abstract":"The Dependency-based syntactic parsing task consists in identifying a head word for each word in an input sentence. Hence, its output is a rooted tree where the nodes are the words in the sentence. State-of-the-art dependency parsing systems use transition-based or graph-based models. We present a token classification approach to dependency parsing, where any classification algorithm can be used. To evaluate its effectiveness, we apply the Entropy GuidedTransformation Learning algorithm to the CoNLL 2006 corpus, using the Unlabelled Attachment Score as the accuracy metric. Our results show that the generated models are close to the average CoNLL system performance. Additionally,these findings also indicate that the token classification approach is a promising one.","PeriodicalId":265848,"journal":{"name":"2009 Seventh Brazilian Symposium in Information and Human Language Technology","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127492785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenges to the Creation of a Frame-Based Lexicon for the Portuguese Language: A study of the Judgement and Assessing Frames","authors":"A. Bertoldi, R. Chishman","doi":"10.1109/STIL.2009.40","DOIUrl":"https://doi.org/10.1109/STIL.2009.40","url":null,"abstract":"This paper presents a comparative study of Judgment and Assessing frames in English and Portuguese. The aim is to verify the possibility of using the FrameNet frames to construct a lexical database for Brazilian Portuguese. The research corpus is composed by 50 legal documents, totalizing 1.055,535 tokens and 39,108 types. Through a contrastive method the Judgment and Assessing frames were selected and translation equivalents for the English lexical units were established. The points considered in this research were the polysemy and the semantic relations of words. The polysemy is the main difficulty in applying FrameNet frames for Portuguese description.","PeriodicalId":265848,"journal":{"name":"2009 Seventh Brazilian Symposium in Information and Human Language Technology","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125965381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The C-ORAL-BRASIL Corpus: Methodological Basis for the Treatment of Spontaneous Speech","authors":"M. Mittmann, Tommaso Raso, Heliana Mello","doi":"10.1109/STIL.2009.22","DOIUrl":"https://doi.org/10.1109/STIL.2009.22","url":null,"abstract":"This paper highlights the primary methods employed in the C-ORAL-BRASIL compiling process, i.e, recording, transcribing and segmenting oral texts. The C-ORAL-BRASIL is a Brazilian Portuguese corpus of spontaneous speech, designed for the study of informational structure. It is representative of the diaphasic variation, seeking to cover as many different comunicative situations as possible. This paper presents and exemplifies the processes of transcription and segmentation of speech into prosodic units as employed in our on-going research. It concludes with illustrations of some questions that the corpus will enable us to answer.","PeriodicalId":265848,"journal":{"name":"2009 Seventh Brazilian Symposium in Information and Human Language Technology","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121680529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the Extraction of Semantic Relations between Portuguese Words by Means of a Dictionary","authors":"Hugo Gonçalo Oliveira, Diana Santos, P. Gomes","doi":"10.1109/STIL.2009.30","DOIUrl":"https://doi.org/10.1109/STIL.2009.30","url":null,"abstract":"This paper presents PAPEL, a lexical resource for Portuguese, consisting of relations between terms, extracted by (semi) automatic means from a general dictionary. After a short overview of the building process, a quantitative overview is given together with some examples. Evaluation is then presented and discussed: for synonymy, we used a public thesaurus, Tep, for the other relations, we queried Portuguese corpora through the AC/DC interface.","PeriodicalId":265848,"journal":{"name":"2009 Seventh Brazilian Symposium in Information and Human Language Technology","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126826995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Studying Portuguese as Used: the AC/DC service","authors":"L. Costa, Diana Santos, Paulo Rocha","doi":"10.1109/STIL.2009.25","DOIUrl":"https://doi.org/10.1109/STIL.2009.25","url":null,"abstract":"The AC/DC service has been giving access to Portuguese corpora through the Web since 1999. This paper describes the tasks related to processing and making the texts publicly available. It also provides an overview on the interface with which the users can query the corpora and finalizes pointing future directions.O AC/DC é um serviço que desde 1999 dá acesso a corpos emportuguês através da Internet. Neste artigo descrevemos sucintamente o processo pelo qual os textos são processados e tornados públicos e a interface através da qual se podem fazer as pesquisas. Concluímos lançando pontes para o desenvolvimento futuro deste serviço.","PeriodicalId":265848,"journal":{"name":"2009 Seventh Brazilian Symposium in Information and Human Language Technology","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132817132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Barufaldi, E. F. Santana, José Rogério B. B. Filho, J. V. D. Poel, Milton Marques Júnior, L. Batista
{"title":"Text Classification by Literary Period Using PPM-C Data Compression","authors":"B. Barufaldi, E. F. Santana, José Rogério B. B. Filho, J. V. D. Poel, Milton Marques Júnior, L. Batista","doi":"10.1109/STIL.2009.39","DOIUrl":"https://doi.org/10.1109/STIL.2009.39","url":null,"abstract":"Methods and techniques for data compression have been used for pattern recognition, including automatic text classification. The performance of the Prediction by Partial Matching (PPM) as a text classifier has already been proofed by many works, including authorship attribution for Portuguese texts. Classes involved in classification process may not be restricted by only one author. By including two or more authors in one class, one can create a literature style. This work presents a literature style classifier for texts from Brazilian literature by using the PPM-C statistical model.","PeriodicalId":265848,"journal":{"name":"2009 Seventh Brazilian Symposium in Information and Human Language Technology","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131421797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}