LinguamaticaPub Date : 2016-12-31DOI: 10.21814/LM.8.2.230
N. Hartmann
{"title":"Solo Queue at ASSIN: Combinando Abordagens Tradicionais e Emergentes","authors":"N. Hartmann","doi":"10.21814/LM.8.2.230","DOIUrl":"https://doi.org/10.21814/LM.8.2.230","url":null,"abstract":"In this paper we present a proposal to automatically label the similarity between a pair of sentences and the results obtained on ASSIN 2016 sentence similarity shared-task. Our proposal consists of using a classical feature of bag-of-words, the TF-IDF model; and an emergent feature, obtained from processing word embeddings. The TF-IDF is used to relate texts which share words. Word embeddings are known by capture the syntax and semantics of a word. Following Mikolov et al. (2013), the sum of embedding vectors can model the meaning of a sentence. Using both features, we are able to capture the words shared between sentences and their semantics. We use linear regression to solve this problem, once the dataset is labeled as real numbers between 1 and 5. Our results are promising. Although the usage of embeddings has not overcome our baseline system, when we combined it with TF-IDF, our system achieved better results than only using TF-IDF. Our results achieved the first collocation of ASSIN 2016 for sentence similarity shared-task applied on brazilian portuguese sentences and second collocation when applying to Portugal portuguese sentences.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"29 1","pages":"59-64"},"PeriodicalIF":0.6,"publicationDate":"2016-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LinguamaticaPub Date : 2016-07-22DOI: 10.21814/LM.8.1.227
Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno, Azucena Montes Rendón, Gerardo E Sierra
{"title":"Perfilado de autor multilingüe en redes sociales a partir de n-gramas de caracteres y de etiquetas gramaticales","authors":"Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno, Azucena Montes Rendón, Gerardo E Sierra","doi":"10.21814/LM.8.1.227","DOIUrl":"https://doi.org/10.21814/LM.8.1.227","url":null,"abstract":"En este articulo presentamos un algoritmo que combina las caracteristicas estilisticas representadas por los n-gramas de caracteres y los n-gramas de etiquetas gramaticales (POS) para clasificar documentos multilengua de redes sociales. En ambos grupos de n-gramas se aplico una normalizacion dinamica dependiente del contexto para extraer la mayor cantidad de informacion estilistica posible codificada en los documentos (emoticonos, inundamiento de caracteres, uso de letras mayusculas, referencias a usuarios, ligas a sitios externos, hashtags, etc.). El algoritmo fue aplicado sobre dos corpus diferentes: los tweets del corpus de entrenamiento de la tarea Author Profiling de PAN-CLEF 2015 y el corpus de \"Comentarios de la Ciudad de Mexico en el tiempo\" (CCDMX). Los resultados presentan una exactitud muy alta, cercana al 90%.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"8 1","pages":"21-29"},"PeriodicalIF":0.6,"publicationDate":"2016-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LinguamaticaPub Date : 2016-07-22DOI: 10.21814/LM.8.1.221
Hernani Costa, Isabel Dúran Muñoz, Gloria Corpas Pastor, Ruslan Mitkov
{"title":"Compilação de Corpos Comparáveis Especializados: Devemos sempre confiar nas Ferramentas de Compilação Semi-automáticas?","authors":"Hernani Costa, Isabel Dúran Muñoz, Gloria Corpas Pastor, Ruslan Mitkov","doi":"10.21814/LM.8.1.221","DOIUrl":"https://doi.org/10.21814/LM.8.1.221","url":null,"abstract":"Decisions at the outset of compiling a comparable corpus are of crucial importance for how the corpus is to be built and analysed later on. Several variables and external criteria are usually followed when building a corpus but little is been said about textual distributional similarity in this context and the quality that it brings to research. In an attempt to fulfil this gap, this paper aims at presenting a simple but efficient methodology capable of measuring a corpus internal degree of relatedness. To do so, this methodology takes advantage of both available natural language processing technology and statistical methods in a successful attempt to access the relatedness degree between documents. Our findings prove that using a list of common entities and a set of distributional similarity measures is enough not only to describe and assess the degree of relatedness between the documents in a comparable corpus, but also to rank them according to their degree of relatedness within the corpus.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"8 1","pages":"3-19"},"PeriodicalIF":0.6,"publicationDate":"2016-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LinguamaticaPub Date : 2016-07-22DOI: 10.21814/LM.8.1.226
Fernanda López-Escobedo, Julián Solórzano-Soto
{"title":"Propuesta de clasificación de un banco de voces con fines de identificación forense","authors":"Fernanda López-Escobedo, Julián Solórzano-Soto","doi":"10.21814/LM.8.1.226","DOIUrl":"https://doi.org/10.21814/LM.8.1.226","url":null,"abstract":"En este articulo se presenta el proyecto que se desarrolla para proponer una clasificacion de un banco de voces con fines de identificacion forense. Se expone la manera en que la informacion linguistica puede ser utilizada en una base de datos para reducir el numero de falsos positivos y falsos negativos que resultan cuando se llevan a cabo comparaciones automatizadas para la identificacion forense de voz. En particular, se abordan los fenomenos foneticos que se han propuesto para realizar una clasificacion de un banco de voces en este nivel de la lengua. A partir de esta informacion se describe como construir un modelo de base de datos y el tipo de busquedas que se espera lograr. La propuesta de generar descriptores linguisticos para la clasificacion de un banco de voces pretende ser una metodologia que permita coadyuvar en la imparticion de justicia en Mexico y otros paises de habla hispana.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"8 1","pages":"33-41"},"PeriodicalIF":0.6,"publicationDate":"2016-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LinguamaticaPub Date : 2015-12-30DOI: 10.21814/LM.7.2.215
A. Cerrudo, Á. J. Gallego, Anna Pineda, F. Roca
{"title":"ASinEs: Prolegómenos de un atlas de la variación sintáctica del español","authors":"A. Cerrudo, Á. J. Gallego, Anna Pineda, F. Roca","doi":"10.21814/LM.7.2.215","DOIUrl":"https://doi.org/10.21814/LM.7.2.215","url":null,"abstract":"espanolEn este articulo se presenta el ASinEs, una aplicacion con formato de atlas dedicada al estudio sincronico de la variacion sintactica de los geolectos del espanol. Este proyecto es innovador, ya que no existe ningun atlas dedicado exclusivamente a investigar la variacion geolectal de la sintaxis de esta lengua. La versatilidad del ASinEs permite tambien el estudio de geolectos de otros estadios del espanol, asi como los de otras lenguas con las que esta actualmente en contacto. Todo ello proporciona una potente herramienta para la investigacion en el campo de la variacion de las lenguas romanicas y no romanicas (vasco, ingles, lenguas amerindias, etc.).El desarrollo de este proyecto cuenta con la colaboracion del Centre de Linguistica Teorica (Universitat Autonoma de Barcelona), el Centro IKER con sede en Bayona (Francia) y la Real Academia Espanola. EnglishThis paper introduces the ASinEs1, an atlas-based application devoted to the study of the syntactic variation of Spanish geolects. This project is groundbreaking, as there is no other atlas exclusively devoted to study the geolectal variation of geolectal variants of Spanish. Although ASinEs was originally conceived to explore the current geolects of Spanish, its flexibility allows it to study both the geolects of previous stages and the geolects of other close-by languages. This provides us with a po-werful tool to study variation of both Romance and non-Romance languages (Basque, English, Amerindi-an languages, etc.). This project is being developed in collaboration with the Centre de Ling¨u´istica Te`orica (Universitat Aut`onoma de Barcelona), the IKER Center at Bayonne (France), and the Real Academia Espanola.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"59-69"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LinguamaticaPub Date : 2015-12-30DOI: 10.21814/LM.7.2.213
Fábio Santos, Hugo Gonçalo Oliveira
{"title":"Descoberta de Synsets Difusos com base na Redundância em vários Dicionários","authors":"Fábio Santos, Hugo Gonçalo Oliveira","doi":"10.21814/LM.7.2.213","DOIUrl":"https://doi.org/10.21814/LM.7.2.213","url":null,"abstract":"EnglishIn a wordnet, concepts are typically represented as groups of words, commonly known as synsets, and each membership of a word to a synset denotes a different sense of that word. However, since word senses are complex entities, without well-defined boundaries, we suggest to handle them less artificially, by representing them as fuzzy objects, where each word has its membership degree, which can be related to the confidence on using the word to denote the concept conveyed by the synset. We thus propose an approach to discover synsets from a synonymy network, ideally redundant and extracted from several broad-coverage sources. The more synonymy relations there are between two words, the higher the confidence on the semantic equivalence of at least one of their senses. The proposed approach was applied to a network extracted from three Portuguese dictionaries and resulted in a large set of fuzzy synsets. Besides describing this approach and illustrating its results, we rely on three evaluations — comparison against a handcrafted Portuguese thesaurus; comparison against the results of a previous approach with a similar goal; and manual evaluation — to believe that our outcomes are positive and that, in the future, they might my expanded by exploring additional synonymy sources portuguesNuma wordnet, conceitos sao representados atraves de grupos de palavras, vulgarmente chamados de synsets, e cada pertenca de uma palavra a um synset representa um diferente sentido dessa mesma palavra. Mas como os sentidos sao entidades complexas, sem fronteiras bem definidas, para lidar com eles de forma menos artificial, sugerimos que synsets sejam tratados como conjuntos difusos, em que cada palavra tem um grau de pertenca, associado a confianca que existe na utilizacao de cada palavra para transmitir o conceito que emerge do synset. Propomos entao uma abordagem automatica para descobrir um conjunto de synsets difusos a partir de uma rede de sinonimos, idealmente redundante, por ser extraida a partir de varias fontes, e o mais abrangentes possivel. Um dos principios e que, em quantos mais recursos duas palavras forem consideradas sinonimos, maior confianca havera na equivalencia de pelo menos um dos seus sentidos. A abordagem proposta foi aplicada a uma rede extraida a partir de tres dicionarios do portugues e resultou num novo conjunto de synsets para esta lingua, em que as palavras tem pertencas difusas, ou seja, fuzzy synsets. Para alem de apresentar a abordagem e a ilustrar com alguns resultados obtidos, baseamo-nos em tres avaliacoes — comparacao com um tesauro criado manualmente para o portugues; comparacao com uma abordagem anterior com o mesmo objetivo; e avaliacao manual — para confirmar que os resultados sao positivos, e poderao no futuro ser expandidos atraves da exploracao de outras fontes de sinonimos.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"3-17"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LinguamaticaPub Date : 2015-12-30DOI: 10.21814/LM.7.2.216
Larissa Picoli, Juliana Pinheiro Campos Pirovani, E. Oliveira, Eric Guy Claude Laporte
{"title":"Uso de uma Ferramenta de Processamento de Linguagem Natural como Auxílio à Coleta de Exemplos para o Estudo de Propriedades Sintático-Semânticas de Verbos","authors":"Larissa Picoli, Juliana Pinheiro Campos Pirovani, E. Oliveira, Eric Guy Claude Laporte","doi":"10.21814/LM.7.2.216","DOIUrl":"https://doi.org/10.21814/LM.7.2.216","url":null,"abstract":"A analise e descricao de propriedades sintatico-semânticas de verbos sao importantes para a compreensao do funcionamento de uma lingua e fundamentais para o processamento automatico de linguagem natural, uma vez que a codificacao dessa descricao pode ser explorada por ferramentas que realizam esse tipo de processamento. Esse trabalho experimenta o uso do Unitex, uma ferramenta de processamento de linguagem natural, para coletar uma lista de verbos que podem ser analisados e descritos por um linguista. Isso contribui significativamente para esse tipo de estudo linguistico, diminuindo o esforco manual humano na busca de verbos. Foi realizado um estudo de caso para automatizar parcialmente a coleta de verbos de base adjetiva com sufixo -ecer em um corpus de 47 milhoes de palavras. A abordagem proposta e comparada com a coleta manual e a extracao a partir de um dicionario para o PLN.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"6 1","pages":"35-44"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"El Test de Turing para la evaluación de resumen automático de texto","authors":"Alejandro Molina-Villegas, Juan-Manuel Torres-Moreno","doi":"10.21814/LM.7.2.214","DOIUrl":"https://doi.org/10.21814/LM.7.2.214","url":null,"abstract":"espanolActualmente existen varios metodos para producir resumenes de texto de manera automatica, pero la evaluacion de los mismos continua siendo un tema desafiante. En este articulo estudiamos la evaluacion de la calidad de resumenes producidos de manera automatica mediante un metodo de compresion de frases. Abordamos la problematica que supone el uso de metricas automaticas, las cuales no toman en cuenta ni la gramatica ni la validez de las oraciones. Nuestra propuesta de evaluacion esta basada en el test de Turing, en el cual varios jueces humanos deben identificar el origen, humano o automatico, de una serie de resumenes. Tambien explicamos como validar las respuestas de los jueces por medio del test estadistico de Fisher. EnglishCurrently there are several methods to produce summaries of text automatically, but the evaluation of these remains a challenging issue. In this paper, we study the quality assessment of automatically generated abstracts. We deal with one of the major drawbacks of automatic metrics, which do not take into account either the grammar or the validity of sentences. Our proposal is based on the Turing test, in which a human judges must identify the source of a series of summaries. We propose how statistically validate the judgements using the Fisher's exact test.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"45-55"},"PeriodicalIF":0.6,"publicationDate":"2015-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68372077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LinguamaticaPub Date : 2015-12-01DOI: 10.21814/LM.7.2.217
O. A. López, C. Aguilar, Tomás Infante
{"title":"Reconocimiento de términos en español mediante la aplicación de un enfoque de comparación entre corpus","authors":"O. A. López, C. Aguilar, Tomás Infante","doi":"10.21814/LM.7.2.217","DOIUrl":"https://doi.org/10.21814/LM.7.2.217","url":null,"abstract":"espanolEn este articulo presentamos una metodologia para la identificacion y extraccion de terminos a partir de fuentes textuales en espanol correspondientes a dominios de conocimiento especializados mediante un enfoque de contraste entre corpus. El enfoque de contraste entre corpus hace uso de medidas para asignar relevancia a palabras que ocurren tanto en el corpus de dominio como en corpus de lengua general o de otro dominio diferente al de interes. Dado lo anterior, en este trabajo realizamos una exploracion de cuatro medidas usadas para asignar relevancia a palabras con el objetivo de incorporar la de mejor desempeno a nuestra metodologia. Los resultados obtenidos muestran un desempeno mejor de las medidas diferencia de rangos y razon de frecuencias relativas comparado con la razon log-likelihood y la medida usada en Termostat. EnglishIn this article we present a methodology for identifying and extracting terms from text sources in Spanish corresponding specialized-domain corpus by means of a contrastive approach. The contrastive approach requires a measure for assigning relevance to words occurring both in domain corpus and reference corpus. Therefore, in this work we explored four measures used for assigning relevance to words with the goal of incorporating the best measure in our methodology. Our results show a better performance of rank difference and relative frequency ratio measures compared with log-likelihood ratio and the measure used by Termostat.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"38 1","pages":"19-34"},"PeriodicalIF":0.6,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LinguamaticaPub Date : 2015-07-31DOI: 10.21814/LM.7.1.205
Óscar Alcón, E. Lloret
{"title":"Estudio de la influencia de incorporar conocimiento léxico-semántico a la técnica de Análisis de Componentes Principales para la generación de resúmenes multilingües","authors":"Óscar Alcón, E. Lloret","doi":"10.21814/LM.7.1.205","DOIUrl":"https://doi.org/10.21814/LM.7.1.205","url":null,"abstract":"The objective of automatic text summarization is to reduce the dimension of a text keeping the relevant information. In this paper we analyse and apply the language-independent Principal Component Analysis technique for generating extractive single-document multilingual summaries. This technique will be studied to evaluate its performance with and without adding lexical-semantic knowledge through language-dependent resources and tools. Experiments were conducted using two different corpora: newswire and Wikipedia articles in three languages (English, German and Spanish) to validate the use of this technique in several scenarios. The proposed approaches show very competitive results compared to multilingual available systems, indicating that, although there is still room for improvement with respect to the technique and the type of knowledge to be taken into consideration, this has great potential for being applied in other contexts and for other languages.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"7 1","pages":"53-63"},"PeriodicalIF":0.6,"publicationDate":"2015-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68371618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}