Computational Linguistics最新文献

Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies 非结构重入图语言的生成与多项式解析

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-07-06 DOI: 10.1162/coli_a_00488

Johanna Björklund, F. Drewes, Anna Jonsson

{"title":"Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies","authors":"Johanna Björklund, F. Drewes, Anna Jonsson","doi":"10.1162/coli_a_00488","DOIUrl":"https://doi.org/10.1162/coli_a_00488","url":null,"abstract":"\u0000 Graph-based semantic representations are popular in natural language processing (NLP), where it is often convenient to model linguistic concepts as nodes and relations as edges between them. Several attempts have been made to find a generative device that is sufficiently powerful to describe languages of semantic graphs, while at the same allowing efficient parsing. We contribute to this line of work by introducing graph extension grammar, a variant of the contextual hyperedge replacement grammars proposed by Hoffmann et al. Contextual hyperedge replacement can generate graphs with non-structural reentrancies, a type of node-sharing that is very common in formalisms such as abstract meaning representation, but which context-free types of graph grammars cannot model. To provide our formalism with a way to place reentrancies in a linguistically meaningful way, we endow rules with logical formulas in counting monadic second-order logic. We then present a parsing algorithm and show as our main result that this algorithm runs in polynomial time on graph languages generated by a subclass of our grammars, the so-called local graph extension grammars.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47831566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Languages through the Looking Glass of BPE Compression 从BPE压缩的角度看语言

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-07-06 DOI: 10.1162/coli_a_00489

Ximena Gutierrez-Vasques, C. Bentz, T. Samardžić

{"title":"Languages through the Looking Glass of BPE Compression","authors":"Ximena Gutierrez-Vasques, C. Bentz, T. Samardžić","doi":"10.1162/coli_a_00489","DOIUrl":"https://doi.org/10.1162/coli_a_00489","url":null,"abstract":"\u0000 Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It uncovers redundant patterns for compressing the data, and hence alleviates the sparsity problem in downstream applications. Subwords discovered during the first merge operations tend to have the most substantial impact on the compression of texts. However, the structural underpinnings of this effect have not been analyzed cross-linguistically. We conduct in-depth analyses across 47 typologically diverse languages and three parallel corpora, and thereby show that the types of recurrent patterns that have the strongest impact on compression are an indicator of morphological typology. For languages with richer inflectional morphology there is a preference for highly productive subwords on the early merges, while for languages with less inflectional morphology, idiosyncratic subwords are more prominent. Both types of patterns contribute to efficient compression. Counter the common perception that BPE subwords are not linguistically relevant, we find patterns across languages that resemble those described in traditional typology. We thus propose a novel way to characterize languages according to their BPE subword properties, inspired by the notion of morphological productivity in linguistics. This allows us to have language vectors that encode typological knowledge induced from raw text. Our approach is easily applicable to a wider range of languages and texts, as it does not require annotated data or any external linguistic knowledge. We discuss its potential contributions to quantitative typology and multilingual NLP.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48265607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings 通过投票区嵌入捕捉语言使用的精细区域差异

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-06-13 DOI: 10.1162/coli_a_00487

Alex Rosenfeld, L. Hinrichs

{"title":"Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings","authors":"Alex Rosenfeld, L. Hinrichs","doi":"10.1162/coli_a_00487","DOIUrl":"https://doi.org/10.1162/coli_a_00487","url":null,"abstract":"\u0000 Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas which can provide higher resolution analyses of language variation. We embed voting precincts which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse with many areas having scant social media data.We propose a novel embedding approach that alternates training with smoothing which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We developed two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48983208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Lingual Transfer with Language-Specific Subnetworks for Low-Resource Dependency Parsing 基于特定语言子网的低资源依赖解析跨语言传输

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-05-25 DOI: 10.1162/coli_a_00482

Rochelle Choenni, Dan Garrette, Ekaterina Shutova

引用次数: 2

Statistical Methods for Annotation Analysis by Silviu Paun, Ron Artstein, and Massimo Poesio Silviu Paun、Ron Artstein和Massimo Poesio注释分析的统计方法

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-05-25 DOI: 10.1162/coli_r_00483

Rodrigo Wilkens

引用次数: 0

Machine Learning for Ancient Languages: A Survey 古代语言的机器学习综述

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-05-25 DOI: 10.1162/coli_a_00481

Thea Sommerschield, Yannis Assael, John Pavlopoulos, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, J. Prag, I. Androutsopoulos, Nando de Freitas

{"title":"Machine Learning for Ancient Languages: A Survey","authors":"Thea Sommerschield, Yannis Assael, John Pavlopoulos, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, J. Prag, I. Androutsopoulos, Nando de Freitas","doi":"10.1162/coli_a_00481","DOIUrl":"https://doi.org/10.1162/coli_a_00481","url":null,"abstract":"\u0000 Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in Artificial Intelligence and Machine Learning have enabled analyses on a scale and in a detail that are reshaping the field of Humanities, similarly to how microscopes and telescopes have contributed to the realm of Science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script and medium, spanning over three and a half millennia of civilisations around the ancient world. To analyse the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitisation, restoration, attribution, linguistic analysis, textual criticism, translation and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the Humanities and Machine Learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, flagging promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the Humanities and Machine Learning.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43671296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Dimensions of Explanatory Value in NLP models NLP模型中解释价值的维度

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-05-04 DOI: 10.1162/coli_a_00480

Kees van Deemter

引用次数: 1

Comparing Selective Masking Methods for Depression Detection in Social Media 社交媒体中抑郁症检测的选择性蒙面方法比较

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-04-28 DOI: 10.1162/coli_a_00479

Chanapa Pananookooln, Jakrapop Akaranee, Chaklam Silpasuwanchai

{"title":"Comparing Selective Masking Methods for Depression Detection in Social Media","authors":"Chanapa Pananookooln, Jakrapop Akaranee, Chaklam Silpasuwanchai","doi":"10.1162/coli_a_00479","DOIUrl":"https://doi.org/10.1162/coli_a_00479","url":null,"abstract":"\u0000 Identifying those at risk for depression is a crucial issue where social media provides an excellent platform for examining the linguistic patterns of depressed individuals. A significant challenge in depression classification problem is ensuring that prediction models are not overly dependent on topic keywords i.e., depression keywords, such that it fails to predict when such keywords are unavailable. One promising approach is masking, i.e., by selectively masking various words and asking the model to predict the masked words, the model is forced to learn the inherent language patterns of depression. This study evaluates seven masking techniques. Moreover, to predict the masked words during pre-training or fine-tuning phase was also examined. Last, six class imbalance ratios were compared to determine the robustness of masked words selection methods. Key findings demonstrated that selective masking outperforms random masking in terms of F1-score. The most accurate and robust models were identified. Our research also indicated that reconstructing the masked words during pre-training phase is more advantageous than during the fine-tuning phase. Further discussion and implications were made. This is the first study to comprehensively compare masked words selection methods, which has broad implications for the field of depression classification and general NLP. Our code can be found in: https://github.com/chanapapan/Depression-Detection.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":9.3,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45179378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reflection of Demographic Background on Word Usage 人口背景对词汇使用的反映

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-01-17 DOI: 10.1162/coli_a_00475

Aparna Garimella, Carmen Banea, Rada Mihalcea

{"title":"Reflection of Demographic Background on Word Usage","authors":"Aparna Garimella, Carmen Banea, Rada Mihalcea","doi":"10.1162/coli_a_00475","DOIUrl":"https://doi.org/10.1162/coli_a_00475","url":null,"abstract":"The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the baseline for all demographic categories, even for scenarios with classification into 15 categories, illustrating the usefulness of topic-based features in identifying word usage differences. Further, we note that for location and industry, topics extracted from immediate context are the best predictors of word usages, hinting at the importance of word meaning and its grammatical function for these demographics, while for gender, topics obtained from longer contexts are better predictors for word usage.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"49 1","pages":"373-394"},"PeriodicalIF":9.3,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46226615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data-driven Cross-lingual Syntax: An Agreement Study with Massively Multilingual Models 数据驱动的跨语言语法:大规模多语言模型的一致性研究

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-01-13 DOI: 10.1162/coli_a_00472

Andrea Gregor de Varda, M. Marelli

{"title":"Data-driven Cross-lingual Syntax: An Agreement Study with Massively Multilingual Models","authors":"Andrea Gregor de Varda, M. Marelli","doi":"10.1162/coli_a_00472","DOIUrl":"https://doi.org/10.1162/coli_a_00472","url":null,"abstract":"Massively multilingual models such as mBERT and XLM-R are increasingly valued in Natural Language Processing research and applications, due to their ability to tackle the uneven distribution of resources available for different languages. The models’ ability to process multiple languages relying on a shared set of parameters raises the question of whether the grammatical knowledge they extracted during pre-training can be considered as a data-driven cross-lingual grammar. The present work studies the inner workings of mBERT and XLM-R in order to test the cross-lingual consistency of the individual neural units that respond to a precise syntactic phenomenon, that is, number agreement, in five languages (English, German, French, Hebrew, Russian). We found that there is a significant overlap in the latent dimensions that encode agreement across the languages we considered. This overlap is larger (a) for long- vis-à-vis short-distance agreement and (b) when considering XLM-R as compared to mBERT, and peaks in the intermediate layers of the network. We further show that a small set of syntax-sensitive neurons can capture agreement violations across languages; however, their contribution is not decisive in agreement processing.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"49 1","pages":"261-299"},"PeriodicalIF":9.3,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47159527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5