Computational Linguistics最新文献

筛选
英文 中文
Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies 非结构重入图语言的生成与多项式解析
IF 9.3 2区 计算机科学
Computational Linguistics Pub Date : 2023-07-06 DOI: 10.1162/coli_a_00488
Johanna Björklund, F. Drewes, Anna Jonsson
{"title":"Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies","authors":"Johanna Björklund, F. Drewes, Anna Jonsson","doi":"10.1162/coli_a_00488","DOIUrl":"https://doi.org/10.1162/coli_a_00488","url":null,"abstract":"\u0000 Graph-based semantic representations are popular in natural language processing (NLP), where it is often convenient to model linguistic concepts as nodes and relations as edges between them. Several attempts have been made to find a generative device that is sufficiently powerful to describe languages of semantic graphs, while at the same allowing efficient parsing. We contribute to this line of work by introducing graph extension grammar, a variant of the contextual hyperedge replacement grammars proposed by Hoffmann et al. Contextual hyperedge replacement can generate graphs with non-structural reentrancies, a type of node-sharing that is very common in formalisms such as abstract meaning representation, but which context-free types of graph grammars cannot model. To provide our formalism with a way to place reentrancies in a linguistically meaningful way, we endow rules with logical formulas in counting monadic second-order logic. We then present a parsing algorithm and show as our main result that this algorithm runs in polynomial time on graph languages generated by a subclass of our grammars, the so-called local graph extension grammars.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47831566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Languages through the Looking Glass of BPE Compression 从BPE压缩的角度看语言
IF 9.3 2区 计算机科学
Computational Linguistics Pub Date : 2023-07-06 DOI: 10.1162/coli_a_00489
Ximena Gutierrez-Vasques, C. Bentz, T. Samardžić
{"title":"Languages through the Looking Glass of BPE Compression","authors":"Ximena Gutierrez-Vasques, C. Bentz, T. Samardžić","doi":"10.1162/coli_a_00489","DOIUrl":"https://doi.org/10.1162/coli_a_00489","url":null,"abstract":"\u0000 Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. It uncovers redundant patterns for compressing the data, and hence alleviates the sparsity problem in downstream applications. Subwords discovered during the first merge operations tend to have the most substantial impact on the compression of texts. However, the structural underpinnings of this effect have not been analyzed cross-linguistically. We conduct in-depth analyses across 47 typologically diverse languages and three parallel corpora, and thereby show that the types of recurrent patterns that have the strongest impact on compression are an indicator of morphological typology. For languages with richer inflectional morphology there is a preference for highly productive subwords on the early merges, while for languages with less inflectional morphology, idiosyncratic subwords are more prominent. Both types of patterns contribute to efficient compression. Counter the common perception that BPE subwords are not linguistically relevant, we find patterns across languages that resemble those described in traditional typology. We thus propose a novel way to characterize languages according to their BPE subword properties, inspired by the notion of morphological productivity in linguistics. This allows us to have language vectors that encode typological knowledge induced from raw text. Our approach is easily applicable to a wider range of languages and texts, as it does not require annotated data or any external linguistic knowledge. We discuss its potential contributions to quantitative typology and multilingual NLP.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48265607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings 通过投票区嵌入捕捉语言使用的精细区域差异
IF 9.3 2区 计算机科学
Computational Linguistics Pub Date : 2023-06-13 DOI: 10.1162/coli_a_00487
Alex Rosenfeld, L. Hinrichs
{"title":"Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings","authors":"Alex Rosenfeld, L. Hinrichs","doi":"10.1162/coli_a_00487","DOIUrl":"https://doi.org/10.1162/coli_a_00487","url":null,"abstract":"\u0000 Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas which can provide higher resolution analyses of language variation. We embed voting precincts which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse with many areas having scant social media data.We propose a novel embedding approach that alternates training with smoothing which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We developed two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48983208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Lingual Transfer with Language-Specific Subnetworks for Low-Resource Dependency Parsing 基于特定语言子网的低资源依赖解析跨语言传输
IF 9.3 2区 计算机科学
Computational Linguistics Pub Date : 2023-05-25 DOI: 10.1162/coli_a_00482
Rochelle Choenni, Dan Garrette, Ekaterina Shutova
{"title":"Cross-Lingual Transfer with Language-Specific Subnetworks for Low-Resource Dependency Parsing","authors":"Rochelle Choenni, Dan Garrette, Ekaterina Shutova","doi":"10.1162/coli_a_00482","DOIUrl":"https://doi.org/10.1162/coli_a_00482","url":null,"abstract":"\u0000 Large multilingual language models typically share their parameters across all languages, which enables cross-lingual task transfer, but learning can also be hindered when training updates from different languages are in conflict. In this article, we propose novel methods for using language-specific subnetworks, which control cross-lingual parameter sharing, to reduce conflicts and increase positive transfer during fine-tuning. We introduce dynamic subnetworks, which are jointly updated with the model, and we combine our methods with meta-learning, an established, but complementary, technique for improving cross-lingual transfer. Finally, we provide extensive analyses of how each of our methods affects the models.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46306161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Statistical Methods for Annotation Analysis by Silviu Paun, Ron Artstein, and Massimo Poesio Silviu Paun、Ron Artstein和Massimo Poesio注释分析的统计方法
IF 9.3 2区 计算机科学
Computational Linguistics Pub Date : 2023-05-25 DOI: 10.1162/coli_r_00483
Rodrigo Wilkens
{"title":"Statistical Methods for Annotation Analysis by Silviu Paun, Ron Artstein, and Massimo Poesio","authors":"Rodrigo Wilkens","doi":"10.1162/coli_r_00483","DOIUrl":"https://doi.org/10.1162/coli_r_00483","url":null,"abstract":"","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44906611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine Learning for Ancient Languages: A Survey 古代语言的机器学习综述
IF 9.3 2区 计算机科学
Computational Linguistics Pub Date : 2023-05-25 DOI: 10.1162/coli_a_00481
Thea Sommerschield, Yannis Assael, John Pavlopoulos, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, J. Prag, I. Androutsopoulos, Nando de Freitas
{"title":"Machine Learning for Ancient Languages: A Survey","authors":"Thea Sommerschield, Yannis Assael, John Pavlopoulos, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, J. Prag, I. Androutsopoulos, Nando de Freitas","doi":"10.1162/coli_a_00481","DOIUrl":"https://doi.org/10.1162/coli_a_00481","url":null,"abstract":"\u0000 Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in Artificial Intelligence and Machine Learning have enabled analyses on a scale and in a detail that are reshaping the field of Humanities, similarly to how microscopes and telescopes have contributed to the realm of Science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script and medium, spanning over three and a half millennia of civilisations around the ancient world. To analyse the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitisation, restoration, attribution, linguistic analysis, textual criticism, translation and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the Humanities and Machine Learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, flagging promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the Humanities and Machine Learning.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43671296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Dimensions of Explanatory Value in NLP models NLP模型中解释价值的维度
IF 9.3 2区 计算机科学
Computational Linguistics Pub Date : 2023-05-04 DOI: 10.1162/coli_a_00480
Kees van Deemter
{"title":"Dimensions of Explanatory Value in NLP models","authors":"Kees van Deemter","doi":"10.1162/coli_a_00480","DOIUrl":"https://doi.org/10.1162/coli_a_00480","url":null,"abstract":"\u0000 Performance on a dataset is often regarded as the key criterion for assessing NLP models. I will argue for a broader perspective, which emphasizes scientific explanation. I will draw on a long tradition in the philosophy of science, and on the Bayesian approach to assessing scientific theories, to argue for a plurality of criteria for assessing NLP models. To illustrate these ideas, I will compare some recent models of language production with each other. I conclude by asking what it would mean for institutional policies if the NLP community took these ideas onboard.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43103061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Comparing Selective Masking Methods for Depression Detection in Social Media 社交媒体中抑郁症检测的选择性蒙面方法比较
IF 9.3 2区 计算机科学
Computational Linguistics Pub Date : 2023-04-28 DOI: 10.1162/coli_a_00479
Chanapa Pananookooln, Jakrapop Akaranee, Chaklam Silpasuwanchai
{"title":"Comparing Selective Masking Methods for Depression Detection in Social Media","authors":"Chanapa Pananookooln, Jakrapop Akaranee, Chaklam Silpasuwanchai","doi":"10.1162/coli_a_00479","DOIUrl":"https://doi.org/10.1162/coli_a_00479","url":null,"abstract":"\u0000 Identifying those at risk for depression is a crucial issue where social media provides an excellent platform for examining the linguistic patterns of depressed individuals. A significant challenge in depression classification problem is ensuring that prediction models are not overly dependent on topic keywords i.e., depression keywords, such that it fails to predict when such keywords are unavailable. One promising approach is masking, i.e., by selectively masking various words and asking the model to predict the masked words, the model is forced to learn the inherent language patterns of depression. This study evaluates seven masking techniques. Moreover, to predict the masked words during pre-training or fine-tuning phase was also examined. Last, six class imbalance ratios were compared to determine the robustness of masked words selection methods. Key findings demonstrated that selective masking outperforms random masking in terms of F1-score. The most accurate and robust models were identified. Our research also indicated that reconstructing the masked words during pre-training phase is more advantageous than during the fine-tuning phase. Further discussion and implications were made. This is the first study to comprehensively compare masked words selection methods, which has broad implications for the field of depression classification and general NLP. Our code can be found in: https://github.com/chanapapan/Depression-Detection.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45179378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reflection of Demographic Background on Word Usage 人口背景对词汇使用的反映
IF 9.3 2区 计算机科学
Computational Linguistics Pub Date : 2023-01-17 DOI: 10.1162/coli_a_00475
Aparna Garimella, Carmen Banea, Rada Mihalcea
{"title":"Reflection of Demographic Background on Word Usage","authors":"Aparna Garimella, Carmen Banea, Rada Mihalcea","doi":"10.1162/coli_a_00475","DOIUrl":"https://doi.org/10.1162/coli_a_00475","url":null,"abstract":"The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the baseline for all demographic categories, even for scenarios with classification into 15 categories, illustrating the usefulness of topic-based features in identifying word usage differences. Further, we note that for location and industry, topics extracted from immediate context are the best predictors of word usages, hinting at the importance of word meaning and its grammatical function for these demographics, while for gender, topics obtained from longer contexts are better predictors for word usage.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46226615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gradual Modifications and Abrupt Replacements: Two Stochastic Lexical Ingredients of Language Evolution 渐进式修饰与突发性替换:语言演化的两种随机词汇成分
IF 9.3 2区 计算机科学
Computational Linguistics Pub Date : 2023-01-13 DOI: 10.1162/coli_a_00471
M. Pasquini, M. Serva, D. Vergni
{"title":"Gradual Modifications and Abrupt Replacements: Two Stochastic Lexical Ingredients of Language Evolution","authors":"M. Pasquini, M. Serva, D. Vergni","doi":"10.1162/coli_a_00471","DOIUrl":"https://doi.org/10.1162/coli_a_00471","url":null,"abstract":"The evolution of the vocabulary of a language is characterized by two different random processes: abrupt lexical replacements, when a complete new word emerges to represent a given concept (which was at the basis of the Swadesh foundation of glottochronology in the 1950s), and gradual lexical modifications that progressively alter words over the centuries, considered here in detail for the first time. The main discriminant between these two processes is their impact on cognacy within a family of languages or dialects, since the former modifies the subsets of cognate terms and the latter does not. The automated cognate detection, which is here performed following a new approach inspired by graph theory, is a key preliminary step that allows us to later measure the effects of the slow modification process. We test our dual approach on the family of Malagasy dialects using a cladistic analysis, which provides strong evidence that lexical replacements and gradual lexical modifications are two random processes that separately drive the evolution of languages.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":9.3,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48927342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信