Computational Linguistics最新文献_第5页

Boring Problems Are Sometimes the Most Interesting 无聊的问题有时是最有趣的

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2022-03-07 DOI: 10.1162/coli_a_00439

R. Sproat

引用次数: 5

Assessing Corpus Evidence for Formal and Psycholinguistic Constraints on Nonprojectivity 评估语料库证据对非投射性的形式和心理语言学约束

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2022-03-07 DOI: 10.1162/coli_a_00437

Himanshu Yadav, Samar Husain, Richard Futrell

{"title":"Assessing Corpus Evidence for Formal and Psycholinguistic Constraints on Nonprojectivity","authors":"Himanshu Yadav, Samar Husain, Richard Futrell","doi":"10.1162/coli_a_00437","DOIUrl":"https://doi.org/10.1162/coli_a_00437","url":null,"abstract":"Abstract Formal constraints on crossing dependencies have played a large role in research on the formal complexity of natural language grammars and parsing. Here we ask whether the apparent evidence for constraints on crossing dependencies in treebanks might arise because of independent constraints on trees, such as low arity and dependency length minimization. We address this question using two sets of experiments. In Experiment 1, we compare the distribution of formal properties of crossing dependencies, such as gap degree, between real trees and baseline trees matched for rate of crossing dependencies and various other properties. In Experiment 2, we model whether two dependencies cross, given certain psycholinguistic properties of the dependencies. We find surprisingly weak evidence for constraints originating from the mild context-sensitivity literature (gap degree and well-nestedness) beyond what can be explained by constraints on rate of crossing dependencies, topological properties of the trees, and dependency length. However, measures that have emerged from the parsing literature (e.g., edge degree, end-point crossings, and heads’ depth difference) differ strongly between real and random trees. Modeling results show that cognitive metrics relating to information locality and working-memory limitations affect whether two dependencies cross or not, but they do not fully explain the distribution of crossing dependencies in natural languages. Together these results suggest that crossing constraints are better characterized by processing pressures than by mildly context-sensitive constraints.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"375-401"},"PeriodicalIF":9.3,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45451373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Challenges of Neural Machine Translation for Short Texts 短文本神经机器翻译的挑战

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2022-03-07 DOI: 10.1162/coli_a_00435

Yu Wan, Baosong Yang, Derek F. Wong, Lidia S. Chao, Liang Yao, Haibo Zhang, Boxing Chen

{"title":"Challenges of Neural Machine Translation for Short Texts","authors":"Yu Wan, Baosong Yang, Derek F. Wong, Lidia S. Chao, Liang Yao, Haibo Zhang, Boxing Chen","doi":"10.1162/coli_a_00435","DOIUrl":"https://doi.org/10.1162/coli_a_00435","url":null,"abstract":"Abstract Short texts (STs) present in a variety of scenarios, including query, dialog, and entity names. Most of the exciting studies in neural machine translation (NMT) are focused on tackling open problems concerning long sentences rather than short ones. The intuition behind is that, with respect to human learning and processing, short sequences are generally regarded as easy examples. In this article, we first dispel this speculation via conducting preliminary experiments, showing that the conventional state-of-the-art NMT approach, namely, Transformer (Vaswani et al. 2017), still suffers from over-translation and mistranslation errors over STs. After empirically investigating the rationale behind this, we summarize two challenges in NMT for STs associated with translation error types above, respectively: (1) the imbalanced length distribution in training set intensifies model inference calibration over STs, leading to more over-translation cases on STs; and (2) the lack of contextual information forces NMT to have higher data uncertainty on short sentences, and thus NMT model is troubled by considerable mistranslation errors. Some existing approaches, like balancing data distribution for training (e.g., data upsampling) and complementing contextual information (e.g., introducing translation memory) can alleviate the translation issues in NMT for STs. We encourage researchers to investigate other challenges in NMT for STs, thus reducing ST translation errors and enhancing translation quality.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"321-342"},"PeriodicalIF":9.3,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48624460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Hierarchical Interpretation of Neural Text Classification 神经文本分类的层次解释

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2022-02-20 DOI: 10.1162/coli_a_00459

Hanqi Yan, Lin Gui, Yulan He

引用次数: 9

Transformers and the Representation of Biomedical Background Knowledge 变形金刚与生物医学背景知识的表示

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2022-02-04 DOI: 10.1162/coli_a_00462

Oskar Wysocki, Zili Zhou, Paul O'Regan, D. Ferreira, M. Wysocka, Dónal Landers, Andr'e Freitas Department of Computer Science, The University of Manchester, digital Experimental Cancer Medicine Team, Cancer Biomarker Centre, Cruk Manchester Institute, U. Manchester, Idiap Research Institute

{"title":"Transformers and the Representation of Biomedical Background Knowledge","authors":"Oskar Wysocki, Zili Zhou, Paul O'Regan, D. Ferreira, M. Wysocka, Dónal Landers, Andr'e Freitas Department of Computer Science, The University of Manchester, digital Experimental Cancer Medicine Team, Cancer Biomarker Centre, Cruk Manchester Institute, U. Manchester, Idiap Research Institute","doi":"10.1162/coli_a_00462","DOIUrl":"https://doi.org/10.1162/coli_a_00462","url":null,"abstract":"Specialized transformers-based models (such as BioBERT and BioMegatron) are adapted for the biomedical domain based on publicly available biomedical corpora. As such, they have the potential to encode large-scale biological knowledge. We investigate the encoding and representation of biological knowledge in these models, and its potential utility to support inference in cancer precision medicine—namely, the interpretation of the clinical significance of genomic alterations. We compare the performance of different transformer baselines; we use probing to determine the consistency of encodings for distinct entities; and we use clustering methods to compare and contrast the internal properties of the embeddings for genes, variants, drugs, and diseases. We show that these models do indeed encode biological knowledge, although some of this is lost in fine-tuning for specific tasks. Finally, we analyze how the models behave with regard to biases and imbalances in the dataset.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"49 1","pages":"73-115"},"PeriodicalIF":9.3,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48709787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization 文本匿名基准(TAB):文本匿名化的专用语料库和评估框架

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2022-01-25 DOI: 10.1162/coli_a_00458

Ildik'o Pil'an, Pierre Lison, Lilja Ovrelid, Anthia Papadopoulou, David Sánchez, Montserrat Batet

{"title":"The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization","authors":"Ildik'o Pil'an, Pierre Lison, Lilja Ovrelid, Anthia Papadopoulou, David Sánchez, Montserrat Batet","doi":"10.1162/coli_a_00458","DOIUrl":"https://doi.org/10.1162/coli_a_00458","url":null,"abstract":"Abstract We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared with previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored toward measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts, and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"1053-1101"},"PeriodicalIF":9.3,"publicationDate":"2022-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43427803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Domain Adaptation with Pre-trained Transformers for Query-Focused Abstractive Text Summarization 面向查询的抽象文本摘要的预训练变形域自适应

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2021-12-22 DOI: 10.1162/coli_a_00434

Md Tahmid Rahman Laskar, Enamul Hoque, J. Huang

引用次数: 24

Novelty Detection: A Perspective from Natural Language Processing 从自然语言处理的角度看新颖性检测

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2021-12-20 DOI: 10.1162/coli_a_00429

Tirthankar Ghosal, Tanik Saikh, Tameesh Biswas, Asif Ekbal, P. Bhattacharyya

{"title":"Novelty Detection: A Perspective from Natural Language Processing","authors":"Tirthankar Ghosal, Tanik Saikh, Tameesh Biswas, Asif Ekbal, P. Bhattacharyya","doi":"10.1162/coli_a_00429","DOIUrl":"https://doi.org/10.1162/coli_a_00429","url":null,"abstract":"The quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the Web, there is an accompanying menace of redundancy. A considerable portion of the Web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant information. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts toward the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multipremise entailment task is one close approximation toward identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state of the art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further research on document-level novelty detection.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"77-117"},"PeriodicalIF":9.3,"publicationDate":"2021-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41934834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Linguistic Parameters of Spontaneous Speech for Identifying Mild Cognitive Impairment and Alzheimer Disease 识别轻度认知障碍和阿尔茨海默病的自发言语语言参数

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2021-12-20 DOI: 10.1162/coli_a_00428

V. Vincze, Martina Katalin Szabó, I. Hoffmann, L. Tóth, M. Pákáski, J. Kálmán, G. Gosztolya

{"title":"Linguistic Parameters of Spontaneous Speech for Identifying Mild Cognitive Impairment and Alzheimer Disease","authors":"V. Vincze, Martina Katalin Szabó, I. Hoffmann, L. Tóth, M. Pákáski, J. Kálmán, G. Gosztolya","doi":"10.1162/coli_a_00428","DOIUrl":"https://doi.org/10.1162/coli_a_00428","url":null,"abstract":"In this article, we seek to automatically identify Hungarian patients suffering from mild cognitive impairment (MCI) or mild Alzheimer disease (mAD) based on their speech transcripts, focusing only on linguistic features. In addition to the features examined in our earlier study, we introduce syntactic, semantic, and pragmatic features of spontaneous speech that might affect the detection of dementia. In order to ascertain the most useful features for distinguishing healthy controls, MCI patients, and mAD patients, we carry out a statistical analysis of the data and investigate the significance level of the extracted features among various speaker group pairs and for various speaking tasks. In the second part of the article, we use this rich feature set as a basis for an effective discrimination among the three speaker groups. In our machine learning experiments, we analyze the efficacy of each feature group separately. Our model that uses all the features achieves competitive scores, either with or without demographic information (3-class accuracy values: 68%–70%, 2-class accuracy values: 77.3%–80%). We also analyze how different data recording scenarios affect linguistic features and how they can be productively used when distinguishing MCI patients from healthy controls.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":"119-153"},"PeriodicalIF":9.3,"publicationDate":"2021-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44385743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Obituary: Martin Kay 讣告:马丁·凯

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2021-12-16 DOI: 10.1162/coli_a_00424

R. Kaplan, H. Uszkoreit

{"title":"Obituary: Martin Kay","authors":"R. Kaplan, H. Uszkoreit","doi":"10.1162/coli_a_00424","DOIUrl":"https://doi.org/10.1162/coli_a_00424","url":null,"abstract":"It is with great sadness that we report the passing of Martin Kay in August 2021. Martin was a pioneer and intellectual trailblazer in computational linguistics. He was also a close friend and colleague of many years. Martin was a polyglot undergraduate student of modern and medieval languages at Cambridge University, with a particular interest in translation. He was not (yet) a mathematician or engineer, but idle speculation in 1958 about the possibilities of automating the translation process led him to Margaret Masterman at the Cambridge Language Research Unit, and a shift to a long and productive career. In 1960 he was offered an internship with Dave Hays and the Linguistics Project at The RAND Corporation in California, another early center of research in our emerging discipline. He stayed at RAND for more than a decade, working on basic technologies that are needed for machine processing of natural language. Among his contributions during that period was the development of the first so-called chart parser (Kay 1967), a computationally effective mechanism for dealing systematically with linguistic dependencies that cannot be expressed in context-free grammars. The chart architecture could be deployed for language generation as well as parsing, an important property for Martin’s continuing interest in translation. It was during the years at RAND that Martin found his second calling, as a teacher of computational linguistics, initially at UCLA and then in many other settings. He was a gifted and entertaining speaker and lecturer, able to present complex material with clarity and precision. He took great pleasure in the interactions with his students and the role that he played in helping to advance their careers. He left RAND in 1972 to become a full-time professor and chair of the Computer Science Department at the University of California at Irvine. His time at Irvine was short-lived, as he was attracted back to an open-ended research environment. In 1974 he joined with Danny Bobrow, Ron Kaplan, and Terry Winograd to form the Language Understander project at the recently created Palo Alto Research Center (PARC) of the Xerox Corporation. The group took as a first goal the construction of a mixed-initiative dialog system using state-of-the-art components for knowledge representation and reasoning, language understanding, language production, and dialog management (Bobrow et al. 1977). Martin took responsibility for","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"1-3"},"PeriodicalIF":9.3,"publicationDate":"2021-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45299319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 104