Marcos Fernández-Pichel, Manuel Prada-Corral, D. Losada, J. C. Pichel, Pablo Gamallo
{"title":"An unsupervised perplexity-based method for boilerplate removal","authors":"Marcos Fernández-Pichel, Manuel Prada-Corral, D. Losada, J. C. Pichel, Pablo Gamallo","doi":"10.1017/s1351324923000049","DOIUrl":"https://doi.org/10.1017/s1351324923000049","url":null,"abstract":"The availability of large web-based corpora has led to significant advances in a wide range of technologies, including massive retrieval systems or deep neural networks. However, leveraging this data is challenging, since web content is plagued by the so-called boilerplate: ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. In this work, we present a novel and efficient approach to extract useful and well-formed content from web-scraped data. Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As a matter of fact, the removal of noisy parts leads to lighter AI or search solutions that are effective and entail important reductions in resources spent. We exemplify here the usefulness of our method with two downstream tasks, search and classification, and a cleaning task. We also provide a Python package with pre-trained models and a web demo demonstrating the capabilities of our approach.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47336299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, Ngoc Thang Vu
{"title":"How to do human evaluation: A brief introduction to user studies in NLP","authors":"Hendrik Schuff, Lindsey Vanderlyn, Heike Adel, Ngoc Thang Vu","doi":"10.1017/S1351324922000535","DOIUrl":"https://doi.org/10.1017/S1351324922000535","url":null,"abstract":"Abstract Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1 score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1199 - 1222"},"PeriodicalIF":2.5,"publicationDate":"2023-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43935814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NLP startup funding in 2022","authors":"R. Dale","doi":"10.1017/S1351324923000013","DOIUrl":"https://doi.org/10.1017/S1351324923000013","url":null,"abstract":"Abstract It’s no secret that the commercial application of NLP technologies has exploded in recent years. From chatbots and virtual assistants to machine translation and sentiment analysis, NLP technologies are now being used in a wide variety of applications across a range of industries. With the increasing demand for technologies that can process human language, investors have been eager to get a piece of the action. In this article, we look at NLP startup funding over the past year, identifying the applications and domains that have received investment.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"162 - 176"},"PeriodicalIF":2.5,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43590821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SEN: A subword-based ensemble network for Chinese historical entity extraction","authors":"Cheng Yan, Ruojiang Wang, Xiaoke Fang","doi":"10.1017/S1351324922000493","DOIUrl":"https://doi.org/10.1017/S1351324922000493","url":null,"abstract":"Abstract Understanding various historical entity information (e.g., persons, locations, and time) plays a very important role in reasoning about the developments of historical events. With the increasing concern about the fields of digital humanities and natural language processing, named entity recognition (NER) provides a feasible solution for automatically extracting these entities from historical texts, especially in Chinese historical research. However, previous approaches are domain-specific, ineffective with relatively low accuracy, and non-interpretable, which hinders the development of NER in Chinese history. In this paper, we propose a new hybrid deep learning model called “subword-based ensemble network” (SEN), by incorporating subword information and a novel attention fusion mechanism. The experiments on a massive self-built Chinese historical corpus CMAG show that SEN has achieved the best with 93.87% for F1-micro and 89.70% for F1-macro, compared with other advanced models. Further investigation reveals that SEN has a strong generalization ability of NER on Chinese historical texts, which is not only relatively insensitive to the categories with fewer annotation labels (e.g., OFI) but can also accurately capture diverse local and global semantic relations. Our research demonstrates the effectiveness of the integration of subword information and attention fusion, which provides an inspiring solution for the practical use of entity extraction in the Chinese historical domain.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1043 - 1065"},"PeriodicalIF":2.5,"publicationDate":"2022-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45250172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"KLAUS-Tr: Knowledge & learning-based unit focused arithmetic word problem solver for transfer cases","authors":"Suresh Kumar, P. S. Kumar","doi":"10.1017/s1351324922000511","DOIUrl":"https://doi.org/10.1017/s1351324922000511","url":null,"abstract":"\u0000 Solving the Arithmetic Word Problems (AWPs) using AI techniques has attracted much attention in recent years. We feel that the current AWP solvers are under-utilizing the relevant domain knowledge. We present a knowledge- and learning-based system that effectively solves AWPs of a specific type—those that involve transfer of objects from one agent to another (Transfer Cases (TC)). We represent the knowledge relevant to these problems as TC Ontology. The sentences in TC-AWPs contain information of essentially four types: before-transfer, transfer, after-transfer, and query. Our system (KLAUS-Tr) uses statistical classifier to recognize the types of sentences. The sentence types guide the information extraction process used to identify the agents, quantities, units, types of objects, and the direction of transfer from the AWP text. The extracted information is represented as an RDF graph that utilizes the TC Ontology terminology. To solve the given AWP, we utilize semantic web rule language (SWRL) rules that capture the knowledge about how object transfer affects the RDF graph of the AWP. Using the TC ontology, we also analyze if the given problem is consistent or otherwise. The different ways in which TC-AWPs can be inconsistent are encoded as SWRL rules. Thus, KLAUS-Tr can identify if the given AWP is invalid and accordingly notify the user. Since the existing datasets do not have inconsistent AWPs, we create AWPs of this type and augment the datasets. We have implemented KLAUS-Tr and tested it on TC-type AWPs drawn from the All-Arith and other datasets. We find that TC-AWPs constitute about 40% of the AWPs in a typical dataset like All-Arith. Our system achieves an impressive accuracy of 92%, thus improving the state-of-the-art significantly. We plan to extend the system to handle AWPs that contain multiple transfers of objects and also offer explanations of the solutions.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"1 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2022-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"57584428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kenneth Ward Church, Annika Marie Schoene, John E. Ortega, Raman Chandrasekar, Valia Kordoni
{"title":"Emerging trends: Unfair, biased, addictive, dangerous, deadly, and insanely profitable","authors":"Kenneth Ward Church, Annika Marie Schoene, John E. Ortega, Raman Chandrasekar, Valia Kordoni","doi":"10.1017/s1351324922000481","DOIUrl":"https://doi.org/10.1017/s1351324922000481","url":null,"abstract":"\u0000 There has been considerable work recently in the natural language community and elsewhere on Responsible AI. Much of this work focuses on fairness and biases (henceforth Risks 1.0), following the 2016 best seller: Weapons of Math Destruction. Two books published in 2022, The Chaos Machine and Like, Comment, Subscribe, raise additional risks to public health/safety/security such as genocide, insurrection, polarized politics, vaccinations (henceforth, Risks 2.0). These books suggest that the use of machine learning to maximize engagement in social media has created a Frankenstein Monster that is exploiting human weaknesses with persuasive technology, the illusory truth effect, Pavlovian conditioning, and Skinner’s intermittent variable reinforcement. Just as we cannot expect tobacco companies to sell fewer cigarettes and prioritize public health ahead of profits, so too, it may be asking too much of companies (and countries) to stop trafficking in misinformation given that it is so effective and so insanely profitable (at least in the short term). Eventually, we believe the current chaos will end, like the lawlessness in Wild West, because chaos is bad for business. As computer scientists, this paper will summarize criticisms from other fields and focus on implications for computer science; we will not attempt to contribute to those other fields. There is quite a bit of work in computer science on these risks, especially on Risks 1.0 (bias and fairness), but more work is needed, especially on Risks 2.0 (addictive, dangerous, and deadly).","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"02 1","pages":"483-508"},"PeriodicalIF":2.5,"publicationDate":"2022-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80068869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parameter-efficient feature-based transfer for paraphrase identification","authors":"Xiaodong Liu, Rafal Rzepka, K. Araki","doi":"10.1017/S135132492200050X","DOIUrl":"https://doi.org/10.1017/S135132492200050X","url":null,"abstract":"Abstract There are many types of approaches for Paraphrase Identification (PI), an NLP task of determining whether a sentence pair has equivalent semantics. Traditional approaches mainly consist of unsupervised learning and feature engineering, which are computationally inexpensive. However, their task performance is moderate nowadays. To seek a method that can preserve the low computational costs of traditional approaches but yield better task performance, we take an investigation into neural network-based transfer learning approaches. We discover that by improving the usage of parameters efficiently for feature-based transfer, our research goal can be accomplished. Regarding the improvement, we propose a pre-trained task-specific architecture. The fixed parameters of the pre-trained architecture can be shared by multiple classifiers with small additional parameters. As a result, the computational cost left involving parameter update is only generated from classifier-tuning: the features output from the architecture combined with lexical overlap features are fed into a single classifier for tuning. Furthermore, the pre-trained task-specific architecture can be applied to natural language inference and semantic textual similarity tasks as well. Such technical novelty leads to slight consumption of computational and memory resources for each task and is also conducive to power-efficient continual learning. The experimental results show that our proposed method is competitive with adapter-BERT (a parameter-efficient fine-tuning approach) over some tasks while consuming only 16% trainable parameters and saving 69-96% time for parameter update.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1066 - 1096"},"PeriodicalIF":2.5,"publicationDate":"2022-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41557645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NLE volume 28 issue 6 Cover and Front matter","authors":"R. Mitkov, B. Boguraev","doi":"10.1017/s1351324922000468","DOIUrl":"https://doi.org/10.1017/s1351324922000468","url":null,"abstract":"whether trans-lation, computer science or engineering. Its is to the computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing original research articles on a broad range of topics - from text analy- sis, machine translation, information retrieval, speech processing and generation to integrated systems and multi-modal interfaces - it also publishes special issues on specific natural language processing methods, tasks or applications. The journal welcomes survey papers describing the state of the art of a specific topic. The Journal of Natural Language Engineering also publishes the popular Industry Watch and Emerging Trends columns as well as book reviews.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"28 1","pages":"f1 - f2"},"PeriodicalIF":2.5,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42536366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NLE volume 28 issue 6 Cover and Back matter","authors":"","doi":"10.1017/s135132492200047x","DOIUrl":"https://doi.org/10.1017/s135132492200047x","url":null,"abstract":"","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"28 1","pages":"b1 - b2"},"PeriodicalIF":2.5,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41966861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards universal methods for fake news detection","authors":"M. Pszona, M. Janicka, Grzegorz Wojdyga, A. Wawer","doi":"10.1017/s1351324922000456","DOIUrl":"https://doi.org/10.1017/s1351324922000456","url":null,"abstract":"Abstract Fake news detection is an emerging topic that has attracted a lot of attention among researchers and in the industry. This paper focuses on fake news detection as a text classification problem: on the basis of five publicly available corpora with documents labeled as true or fake, the task was to automatically distinguish both classes without relying on fact-checking. The aim of our research was to test the feasibility of a universal model: one that produces satisfactory results on all data sets tested in our article. We attempted to do so by training a set of classification models on one collection and testing them on another. As it turned out, this resulted in a sharp performance degradation. Therefore, this paper focuses on finding the most effective approach to utilizing information in a transferable manner. We examined a variety of methods: feature selection, machine learning approaches to data set shift (instance re-weighting and projection-based), and deep learning approaches based on domain transfer. These methods were applied to various feature spaces: linguistic and psycholinguistic, embeddings obtained from the Universal Sentence Encoder, and GloVe embeddings. A detailed analysis showed that some combinations of these methods and selected feature spaces bring significant improvements. When using linguistic data, feature selection yielded the best overall mean improvement (across all train-test pairs) of 4%. Among the domain adaptation methods, the greatest improvement of 3% was achieved by subspace alignment.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1004 - 1042"},"PeriodicalIF":2.5,"publicationDate":"2022-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45906028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}