Language Resources and Evaluation最新文献_第2页

Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Chamber of Deputies 在巴西众议院的真实案例中为法律信息检索建立相关性反馈语料库

IF 2.7 3区计算机科学

Language Resources and Evaluation Pub Date : 2024-08-18 DOI: 10.1007/s10579-024-09767-3

Douglas Vitório, Ellen Souza, Lucas Martins, Nádia F. F. da Silva, André Carlos Ponce de Leon de Carvalho, Adriano L. I. Oliveira, Francisco Edmundo de Andrade

{"title":"Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Chamber of Deputies","authors":"Douglas Vitório, Ellen Souza, Lucas Martins, Nádia F. F. da Silva, André Carlos Ponce de Leon de Carvalho, Adriano L. I. Oliveira, Francisco Edmundo de Andrade","doi":"10.1007/s10579-024-09767-3","DOIUrl":"https://doi.org/10.1007/s10579-024-09767-3","url":null,"abstract":"The proper functioning of judicial and legislative institutions requires the efficient retrieval of legal documents from extensive datasets. Legal Information Retrieval focuses on investigating how to efficiently handle these datasets, enabling the retrieval of pertinent information from them. Relevance Feedback, an important aspect of Information Retrieval systems, utilizes the relevance information provided by the user to enhance document retrieval for a specific request. However, there is a lack of available corpora containing this information, particularly for the legislative scenario. Thus, this paper presents Ulysses-RFCorpus, a Relevance Feedback corpus for legislative information retrieval, built in the real-case scenario of the Brazilian Chamber of Deputies. To the best of our knowledge, this corpus is the first publicly available of its kind for the Brazilian Portuguese language. It is also the only corpus that contains feedback information for legislative documents, as the other corpora found in the literature primarily focus on judicial texts. We also used the corpus to evaluate the performance of the Brazilian Chamber of Deputies’ Information Retrieval system. Thereby, we highlighted the model’s strong performance and emphasized the dataset’s significance in the field of Legal Information Retrieval.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"41 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PESTS: Persian_English cross lingual corpus for semantic textual similarity PESTS：波斯语_英语跨语言语料库的语义文本相似性

IF 2.7 3区计算机科学

Language Resources and Evaluation Pub Date : 2024-08-03 DOI: 10.1007/s10579-024-09759-3

Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli

{"title":"PESTS: Persian_English cross lingual corpus for semantic textual similarity","authors":"Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli","doi":"10.1007/s10579-024-09759-3","DOIUrl":"https://doi.org/10.1007/s10579-024-09759-3","url":null,"abstract":"In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"10 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DoSLex: automatic generation of all domain semantically rich sentiment lexicon DoSLex：自动生成所有领域语义丰富的情感词典

IF 2.7 3区计算机科学

Language Resources and Evaluation Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09753-9

Minni Jain, Rajni Jindal, Amita Jain

{"title":"DoSLex: automatic generation of all domain semantically rich sentiment lexicon","authors":"Minni Jain, Rajni Jindal, Amita Jain","doi":"10.1007/s10579-024-09753-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09753-9","url":null,"abstract":"For sentiment analysis, lexicons are among the important resources. Existing sentiment lexicons have a generic polarity for each word. In fact, many words have different polarities when they are used in different domain. For the first time, in this work automation of a domain-specific sentiment lexicon named “DoSLex” has been proposed. In DoSLex, all the words are represented in a circle where the centre stands for the domain, and the x and y axis for the strength and the orientation of the sentiment, respectively. In the circle, the radius is the contextual similarity between the domain and term calculated using MuRIL embeddings, and the angle is the prior sentiment score taken from various knowledge bases. The proposed approach is language-independent and can be applied to any domain. The extensive experiments were conducted on three low-resource languages: Hindi, Tamil, and Bangla. The experimental studies discuss the performance of the combinations of different word embeddings (FastText, M-Bert and MuRIL) with several sources of prior sentiment knowledge bases on various domains. The performance of DoSLex has also been compared with three sentiment lexicons, and the results demonstrating a significant improvement in sentiment analysis.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"63 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How different is different? Systematically identifying distribution shifts and their impacts in NER datasets 差异有多大？在 NER 数据集中系统识别分布变化及其影响

IF 2.7 3区计算机科学

Language Resources and Evaluation Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09754-8

Xue Li, Paul Groth

{"title":"How different is different? Systematically identifying distribution shifts and their impacts in NER datasets","authors":"Xue Li, Paul Groth","doi":"10.1007/s10579-024-09754-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09754-8","url":null,"abstract":"When processing natural language, we are frequently confronted with the problem of distribution shift. For example, using a model trained on a news corpus to subsequently process legal text exhibits reduced performance. While this problem is well-known, to this point, there has not been a systematic study of detecting shifts and investigating the impact shifts have on model performance for NLP tasks. Therefore, in this paper, we detect and measure two types of distribution shift, across three different representations, for 12 benchmark Named Entity Recognition datasets. We show that both input shift and label shift can lead to dramatic performance degradation. For example, fine-tuning on a wide spectrum dataset (OntoNotes) and testing on an email dataset (CEREC) that shares labels leads to a 63-points drop in F1 performance. Overall, our results indicate that the measurement of distribution shift can provide guidance to the amount of data needed for fine-tuning and whether or not a model can be used “off-the-shelf” without subsequent fine-tuning. Finally, our results show that shift measurement can play an important role in NLP model pipeline definition.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"39 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain Ulysses Tesemõ：巴西法律和政府领域的新大型语料库

IF 2.7 3区计算机科学

Language Resources and Evaluation Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09762-8

Felipe A. Siqueira, Douglas Vitório, Ellen Souza, José A. P. Santos, Hidelberg O. Albuquerque, Márcio S. Dias, Nádia F. F. Silva, André C. P. L. F. de Carvalho, Adriano L. I. Oliveira, Carmelo Bastos-Filho

{"title":"Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain","authors":"Felipe A. Siqueira, Douglas Vitório, Ellen Souza, José A. P. Santos, Hidelberg O. Albuquerque, Márcio S. Dias, Nádia F. F. Silva, André C. P. L. F. de Carvalho, Adriano L. I. Oliveira, Carmelo Bastos-Filho","doi":"10.1007/s10579-024-09762-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09762-8","url":null,"abstract":"The increasing use of artificial intelligence methods in the legal field has sparked interest in applying Natural Language Processing techniques to handle legal tasks and reduce the workload of these professionals. However, the availability of legal corpora in Portuguese, especially for the Brazilian legal domain, is limited. Existing resources offer some legal data but lack comprehensive coverage. To address this gap, we present Ulysses Tesemõ, a large corpus specifically built for the Brazilian legal domain. The corpus consists of over 3.5 million files, totaling 30.7 GiB of raw text, collected from 159 sources encompassing judicial, legislative, academic, news, and other related data. The data was collected by scraping public information from governmental websites, emphasizing contents generated over the past two decades. We categorized the obtained files into 30 distinct categories, covering various branches of the Brazilian government and different types of texts. The corpus retains the original content with minimal data transformations, addressing the scarcity of Portuguese legal corpora and providing researchers with a valuable resource for advancing in the research area.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"22 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Historical Portuguese corpora: a survey 葡萄牙语历史语料库：调查

IF 2.7 3区计算机科学

Language Resources and Evaluation Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09757-5

Tomás Freitas Osório, Henrique Lopes Cardoso

引用次数: 0

Šolar, the developmental corpus of Slovene 斯洛文尼亚语发展语料库

IF 2.7 3区计算机科学

Language Resources and Evaluation Pub Date : 2024-07-18 DOI: 10.1007/s10579-024-09758-4

Špela Arhar Holdt, Iztok Kosem

引用次数: 0

Parlamint-it: an 18-karat UD treebank of Italian parliamentary speeches Parlamint-it：意大利议会发言的 18 克拉 UD 树状库

IF 2.7 3区计算机科学

Language Resources and Evaluation Pub Date : 2024-07-06 DOI: 10.1007/s10579-024-09748-6

Chiara Alzetta, Simonetta Montemagni, Marta Sartor, Giulia Venturi

引用次数: 0

BRISE-plandok: a German legal corpus of building regulations BRISE-plandok：德国建筑法规法律语料库

IF 2.7 3区计算机科学

Language Resources and Evaluation Pub Date : 2024-07-06 DOI: 10.1007/s10579-024-09747-7

Gábor Recski, Eszter Iklódi, Björn Lellmann, Ádám Kovács, Allan Hanbury

{"title":"BRISE-plandok: a German legal corpus of building regulations","authors":"Gábor Recski, Eszter Iklódi, Björn Lellmann, Ádám Kovács, Allan Hanbury","doi":"10.1007/s10579-024-09747-7","DOIUrl":"https://doi.org/10.1007/s10579-024-09747-7","url":null,"abstract":"We present the BRISE-Plandok corpus, a collection of 250 text documents with a total of over 7000 sentences from the Zoning Map of the City of Vienna, annotated manually with formal representations of the rules they convey. The generic rule format used by the corpus enables automated compliance checking of building plans, a process developed as part of the BRISE (https://smartcity.wien.gv.at/en/brise/) project. The format also allows for conversion to multiple logic formalisms, including dyadic deontic logic, enabling automated reasoning. Annotation guidelines were developed in collaboration with experts of the city’s building inspection office, describing nearly 100 domain-specific attributes with examples. Each document was annotated independently by two trained annotators and subsequently reviewed by the authors. A rule-based system for the automatic extraction of rules from text was developed and used in the annotation process to provide suggestions. The reviewed dataset was also used to train a set of baseline machine learning models for the task of attribute extraction, the main step in the rule extraction process. Both the rule-based system and the ML baselines are evaluated on the annotated dataset and released as open-source software. We also describe and release the framework used for generating and parsing the interactive xlsx spreadsheets used by annotators.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"38 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Research on translation quality self-evaluation by expert translators: an empirical study 专家译者的翻译质量自我评价研究：一项实证研究

IF 2.7 3区计算机科学

Language Resources and Evaluation Pub Date : 2024-07-06 DOI: 10.1007/s10579-024-09760-w

Yaya Zheng

引用次数: 0