Language Resources and Evaluation最新文献

筛选
英文 中文
NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese NILC-Metrix:评估巴西葡萄牙语书面和口头语言的复杂性
3区 计算机科学
Language Resources and Evaluation Pub Date : 2023-10-17 DOI: 10.1007/s10579-023-09693-w
Sidney Evaldo Leal, Magali Sanches Duran, Carolina Evaristo Scarton, Nathan Siegle Hartmann, Sandra Maria Aluísio
{"title":"NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese","authors":"Sidney Evaldo Leal, Magali Sanches Duran, Carolina Evaristo Scarton, Nathan Siegle Hartmann, Sandra Maria Aluísio","doi":"10.1007/s10579-023-09693-w","DOIUrl":"https://doi.org/10.1007/s10579-023-09693-w","url":null,"abstract":"The objective of this paper is to present and make publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). The metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics were developed during the last 13 years, starting in the end of 2007, within the scope of the PorSimples project. Once the PorSimples finished, new metrics were added to the initial 48 metrics of the Coh-Metrix-Port tool. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English, in future studies using both tools. In this paper, we illustrate the potential of the NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children’s film subtitles and texts written for Elementary School I (comprises classes from 1st to 5th grade) and II (Final Years) (comprises classes from 6th to 9th grade, in an age group that corresponds to the transition between childhood and adolescence); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children’s story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136033230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A semi-supervised method to generate a persian dataset for suggestion classification 一种用于建议分类生成波斯语数据集的半监督方法
3区 计算机科学
Language Resources and Evaluation Pub Date : 2023-09-29 DOI: 10.1007/s10579-023-09688-7
Leila Safari, Zanyar Mohammady
{"title":"A semi-supervised method to generate a persian dataset for suggestion classification","authors":"Leila Safari, Zanyar Mohammady","doi":"10.1007/s10579-023-09688-7","DOIUrl":"https://doi.org/10.1007/s10579-023-09688-7","url":null,"abstract":"Suggestion mining has become a popular subject in the field of natural language processing (NLP) that is useful in areas like a service/product improvement. The purpose of this study is to provide an automated machine learning (ML) based approach to extract suggestions from Persian text. In this research, first, a novel two-step semi-supervised method has been proposed to generate a Persian dataset called ParsSugg, which is then used in the automatic classification of the user’s suggestions. The first step is manual labeling of data based on a proposed guideline, followed by a data augmentation phase. In the second step, using pre-trained Persian Bidirectional Encoder Representations from Transformers (ParsBERT) as a classifier and the data from the previous step, more data were labeled. The performance of various ML models, including Support Vector Machine (SVM), Random Forest (RF), Convolutional Neural Networks (CNN), Long Short Term Memory (LSTM), and the ParsBERT language model has been examined on the generated dataset. The F-score value of 97.27 for ParsBERT and about 94.5 for SVM and CNN classifiers were obtained for the suggestion class which is a promising result as the first research on suggestion classification on Persian texts. Also, the proposed guideline can be used for other NLP tasks, and the generated dataset can be used in other suggestion classification tasks.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135199301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NEREL: a Russian information extraction dataset with rich annotation for nested entities, relations, and wikidata entity links NEREL:一个俄语信息提取数据集,为嵌套的实体、关系和维基数据实体链接提供了丰富的注释
3区 计算机科学
Language Resources and Evaluation Pub Date : 2023-09-21 DOI: 10.1007/s10579-023-09674-z
Natalia Loukachevitch, Ekaterina Artemova, Tatiana Batura, Pavel Braslavski, Vladimir Ivanov, Suresh Manandhar, Alexander Pugachev, Igor Rozhkov, Artem Shelmanov, Elena Tutubalina, Alexey Yandutov
{"title":"NEREL: a Russian information extraction dataset with rich annotation for nested entities, relations, and wikidata entity links","authors":"Natalia Loukachevitch, Ekaterina Artemova, Tatiana Batura, Pavel Braslavski, Vladimir Ivanov, Suresh Manandhar, Alexander Pugachev, Igor Rozhkov, Artem Shelmanov, Elena Tutubalina, Alexey Yandutov","doi":"10.1007/s10579-023-09674-z","DOIUrl":"https://doi.org/10.1007/s10579-023-09674-z","url":null,"abstract":"This paper describes NEREL—a Russian news dataset suited for three tasks: nested named entity recognition, relation extraction, and entity linking. Compared to flat entities, nested named entities provide a richer and more complete annotation while also increasing the coverage of relations annotation and entity linking. Relations between nested named entities may cross entity boundaries to connect to shorter entities nested within longer ones, which makes it harder to detect such relations. NEREL is currently the largest Russian dataset annotated with entities and relations: it comprises 29 named entity types and 49 relation types. At the time of writing, the dataset contains 56 K named entities and 39 K relations annotated in 933 person-oriented news articles. NEREL is annotated with relations at three levels: (1) within nested named entities, (2) within sentences, and (3) with relations crossing sentence boundaries. We provide benchmark evaluation of current state-of-the-art methods in all three tasks. The dataset is freely available at https://github.com/nerel-ds/NEREL .","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136136095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey and study impact of tweet sentiment analysis via transfer learning in low resource scenarios 低资源情境下迁移学习对推文情感分析影响的调查研究
3区 计算机科学
Language Resources and Evaluation Pub Date : 2023-09-14 DOI: 10.1007/s10579-023-09687-8
Manoel Veríssimo dos Santos Neto, Nádia Félix F. da Silva, Anderson da Silva Soares
{"title":"A survey and study impact of tweet sentiment analysis via transfer learning in low resource scenarios","authors":"Manoel Veríssimo dos Santos Neto, Nádia Félix F. da Silva, Anderson da Silva Soares","doi":"10.1007/s10579-023-09687-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09687-8","url":null,"abstract":"Sentiment analysis (SA) is a study area focused on obtaining contextual polarity from the text. Currently, deep learning has obtained outstanding results in this task. However, much annotated data are necessary to train these algorithms, and obtaining this data is expensive and difficult. In the context of low-resource scenarios, this problem is even more significant because there are little available data. Transfer learning (TL) can be used to minimize this problem because it is possible to develop some architectures using fewer data. Language models are a way of applying TL in natural language processing (NLP), and they have achieved competitive results. Nevertheless, some models need many hours of training using many computational resources, and in some contexts, people and organizations do not have the resources to do this. In this paper, we explore the models BERT (Pretraining of Deep Bidirectional Transformers for Language Understanding), MultiFiT (Efficient Multilingual Language Model Fine-tuning), ALBERT (A Lite BERT for Self-supervised Learning of Language Representations), and RoBERTa (A Robustly Optimized BERT Pretraining Approach). In all of our experiments, these models obtain better results than CNN (convolutional neural network) and LSTM (Long Short Term Memory) models. To MultiFiT and RoBERTa models, we propose a pretrained language model (PTLM) using Twitter data. Using this approach, we obtained competitive results compared with the models trained in formal language datasets. The main goal is to show the impacts of TL and language models comparing results with other techniques and showing the computational costs of using these approaches.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134912901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An eye-tracking-with-EEG coregistration corpus of narrative sentences 叙述句眼动-脑电共配语料库
IF 2.7 3区 计算机科学
Language Resources and Evaluation Pub Date : 2023-08-29 DOI: 10.1007/s10579-023-09684-x
S. Frank, Anna Aumeistere
{"title":"An eye-tracking-with-EEG coregistration corpus of narrative sentences","authors":"S. Frank, Anna Aumeistere","doi":"10.1007/s10579-023-09684-x","DOIUrl":"https://doi.org/10.1007/s10579-023-09684-x","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46749373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Data augmentation strategies to improve text classification: a use case in smart cities 改进文本分类的数据增强策略:智能城市中的一个用例
IF 2.7 3区 计算机科学
Language Resources and Evaluation Pub Date : 2023-08-23 DOI: 10.1007/s10579-023-09685-w
Luciana Bencke, V. Moreira
{"title":"Data augmentation strategies to improve text classification: a use case in smart cities","authors":"Luciana Bencke, V. Moreira","doi":"10.1007/s10579-023-09685-w","DOIUrl":"https://doi.org/10.1007/s10579-023-09685-w","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47201217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The development of a labelled te reo Māori–English bilingual database for language technology 开发标记的reo Māori-English语言技术双语数据库
3区 计算机科学
Language Resources and Evaluation Pub Date : 2023-08-20 DOI: 10.1007/s10579-023-09680-1
Jesin James, Isabella Shields, Vithya Yogarajan, Peter J. Keegan, Catherine I. Watson, Peter-Lucas Jones, Keoni Mahelona
{"title":"The development of a labelled te reo Māori–English bilingual database for language technology","authors":"Jesin James, Isabella Shields, Vithya Yogarajan, Peter J. Keegan, Catherine I. Watson, Peter-Lucas Jones, Keoni Mahelona","doi":"10.1007/s10579-023-09680-1","DOIUrl":"https://doi.org/10.1007/s10579-023-09680-1","url":null,"abstract":"Te reo Māori (referred to as Māori), New Zealand’s indigenous language, is under-resourced in language technology. Māori speakers are bilingual, where Māori is code-switched with English. Unfortunately, there are minimal resources available for Māori language technology, language detection and code-switch detection between Māori–English pair. Both English and Māori use Roman-derived orthography making rule-based systems for detecting language and code-switching restrictive. Most Māori language detection is done manually by language experts. This research builds a Māori–English bilingual database of 66,016,807 words with word-level language annotation. The New Zealand Parliament Hansard debates reports were used to build the database. The language labels are assigned automatically using language-specific rules and expert manual annotations. Words with the same spelling, but different meanings, exist for Māori and English. These words could not be categorised as Māori or English based on word-level language rules. Hence, manual annotations were necessary. An analysis reporting the various aspects of the database such as metadata, year-wise analysis, frequently occurring words, sentence length and N-grams is also reported. The database developed here is a valuable tool for future language and speech technology development for Aotearoa New Zealand. The methodology followed to label the database can also be followed by other low-resourced language pairs.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135876929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparative performance of ensemble machine learning for Arabic cyberbullying and offensive language detection 集成机器学习在阿拉伯网络欺凌和攻击性语言检测中的比较性能
IF 2.7 3区 计算机科学
Language Resources and Evaluation Pub Date : 2023-08-13 DOI: 10.1007/s10579-023-09683-y
M. Khairy, Tarek M. Mahmoud, Ahmed Omar, Tarek Abd El-Hafeez
{"title":"Comparative performance of ensemble machine learning for Arabic cyberbullying and offensive language detection","authors":"M. Khairy, Tarek M. Mahmoud, Ahmed Omar, Tarek Abd El-Hafeez","doi":"10.1007/s10579-023-09683-y","DOIUrl":"https://doi.org/10.1007/s10579-023-09683-y","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44624553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RUN-AS: a novel approach to annotate news reliability for disinformation detection RUN-AS:一种用于虚假信息检测的标注新闻可靠性的新方法
IF 2.7 3区 计算机科学
Language Resources and Evaluation Pub Date : 2023-08-06 DOI: 10.1007/s10579-023-09678-9
Alba Bonet-Jover, Robiert Sepúlveda-Torres, E. Saquete, P. Martínez-Barco, Mario Nieto-Pérez
{"title":"RUN-AS: a novel approach to annotate news reliability for disinformation detection","authors":"Alba Bonet-Jover, Robiert Sepúlveda-Torres, E. Saquete, P. Martínez-Barco, Mario Nieto-Pérez","doi":"10.1007/s10579-023-09678-9","DOIUrl":"https://doi.org/10.1007/s10579-023-09678-9","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44243946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The limitations of irony detection in Dutch social media 荷兰社交媒体中反讽检测的局限性
IF 2.7 3区 计算机科学
Language Resources and Evaluation Pub Date : 2023-07-23 DOI: 10.1007/s10579-023-09656-1
Aaron Maladry, Els Lefever, Cynthia Van Hee, Veronique Hoste
{"title":"The limitations of irony detection in Dutch social media","authors":"Aaron Maladry, Els Lefever, Cynthia Van Hee, Veronique Hoste","doi":"10.1007/s10579-023-09656-1","DOIUrl":"https://doi.org/10.1007/s10579-023-09656-1","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46825933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信