A sentiment corpus for the cryptocurrency financial domain: the CryptoLin corpus

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation Pub Date : 2024-05-25 DOI:10.1007/s10579-024-09743-x

Manoel Fernando Alonso Gadi, Miguel Ángel Sicilia

{"title":"A sentiment corpus for the cryptocurrency financial domain: the CryptoLin corpus","authors":"Manoel Fernando Alonso Gadi, Miguel Ángel Sicilia","doi":"10.1007/s10579-024-09743-x","DOIUrl":null,"url":null,"abstract":"<p>The objective of this paper is to describe Cryptocurrency Linguo (CryptoLin), a novel corpus containing 2683 cryptocurrency-related news articles covering more than a three-year period. CryptoLin was human-annotated with discrete values representing negative, neutral, and positive news respectively. Eighty-three people participated in the annotation process; each news title was randomly assigned and blindly annotated by three human annotators, one in each different cohort, followed by a consensus mechanism using simple voting. The selection of the annotators was intentionally made using three cohorts with students from a very diverse set of nationalities and educational backgrounds to minimize bias as much as possible. In case one of the annotators was in total disagreement with the other two (e.g., one negative vs two positive or one positive vs two negative), we considered this minority report and defaulted the labeling to neutral. Fleiss’s Kappa, Krippendorff’s Alpha, and Gwet’s AC1 inter-rater reliability coefficients demonstrate CryptoLin’s acceptable quality of inter-annotator agreement. The dataset also includes a text span with the three manual label annotations for further auditing of the annotation mechanism. To further assess the quality of the labeling and the usefulness of CryptoLin dataset, it incorporates four pretrained Sentiment Analysis models: Vader, Textblob, Flair, and FinBERT. Vader and FinBERT demonstrate reasonable performance in the CryptoLin dataset, indicating that the data was not annotated randomly and is therefore useful for further research1. FinBERT (negative) presents the best performance, indicating an advantage of being trained with financial news. Both the CryptoLin dataset and the Jupyter Notebook with the analysis, for reproducibility, are available at the project’s Github. Overall, CryptoLin aims to complement the current knowledge by providing a novel and publicly available Gadi and Ángel Sicilia (Cryptolin dataset and python jupyter notebooks reproducibility codes, 2022) cryptocurrency sentiment corpus and fostering research on the topic of cryptocurrency sentiment analysis and potential applications in behavioral science. This can be useful for businesses and policymakers who want to understand how cryptocurrencies are being used and how they might be regulated. Finally, the rules for selecting and assigning annotators make CryptoLin unique and interesting for new research in annotator selection, assignment, and biases.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"16 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09743-x","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

The objective of this paper is to describe Cryptocurrency Linguo (CryptoLin), a novel corpus containing 2683 cryptocurrency-related news articles covering more than a three-year period. CryptoLin was human-annotated with discrete values representing negative, neutral, and positive news respectively. Eighty-three people participated in the annotation process; each news title was randomly assigned and blindly annotated by three human annotators, one in each different cohort, followed by a consensus mechanism using simple voting. The selection of the annotators was intentionally made using three cohorts with students from a very diverse set of nationalities and educational backgrounds to minimize bias as much as possible. In case one of the annotators was in total disagreement with the other two (e.g., one negative vs two positive or one positive vs two negative), we considered this minority report and defaulted the labeling to neutral. Fleiss’s Kappa, Krippendorff’s Alpha, and Gwet’s AC1 inter-rater reliability coefficients demonstrate CryptoLin’s acceptable quality of inter-annotator agreement. The dataset also includes a text span with the three manual label annotations for further auditing of the annotation mechanism. To further assess the quality of the labeling and the usefulness of CryptoLin dataset, it incorporates four pretrained Sentiment Analysis models: Vader, Textblob, Flair, and FinBERT. Vader and FinBERT demonstrate reasonable performance in the CryptoLin dataset, indicating that the data was not annotated randomly and is therefore useful for further research1. FinBERT (negative) presents the best performance, indicating an advantage of being trained with financial news. Both the CryptoLin dataset and the Jupyter Notebook with the analysis, for reproducibility, are available at the project’s Github. Overall, CryptoLin aims to complement the current knowledge by providing a novel and publicly available Gadi and Ángel Sicilia (Cryptolin dataset and python jupyter notebooks reproducibility codes, 2022) cryptocurrency sentiment corpus and fostering research on the topic of cryptocurrency sentiment analysis and potential applications in behavioral science. This can be useful for businesses and policymakers who want to understand how cryptocurrencies are being used and how they might be regulated. Finally, the rules for selecting and assigning annotators make CryptoLin unique and interesting for new research in annotator selection, assignment, and biases.

查看原文本刊更多论文

加密货币金融领域的情感语料库：CryptoLin 语料库

CryptoLin 是一个新颖的语料库，包含 2683 篇与加密货币相关的新闻文章，时间跨度超过三年。CryptoLin 由人工标注，离散值分别代表负面、中性和正面新闻。有 83 人参与了注释过程；每个新闻标题都由三名人类注释者随机分配并盲注，每组各一名，然后通过简单投票达成共识。在选择注释员时，有意使用了来自不同国籍和教育背景的三批学生，以尽可能减少偏差。如果其中一位批注者与其他两位批注者意见完全不一致（例如，一个否定对两个肯定，或一个肯定对两个否定），我们会考虑这个少数报告，并将标签默认为中性。弗莱斯卡帕（Fleiss's Kappa）、克里彭多夫阿尔法（Krippendorff's Alpha）和格威特AC1评分者间可靠性系数表明，CryptoLin的评分者间一致性质量是可以接受的。数据集还包括一个包含三个人工标签注释的文本跨度，以便进一步审核注释机制。为了进一步评估 CryptoLin 数据集的标注质量和实用性，该数据集采用了四种预训练的情感分析模型：Vader、Textblob、Flair 和 FinBERT。Vader 和 FinBERT 在 CryptoLin 数据集中表现出了合理的性能，表明该数据并非随机标注，因此对进一步研究非常有用1。FinBERT（负值）的性能最好，这表明使用财经新闻进行训练具有优势。CryptoLin 数据集和包含分析结果的 Jupyter Notebook 均可在该项目的 Github 上获取，以实现可重复性。总之，CryptoLin 的目的是通过提供一个新颖的、可公开获取的 Gadi 和 Ángel Sicilia（Cryptolin 数据集和 python Jupyter 笔记本可重现性代码，2022 年）加密货币情感语料库来补充现有知识，并促进加密货币情感分析主题的研究和在行为科学中的潜在应用。这对于想要了解加密货币的使用情况以及如何对其进行监管的企业和政策制定者来说非常有用。最后，选择和分配注释者的规则使 CryptoLin 具有独特性，对注释者选择、分配和偏见方面的新研究很有意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.