Constructing Corpus and Word Embedding for Spanish Covid-19 Data

Proces. del Leng. Natural Pub Date : 2021-09-06 DOI:10.26342/2021-67-3

Kyungjin Hwang

引用次数: 0

Abstract

Severe acute respiratory syndrome coronavirus 2 (COVID 19), colloquially referred to as coronavirus, escalated into a global pandemic with severe transmission and mortality rates in 2019. Despite the escalation of the virus’ worldwide impact in 2020, numerous studies on Natural Language Processing in Spanish have neglected corpus construction or word embedding, especially conspicuous in its absence being the corpora involving coronavirus or infectious diseases. Additionally, corpus construction or word embedding conducted in the medical field do not display efficacy in production pertaining to coronavirus or infectious diseases. To supplement this potentially detrimental insufficiency, this study collects Spanish Language data to build a relevant coronavirus corpus through appropriate preprocessing and then obtains a word embedding. Performance of the corpus and word embedding are then tested through word similarity evaluations, a cosine similarity evaluation, and a visualization evaluation with the existing Spanish corpus. After comparison, corpus and word embedding suitable for coronavirus will be suggested.

查看原文本刊更多论文

构建西班牙语Covid-19数据的语料库和词嵌入

2019年，俗称冠状病毒的严重急性呼吸综合征冠状病毒(COVID - 19)升级为全球大流行，传播严重，死亡率高。尽管2020年该病毒在全球范围内的影响有所升级，但许多关于西班牙语自然语言处理的研究都忽略了语料库构建或词嵌入，特别是没有涉及冠状病毒或传染病的语料库。此外，在医学领域进行的语料库构建或词嵌入在与冠状病毒或传染病相关的生产中没有显示出功效。为了弥补这一潜在的不利不足，本研究收集西班牙语数据，通过适当的预处理构建相关的冠状病毒语料库，然后获得词嵌入。然后通过单词相似度评估、余弦相似度评估和现有西班牙语语料库的可视化评估来测试语料库和词嵌入的性能。通过比较，提出适合冠状病毒的语料库和词嵌入。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proces. del Leng. Natural

自引率

0.00%

发文量