Constructing Corpus and Word Embedding for Spanish Covid-19 Data

Kyungjin Hwang
{"title":"Constructing Corpus and Word Embedding for Spanish Covid-19 Data","authors":"Kyungjin Hwang","doi":"10.26342/2021-67-3","DOIUrl":null,"url":null,"abstract":"Severe acute respiratory syndrome coronavirus 2 (COVID 19), colloquially referred to as coronavirus, escalated into a global pandemic with severe transmission and mortality rates in 2019. Despite the escalation of the virus’ worldwide impact in 2020, numerous studies on Natural Language Processing in Spanish have neglected corpus construction or word embedding, especially conspicuous in its absence being the corpora involving coronavirus or infectious diseases. Additionally, corpus construction or word embedding conducted in the medical field do not display efficacy in production pertaining to coronavirus or infectious diseases. To supplement this potentially detrimental insufficiency, this study collects Spanish Language data to build a relevant coronavirus corpus through appropriate preprocessing and then obtains a word embedding. Performance of the corpus and word embedding are then tested through word similarity evaluations, a cosine similarity evaluation, and a visualization evaluation with the existing Spanish corpus. After comparison, corpus and word embedding suitable for coronavirus will be suggested.","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proces. del Leng. Natural","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26342/2021-67-3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Severe acute respiratory syndrome coronavirus 2 (COVID 19), colloquially referred to as coronavirus, escalated into a global pandemic with severe transmission and mortality rates in 2019. Despite the escalation of the virus’ worldwide impact in 2020, numerous studies on Natural Language Processing in Spanish have neglected corpus construction or word embedding, especially conspicuous in its absence being the corpora involving coronavirus or infectious diseases. Additionally, corpus construction or word embedding conducted in the medical field do not display efficacy in production pertaining to coronavirus or infectious diseases. To supplement this potentially detrimental insufficiency, this study collects Spanish Language data to build a relevant coronavirus corpus through appropriate preprocessing and then obtains a word embedding. Performance of the corpus and word embedding are then tested through word similarity evaluations, a cosine similarity evaluation, and a visualization evaluation with the existing Spanish corpus. After comparison, corpus and word embedding suitable for coronavirus will be suggested.
构建西班牙语Covid-19数据的语料库和词嵌入
2019年,俗称冠状病毒的严重急性呼吸综合征冠状病毒(COVID - 19)升级为全球大流行,传播严重,死亡率高。尽管2020年该病毒在全球范围内的影响有所升级,但许多关于西班牙语自然语言处理的研究都忽略了语料库构建或词嵌入,特别是没有涉及冠状病毒或传染病的语料库。此外,在医学领域进行的语料库构建或词嵌入在与冠状病毒或传染病相关的生产中没有显示出功效。为了弥补这一潜在的不利不足,本研究收集西班牙语数据,通过适当的预处理构建相关的冠状病毒语料库,然后获得词嵌入。然后通过单词相似度评估、余弦相似度评估和现有西班牙语语料库的可视化评估来测试语料库和词嵌入的性能。通过比较,提出适合冠状病毒的语料库和词嵌入。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信