Analysis of Lexical Semantic Changes in Corpora with the Diachronic Engine

Pierluigi Cassotti, Pierpaolo Basile, M. Degemmis, G. Semeraro
{"title":"Analysis of Lexical Semantic Changes in Corpora with the Diachronic Engine","authors":"Pierluigi Cassotti, Pierpaolo Basile, M. Degemmis, G. Semeraro","doi":"10.4000/books.aaccademia.8343","DOIUrl":null,"url":null,"abstract":"English. With the growing availability of digitized diachronic corpora, the need for tools capable of taking into account the diachronic component of corpora becomes ever more pressing. Recent works on diachronic embeddings show that computational approaches to the diachronic analysis of language seem to be promising, but they are not user friendly for people without a technical background. This paper presents the Diachronic Engine, a system for the diachronic analysis of corpora lexical features. Diachronic Engine computes word frequency, concordances and collocations taking into account the temporal dimension. It is also able to compute temporal word embeddings and timeseries that can be exploited for lexical semantic change detection. 1 Motivation and Background Synchronic corpora are widely used in linguistics for deriving a set of abstract rules that govern a particular language under analysis by using statistical approaches. The same methodology can be adopted for analyzing the evolution of word meanings over time in the case of diachronic corpora. However, this process can be very time-consuming. Usually, linguists rely on software tools that can easily explore and clean the corpus, while highlighting the more relevant linguistic features. Sketch Engine1(Kilgarriff et al., 2004; Kilgarriff et al., 2014) is the leading tool in the corpus analysis field. Beyond several interesting features, Sketch Engine includes trends (Kilgarriff et al., 2015), which allow for diachronic Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). https://www.sketchengine.eu/ analysis based on the frequency distribution of words. Trends rely on merely frequency features, ignoring word usage information. Moreover, the Sketch Engine interface does not provide temporal information about concordances and collocations. NoSketchEngine2 is an open-source version of SketchEngine. It requires technical expertise for the setup and, contrarily to SketchEngine, it does not support word sketches, terminology, thesaurus, n-grams, trends and corpus building. An interesting system is DiaCollo3 (Jurish and der Wissenschaften, 2015), a software tool for the discovery, comparison, and interactive visualization of target word combinations. Combinations can be requested for a particular time period, or for a direct comparison between different time periods. However, DiaCollo is focused exclusively on the extraction and visualization of collocations from diachronic corpora. In recent works about computational diachronic linguistics, techniques based on word embeddings produce promising results. In Semeval Task 1 (Schlechtweg et al., 2020), for instance, type embeddings rich high performances on both subtasks. However, these techniques are not included in any aforementioned linguistic tool. In order to bridge this gap, we try to build a tool that includes approaches for the analysis of diachronic embeddings. The result of our work is Diachronic Engine (DE), an engine for the management of diachronic corpora that provides tools for change detection of lexical semantics from a frequentist perspective. DE includes tools for extracting diachronic collocations, concordances in different time periods as well as for computing semantic change time-series by exploiting both word frequencies and word embeddings similarity over time. The rest of the paper is organized as follows: https://nlp.fi.muni.cz/trac/noske https://www.clarin.eu/showcase/","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4000/books.aaccademia.8343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

English. With the growing availability of digitized diachronic corpora, the need for tools capable of taking into account the diachronic component of corpora becomes ever more pressing. Recent works on diachronic embeddings show that computational approaches to the diachronic analysis of language seem to be promising, but they are not user friendly for people without a technical background. This paper presents the Diachronic Engine, a system for the diachronic analysis of corpora lexical features. Diachronic Engine computes word frequency, concordances and collocations taking into account the temporal dimension. It is also able to compute temporal word embeddings and timeseries that can be exploited for lexical semantic change detection. 1 Motivation and Background Synchronic corpora are widely used in linguistics for deriving a set of abstract rules that govern a particular language under analysis by using statistical approaches. The same methodology can be adopted for analyzing the evolution of word meanings over time in the case of diachronic corpora. However, this process can be very time-consuming. Usually, linguists rely on software tools that can easily explore and clean the corpus, while highlighting the more relevant linguistic features. Sketch Engine1(Kilgarriff et al., 2004; Kilgarriff et al., 2014) is the leading tool in the corpus analysis field. Beyond several interesting features, Sketch Engine includes trends (Kilgarriff et al., 2015), which allow for diachronic Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). https://www.sketchengine.eu/ analysis based on the frequency distribution of words. Trends rely on merely frequency features, ignoring word usage information. Moreover, the Sketch Engine interface does not provide temporal information about concordances and collocations. NoSketchEngine2 is an open-source version of SketchEngine. It requires technical expertise for the setup and, contrarily to SketchEngine, it does not support word sketches, terminology, thesaurus, n-grams, trends and corpus building. An interesting system is DiaCollo3 (Jurish and der Wissenschaften, 2015), a software tool for the discovery, comparison, and interactive visualization of target word combinations. Combinations can be requested for a particular time period, or for a direct comparison between different time periods. However, DiaCollo is focused exclusively on the extraction and visualization of collocations from diachronic corpora. In recent works about computational diachronic linguistics, techniques based on word embeddings produce promising results. In Semeval Task 1 (Schlechtweg et al., 2020), for instance, type embeddings rich high performances on both subtasks. However, these techniques are not included in any aforementioned linguistic tool. In order to bridge this gap, we try to build a tool that includes approaches for the analysis of diachronic embeddings. The result of our work is Diachronic Engine (DE), an engine for the management of diachronic corpora that provides tools for change detection of lexical semantics from a frequentist perspective. DE includes tools for extracting diachronic collocations, concordances in different time periods as well as for computing semantic change time-series by exploiting both word frequencies and word embeddings similarity over time. The rest of the paper is organized as follows: https://nlp.fi.muni.cz/trac/noske https://www.clarin.eu/showcase/
用历时引擎分析语料库中的词汇语义变化
英语。随着数字化历时语料库的日益普及,对能够考虑语料库历时成分的工具的需求变得越来越迫切。最近关于历时嵌入的研究表明,语言历时分析的计算方法似乎很有前途,但对于没有技术背景的人来说,它们并不友好。本文介绍了语料库词汇特征历时分析系统历时引擎。历时引擎计算词频,一致性和搭配考虑到时间维度。它还能够计算可用于词法语义变化检测的时态词嵌入和时间序列。共时语料库在语言学中被广泛使用,通过统计方法推导出一套控制特定语言的抽象规则。同样的方法也可以用于分析历时语料库中词义随时间的演变。然而,这个过程可能非常耗时。通常,语言学家依赖于能够轻松探索和清理语料库的软件工具,同时突出更相关的语言特征。Sketch Engine1(Kilgarriff et al., 2004;Kilgarriff et al., 2014)是语料库分析领域的领先工具。除了几个有趣的功能之外,Sketch Engine还包括趋势(Kilgarriff等人,2015年),这允许作者对本文进行历时性版权保护c©2020。在知识共享许可国际署名4.0 (CC BY 4.0)下允许使用。https://www.sketchengine.eu/基于词频分布的分析。趋势仅仅依赖于频率特征,忽略了单词用法信息。此外,Sketch Engine接口不提供关于一致性和搭配的临时信息。NoSketchEngine2是SketchEngine的开源版本。它需要技术专业知识来设置,与SketchEngine相反,它不支持单词草图、术语、同义词库、n-grams、趋势和语料库构建。DiaCollo3 (Jurish and der Wissenschaften, 2015)是一个有趣的系统,它是一个用于发现、比较和目标单词组合交互式可视化的软件工具。可以要求对特定时间段进行组合,或者在不同时间段之间进行直接比较。然而,DiaCollo专注于从历时语料库中提取和可视化搭配。在最近关于计算历时语言学的工作中,基于词嵌入的技术产生了有希望的结果。例如,在Semeval Task 1 (Schlechtweg et al., 2020)中,类型嵌入在两个子任务上都具有很高的性能。然而,上述任何语言工具都不包括这些技术。为了弥合这一差距,我们试图构建一个工具,其中包括分析历时嵌入的方法。我们的工作成果是历时引擎(Diachronic Engine, DE),这是一个用于历时语料库管理的引擎,从频率论的角度为词法语义的变化检测提供了工具。DE包括用于提取历时搭配、不同时间段的一致性以及通过利用词频和词嵌入随时间的相似性来计算语义变化时间序列的工具。论文的其余部分组织如下:https://nlp.fi.muni.cz/trac/noske https://www.clarin.eu/showcase/
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信