Journal of Linguistics/Jazykovedný casopis最新文献

筛选
英文 中文
Slovak Question Answering Dataset Based on the Machine Translation of the Squad V2.0 基于《小分队》V2.0 机器翻译的斯洛伐克语问题解答数据集
Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI: 10.2478/jazcas-2023-0054
J. Staš, D. Hládek, Tomás Koctúr
{"title":"Slovak Question Answering Dataset Based on the Machine Translation of the Squad V2.0","authors":"J. Staš, D. Hládek, Tomás Koctúr","doi":"10.2478/jazcas-2023-0054","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0054","url":null,"abstract":"Abstract This paper describes the process of building the first large-scale machinetranslated question answering dataset SQuAD-sk for the Slovak language. The dataset was automatically translated from the original English SQuAD v2.0 using the Marian neural machine translation together with the Helsinki-NLP Opus English-Slovak model. Moreover, we proposed an effective approach for the approximate search of the translated answer in the translated paragraph based on measuring their similarity using their word vectors. In this way, we obtained more than 92% of the translated questions and answers from the original English dataset. We then used this machine-translated dataset to train the Slovak question answering system by fine-tuning monolingual and multilingual BERT-based language models. The scores achieved by EM = 69.48% and F1 = 78.87% for the fine-tuned mBERT model show comparable results of question answering with recently published machinetranslated SQuAD datasets for other European languages.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"9 1","pages":"381 - 390"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Corroborating Corpus Data with Elicited Introspection Data: A Case Study 用内省数据证实语料库数据:案例研究
Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI: 10.2478/jazcas-2023-0024
Jakob Horsch
{"title":"Corroborating Corpus Data with Elicited Introspection Data: A Case Study","authors":"Jakob Horsch","doi":"10.2478/jazcas-2023-0024","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0024","url":null,"abstract":"Abstract The last decades have seen an exponential growth of corpus sizes. This development has been driven by a desire to investigate rare syntactic phenomena, but issues remain: Corpora are by definition finite samples, but language is by definition infinite, leading to the negative data problem (‘absence of evidence is not evidence of absence’). One solution is corroborating corpus data with elicited introspection data that is obtained in a reliable, valid, and objective way. I present a case study to show how this can be done using the Magnitude Estimation Test (MET) method (Hoffmann 2013). Analyzing elicited data from 37 L1 English speakers, I show that introspective data can complement corpus data and lead to interesting new findings.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"19 1","pages":"60 - 69"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Morphosyntactic Annotation in Universal Dependencies for Old Czech 旧捷克语通用依存关系中的语态句法注释
Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI: 10.2478/jazcas-2023-0039
Daniel Zeman, Pavel Kosek, Martin Březina, Jiří Pergler
{"title":"Morphosyntactic Annotation in Universal Dependencies for Old Czech","authors":"Daniel Zeman, Pavel Kosek, Martin Březina, Jiří Pergler","doi":"10.2478/jazcas-2023-0039","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0039","url":null,"abstract":"Abstract We describe the first steps in preparation of a treebank of 14th-century Czech in the framework of Universal Dependencies. The Dresden and Olomouc versions of the Gospel of Matthew have been selected for this pilot study, which also involves modification of the annotation guidelines for phenomena that occur in Old Czech but not in Modern Czech. We describe some of these modifications in the paper. In addition, we provide some interesting observations about applicability of a Modern Czech parser to the Old Czech data.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"61 1","pages":"214 - 222"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
God Knows How It Turns Out: On Three Constructions Including Bog ‘God’, Čert ‘Devil’ and Some Taboo Words in the Russian Language Over the Last Three Centuries 天知道结果如何:论过去三个世纪俄语中包括 "上帝"(Bog)、"魔鬼"(Čert)和一些禁忌词在内的三种构词法
Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI: 10.2478/jazcas-2023-0020
Evgeniya Budennaya, Kristina Litvintseva, Anastasia Yakovleva
{"title":"God Knows How It Turns Out: On Three Constructions Including Bog ‘God’, Čert ‘Devil’ and Some Taboo Words in the Russian Language Over the Last Three Centuries","authors":"Evgeniya Budennaya, Kristina Litvintseva, Anastasia Yakovleva","doi":"10.2478/jazcas-2023-0020","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0020","url":null,"abstract":"Abstract The constructions with the anchor [Noun-Nom Verb (meaning ‘to know’)] are very productive in Russian. In this article we show that variables such as Bog ‘God’, čert ‘devil’ and xer/xren ‘X/horseradish’ have some common patterns, as well as some shifts with exclusive patterns in semantics and constructionalization.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"32 1","pages":"19 - 31"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distractor Generation for Lexical Questions Using Learner Corpus Data 利用学习者语料库数据生成词汇问题的干扰项
Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI: 10.2478/jazcas-2023-0051
Nikita Login
{"title":"Distractor Generation for Lexical Questions Using Learner Corpus Data","authors":"Nikita Login","doi":"10.2478/jazcas-2023-0051","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0051","url":null,"abstract":"Abstract Learner corpora with error annotation can serve as a source of data for automated question generation (QG) for language testing. In case of multiple choice gapfill lexical questions, this process involves two steps. The first step is to extract sentences with lexical corrections from the learner corpus. The second step, which is the focus of this paper, is to generate distractors for the retrieved questions. The presented approach (called DisSelector) is based on supervised learning on specially annotated learner corpus data. For each sentence a list of distractor candidates was retrieved. Then, each candidate was manually labelled as a plausible or implausible distractor. The derived set of examples was additionally filtered by a set of lexical and grammatical rules and then split into training and testing subsets in 4:1 ratio. Several classification models, including classical machine learning algorithms and gradient boosting implementations, were trained on the data. Word and sentence vectors from language models together with corpus word frequencies were used as input features for the classifiers. The highest F1-score (0.72) was attained by a XGBoost model. Various configurations of DisSelector showed improvements over the unsupervised baseline in both automatic and expert evaluation. DisSelector was integrated into an opensource language testing platform LangExBank as a microservice with a REST API.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"9 1","pages":"345 - 356"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139372042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lemmatization of the DIA1900 Diachronic Corpus DIA1900双时态语料库的词表化
Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI: 10.2478/jazcas-2023-0045
Lucie Benešová, Klára Pivoňková, Martin Stluka
{"title":"Lemmatization of the DIA1900 Diachronic Corpus","authors":"Lucie Benešová, Klára Pivoňková, Martin Stluka","doi":"10.2478/jazcas-2023-0045","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0045","url":null,"abstract":"Abstract This paper focuses on the process of lemmatization of the upcoming Czech diachronic corpus of the second half of the 19th century, DIA1900. The article describes different approaches to the corpus lemmatization of synchronic written, spoken and diachronic corpora within the Czech National Corpus project, including single- and multilevel lemmatization and available tools used to link the variants.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"5 1","pages":"275 - 284"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Annotation of Analytic Verb Forms in Czech – Complex Cases 捷克语分析动词形式的注释 - 复杂案例
Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI: 10.2478/jazcas-2023-0041
Vladimír Petkevic, Hana Skoumalová
{"title":"Annotation of Analytic Verb Forms in Czech – Complex Cases","authors":"Vladimír Petkevic, Hana Skoumalová","doi":"10.2478/jazcas-2023-0041","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0041","url":null,"abstract":"Abstract The article deals with complex cases of determining the attribute verbtag, which contains the values of morphosyntactic categories of analytic verb forms. The latest corpora of contemporary written Czech from the SYN series are tagged with this attribute. In this paper, we focus on cases where it is difficult to identify values of verbtag categories. These include, e.g. the identification of the auxiliary verb být ‘to be’, recognition of the mood and tense of coordinated participles, or determining the number in compound forms in which the individual parts have a different morphological number. Some of the problems are of a theoretical nature, since it is not clear what the correct solution should be. Here we have arbitrarily opted for one option that was offered. Other problems are due to imperfections in the algorithms we use for annotation. The solution here is to improve these algorithms.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"21 1","pages":"234 - 243"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Epistemic Marker Určit Ě in the Light of Corpus Data 从语料库数据看认识标记Určit Ě
Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI: 10.2478/jazcas-2023-0031
B. Štěpánková, J. Šindlerová, Lucie Poláková
{"title":"The Epistemic Marker Určit Ě in the Light of Corpus Data","authors":"B. Štěpánková, J. Šindlerová, Lucie Poláková","doi":"10.2478/jazcas-2023-0031","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0031","url":null,"abstract":"Abstract The paper presents a pilot study for a research project on epistemic modality and/or evidentiality markers in Czech. The study focuses on the expression určitě. Although, this marker is typically considered to signal high certainty, the dictionary of standard Czech (Slovník spisovné češtiny, SSČ) also offers an alternative meaning of probability, indicating a lower degree of certainty. We use parallel data from the InterCorp v15 corpus to determine whether the probability meaning can be identified unequivocally in real language data and whether it correlates with specific translation equivalents, linguistic features, or lexical context. Based on our findings, we propose an alternative method for distinguishing between different shades of meaning based on the communicative functions of the utterances, and we draw conclusions regarding the relevance of individual grammatical and lexical clues in context for future annotations.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"210 1","pages":"130 - 139"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139372036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adverbs Derived from Adjectival Present Participles in Polish, Slovak and Czech: A Comparative Corpus-Based Study 波兰语、斯洛伐克语和捷克语中由形容词性现在分词派生的副词:基于语料库的比较研究
Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI: 10.2478/jazcas-2023-0030
Aksana Schillová
{"title":"Adverbs Derived from Adjectival Present Participles in Polish, Slovak and Czech: A Comparative Corpus-Based Study","authors":"Aksana Schillová","doi":"10.2478/jazcas-2023-0030","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0030","url":null,"abstract":"Abstract The paper investigates and compares the inventory of adverbs derived from adjectival present participles in the Polish, Slovak and Czech languages. The particularity of these adverbs is that they do not occur in every language, even if it concerns closely related languages. While in Polish and Slovak this type of adverbs is represented by hundreds of lemmas, in Czech it is almost not represented. The comparative analysis is carried out on the data retrieved from the comparable web corpora Aranea. The sets of adverbs extracted from the comparable corpora of the languages examined are analysed by the following criteria: the total number of the adverb lemmas in the corpus, their relative frequency (ipm), morphemic structure features, collocability preferences. The similarities and differences between the adverb sets are established. According to the corpus data, the adverbs derived from adjectival present participles are more widely used in Polish than in Slovak, and in Czech they are a rare phenomenon represented by a limited number of lemmas with a negligible frequency.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"216 1","pages":"119 - 129"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linear Dependency Segments in Foreign Language Acquisition: Syntactic Complexity Analysis in Czech Learners’ Texts 外语习得中的线性依赖段:捷克语学习者文本中的句法复杂性分析
Journal of Linguistics/Jazykovedný casopis Pub Date : 2023-06-01 DOI: 10.2478/jazcas-2023-0037
Michaela Nogolová, Michaela Hanušková, Miroslav Kubát, Radek Čech
{"title":"Linear Dependency Segments in Foreign Language Acquisition: Syntactic Complexity Analysis in Czech Learners’ Texts","authors":"Michaela Nogolová, Michaela Hanušková, Miroslav Kubát, Radek Čech","doi":"10.2478/jazcas-2023-0037","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0037","url":null,"abstract":"Abstract The paper discusses a new way to measure syntactic complexity in foreign language acquisition. It is based on a recently proposed syntactic unit called linear dependency segment (LDS), the longest possible sequence of words belonging to the same clause where all linear neighbours are also syntactic neighbours. The dataset comprises 5,721 Czech texts from the CzeSL-SGT learner corpus covering five CEFR proficiency levels (A1–C1). The study covers two analyses. First, the development of the average clause length in terms of LDS and the average LDS length in the number of words across the latter language proficiency levels. Second, we consider the differences between Slavic and non-Slavic speakers. The results show an increasing tendency of the average clause length measured in LDS while the average clause length measured in words is decreasing. Results also show statistically significant differences between Slavic and non-Slavic speakers in most cases. Our results indicate that using LDS may be a useful unit of syntactic complexity measure in foreign language acquisition research.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"109 1","pages":"193 - 203"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信