{"title":"Slovak Question Answering Dataset Based on the Machine Translation of the Squad V2.0","authors":"J. Staš, D. Hládek, Tomás Koctúr","doi":"10.2478/jazcas-2023-0054","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0054","url":null,"abstract":"Abstract This paper describes the process of building the first large-scale machinetranslated question answering dataset SQuAD-sk for the Slovak language. The dataset was automatically translated from the original English SQuAD v2.0 using the Marian neural machine translation together with the Helsinki-NLP Opus English-Slovak model. Moreover, we proposed an effective approach for the approximate search of the translated answer in the translated paragraph based on measuring their similarity using their word vectors. In this way, we obtained more than 92% of the translated questions and answers from the original English dataset. We then used this machine-translated dataset to train the Slovak question answering system by fine-tuning monolingual and multilingual BERT-based language models. The scores achieved by EM = 69.48% and F1 = 78.87% for the fine-tuned mBERT model show comparable results of question answering with recently published machinetranslated SQuAD datasets for other European languages.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"9 1","pages":"381 - 390"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Corroborating Corpus Data with Elicited Introspection Data: A Case Study","authors":"Jakob Horsch","doi":"10.2478/jazcas-2023-0024","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0024","url":null,"abstract":"Abstract The last decades have seen an exponential growth of corpus sizes. This development has been driven by a desire to investigate rare syntactic phenomena, but issues remain: Corpora are by definition finite samples, but language is by definition infinite, leading to the negative data problem (‘absence of evidence is not evidence of absence’). One solution is corroborating corpus data with elicited introspection data that is obtained in a reliable, valid, and objective way. I present a case study to show how this can be done using the Magnitude Estimation Test (MET) method (Hoffmann 2013). Analyzing elicited data from 37 L1 English speakers, I show that introspective data can complement corpus data and lead to interesting new findings.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"19 1","pages":"60 - 69"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Zeman, Pavel Kosek, Martin Březina, Jiří Pergler
{"title":"Morphosyntactic Annotation in Universal Dependencies for Old Czech","authors":"Daniel Zeman, Pavel Kosek, Martin Březina, Jiří Pergler","doi":"10.2478/jazcas-2023-0039","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0039","url":null,"abstract":"Abstract We describe the first steps in preparation of a treebank of 14th-century Czech in the framework of Universal Dependencies. The Dresden and Olomouc versions of the Gospel of Matthew have been selected for this pilot study, which also involves modification of the annotation guidelines for phenomena that occur in Old Czech but not in Modern Czech. We describe some of these modifications in the paper. In addition, we provide some interesting observations about applicability of a Modern Czech parser to the Old Czech data.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"61 1","pages":"214 - 222"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"God Knows How It Turns Out: On Three Constructions Including Bog ‘God’, Čert ‘Devil’ and Some Taboo Words in the Russian Language Over the Last Three Centuries","authors":"Evgeniya Budennaya, Kristina Litvintseva, Anastasia Yakovleva","doi":"10.2478/jazcas-2023-0020","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0020","url":null,"abstract":"Abstract The constructions with the anchor [Noun-Nom Verb (meaning ‘to know’)] are very productive in Russian. In this article we show that variables such as Bog ‘God’, čert ‘devil’ and xer/xren ‘X/horseradish’ have some common patterns, as well as some shifts with exclusive patterns in semantics and constructionalization.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"32 1","pages":"19 - 31"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distractor Generation for Lexical Questions Using Learner Corpus Data","authors":"Nikita Login","doi":"10.2478/jazcas-2023-0051","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0051","url":null,"abstract":"Abstract Learner corpora with error annotation can serve as a source of data for automated question generation (QG) for language testing. In case of multiple choice gapfill lexical questions, this process involves two steps. The first step is to extract sentences with lexical corrections from the learner corpus. The second step, which is the focus of this paper, is to generate distractors for the retrieved questions. The presented approach (called DisSelector) is based on supervised learning on specially annotated learner corpus data. For each sentence a list of distractor candidates was retrieved. Then, each candidate was manually labelled as a plausible or implausible distractor. The derived set of examples was additionally filtered by a set of lexical and grammatical rules and then split into training and testing subsets in 4:1 ratio. Several classification models, including classical machine learning algorithms and gradient boosting implementations, were trained on the data. Word and sentence vectors from language models together with corpus word frequencies were used as input features for the classifiers. The highest F1-score (0.72) was attained by a XGBoost model. Various configurations of DisSelector showed improvements over the unsupervised baseline in both automatic and expert evaluation. DisSelector was integrated into an opensource language testing platform LangExBank as a microservice with a REST API.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"9 1","pages":"345 - 356"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139372042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lemmatization of the DIA1900 Diachronic Corpus","authors":"Lucie Benešová, Klára Pivoňková, Martin Stluka","doi":"10.2478/jazcas-2023-0045","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0045","url":null,"abstract":"Abstract This paper focuses on the process of lemmatization of the upcoming Czech diachronic corpus of the second half of the 19th century, DIA1900. The article describes different approaches to the corpus lemmatization of synchronic written, spoken and diachronic corpora within the Czech National Corpus project, including single- and multilevel lemmatization and available tools used to link the variants.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"5 1","pages":"275 - 284"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Annotation of Analytic Verb Forms in Czech – Complex Cases","authors":"Vladimír Petkevic, Hana Skoumalová","doi":"10.2478/jazcas-2023-0041","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0041","url":null,"abstract":"Abstract The article deals with complex cases of determining the attribute verbtag, which contains the values of morphosyntactic categories of analytic verb forms. The latest corpora of contemporary written Czech from the SYN series are tagged with this attribute. In this paper, we focus on cases where it is difficult to identify values of verbtag categories. These include, e.g. the identification of the auxiliary verb být ‘to be’, recognition of the mood and tense of coordinated participles, or determining the number in compound forms in which the individual parts have a different morphological number. Some of the problems are of a theoretical nature, since it is not clear what the correct solution should be. Here we have arbitrarily opted for one option that was offered. Other problems are due to imperfections in the algorithms we use for annotation. The solution here is to improve these algorithms.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"21 1","pages":"234 - 243"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Epistemic Marker Určit Ě in the Light of Corpus Data","authors":"B. Štěpánková, J. Šindlerová, Lucie Poláková","doi":"10.2478/jazcas-2023-0031","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0031","url":null,"abstract":"Abstract The paper presents a pilot study for a research project on epistemic modality and/or evidentiality markers in Czech. The study focuses on the expression určitě. Although, this marker is typically considered to signal high certainty, the dictionary of standard Czech (Slovník spisovné češtiny, SSČ) also offers an alternative meaning of probability, indicating a lower degree of certainty. We use parallel data from the InterCorp v15 corpus to determine whether the probability meaning can be identified unequivocally in real language data and whether it correlates with specific translation equivalents, linguistic features, or lexical context. Based on our findings, we propose an alternative method for distinguishing between different shades of meaning based on the communicative functions of the utterances, and we draw conclusions regarding the relevance of individual grammatical and lexical clues in context for future annotations.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"210 1","pages":"130 - 139"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139372036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adverbs Derived from Adjectival Present Participles in Polish, Slovak and Czech: A Comparative Corpus-Based Study","authors":"Aksana Schillová","doi":"10.2478/jazcas-2023-0030","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0030","url":null,"abstract":"Abstract The paper investigates and compares the inventory of adverbs derived from adjectival present participles in the Polish, Slovak and Czech languages. The particularity of these adverbs is that they do not occur in every language, even if it concerns closely related languages. While in Polish and Slovak this type of adverbs is represented by hundreds of lemmas, in Czech it is almost not represented. The comparative analysis is carried out on the data retrieved from the comparable web corpora Aranea. The sets of adverbs extracted from the comparable corpora of the languages examined are analysed by the following criteria: the total number of the adverb lemmas in the corpus, their relative frequency (ipm), morphemic structure features, collocability preferences. The similarities and differences between the adverb sets are established. According to the corpus data, the adverbs derived from adjectival present participles are more widely used in Polish than in Slovak, and in Czech they are a rare phenomenon represented by a limited number of lemmas with a negligible frequency.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"216 1","pages":"119 - 129"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michaela Nogolová, Michaela Hanušková, Miroslav Kubát, Radek Čech
{"title":"Linear Dependency Segments in Foreign Language Acquisition: Syntactic Complexity Analysis in Czech Learners’ Texts","authors":"Michaela Nogolová, Michaela Hanušková, Miroslav Kubát, Radek Čech","doi":"10.2478/jazcas-2023-0037","DOIUrl":"https://doi.org/10.2478/jazcas-2023-0037","url":null,"abstract":"Abstract The paper discusses a new way to measure syntactic complexity in foreign language acquisition. It is based on a recently proposed syntactic unit called linear dependency segment (LDS), the longest possible sequence of words belonging to the same clause where all linear neighbours are also syntactic neighbours. The dataset comprises 5,721 Czech texts from the CzeSL-SGT learner corpus covering five CEFR proficiency levels (A1–C1). The study covers two analyses. First, the development of the average clause length in terms of LDS and the average LDS length in the number of words across the latter language proficiency levels. Second, we consider the differences between Slavic and non-Slavic speakers. The results show an increasing tendency of the average clause length measured in LDS while the average clause length measured in words is decreasing. Results also show statistically significant differences between Slavic and non-Slavic speakers in most cases. Our results indicate that using LDS may be a useful unit of syntactic complexity measure in foreign language acquisition research.","PeriodicalId":262732,"journal":{"name":"Journal of Linguistics/Jazykovedný casopis","volume":"109 1","pages":"193 - 203"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139371980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}