Tapio Nojonen, Kiia Korsu, Filip Ginter, Veronika Laippala, Jenna Kanerva
{"title":"芬兰语儿童文学文本词汇数据库。","authors":"Tapio Nojonen, Kiia Korsu, Filip Ginter, Veronika Laippala, Jenna Kanerva","doi":"10.3758/s13428-025-02832-x","DOIUrl":null,"url":null,"abstract":"<p><p>This work introduces TCBLex, a lexical database of Finnish literary works read by children between the ages of 7 and 15. We explain in detail the work done to build the corpus TCBLex is based on, including how books were sampled and collected, turned into text files, and finally processed. We also touch on legal considerations and how it is possible to build such a corpus in the EU. TCBLex contains over 11 million tokens that are annotated with parts-of-speech tags and lemmatized. We provide 14 different sub-lexicons in total, covering individual intended reading ages, age groups, as well as different genres. We also provide versions with additional morphological information, such as the cases and tenses of words. TCBLex provides various psycholinguistically interesting lexical statistics for both word types and lemmas, such as different frequency metrics, distributions, word lengths, numbers of syllables, morphological paradigm sizes, and for the first time in a Finnish lexicon, ages when words and lemmas are first encountered in books. TCBLex is freely available at https://doi.org/10.5281/zenodo.15655580 .</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"57 11","pages":"312"},"PeriodicalIF":3.9000,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12528317/pdf/","citationCount":"0","resultStr":"{\"title\":\"TCBLex - A lexical database of Finnish literary texts for children.\",\"authors\":\"Tapio Nojonen, Kiia Korsu, Filip Ginter, Veronika Laippala, Jenna Kanerva\",\"doi\":\"10.3758/s13428-025-02832-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This work introduces TCBLex, a lexical database of Finnish literary works read by children between the ages of 7 and 15. We explain in detail the work done to build the corpus TCBLex is based on, including how books were sampled and collected, turned into text files, and finally processed. We also touch on legal considerations and how it is possible to build such a corpus in the EU. TCBLex contains over 11 million tokens that are annotated with parts-of-speech tags and lemmatized. We provide 14 different sub-lexicons in total, covering individual intended reading ages, age groups, as well as different genres. We also provide versions with additional morphological information, such as the cases and tenses of words. TCBLex provides various psycholinguistically interesting lexical statistics for both word types and lemmas, such as different frequency metrics, distributions, word lengths, numbers of syllables, morphological paradigm sizes, and for the first time in a Finnish lexicon, ages when words and lemmas are first encountered in books. TCBLex is freely available at https://doi.org/10.5281/zenodo.15655580 .</p>\",\"PeriodicalId\":8717,\"journal\":{\"name\":\"Behavior Research Methods\",\"volume\":\"57 11\",\"pages\":\"312\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12528317/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Behavior Research Methods\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://doi.org/10.3758/s13428-025-02832-x\",\"RegionNum\":2,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PSYCHOLOGY, EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Behavior Research Methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.3758/s13428-025-02832-x","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, EXPERIMENTAL","Score":null,"Total":0}
TCBLex - A lexical database of Finnish literary texts for children.
This work introduces TCBLex, a lexical database of Finnish literary works read by children between the ages of 7 and 15. We explain in detail the work done to build the corpus TCBLex is based on, including how books were sampled and collected, turned into text files, and finally processed. We also touch on legal considerations and how it is possible to build such a corpus in the EU. TCBLex contains over 11 million tokens that are annotated with parts-of-speech tags and lemmatized. We provide 14 different sub-lexicons in total, covering individual intended reading ages, age groups, as well as different genres. We also provide versions with additional morphological information, such as the cases and tenses of words. TCBLex provides various psycholinguistically interesting lexical statistics for both word types and lemmas, such as different frequency metrics, distributions, word lengths, numbers of syllables, morphological paradigm sizes, and for the first time in a Finnish lexicon, ages when words and lemmas are first encountered in books. TCBLex is freely available at https://doi.org/10.5281/zenodo.15655580 .
期刊介绍:
Behavior Research Methods publishes articles concerned with the methods, techniques, and instrumentation of research in experimental psychology. The journal focuses particularly on the use of computer technology in psychological research. An annual special issue is devoted to this field.