低资源语言的数字人文语料库汇编:塞尔维亚语、克罗地亚语和斯洛文尼亚语主题数字媒体语料库汇编的实践

IF 0.2 0 LANGUAGE & LINGUISTICS
Ksenija Bogetić, Vuk Batanović, Nikola Ljubesic
{"title":"低资源语言的数字人文语料库汇编:塞尔维亚语、克罗地亚语和斯洛文尼亚语主题数字媒体语料库汇编的实践","authors":"Ksenija Bogetić, Vuk Batanović, Nikola Ljubesic","doi":"10.22210/suvlin.2022.094.01","DOIUrl":null,"url":null,"abstract":"The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empirically–grounded social–scientific analysis (sometimes dubbed ‘corpus–assisted discourse analysis’ or ‘corpus–based critical discourse analysis’, cf. Hardt–Mautner 1995; Baker 2016). In the post–Yugoslav space, recent corpus developments have brought table–turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist – partly due to the fast–changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one step–by–step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of South–Slavic corpus linguistics and corpus linguistic approaches in the humanities more generally","PeriodicalId":40950,"journal":{"name":"Suvremena Lingvistika","volume":"1 1","pages":""},"PeriodicalIF":0.2000,"publicationDate":"2022-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Corpus compilation for digital humanities in lower– resourced languages: A practical look at compiling thematic digital media corpora in Serbian, Croatian and Slovenian\",\"authors\":\"Ksenija Bogetić, Vuk Batanović, Nikola Ljubesic\",\"doi\":\"10.22210/suvlin.2022.094.01\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empirically–grounded social–scientific analysis (sometimes dubbed ‘corpus–assisted discourse analysis’ or ‘corpus–based critical discourse analysis’, cf. Hardt–Mautner 1995; Baker 2016). In the post–Yugoslav space, recent corpus developments have brought table–turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist – partly due to the fast–changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one step–by–step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of South–Slavic corpus linguistics and corpus linguistic approaches in the humanities more generally\",\"PeriodicalId\":40950,\"journal\":{\"name\":\"Suvremena Lingvistika\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2022-12-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Suvremena Lingvistika\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22210/suvlin.2022.094.01\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"LANGUAGE & LINGUISTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Suvremena Lingvistika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22210/suvlin.2022.094.01","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0

摘要

数字时代为社会话语语料库的编纂提供了前所未有的可能性,这使得语料库语言学方法与其他话语分析方法和人文学科的互动更加密切。即使不使用任何特定的语料库语言学技术,利用某种语料库也越来越多地用于基于经验的社会科学分析(有时被称为“语料库辅助语篇分析”或“基于语料库的批评语篇分析”,参见Hardt-Mautner 1995;贝克2016)。在后南斯拉夫时期,随着语料库和工具的不断增加,最近语料库的发展为话语研究的许多领域带来了优势。尽管如此,对于为自己的研究目的而着手收集专业语料库的语言学家和话语分析学家来说,许多问题仍然存在——部分原因是这些问题的背景瞬息万变,但也因为语料库方法和语料库编制指南在应用于英语语境之外时仍然存在差距。在本文中,我们的目标是通过一个从数字媒体来源(新闻文章和读者评论)编译主题语料库的例子,通过介绍专门针对克罗地亚语、塞尔维亚语和斯洛文尼亚语的语料库构建程序的逐步说明,讨论解决这些困难的一些可能的解决方案。在概述了语料库类型、在社会科学和数字人文科学中的用途和优势之后,我们提出了语料库在南斯拉夫语言环境中的编译可能性,包括数据抓取选项、许可和伦理问题、促进或复杂化自动收集的因素,以及语料库注释和处理的可能性。这项研究表明,使用特定语言的可能性在不断扩大,但也存在一些长期存在的灰色地带,研究人员需要根据研究预期做出决定。总的来说,本文旨在总结我们自己的语料库编写经验在更广泛的背景下,南斯拉夫语料库语言学和语料库语言学方法在人文学科更普遍
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Corpus compilation for digital humanities in lower– resourced languages: A practical look at compiling thematic digital media corpora in Serbian, Croatian and Slovenian
The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empirically–grounded social–scientific analysis (sometimes dubbed ‘corpus–assisted discourse analysis’ or ‘corpus–based critical discourse analysis’, cf. Hardt–Mautner 1995; Baker 2016). In the post–Yugoslav space, recent corpus developments have brought table–turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist – partly due to the fast–changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one step–by–step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of South–Slavic corpus linguistics and corpus linguistic approaches in the humanities more generally
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Suvremena Lingvistika
Suvremena Lingvistika LANGUAGE & LINGUISTICS-
CiteScore
0.30
自引率
0.00%
发文量
8
审稿时长
17 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信