创建敏感和难以获取文本的语料库:构建WiSP语料库的方法挑战和伦理问题

Applied Corpus Linguistics Pub Date : 2021-12-01 DOI:10.1016/j.acorp.2021.100011

Maria Leedham, Theresa Lillis, Alison Twiner

{"title":"创建敏感和难以获取文本的语料库:构建WiSP语料库的方法挑战和伦理问题","authors":"Maria Leedham, Theresa Lillis, Alison Twiner","doi":"10.1016/j.acorp.2021.100011","DOIUrl":null,"url":null,"abstract":"<div><p>Corpus linguistics is increasingly employed to explore large, publicly-available datasets such as newspaper texts, government speeches and online fora. However, comparatively few corpora exist where the subject matter concerns sensitive topics about living individuals since, due to their highly personal and confidential nature, these texts are hard to access and raise difficult ethical questions around secondary data analysis. One exception is the Writing in professional social work practice (WiSP) corpus, comprising texts written by UK-based professional social workers in the course of their daily work and now available to other researchers through the ReShare archive. This paper focuses on the challenges involved in building the WiSP corpus and the epistemological and ethical issues raised. Two key aspects of research practice are discussed: data anonymisation and dataset archiving. Specifically, the paper explores decision-making around anonymisation and an ethically-informed rationale for treating some texts as ‘not for sharing’, leading to the decision to create two corpora: one for the research team and a further anonymised and slightly reduced version for archiving. The paper explores what the WiSP corpora (Corpus 1 and Corpus 2) contribute to understandings about social work writing, the extent to which the two corpora enable different analyses and whether the existence of two corpora is problematic from a corpus linguistic perspective. The paper concludes by considering how the ethical decisions around corpus creation of sensitive texts raise questions about key principles in corpus linguistics.</p></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666799121000113/pdfft?md5=114da62f552bc3d2ef6ffdd86419da6d&pid=1-s2.0-S2666799121000113-main.pdf","citationCount":"1","resultStr":"{\"title\":\"Creating a corpus of sensitive and hard-to-access texts: Methodological challenges and ethical concerns in the building of the WiSP Corpus\",\"authors\":\"Maria Leedham, Theresa Lillis, Alison Twiner\",\"doi\":\"10.1016/j.acorp.2021.100011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Corpus linguistics is increasingly employed to explore large, publicly-available datasets such as newspaper texts, government speeches and online fora. However, comparatively few corpora exist where the subject matter concerns sensitive topics about living individuals since, due to their highly personal and confidential nature, these texts are hard to access and raise difficult ethical questions around secondary data analysis. One exception is the Writing in professional social work practice (WiSP) corpus, comprising texts written by UK-based professional social workers in the course of their daily work and now available to other researchers through the ReShare archive. This paper focuses on the challenges involved in building the WiSP corpus and the epistemological and ethical issues raised. Two key aspects of research practice are discussed: data anonymisation and dataset archiving. Specifically, the paper explores decision-making around anonymisation and an ethically-informed rationale for treating some texts as ‘not for sharing’, leading to the decision to create two corpora: one for the research team and a further anonymised and slightly reduced version for archiving. The paper explores what the WiSP corpora (Corpus 1 and Corpus 2) contribute to understandings about social work writing, the extent to which the two corpora enable different analyses and whether the existence of two corpora is problematic from a corpus linguistic perspective. The paper concludes by considering how the ethical decisions around corpus creation of sensitive texts raise questions about key principles in corpus linguistics.</p></div>\",\"PeriodicalId\":72254,\"journal\":{\"name\":\"Applied Corpus Linguistics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666799121000113/pdfft?md5=114da62f552bc3d2ef6ffdd86419da6d&pid=1-s2.0-S2666799121000113-main.pdf\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Corpus Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666799121000113\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Corpus Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666799121000113","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

语料库语言学越来越多地被用于探索大型、公开的数据集，如报纸文本、政府演讲和在线论坛。然而，相对较少的语料库存在，其中的主题涉及到关于活着的个人的敏感话题，因为由于其高度个人和机密的性质，这些文本很难访问，并提出了围绕二手数据分析的困难伦理问题。一个例外是专业社会工作实践写作(WiSP)语料库，包括由英国专业社会工作者在日常工作过程中撰写的文本，现在通过ReShare存档可供其他研究人员使用。本文的重点是建立WiSP语料库所涉及的挑战以及所提出的认识论和伦理问题。讨论了研究实践的两个关键方面:数据匿名化和数据集存档。具体来说，本文探讨了围绕匿名的决策，以及将一些文本视为“不用于共享”的道德知情理由，从而决定创建两个语料库:一个供研究团队使用，另一个进一步匿名并略微简化的版本用于存档。本文探讨了WiSP语料库(语料库1和语料库2)对社会工作写作理解的贡献，这两个语料库在多大程度上实现了不同的分析，以及从语料库语言学的角度来看，两个语料库的存在是否存在问题。本文最后考虑了围绕敏感文本的语料库创建的伦理决策如何引起对语料库语言学关键原则的质疑。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Creating a corpus of sensitive and hard-to-access texts: Methodological challenges and ethical concerns in the building of the WiSP Corpus

Corpus linguistics is increasingly employed to explore large, publicly-available datasets such as newspaper texts, government speeches and online fora. However, comparatively few corpora exist where the subject matter concerns sensitive topics about living individuals since, due to their highly personal and confidential nature, these texts are hard to access and raise difficult ethical questions around secondary data analysis. One exception is the Writing in professional social work practice (WiSP) corpus, comprising texts written by UK-based professional social workers in the course of their daily work and now available to other researchers through the ReShare archive. This paper focuses on the challenges involved in building the WiSP corpus and the epistemological and ethical issues raised. Two key aspects of research practice are discussed: data anonymisation and dataset archiving. Specifically, the paper explores decision-making around anonymisation and an ethically-informed rationale for treating some texts as ‘not for sharing’, leading to the decision to create two corpora: one for the research team and a further anonymised and slightly reduced version for archiving. The paper explores what the WiSP corpora (Corpus 1 and Corpus 2) contribute to understandings about social work writing, the extent to which the two corpora enable different analyses and whether the existence of two corpora is problematic from a corpus linguistic perspective. The paper concludes by considering how the ethical decisions around corpus creation of sensitive texts raise questions about key principles in corpus linguistics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Corpus Linguistics Linguistics and Language

CiteScore

1.30

自引率

0.00%

发文量

审稿时长

70 days