访谈资料库中社会科学相关主题的识别:自然语言处理实验

IF 1.7 3区 管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE
Judit Gárdos, Julia Egyed-Gergely, Anna Horváth, Balázs Pataki, Roza Vajda, András Micsik
{"title":"访谈资料库中社会科学相关主题的识别:自然语言处理实验","authors":"Judit Gárdos, Julia Egyed-Gergely, Anna Horváth, Balázs Pataki, Roza Vajda, András Micsik","doi":"10.1108/jd-12-2022-0269","DOIUrl":null,"url":null,"abstract":"Purpose The present study is about generating metadata to enhance thematic transparency and facilitate research on interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence (AI) in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale. Design/methodology/approach The authors combined manual and automated/semi-automated methods of metadata development and curation. The authors developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, the authors adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. The authors identified and tested the most promising natural language processing (NLP) tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface. Findings The study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. The authors show the potential and limitations of existing and new NLP methods for thematic assignment. The current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool. Originality/value Interview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated-indexing methods, this study shows a possible implementation of a researcher tool supporting custom visualizations and the faceted search of interview collections.","PeriodicalId":47969,"journal":{"name":"Journal of Documentation","volume":"17 1","pages":"0"},"PeriodicalIF":1.7000,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Identification of social scientifically relevant topics in an interview repository: a natural language processing experiment\",\"authors\":\"Judit Gárdos, Julia Egyed-Gergely, Anna Horváth, Balázs Pataki, Roza Vajda, András Micsik\",\"doi\":\"10.1108/jd-12-2022-0269\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose The present study is about generating metadata to enhance thematic transparency and facilitate research on interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence (AI) in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale. Design/methodology/approach The authors combined manual and automated/semi-automated methods of metadata development and curation. The authors developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, the authors adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. The authors identified and tested the most promising natural language processing (NLP) tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface. Findings The study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. The authors show the potential and limitations of existing and new NLP methods for thematic assignment. The current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool. Originality/value Interview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated-indexing methods, this study shows a possible implementation of a researcher tool supporting custom visualizations and the faceted search of interview collections.\",\"PeriodicalId\":47969,\"journal\":{\"name\":\"Journal of Documentation\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2023-10-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Documentation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1108/jd-12-2022-0269\",\"RegionNum\":3,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"INFORMATION SCIENCE & LIBRARY SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Documentation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/jd-12-2022-0269","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0

摘要

本研究旨在生成元数据,以提高专题透明度,并促进布达佩斯社会科学中心(TK KDK)研究文献中心访谈集的研究。它探讨了人工智能(AI)在生产、管理和处理社会科学数据方面的应用,以及它产生有用的元数据以大规模描述此类档案内容的潜力。设计/方法论/方法作者结合了手工和自动化/半自动化的元数据开发和管理方法。作者开发了一个合适的面向领域的分类法来对半结构化访谈的大型文本语料库进行分类。为此,作者改编了欧洲语言社会科学同义词库(ELSST),以产生一个简洁的,层次结构的主题相关的社会科学。作者确定并测试了支持匈牙利语的最有前途的自然语言处理(NLP)工具。手工和机器编码的结果将在用户界面中呈现。该研究描述了国际社会科学分类法如何适应特定的当地环境,并为自动化NLP工具量身定制。作者展示了用于主题分配的现有和新的NLP方法的潜力和局限性。讨论了当前社会科学元数据分配中多标签分类的可能性,即从大池中自动选择相关标签的问题。独创性/价值访谈材料尚未用于构建人工注释的训练数据集,以便在数据存储库中自动索引科学相关主题。比较各种自动索引方法,本研究展示了一种可能实现的研究人员工具,支持自定义可视化和采访集合的分面搜索。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Identification of social scientifically relevant topics in an interview repository: a natural language processing experiment
Purpose The present study is about generating metadata to enhance thematic transparency and facilitate research on interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence (AI) in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale. Design/methodology/approach The authors combined manual and automated/semi-automated methods of metadata development and curation. The authors developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, the authors adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. The authors identified and tested the most promising natural language processing (NLP) tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface. Findings The study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. The authors show the potential and limitations of existing and new NLP methods for thematic assignment. The current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool. Originality/value Interview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated-indexing methods, this study shows a possible implementation of a researcher tool supporting custom visualizations and the faceted search of interview collections.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Documentation
Journal of Documentation INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
4.20
自引率
14.30%
发文量
72
期刊介绍: The scope of the Journal of Documentation is broadly information sciences, encompassing all of the academic and professional disciplines which deal with recorded information. These include, but are certainly not limited to: ■Information science, librarianship and related disciplines ■Information and knowledge management ■Information and knowledge organisation ■Information seeking and retrieval, and human information behaviour ■Information and digital literacies
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信