Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations

IF 8.9 2区 管理学 Q1 MANAGEMENT
Louis Hickman, Stuti Thapa, L. Tay, Mengyang Cao, P. Srinivasan
{"title":"Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations","authors":"Louis Hickman, Stuti Thapa, L. Tay, Mengyang Cao, P. Srinivasan","doi":"10.1177/1094428120971683","DOIUrl":null,"url":null,"abstract":"Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.","PeriodicalId":19689,"journal":{"name":"Organizational Research Methods","volume":"25 1","pages":"114 - 146"},"PeriodicalIF":8.9000,"publicationDate":"2020-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094428120971683","citationCount":"76","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Organizational Research Methods","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1177/1094428120971683","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MANAGEMENT","Score":null,"Total":0}
引用次数: 76

Abstract

Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.
组织研究中用于文本挖掘的文本预处理:综述和建议
文本挖掘的最新进展为利用由组织、其员工和其客户创建的大量自然语言文本数据提供了新的方法。虽然经常被忽视,但在文本预处理期间做出的决定会影响是否捕获语言的内容和/或风格、后续分析的统计能力以及从文本挖掘中获得的见解的有效性。过去的方法学文章描述了获取和分析文本数据的一般过程,但是关于预处理文本数据的建议并不一致。此外,初步研究使用并报告了不同的预处理技术。为了解决这个问题,我们对计算语言学和组织文本挖掘研究进行了两个互补的回顾,以提供基于经验的文本预处理决策建议,这些建议考虑了所进行的文本挖掘的类型(即开放或封闭词汇)、正在调查的研究问题和数据集的特征(即语料库大小和平均文档长度)。值得注意的是,由于文本数据的独特特性,偏离这些建议是适当的,有时也是必要的。我们还为报告文本挖掘提供了建议,以提高透明度和可重复性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
23.20
自引率
3.20%
发文量
17
期刊介绍: Organizational Research Methods (ORM) was founded with the aim of introducing pertinent methodological advancements to researchers in organizational sciences. The objective of ORM is to promote the application of current and emerging methodologies to advance both theory and research practices. Articles are expected to be comprehensible to readers with a background consistent with the methodological and statistical training provided in contemporary organizational sciences doctoral programs. The text should be presented in a manner that facilitates accessibility. For instance, highly technical content should be placed in appendices, and authors are encouraged to include example data and computer code when relevant. Additionally, authors should explicitly outline how their contribution has the potential to advance organizational theory and research practice.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信