Improving LIWC Using Soft Word Matching

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI:10.1145/3233547.3233632

Yuan Gong, Kevin Shin, C. Poellabauer

{"title":"Improving LIWC Using Soft Word Matching","authors":"Yuan Gong, Kevin Shin, C. Poellabauer","doi":"10.1145/3233547.3233632","DOIUrl":null,"url":null,"abstract":"The widely deployed and easy-to-use Linguistic Inquiry and Word Count (LIWC) tool is the gold standard for many computerized text analysis tasks for many medical applications such as patient sentiment analysis, depression detection, and ADHD detection. Compared to most other natural language processing (NLP) tasks, in the medical field it is often very difficult to obtain large-scale data sets, making effective automatic representation learning from complex text patterns (e.g., using a deep auto-encoder) challenging. LIWC can solve this problem by using a human-designed dictionary as a substitution of a machine learning model to convert text into a concise and effective vector representation. However, while LIWC's dictionary is large, some potentially informative words might still be neglected due to the knowledge constraint of the dictionary editors. This problem is particularly conspicuous when the analyzed text is not a formal language (e.g., dialect, slang, or cyber words). To address this problem, we propose a new matching scheme that does not require an exact word match, but instead counts all words that are similar to a key in the LIWC dictionary. This scheme is implemented using WordNet, a large lexical database, and Word2Vec, a machine learning based word embedding technology. The output of the proposed method is in the exact same format as LIWC's output, thereby maintaining the usability. Similar to previous work, the proposed method can be viewed as a combination of human domain knowledge and machine learning for text representation encoding.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233632","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

The widely deployed and easy-to-use Linguistic Inquiry and Word Count (LIWC) tool is the gold standard for many computerized text analysis tasks for many medical applications such as patient sentiment analysis, depression detection, and ADHD detection. Compared to most other natural language processing (NLP) tasks, in the medical field it is often very difficult to obtain large-scale data sets, making effective automatic representation learning from complex text patterns (e.g., using a deep auto-encoder) challenging. LIWC can solve this problem by using a human-designed dictionary as a substitution of a machine learning model to convert text into a concise and effective vector representation. However, while LIWC's dictionary is large, some potentially informative words might still be neglected due to the knowledge constraint of the dictionary editors. This problem is particularly conspicuous when the analyzed text is not a formal language (e.g., dialect, slang, or cyber words). To address this problem, we propose a new matching scheme that does not require an exact word match, but instead counts all words that are similar to a key in the LIWC dictionary. This scheme is implemented using WordNet, a large lexical database, and Word2Vec, a machine learning based word embedding technology. The output of the proposed method is in the exact same format as LIWC's output, thereby maintaining the usability. Similar to previous work, the proposed method can be viewed as a combination of human domain knowledge and machine learning for text representation encoding.

查看原文本刊更多论文

利用软词匹配改进LIWC

广泛部署和易于使用的语言查询和单词计数(LIWC)工具是许多计算机文本分析任务的黄金标准，用于许多医疗应用，如患者情绪分析，抑郁症检测和ADHD检测。与大多数其他自然语言处理(NLP)任务相比，在医疗领域通常很难获得大规模数据集，这使得从复杂文本模式(例如，使用深度自动编码器)中进行有效的自动表示学习具有挑战性。LIWC可以通过使用人工设计的字典作为机器学习模型的替代，将文本转换为简洁有效的向量表示来解决这个问题。然而，虽然LIWC的词典很大，但由于词典编辑的知识限制，一些潜在的信息词可能仍然被忽略。当分析的文本不是正式语言(例如方言、俚语或网络词汇)时，这个问题尤其明显。为了解决这个问题，我们提出了一个新的匹配方案，它不需要精确的单词匹配，而是计算LIWC字典中与键相似的所有单词。该方案采用大型词汇数据库WordNet和基于机器学习的词嵌入技术Word2Vec实现。所建议的方法的输出与LIWC的输出格式完全相同，因此保持了可用性。与之前的工作类似，所提出的方法可以被视为人类领域知识和机器学习的结合，用于文本表示编码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

自引率

0.00%

发文量