工业日志数据的短文本词法规范化

2018 IEEE International Conference on Big Knowledge (ICBK) Pub Date : 2018-11-01 DOI:10.1109/ICBK.2018.00023

Michael Stewart, Wei Liu, R. Cardell-Oliver, Rui Wang

{"title":"工业日志数据的短文本词法规范化","authors":"Michael Stewart, Wei Liu, R. Cardell-Oliver, Rui Wang","doi":"10.1109/ICBK.2018.00023","DOIUrl":null,"url":null,"abstract":"Lexical normalisation aims to computationally correct errors in text so that the data may be more successfully analysed. Noisy, unstructured short-text data presents unique challenges as it contains multiple types of Out Of Vocabulary (OOV) words. Some are spelling mistakes, which should be normalised to in-dictionary words; some are acronyms or abbreviations, which should be expanded to the corresponding phrases; and some are domain specific terms which should remain in their original form not to be mis-corrected to conform with the dictionary used. Despite its critical significance in assuring data quality, text normalisation is an area with a less cohesive and focused research effort, evidenced by the diverse set of keywords used and scattered publication venues. Integrated approaches that address all three types of OOV terms are scarce. Here we introduce a two-stage, modular classification-based framework that specifically targets the various types of Out Of Vocabulary terms prevalent in short-text data. To avoid laborious feature engineering, our system utilises a Bi-Directional Long Short-Term Memory + CRF model to classify each erroneous token into a particular class. The system then selects an appropriate normalisation technique based on the predicted class of each token. For spell-checking, we introduce two learning models that predict the correct spelling of a word given its context: one that utilises word embeddings, and another that uses a quasi-recurrent neural network. We compare our system to two existing state of the art lexical normalisation systems and find that our system achieves greater performance on the log data domain.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Short-Text Lexical Normalisation on Industrial Log Data\",\"authors\":\"Michael Stewart, Wei Liu, R. Cardell-Oliver, Rui Wang\",\"doi\":\"10.1109/ICBK.2018.00023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lexical normalisation aims to computationally correct errors in text so that the data may be more successfully analysed. Noisy, unstructured short-text data presents unique challenges as it contains multiple types of Out Of Vocabulary (OOV) words. Some are spelling mistakes, which should be normalised to in-dictionary words; some are acronyms or abbreviations, which should be expanded to the corresponding phrases; and some are domain specific terms which should remain in their original form not to be mis-corrected to conform with the dictionary used. Despite its critical significance in assuring data quality, text normalisation is an area with a less cohesive and focused research effort, evidenced by the diverse set of keywords used and scattered publication venues. Integrated approaches that address all three types of OOV terms are scarce. Here we introduce a two-stage, modular classification-based framework that specifically targets the various types of Out Of Vocabulary terms prevalent in short-text data. To avoid laborious feature engineering, our system utilises a Bi-Directional Long Short-Term Memory + CRF model to classify each erroneous token into a particular class. The system then selects an appropriate normalisation technique based on the predicted class of each token. For spell-checking, we introduce two learning models that predict the correct spelling of a word given its context: one that utilises word embeddings, and another that uses a quasi-recurrent neural network. We compare our system to two existing state of the art lexical normalisation systems and find that our system achieves greater performance on the log data domain.\",\"PeriodicalId\":144958,\"journal\":{\"name\":\"2018 IEEE International Conference on Big Knowledge (ICBK)\",\"volume\":\"92 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Big Knowledge (ICBK)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICBK.2018.00023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Big Knowledge (ICBK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBK.2018.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

词法规范化旨在通过计算纠正文本中的错误，以便更成功地分析数据。嘈杂的、非结构化的短文本数据具有独特的挑战性，因为它包含多种类型的超出词汇表(OOV)的单词。有些是拼写错误，应该标准化为字典中的单词;有些是缩略语或首字母缩略词，应扩展成相应的短语;还有一些是特定领域的术语，应该保持它们的原始形式，而不是为了符合所使用的词典而被错误地纠正。尽管文本规范化在保证数据质量方面具有至关重要的意义，但它是一个缺乏凝聚力和重点研究的领域，这可以从使用的关键字集合的多样性和分散的出版场所中得到证明。解决所有三种面向对象的术语的集成方法很少。在这里，我们介绍了一个两阶段、模块化的基于分类的框架，专门针对短文本数据中常见的各种类型的Out of Vocabulary术语。为了避免费力的特征工程，我们的系统利用双向长短期记忆+ CRF模型将每个错误标记分类到特定的类别中。然后系统根据每个标记的预测类别选择合适的规范化技术。对于拼写检查，我们引入了两种学习模型来预测给定上下文下单词的正确拼写:一种使用单词嵌入，另一种使用准循环神经网络。我们将我们的系统与两个现有的最先进的词汇规范化系统进行比较，发现我们的系统在日志数据领域实现了更高的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Short-Text Lexical Normalisation on Industrial Log Data

Lexical normalisation aims to computationally correct errors in text so that the data may be more successfully analysed. Noisy, unstructured short-text data presents unique challenges as it contains multiple types of Out Of Vocabulary (OOV) words. Some are spelling mistakes, which should be normalised to in-dictionary words; some are acronyms or abbreviations, which should be expanded to the corresponding phrases; and some are domain specific terms which should remain in their original form not to be mis-corrected to conform with the dictionary used. Despite its critical significance in assuring data quality, text normalisation is an area with a less cohesive and focused research effort, evidenced by the diverse set of keywords used and scattered publication venues. Integrated approaches that address all three types of OOV terms are scarce. Here we introduce a two-stage, modular classification-based framework that specifically targets the various types of Out Of Vocabulary terms prevalent in short-text data. To avoid laborious feature engineering, our system utilises a Bi-Directional Long Short-Term Memory + CRF model to classify each erroneous token into a particular class. The system then selects an appropriate normalisation technique based on the predicted class of each token. For spell-checking, we introduce two learning models that predict the correct spelling of a word given its context: one that utilises word embeddings, and another that uses a quasi-recurrent neural network. We compare our system to two existing state of the art lexical normalisation systems and find that our system achieves greater performance on the log data domain.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE International Conference on Big Knowledge (ICBK)

自引率

0.00%

发文量