自然语言处理中的半监督学习和领域自适应

Synthesis Lectures on Human Language Technologies Pub Date : 2013-05-21 DOI:10.2200/s00497ed1v01y201304hlt021

Anders Søgaard

{"title":"自然语言处理中的半监督学习和领域自适应","authors":"Anders Søgaard","doi":"10.2200/s00497ed1v01y201304hlt021","DOIUrl":null,"url":null,"abstract":"This book introduces basic supervised learning algorithms applicable to natural language processing (NLP) and shows how the performance of these algorithms can often be improved by exploiting the marginal distribution of large amounts of unlabeled data. One reason for that is data sparsity, i.e., the limited amounts of data we have available in NLP. However, in most real-world NLP applications our labeled data is also heavily biased. This book introduces extensions of supervised learning algorithms to cope with data sparsity and different kinds of sampling bias. This book is intended to be both readable by first-year students and interesting to the expert audience. My intention was to introduce what is necessary to appreciate the major challenges we face in contemporary NLP related to data sparsity and sampling bias, without wasting too much time on details about supervised learning algorithms or particular NLP applications. I use text classification, part-of-speech tagging, and dependency parsing as running examples, and limit myself to a small set of cardinal learning algorithms. I have worried less about theoretical guarantees (\"this algorithm never does too badly\") than about useful rules of thumb (\"in this case this algorithm may perform really well\"). In NLP, data is so noisy, biased, and non-stationary that few theoretical guarantees can be established and we are typically left with our gut feelings and a catalogue of crazy ideas. I hope this book will provide its readers with both. Throughout the book we include snippets of Python code and empirical evaluations, when relevant. Table of Contents: Introduction / Supervised and Unsupervised Prediction / Semi-Supervised Learning / Learning under Bias / Learning under Unknown Bias / Evaluating under Bias","PeriodicalId":22125,"journal":{"name":"Synthesis Lectures on Human Language Technologies","volume":"57 1","pages":"1-103"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Semi-Supervised Learning and Domain Adaptation in Natural Language Processing\",\"authors\":\"Anders Søgaard\",\"doi\":\"10.2200/s00497ed1v01y201304hlt021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This book introduces basic supervised learning algorithms applicable to natural language processing (NLP) and shows how the performance of these algorithms can often be improved by exploiting the marginal distribution of large amounts of unlabeled data. One reason for that is data sparsity, i.e., the limited amounts of data we have available in NLP. However, in most real-world NLP applications our labeled data is also heavily biased. This book introduces extensions of supervised learning algorithms to cope with data sparsity and different kinds of sampling bias. This book is intended to be both readable by first-year students and interesting to the expert audience. My intention was to introduce what is necessary to appreciate the major challenges we face in contemporary NLP related to data sparsity and sampling bias, without wasting too much time on details about supervised learning algorithms or particular NLP applications. I use text classification, part-of-speech tagging, and dependency parsing as running examples, and limit myself to a small set of cardinal learning algorithms. I have worried less about theoretical guarantees (\\\"this algorithm never does too badly\\\") than about useful rules of thumb (\\\"in this case this algorithm may perform really well\\\"). In NLP, data is so noisy, biased, and non-stationary that few theoretical guarantees can be established and we are typically left with our gut feelings and a catalogue of crazy ideas. I hope this book will provide its readers with both. Throughout the book we include snippets of Python code and empirical evaluations, when relevant. Table of Contents: Introduction / Supervised and Unsupervised Prediction / Semi-Supervised Learning / Learning under Bias / Learning under Unknown Bias / Evaluating under Bias\",\"PeriodicalId\":22125,\"journal\":{\"name\":\"Synthesis Lectures on Human Language Technologies\",\"volume\":\"57 1\",\"pages\":\"1-103\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Synthesis Lectures on Human Language Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2200/s00497ed1v01y201304hlt021\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Synthesis Lectures on Human Language Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2200/s00497ed1v01y201304hlt021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

本书介绍了适用于自然语言处理(NLP)的基本监督学习算法，并展示了如何通过利用大量未标记数据的边际分布来改进这些算法的性能。其中一个原因是数据稀疏性，即我们在NLP中可用的数据量有限。然而，在大多数现实世界的NLP应用中，我们的标记数据也存在严重偏差。本书介绍了监督学习算法的扩展，以应对数据稀疏性和不同类型的抽样偏差。这本书的目的是既可读的一年级学生和有趣的专家观众。我的目的是介绍我们在当代NLP中所面临的与数据稀疏性和抽样偏差相关的主要挑战，而不是浪费太多时间在监督学习算法或特定NLP应用的细节上。我使用文本分类、词性标记和依赖关系解析作为运行示例，并将自己限制在一小组基本学习算法中。比起理论上的保证(“这个算法永远不会做得太糟”)，我更担心有用的经验法则(“在这种情况下，这个算法可能会表现得非常好”)。在NLP中，数据是如此嘈杂、有偏见和非平稳，以至于几乎没有理论保证可以建立，我们通常只剩下我们的直觉和一系列疯狂的想法。我希望这本书能为读者提供这两方面的知识。在本书中，我们包含了Python代码片段和相关的经验评估。目录:引言/监督与无监督预测/半监督学习/偏见下的学习/未知偏见下的学习/偏见下的评估

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Semi-Supervised Learning and Domain Adaptation in Natural Language Processing

This book introduces basic supervised learning algorithms applicable to natural language processing (NLP) and shows how the performance of these algorithms can often be improved by exploiting the marginal distribution of large amounts of unlabeled data. One reason for that is data sparsity, i.e., the limited amounts of data we have available in NLP. However, in most real-world NLP applications our labeled data is also heavily biased. This book introduces extensions of supervised learning algorithms to cope with data sparsity and different kinds of sampling bias. This book is intended to be both readable by first-year students and interesting to the expert audience. My intention was to introduce what is necessary to appreciate the major challenges we face in contemporary NLP related to data sparsity and sampling bias, without wasting too much time on details about supervised learning algorithms or particular NLP applications. I use text classification, part-of-speech tagging, and dependency parsing as running examples, and limit myself to a small set of cardinal learning algorithms. I have worried less about theoretical guarantees ("this algorithm never does too badly") than about useful rules of thumb ("in this case this algorithm may perform really well"). In NLP, data is so noisy, biased, and non-stationary that few theoretical guarantees can be established and we are typically left with our gut feelings and a catalogue of crazy ideas. I hope this book will provide its readers with both. Throughout the book we include snippets of Python code and empirical evaluations, when relevant. Table of Contents: Introduction / Supervised and Unsupervised Prediction / Semi-Supervised Learning / Learning under Bias / Learning under Unknown Bias / Evaluating under Bias

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Synthesis Lectures on Human Language Technologies

CiteScore

2.30

自引率

0.00%

发文量