自然语言处理中的半监督学习和领域自适应

Anders Søgaard
{"title":"自然语言处理中的半监督学习和领域自适应","authors":"Anders Søgaard","doi":"10.2200/s00497ed1v01y201304hlt021","DOIUrl":null,"url":null,"abstract":"This book introduces basic supervised learning algorithms applicable to natural language processing (NLP) and shows how the performance of these algorithms can often be improved by exploiting the marginal distribution of large amounts of unlabeled data. One reason for that is data sparsity, i.e., the limited amounts of data we have available in NLP. However, in most real-world NLP applications our labeled data is also heavily biased. This book introduces extensions of supervised learning algorithms to cope with data sparsity and different kinds of sampling bias. This book is intended to be both readable by first-year students and interesting to the expert audience. My intention was to introduce what is necessary to appreciate the major challenges we face in contemporary NLP related to data sparsity and sampling bias, without wasting too much time on details about supervised learning algorithms or particular NLP applications. I use text classification, part-of-speech tagging, and dependency parsing as running examples, and limit myself to a small set of cardinal learning algorithms. I have worried less about theoretical guarantees (\"this algorithm never does too badly\") than about useful rules of thumb (\"in this case this algorithm may perform really well\"). In NLP, data is so noisy, biased, and non-stationary that few theoretical guarantees can be established and we are typically left with our gut feelings and a catalogue of crazy ideas. I hope this book will provide its readers with both. Throughout the book we include snippets of Python code and empirical evaluations, when relevant. Table of Contents: Introduction / Supervised and Unsupervised Prediction / Semi-Supervised Learning / Learning under Bias / Learning under Unknown Bias / Evaluating under Bias","PeriodicalId":22125,"journal":{"name":"Synthesis Lectures on Human Language Technologies","volume":"57 1","pages":"1-103"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Semi-Supervised Learning and Domain Adaptation in Natural Language Processing\",\"authors\":\"Anders Søgaard\",\"doi\":\"10.2200/s00497ed1v01y201304hlt021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This book introduces basic supervised learning algorithms applicable to natural language processing (NLP) and shows how the performance of these algorithms can often be improved by exploiting the marginal distribution of large amounts of unlabeled data. One reason for that is data sparsity, i.e., the limited amounts of data we have available in NLP. However, in most real-world NLP applications our labeled data is also heavily biased. This book introduces extensions of supervised learning algorithms to cope with data sparsity and different kinds of sampling bias. This book is intended to be both readable by first-year students and interesting to the expert audience. My intention was to introduce what is necessary to appreciate the major challenges we face in contemporary NLP related to data sparsity and sampling bias, without wasting too much time on details about supervised learning algorithms or particular NLP applications. I use text classification, part-of-speech tagging, and dependency parsing as running examples, and limit myself to a small set of cardinal learning algorithms. I have worried less about theoretical guarantees (\\\"this algorithm never does too badly\\\") than about useful rules of thumb (\\\"in this case this algorithm may perform really well\\\"). In NLP, data is so noisy, biased, and non-stationary that few theoretical guarantees can be established and we are typically left with our gut feelings and a catalogue of crazy ideas. I hope this book will provide its readers with both. Throughout the book we include snippets of Python code and empirical evaluations, when relevant. Table of Contents: Introduction / Supervised and Unsupervised Prediction / Semi-Supervised Learning / Learning under Bias / Learning under Unknown Bias / Evaluating under Bias\",\"PeriodicalId\":22125,\"journal\":{\"name\":\"Synthesis Lectures on Human Language Technologies\",\"volume\":\"57 1\",\"pages\":\"1-103\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Synthesis Lectures on Human Language Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2200/s00497ed1v01y201304hlt021\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Synthesis Lectures on Human Language Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2200/s00497ed1v01y201304hlt021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

摘要

本书介绍了适用于自然语言处理(NLP)的基本监督学习算法,并展示了如何通过利用大量未标记数据的边际分布来改进这些算法的性能。其中一个原因是数据稀疏性,即我们在NLP中可用的数据量有限。然而,在大多数现实世界的NLP应用中,我们的标记数据也存在严重偏差。本书介绍了监督学习算法的扩展,以应对数据稀疏性和不同类型的抽样偏差。这本书的目的是既可读的一年级学生和有趣的专家观众。我的目的是介绍我们在当代NLP中所面临的与数据稀疏性和抽样偏差相关的主要挑战,而不是浪费太多时间在监督学习算法或特定NLP应用的细节上。我使用文本分类、词性标记和依赖关系解析作为运行示例,并将自己限制在一小组基本学习算法中。比起理论上的保证(“这个算法永远不会做得太糟”),我更担心有用的经验法则(“在这种情况下,这个算法可能会表现得非常好”)。在NLP中,数据是如此嘈杂、有偏见和非平稳,以至于几乎没有理论保证可以建立,我们通常只剩下我们的直觉和一系列疯狂的想法。我希望这本书能为读者提供这两方面的知识。在本书中,我们包含了Python代码片段和相关的经验评估。目录:引言/监督与无监督预测/半监督学习/偏见下的学习/未知偏见下的学习/偏见下的评估
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Semi-Supervised Learning and Domain Adaptation in Natural Language Processing
This book introduces basic supervised learning algorithms applicable to natural language processing (NLP) and shows how the performance of these algorithms can often be improved by exploiting the marginal distribution of large amounts of unlabeled data. One reason for that is data sparsity, i.e., the limited amounts of data we have available in NLP. However, in most real-world NLP applications our labeled data is also heavily biased. This book introduces extensions of supervised learning algorithms to cope with data sparsity and different kinds of sampling bias. This book is intended to be both readable by first-year students and interesting to the expert audience. My intention was to introduce what is necessary to appreciate the major challenges we face in contemporary NLP related to data sparsity and sampling bias, without wasting too much time on details about supervised learning algorithms or particular NLP applications. I use text classification, part-of-speech tagging, and dependency parsing as running examples, and limit myself to a small set of cardinal learning algorithms. I have worried less about theoretical guarantees ("this algorithm never does too badly") than about useful rules of thumb ("in this case this algorithm may perform really well"). In NLP, data is so noisy, biased, and non-stationary that few theoretical guarantees can be established and we are typically left with our gut feelings and a catalogue of crazy ideas. I hope this book will provide its readers with both. Throughout the book we include snippets of Python code and empirical evaluations, when relevant. Table of Contents: Introduction / Supervised and Unsupervised Prediction / Semi-Supervised Learning / Learning under Bias / Learning under Unknown Bias / Evaluating under Bias
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
2.30
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信