模糊电视节目标题推文采集训练数据的自动标注

M. Erdmann, Erik Ward, K. Ikeda, Gen Hattori, C. Ono, Y. Takishima
{"title":"模糊电视节目标题推文采集训练数据的自动标注","authors":"M. Erdmann, Erik Ward, K. Ikeda, Gen Hattori, C. Ono, Y. Takishima","doi":"10.1109/SocialCom.2013.119","DOIUrl":null,"url":null,"abstract":"Twitter is a popular medium for sharing opinions on TV programs, and the analysis of TV related tweets is attracting a lot of interest. However, when collecting all tweets containing a given TV program title, we obtain a large number of unrelated tweets, due to the fact that many of the TV program titles are ambiguous. Using supervised learning, TV related tweets can be collected with high accuracy. The goal of our proposed method is to automate the labeling process, in order to eliminate the cost required for data labeling without sacrificing classification accuracy. When creating the training data, we use only tweets of unambiguous TV program titles. In order to decide whether a TV program title is ambiguous, we automatically determine whether it can be used as a common expression or named entity. In two experiments, in which we collected tweets for 32 ambiguous TV program titles, we achieved the same (78.2%) or even higher classification accuracy (79.1%) with automatically labeled training data as with manually labeled data, while effectively eliminating labeling costs.","PeriodicalId":129308,"journal":{"name":"2013 International Conference on Social Computing","volume":"111 3S 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Automatic Labeling of Training Data for Collecting Tweets for Ambiguous TV Program Titles\",\"authors\":\"M. Erdmann, Erik Ward, K. Ikeda, Gen Hattori, C. Ono, Y. Takishima\",\"doi\":\"10.1109/SocialCom.2013.119\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Twitter is a popular medium for sharing opinions on TV programs, and the analysis of TV related tweets is attracting a lot of interest. However, when collecting all tweets containing a given TV program title, we obtain a large number of unrelated tweets, due to the fact that many of the TV program titles are ambiguous. Using supervised learning, TV related tweets can be collected with high accuracy. The goal of our proposed method is to automate the labeling process, in order to eliminate the cost required for data labeling without sacrificing classification accuracy. When creating the training data, we use only tweets of unambiguous TV program titles. In order to decide whether a TV program title is ambiguous, we automatically determine whether it can be used as a common expression or named entity. In two experiments, in which we collected tweets for 32 ambiguous TV program titles, we achieved the same (78.2%) or even higher classification accuracy (79.1%) with automatically labeled training data as with manually labeled data, while effectively eliminating labeling costs.\",\"PeriodicalId\":129308,\"journal\":{\"name\":\"2013 International Conference on Social Computing\",\"volume\":\"111 3S 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 International Conference on Social Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SocialCom.2013.119\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Social Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SocialCom.2013.119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

Twitter是一种流行的分享电视节目意见的媒体,对电视相关推文的分析引起了很多人的兴趣。然而,当收集包含给定电视节目标题的所有tweet时,我们会得到大量不相关的tweet,因为许多电视节目标题是模糊的。利用监督学习,可以以较高的准确率收集电视相关的推文。我们提出的方法的目标是自动化标注过程,以便在不牺牲分类精度的情况下消除数据标注所需的成本。在创建训练数据时,我们只使用具有明确电视节目标题的tweet。为了确定电视节目标题是否有歧义,我们自动确定它是否可以用作公共表达或命名实体。在两个实验中,我们收集了32个有歧义的电视节目标题的推文,我们使用自动标记的训练数据获得了与手动标记数据相同(78.2%)甚至更高的分类准确率(79.1%),同时有效地消除了标记成本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Automatic Labeling of Training Data for Collecting Tweets for Ambiguous TV Program Titles
Twitter is a popular medium for sharing opinions on TV programs, and the analysis of TV related tweets is attracting a lot of interest. However, when collecting all tweets containing a given TV program title, we obtain a large number of unrelated tweets, due to the fact that many of the TV program titles are ambiguous. Using supervised learning, TV related tweets can be collected with high accuracy. The goal of our proposed method is to automate the labeling process, in order to eliminate the cost required for data labeling without sacrificing classification accuracy. When creating the training data, we use only tweets of unambiguous TV program titles. In order to decide whether a TV program title is ambiguous, we automatically determine whether it can be used as a common expression or named entity. In two experiments, in which we collected tweets for 32 ambiguous TV program titles, we achieved the same (78.2%) or even higher classification accuracy (79.1%) with automatically labeled training data as with manually labeled data, while effectively eliminating labeling costs.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信