Elicit spoken-style data from social media through a style classifier

A. Chotimongkol, Vataya Chunwijitra, Sumonmas Thatphithakkul, Nattapong Kurpukdee, C. Wutiwiwatchai
{"title":"Elicit spoken-style data from social media through a style classifier","authors":"A. Chotimongkol, Vataya Chunwijitra, Sumonmas Thatphithakkul, Nattapong Kurpukdee, C. Wutiwiwatchai","doi":"10.1109/ICSDA.2015.7357856","DOIUrl":null,"url":null,"abstract":"We explore the use of social media data to reduce the effort in developing a conversational speech corpus. The LOTUS-SOC corpus is created by recording Twitter messages through a mobile application. In the first phase, which took around one month, 172 hours of speech from 208 speakers were recorded and ready for use without the need for speech segmentation and transcription. In terms of language similarity to spoken language, the perplexity of LOTUS-SOC with respect to known spoken utterances is lower than that of the broadcast news corpus and almost as low as the telephone conversation corpus. We also applied a style classifier trained from words and parts-of-speech using two machine learning approaches, SVM and CRF, to identify spoken-style utterances in LOTUS-SOC. By training a language model from only the utterances classified as “spoken”, the perplexity of LOTUS-SOC was further reduced as evaluated by three different sets of spoken utterances.","PeriodicalId":290790,"journal":{"name":"2015 International Conference Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2015.7357856","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

We explore the use of social media data to reduce the effort in developing a conversational speech corpus. The LOTUS-SOC corpus is created by recording Twitter messages through a mobile application. In the first phase, which took around one month, 172 hours of speech from 208 speakers were recorded and ready for use without the need for speech segmentation and transcription. In terms of language similarity to spoken language, the perplexity of LOTUS-SOC with respect to known spoken utterances is lower than that of the broadcast news corpus and almost as low as the telephone conversation corpus. We also applied a style classifier trained from words and parts-of-speech using two machine learning approaches, SVM and CRF, to identify spoken-style utterances in LOTUS-SOC. By training a language model from only the utterances classified as “spoken”, the perplexity of LOTUS-SOC was further reduced as evaluated by three different sets of spoken utterances.
通过风格分类器从社交媒体中获取口语风格数据
我们探索使用社交媒体数据来减少开发会话语音语料库的工作量。LOTUS-SOC语料库是通过移动应用程序记录Twitter消息创建的。在第一阶段,大约花了一个月的时间,记录了208位演讲者172小时的演讲,并准备使用,而不需要语音分割和转录。在与口语的语言相似度方面,LOTUS-SOC对已知口语话语的困惑度低于广播新闻语料库,几乎与电话会话语料库一样低。我们还应用了一个使用两种机器学习方法(SVM和CRF)从单词和词性中训练的风格分类器来识别LOTUS-SOC中的口语风格话语。通过仅从被分类为“口语”的话语中训练语言模型,进一步降低LOTUS-SOC的困惑度,并通过三组不同的口语话语进行评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信