Efficient sentence segmentation using syntactic features

Benoit Favre, Dilek Z. Hakkani-Tür, Slav Petrov, D. Klein
{"title":"Efficient sentence segmentation using syntactic features","authors":"Benoit Favre, Dilek Z. Hakkani-Tür, Slav Petrov, D. Klein","doi":"10.1109/SLT.2008.4777844","DOIUrl":null,"url":null,"abstract":"To enable downstream language processing,automatic speech recognition output must be segmented into its individual sentences. Previous sentence segmentation systems have typically been very local,using low-level prosodic and lexical features to independently decide whether or not to segment at each word boundary position. In this work,we leverage global syntactic information from a syntactic parser, which is better able to capture long distance dependencies. While some previous work has included syntactic features, ours is the first to do so in a tractable, lattice-based way, which is crucial for scaling up to long-sentence contexts. Specifically, an initial hypothesis lattice is constructed using local features. Candidate sentences are then assigned syntactic language model scores. These global syntactic scores are combined with local low-level scores in a log-linear model. The resulting system significantly outperforms the most popular long-span model for sentence segmentation (the hidden event language model) on both reference text and automatic speech recognizer output from news broadcasts.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 IEEE Spoken Language Technology Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2008.4777844","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

To enable downstream language processing,automatic speech recognition output must be segmented into its individual sentences. Previous sentence segmentation systems have typically been very local,using low-level prosodic and lexical features to independently decide whether or not to segment at each word boundary position. In this work,we leverage global syntactic information from a syntactic parser, which is better able to capture long distance dependencies. While some previous work has included syntactic features, ours is the first to do so in a tractable, lattice-based way, which is crucial for scaling up to long-sentence contexts. Specifically, an initial hypothesis lattice is constructed using local features. Candidate sentences are then assigned syntactic language model scores. These global syntactic scores are combined with local low-level scores in a log-linear model. The resulting system significantly outperforms the most popular long-span model for sentence segmentation (the hidden event language model) on both reference text and automatic speech recognizer output from news broadcasts.
利用句法特征进行高效的句子切分
为了实现下游语言处理,自动语音识别输出必须被分割成单独的句子。以前的句子分词系统通常是非常局部的,使用低级的韵律和词汇特征来独立决定是否在每个词边界位置分词。在这项工作中,我们利用来自语法解析器的全局语法信息,它能够更好地捕获长距离依赖关系。虽然之前的一些工作已经包含了句法特征,但我们的工作是第一个以一种易于处理的、基于格的方式这样做的,这对于扩展到长句子上下文至关重要。具体来说,利用局部特征构造初始假设格。然后给候选句子分配句法语言模型分数。这些全局语法分数在对数线性模型中与局部低级分数相结合。由此产生的系统在参考文本和新闻广播的自动语音识别输出上都明显优于最流行的句子分割长跨度模型(隐藏事件语言模型)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信