Efficient sentence segmentation using syntactic features

2008 IEEE Spoken Language Technology Workshop Pub Date : 2008-12-01 DOI:10.1109/SLT.2008.4777844

Benoit Favre, Dilek Z. Hakkani-Tür, Slav Petrov, D. Klein

引用次数: 25

Abstract

To enable downstream language processing,automatic speech recognition output must be segmented into its individual sentences. Previous sentence segmentation systems have typically been very local,using low-level prosodic and lexical features to independently decide whether or not to segment at each word boundary position. In this work,we leverage global syntactic information from a syntactic parser, which is better able to capture long distance dependencies. While some previous work has included syntactic features, ours is the first to do so in a tractable, lattice-based way, which is crucial for scaling up to long-sentence contexts. Specifically, an initial hypothesis lattice is constructed using local features. Candidate sentences are then assigned syntactic language model scores. These global syntactic scores are combined with local low-level scores in a log-linear model. The resulting system significantly outperforms the most popular long-span model for sentence segmentation (the hidden event language model) on both reference text and automatic speech recognizer output from news broadcasts.

查看原文本刊更多论文

利用句法特征进行高效的句子切分

为了实现下游语言处理，自动语音识别输出必须被分割成单独的句子。以前的句子分词系统通常是非常局部的，使用低级的韵律和词汇特征来独立决定是否在每个词边界位置分词。在这项工作中，我们利用来自语法解析器的全局语法信息，它能够更好地捕获长距离依赖关系。虽然之前的一些工作已经包含了句法特征，但我们的工作是第一个以一种易于处理的、基于格的方式这样做的，这对于扩展到长句子上下文至关重要。具体来说，利用局部特征构造初始假设格。然后给候选句子分配句法语言模型分数。这些全局语法分数在对数线性模型中与局部低级分数相结合。由此产生的系统在参考文本和新闻广播的自动语音识别输出上都明显优于最流行的句子分割长跨度模型(隐藏事件语言模型)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 IEEE Spoken Language Technology Workshop

自引率

0.00%

发文量