为波斯语文本的自动标点预测创建一个语料库

2017 Iranian Conference on Electrical Engineering (ICEE) Pub Date : 2017-05-01 DOI:10.1109/IRANIANCEE.2017.7985288

Seyyed MohammadSaleh Hosseini, H. Sameti

{"title":"为波斯语文本的自动标点预测创建一个语料库","authors":"Seyyed MohammadSaleh Hosseini, H. Sameti","doi":"10.1109/IRANIANCEE.2017.7985288","DOIUrl":null,"url":null,"abstract":"We present a novel corpus for automatic punctuation prediction in Persian texts. Punctuation prediction is an important task in automatic speech recognition (ASR). The output of ASR systems is typically a raw sequence of words with no punctuation marks; this makes the text difficult or even impossible to make sense of for humans and also for any text processing unit. In this work, we have assembled a state-of-the-art Persian corpus to train and test a punctuation prediction model. To the best of our knowledge, this is the first ever corpus specifically designed for punctuation prediction in Persian texts. The corpus is a modification of a manually part-of-speech (POS) tagged Persian one, with almost 2.6 million words, including punctuation marks. We have made many diligent improvements to the already existing corpus to make one that deliberately facilitates experimental studies on Persian punctuation prediction: 1- replacing 3175 word types with their correct form, 2- normalizing the words (e.g. replacing kashida with hyphen), 3- correcting 451 and 192 words with incorrect DELM and DEFAULT tags, respectively, 4- investigating 17 word types to correct the punctuations around them, and 5- making numerous corrections to the punctuation marks. The final corpus contains nearly 2.3 million words and 221 thousand punctuation marks. Finally, we have trained and tested a CRF (conditional random field) model that shows a micro-averaged F1-score of 60.69% in our preliminary experiments.","PeriodicalId":161929,"journal":{"name":"2017 Iranian Conference on Electrical Engineering (ICEE)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Creating a corpus for automatic punctuation prediction in Persian texts\",\"authors\":\"Seyyed MohammadSaleh Hosseini, H. Sameti\",\"doi\":\"10.1109/IRANIANCEE.2017.7985288\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a novel corpus for automatic punctuation prediction in Persian texts. Punctuation prediction is an important task in automatic speech recognition (ASR). The output of ASR systems is typically a raw sequence of words with no punctuation marks; this makes the text difficult or even impossible to make sense of for humans and also for any text processing unit. In this work, we have assembled a state-of-the-art Persian corpus to train and test a punctuation prediction model. To the best of our knowledge, this is the first ever corpus specifically designed for punctuation prediction in Persian texts. The corpus is a modification of a manually part-of-speech (POS) tagged Persian one, with almost 2.6 million words, including punctuation marks. We have made many diligent improvements to the already existing corpus to make one that deliberately facilitates experimental studies on Persian punctuation prediction: 1- replacing 3175 word types with their correct form, 2- normalizing the words (e.g. replacing kashida with hyphen), 3- correcting 451 and 192 words with incorrect DELM and DEFAULT tags, respectively, 4- investigating 17 word types to correct the punctuations around them, and 5- making numerous corrections to the punctuation marks. The final corpus contains nearly 2.3 million words and 221 thousand punctuation marks. Finally, we have trained and tested a CRF (conditional random field) model that shows a micro-averaged F1-score of 60.69% in our preliminary experiments.\",\"PeriodicalId\":161929,\"journal\":{\"name\":\"2017 Iranian Conference on Electrical Engineering (ICEE)\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 Iranian Conference on Electrical Engineering (ICEE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IRANIANCEE.2017.7985288\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Iranian Conference on Electrical Engineering (ICEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRANIANCEE.2017.7985288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

我们提出了一种新的波斯语文本标点符号自动预测语料库。标点符号预测是自动语音识别(ASR)中的一项重要任务。ASR系统的输出通常是没有标点符号的原始单词序列;这使得文本对人类和任何文本处理单元来说都很难理解，甚至不可能理解。在这项工作中，我们组装了一个最先进的波斯语语料库来训练和测试标点符号预测模型。据我们所知，这是第一个专门为波斯语文本中的标点符号预测设计的语料库。该语料库是对人工标注词性(POS)的波斯语语料库的修改，包括标点符号在内，有近260万个单词。我们对已经存在的语料库进行了许多勤奋的改进，使其能够有意地促进波斯语标点符号预测的实验研究:1-用正确的形式替换3175个单词类型，2-将单词规范化(例如用连字符替换kashida)， 3-分别纠正了451个和192个带有错误DELM和DEFAULT标签的单词，4-调查了17个单词类型，纠正了它们周围的标点符号，5-对标点符号进行了大量更正。最终的语料库包含近230万个单词和22.1万个标点符号。最后，我们训练并测试了一个CRF(条件随机场)模型，在我们的初步实验中显示微观平均f1得分为60.69%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Creating a corpus for automatic punctuation prediction in Persian texts

We present a novel corpus for automatic punctuation prediction in Persian texts. Punctuation prediction is an important task in automatic speech recognition (ASR). The output of ASR systems is typically a raw sequence of words with no punctuation marks; this makes the text difficult or even impossible to make sense of for humans and also for any text processing unit. In this work, we have assembled a state-of-the-art Persian corpus to train and test a punctuation prediction model. To the best of our knowledge, this is the first ever corpus specifically designed for punctuation prediction in Persian texts. The corpus is a modification of a manually part-of-speech (POS) tagged Persian one, with almost 2.6 million words, including punctuation marks. We have made many diligent improvements to the already existing corpus to make one that deliberately facilitates experimental studies on Persian punctuation prediction: 1- replacing 3175 word types with their correct form, 2- normalizing the words (e.g. replacing kashida with hyphen), 3- correcting 451 and 192 words with incorrect DELM and DEFAULT tags, respectively, 4- investigating 17 word types to correct the punctuations around them, and 5- making numerous corrections to the punctuation marks. The final corpus contains nearly 2.3 million words and 221 thousand punctuation marks. Finally, we have trained and tested a CRF (conditional random field) model that shows a micro-averaged F1-score of 60.69% in our preliminary experiments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 Iranian Conference on Electrical Engineering (ICEE)

自引率

0.00%

发文量