利用频繁子树挖掘方法对树库进行定量分析

Workshop on Graph-based Methods for Natural Language Processing Pub Date : 2009-08-07 DOI:10.3115/1708124.1708140

S. Martens

{"title":"利用频繁子树挖掘方法对树库进行定量分析","authors":"S. Martens","doi":"10.3115/1708124.1708140","DOIUrl":null,"url":null,"abstract":"The first task of statistical computational linguistics, or any other type of data-driven processing of language, is the extraction of counts and distributions of phenomena. This is much more difficult for the type of complex structured data found in treebanks and in corpora with sophisticated annotation than for tokenized texts. Recent developments in data mining, particularly in the extraction of frequent subtrees from treebanks, offer some solutions. We have applied a modified version of the TreeMiner algorithm to a small treebank and present some promising results.","PeriodicalId":359354,"journal":{"name":"Workshop on Graph-based Methods for Natural Language Processing","volume":"192 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Quantitative analysis of treebanks using frequent subtree mining methods\",\"authors\":\"S. Martens\",\"doi\":\"10.3115/1708124.1708140\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The first task of statistical computational linguistics, or any other type of data-driven processing of language, is the extraction of counts and distributions of phenomena. This is much more difficult for the type of complex structured data found in treebanks and in corpora with sophisticated annotation than for tokenized texts. Recent developments in data mining, particularly in the extraction of frequent subtrees from treebanks, offer some solutions. We have applied a modified version of the TreeMiner algorithm to a small treebank and present some promising results.\",\"PeriodicalId\":359354,\"journal\":{\"name\":\"Workshop on Graph-based Methods for Natural Language Processing\",\"volume\":\"192 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Graph-based Methods for Natural Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3115/1708124.1708140\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Graph-based Methods for Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3115/1708124.1708140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

统计计算语言学或任何其他类型的数据驱动语言处理的首要任务是提取现象的计数和分布。对于树库和具有复杂注释的语料库中发现的复杂结构化数据类型，这比标记化文本要困难得多。数据挖掘的最新发展，特别是在从树库中提取频繁子树方面，提供了一些解决方案。我们已经将一个改进版本的TreeMiner算法应用于一个小型树库，并给出了一些有希望的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Quantitative analysis of treebanks using frequent subtree mining methods

The first task of statistical computational linguistics, or any other type of data-driven processing of language, is the extraction of counts and distributions of phenomena. This is much more difficult for the type of complex structured data found in treebanks and in corpora with sophisticated annotation than for tokenized texts. Recent developments in data mining, particularly in the extraction of frequent subtrees from treebanks, offer some solutions. We have applied a modified version of the TreeMiner algorithm to a small treebank and present some promising results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on Graph-based Methods for Natural Language Processing

自引率

0.00%

发文量