基于连贯约束的短文本分类

2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing Pub Date : 2011-09-26 DOI:10.1109/SYNASC.2011.33

Anca Dinu

{"title":"基于连贯约束的短文本分类","authors":"Anca Dinu","doi":"10.1109/SYNASC.2011.33","DOIUrl":null,"url":null,"abstract":"In this article we propose a quantitative approach to a relatively new problem: categorizing text as pragmatically correct or pragmatically incorrect (forcing the notion, coherent/incoherent). The typical text categorization criterions comprise categorization by topic, by style (genre classification, authorship identification), by expressed opinion (opinion mining, sentiment classification), etc. Very few approaches consider the problem of categorizing text by degree of coherence. One example of application of text categorization by its coherence is creating a spam filter for personal e-mail accounts able to cope with one of the new strategies adopted by spamers. This strategy consists of encoding the real message as picture (impossible to directly analyze and reject by the text oriented classical filters) and accompanying it by a text especially designed to surpass the filter. An important question for automatically categorizing texts into coherent and incoherent is: are there features that can be extracted from these texts and be successfully used to categorize them? We propose a quantitative approach that relies on the use of ratios between morphological categories from the texts as discriminant features. We use supervised machine learning techniques on a small corpus of English e-mail messages and let the algorithms extract important features from all the pos ratios. The results are encouraging.","PeriodicalId":184344,"journal":{"name":"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Short Text Categorization via Coherence Constraints\",\"authors\":\"Anca Dinu\",\"doi\":\"10.1109/SYNASC.2011.33\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this article we propose a quantitative approach to a relatively new problem: categorizing text as pragmatically correct or pragmatically incorrect (forcing the notion, coherent/incoherent). The typical text categorization criterions comprise categorization by topic, by style (genre classification, authorship identification), by expressed opinion (opinion mining, sentiment classification), etc. Very few approaches consider the problem of categorizing text by degree of coherence. One example of application of text categorization by its coherence is creating a spam filter for personal e-mail accounts able to cope with one of the new strategies adopted by spamers. This strategy consists of encoding the real message as picture (impossible to directly analyze and reject by the text oriented classical filters) and accompanying it by a text especially designed to surpass the filter. An important question for automatically categorizing texts into coherent and incoherent is: are there features that can be extracted from these texts and be successfully used to categorize them? We propose a quantitative approach that relies on the use of ratios between morphological categories from the texts as discriminant features. We use supervised machine learning techniques on a small corpus of English e-mail messages and let the algorithms extract important features from all the pos ratios. The results are encouraging.\",\"PeriodicalId\":184344,\"journal\":{\"name\":\"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SYNASC.2011.33\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2011.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

在本文中，我们提出了一种定量方法来解决一个相对较新的问题:将文本分类为语用正确或语用错误(强制概念，连贯/不连贯)。典型的文本分类标准包括按主题分类、按风格分类(体裁分类、作者识别)、按表达意见分类(意见挖掘、情感分类)等。很少有方法考虑根据连贯程度对文本进行分类的问题。文本一致性分类应用的一个例子是为个人电子邮件帐户创建一个垃圾邮件过滤器，该过滤器能够处理垃圾邮件发送者采用的一种新策略。该策略包括将真实信息编码为图片(不可能被面向文本的经典过滤器直接分析和拒绝)，并附带专门设计的文本以超越过滤器。自动将文本分类为连贯和不连贯的一个重要问题是:是否存在可以从这些文本中提取并成功用于分类的特征?我们提出了一种定量方法，依赖于使用文本中形态类别之间的比率作为判别特征。我们在一个小的英文电子邮件语料库上使用监督机器学习技术，并让算法从所有的比例中提取重要的特征。结果令人鼓舞。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Short Text Categorization via Coherence Constraints

In this article we propose a quantitative approach to a relatively new problem: categorizing text as pragmatically correct or pragmatically incorrect (forcing the notion, coherent/incoherent). The typical text categorization criterions comprise categorization by topic, by style (genre classification, authorship identification), by expressed opinion (opinion mining, sentiment classification), etc. Very few approaches consider the problem of categorizing text by degree of coherence. One example of application of text categorization by its coherence is creating a spam filter for personal e-mail accounts able to cope with one of the new strategies adopted by spamers. This strategy consists of encoding the real message as picture (impossible to directly analyze and reject by the text oriented classical filters) and accompanying it by a text especially designed to surpass the filter. An important question for automatically categorizing texts into coherent and incoherent is: are there features that can be extracted from these texts and be successfully used to categorize them? We propose a quantitative approach that relies on the use of ratios between morphological categories from the texts as discriminant features. We use supervised machine learning techniques on a small corpus of English e-mail messages and let the algorithms extract important features from all the pos ratios. The results are encouraging.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing

自引率

0.00%

发文量