{"title":"Short Text Categorization via Coherence Constraints","authors":"Anca Dinu","doi":"10.1109/SYNASC.2011.33","DOIUrl":null,"url":null,"abstract":"In this article we propose a quantitative approach to a relatively new problem: categorizing text as pragmatically correct or pragmatically incorrect (forcing the notion, coherent/incoherent). The typical text categorization criterions comprise categorization by topic, by style (genre classification, authorship identification), by expressed opinion (opinion mining, sentiment classification), etc. Very few approaches consider the problem of categorizing text by degree of coherence. One example of application of text categorization by its coherence is creating a spam filter for personal e-mail accounts able to cope with one of the new strategies adopted by spamers. This strategy consists of encoding the real message as picture (impossible to directly analyze and reject by the text oriented classical filters) and accompanying it by a text especially designed to surpass the filter. An important question for automatically categorizing texts into coherent and incoherent is: are there features that can be extracted from these texts and be successfully used to categorize them? We propose a quantitative approach that relies on the use of ratios between morphological categories from the texts as discriminant features. We use supervised machine learning techniques on a small corpus of English e-mail messages and let the algorithms extract important features from all the pos ratios. The results are encouraging.","PeriodicalId":184344,"journal":{"name":"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 13th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2011.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
In this article we propose a quantitative approach to a relatively new problem: categorizing text as pragmatically correct or pragmatically incorrect (forcing the notion, coherent/incoherent). The typical text categorization criterions comprise categorization by topic, by style (genre classification, authorship identification), by expressed opinion (opinion mining, sentiment classification), etc. Very few approaches consider the problem of categorizing text by degree of coherence. One example of application of text categorization by its coherence is creating a spam filter for personal e-mail accounts able to cope with one of the new strategies adopted by spamers. This strategy consists of encoding the real message as picture (impossible to directly analyze and reject by the text oriented classical filters) and accompanying it by a text especially designed to surpass the filter. An important question for automatically categorizing texts into coherent and incoherent is: are there features that can be extracted from these texts and be successfully used to categorize them? We propose a quantitative approach that relies on the use of ratios between morphological categories from the texts as discriminant features. We use supervised machine learning techniques on a small corpus of English e-mail messages and let the algorithms extract important features from all the pos ratios. The results are encouraging.