{"title":"用于文档分类的最大频繁序列","authors":"Hai Nguyen Thi Tuyet, Tan Hanh","doi":"10.1109/ATC.2016.7764764","DOIUrl":null,"url":null,"abstract":"Document Classification has attracted several attentions from researchers due to the increase of digital form documents and the need of these documents' organization. One of the most popular approaches to deal with this problem is based on machine learning techniques [1]. However, the result of classification much depends on the linguistic preprocess and the document representation. The dependence is more obvious to languages whose blanks are used to separate not only words but also syllables that constitute words such as Vietnamese, Chinese language. In this paper, we propose a language-independent classifier relied on a flexible feature called Maximal Frequent Sequences (MFSs) [2]. In addition, we design and implement a novel algorithm to find MFSs. Our algorithm follows the MFS definition of H. Ahonen-Myka [2] and ignores the expensive pruning phrase. The experiments shows that our classifying approach achieves the average 85.16% and 89.27% F-measure on 7 classes of the common dataset Reuters-21578 and 5 classes of Vietnamese documents, respectively.","PeriodicalId":225413,"journal":{"name":"2016 International Conference on Advanced Technologies for Communications (ATC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Maximal frequent sequences for document classification\",\"authors\":\"Hai Nguyen Thi Tuyet, Tan Hanh\",\"doi\":\"10.1109/ATC.2016.7764764\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document Classification has attracted several attentions from researchers due to the increase of digital form documents and the need of these documents' organization. One of the most popular approaches to deal with this problem is based on machine learning techniques [1]. However, the result of classification much depends on the linguistic preprocess and the document representation. The dependence is more obvious to languages whose blanks are used to separate not only words but also syllables that constitute words such as Vietnamese, Chinese language. In this paper, we propose a language-independent classifier relied on a flexible feature called Maximal Frequent Sequences (MFSs) [2]. In addition, we design and implement a novel algorithm to find MFSs. Our algorithm follows the MFS definition of H. Ahonen-Myka [2] and ignores the expensive pruning phrase. The experiments shows that our classifying approach achieves the average 85.16% and 89.27% F-measure on 7 classes of the common dataset Reuters-21578 and 5 classes of Vietnamese documents, respectively.\",\"PeriodicalId\":225413,\"journal\":{\"name\":\"2016 International Conference on Advanced Technologies for Communications (ATC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 International Conference on Advanced Technologies for Communications (ATC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ATC.2016.7764764\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Advanced Technologies for Communications (ATC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ATC.2016.7764764","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Maximal frequent sequences for document classification
Document Classification has attracted several attentions from researchers due to the increase of digital form documents and the need of these documents' organization. One of the most popular approaches to deal with this problem is based on machine learning techniques [1]. However, the result of classification much depends on the linguistic preprocess and the document representation. The dependence is more obvious to languages whose blanks are used to separate not only words but also syllables that constitute words such as Vietnamese, Chinese language. In this paper, we propose a language-independent classifier relied on a flexible feature called Maximal Frequent Sequences (MFSs) [2]. In addition, we design and implement a novel algorithm to find MFSs. Our algorithm follows the MFS definition of H. Ahonen-Myka [2] and ignores the expensive pruning phrase. The experiments shows that our classifying approach achieves the average 85.16% and 89.27% F-measure on 7 classes of the common dataset Reuters-21578 and 5 classes of Vietnamese documents, respectively.