{"title":"基于元数据和关键词提取的文档分类","authors":"Eman Y. Rezqa, R. Baraka","doi":"10.1109/PICICT53635.2021.00016","DOIUrl":null,"url":null,"abstract":"We present a model for automatic extraction of metadata and keywords to be used in the classification of scientific documents. The model mainly consists of metadata extraction, keywords extraction and documents classification. At the metadata extraction stage, various metadata items are extracted from research documents in the domain of commerce such title of the thesis/research article, author/s, advisor/s, year, publisher, type, and abstract. At the keywords extraction stage, Latent Semantic Indexing (LSI) is used to extract the underlying topics from these documents. At the classification stage which depends on the metadata and keywords extraction stages, three classification algorithms are used which are Stochastic Gradient Descent (SGD), Linear Support Vector (LSVC) and K-Nearest Neighbor (KNN). SGD has achieved the highest classification accuracy (80.5%) compared to LSVC and KNN when applied to Arabic document corpus. LSVC has achieved the highest classification accuracy (81.5%) compared to SGD and KNN when applied to the English document corpus.","PeriodicalId":308869,"journal":{"name":"2021 Palestinian International Conference on Information and Communication Technology (PICICT)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Document Classification Based on Metadata and Keywords Extraction\",\"authors\":\"Eman Y. Rezqa, R. Baraka\",\"doi\":\"10.1109/PICICT53635.2021.00016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a model for automatic extraction of metadata and keywords to be used in the classification of scientific documents. The model mainly consists of metadata extraction, keywords extraction and documents classification. At the metadata extraction stage, various metadata items are extracted from research documents in the domain of commerce such title of the thesis/research article, author/s, advisor/s, year, publisher, type, and abstract. At the keywords extraction stage, Latent Semantic Indexing (LSI) is used to extract the underlying topics from these documents. At the classification stage which depends on the metadata and keywords extraction stages, three classification algorithms are used which are Stochastic Gradient Descent (SGD), Linear Support Vector (LSVC) and K-Nearest Neighbor (KNN). SGD has achieved the highest classification accuracy (80.5%) compared to LSVC and KNN when applied to Arabic document corpus. LSVC has achieved the highest classification accuracy (81.5%) compared to SGD and KNN when applied to the English document corpus.\",\"PeriodicalId\":308869,\"journal\":{\"name\":\"2021 Palestinian International Conference on Information and Communication Technology (PICICT)\",\"volume\":\"53 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 Palestinian International Conference on Information and Communication Technology (PICICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PICICT53635.2021.00016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Palestinian International Conference on Information and Communication Technology (PICICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PICICT53635.2021.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Document Classification Based on Metadata and Keywords Extraction
We present a model for automatic extraction of metadata and keywords to be used in the classification of scientific documents. The model mainly consists of metadata extraction, keywords extraction and documents classification. At the metadata extraction stage, various metadata items are extracted from research documents in the domain of commerce such title of the thesis/research article, author/s, advisor/s, year, publisher, type, and abstract. At the keywords extraction stage, Latent Semantic Indexing (LSI) is used to extract the underlying topics from these documents. At the classification stage which depends on the metadata and keywords extraction stages, three classification algorithms are used which are Stochastic Gradient Descent (SGD), Linear Support Vector (LSVC) and K-Nearest Neighbor (KNN). SGD has achieved the highest classification accuracy (80.5%) compared to LSVC and KNN when applied to Arabic document corpus. LSVC has achieved the highest classification accuracy (81.5%) compared to SGD and KNN when applied to the English document corpus.