Hashrin C P, Amal Jossy, Sudhakaran K, Thushara A, Ansamma John
{"title":"马拉雅拉姆语手写文件的字符分割","authors":"Hashrin C P, Amal Jossy, Sudhakaran K, Thushara A, Ansamma John","doi":"10.1109/ICIICT1.2019.8741416","DOIUrl":null,"url":null,"abstract":"Construction of an Optical Character Recognition (OCR) model for handwritten documents poses many challenges, the most prominent of them being dataset collection, character segmentation and classification. This paper focuses on the segmentation part, and presents a novel approach to segment individual characters from Malayalam handwritten documents. It is a three-stage approach where morphological operations, contour analysis, and bounding box detection are used to extract individual lines from the document, words from each line, and then characters from each word. An additional masking method is performed to tackle the overlapping of bounding boxes due to skewed lines and the presence of diacritics. The segmented characters can either be used to create datasets or fed to OCR models.","PeriodicalId":118897,"journal":{"name":"2019 1st International Conference on Innovations in Information and Communication Technology (ICIICT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Segmenting Characters from Malayalam Handwritten Documents\",\"authors\":\"Hashrin C P, Amal Jossy, Sudhakaran K, Thushara A, Ansamma John\",\"doi\":\"10.1109/ICIICT1.2019.8741416\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Construction of an Optical Character Recognition (OCR) model for handwritten documents poses many challenges, the most prominent of them being dataset collection, character segmentation and classification. This paper focuses on the segmentation part, and presents a novel approach to segment individual characters from Malayalam handwritten documents. It is a three-stage approach where morphological operations, contour analysis, and bounding box detection are used to extract individual lines from the document, words from each line, and then characters from each word. An additional masking method is performed to tackle the overlapping of bounding boxes due to skewed lines and the presence of diacritics. The segmented characters can either be used to create datasets or fed to OCR models.\",\"PeriodicalId\":118897,\"journal\":{\"name\":\"2019 1st International Conference on Innovations in Information and Communication Technology (ICIICT)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 1st International Conference on Innovations in Information and Communication Technology (ICIICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIICT1.2019.8741416\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 1st International Conference on Innovations in Information and Communication Technology (ICIICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIICT1.2019.8741416","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Segmenting Characters from Malayalam Handwritten Documents
Construction of an Optical Character Recognition (OCR) model for handwritten documents poses many challenges, the most prominent of them being dataset collection, character segmentation and classification. This paper focuses on the segmentation part, and presents a novel approach to segment individual characters from Malayalam handwritten documents. It is a three-stage approach where morphological operations, contour analysis, and bounding box detection are used to extract individual lines from the document, words from each line, and then characters from each word. An additional masking method is performed to tackle the overlapping of bounding boxes due to skewed lines and the presence of diacritics. The segmented characters can either be used to create datasets or fed to OCR models.