{"title":"CESMP:中英文分段对齐的多领域专利数据","authors":"Wuying Liu, Lin Wang, Fumao Hu","doi":"10.1109/CCIS53392.2021.9754662","DOIUrl":null,"url":null,"abstract":"Patent data from various countries in the world implies the essence of scientific discovery and technological innovation of all human beings, but language differences have become a huge obstacle to patent data retrieval and communication. We hope to build a bridge from Chinese to English in the patent domain, so that English speakers can make better use of Chinese patent data. With the help of natural language processing technologies such as optical character recognition, Chinese text processing, machine translation and English text processing, we construct digital Chinese-English segment-aligned multi-field patent (CESMP) data from scanned Chinese patents. The current CESMP data consists of 610,310 patent documents in XML format. Each patent document contains six required fields (date, publication, ipc, title, abstract, and claim) and four optional fields (cpc, wipo, originalapplicant, and currentowner), among which the wipo, title, abstract, and claim fields are aligned with Chinese and English segments. Supported by well-structured bilingual patent data, on the one hand, the resource construction algorithms can efficiently build a bilingual patent dictionary and a parallel patent segment bank; on the other hand, the deep natural language processing algorithms can be effectively implemented into many practical intelligent applications such as cross-language patent retrieval, patent spam filtering, patent network analysis, patent machine translation, etc.","PeriodicalId":191226,"journal":{"name":"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)","volume":"24 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CESMP: Chinese-English Segment-aligned Multi-field Patent Data\",\"authors\":\"Wuying Liu, Lin Wang, Fumao Hu\",\"doi\":\"10.1109/CCIS53392.2021.9754662\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Patent data from various countries in the world implies the essence of scientific discovery and technological innovation of all human beings, but language differences have become a huge obstacle to patent data retrieval and communication. We hope to build a bridge from Chinese to English in the patent domain, so that English speakers can make better use of Chinese patent data. With the help of natural language processing technologies such as optical character recognition, Chinese text processing, machine translation and English text processing, we construct digital Chinese-English segment-aligned multi-field patent (CESMP) data from scanned Chinese patents. The current CESMP data consists of 610,310 patent documents in XML format. Each patent document contains six required fields (date, publication, ipc, title, abstract, and claim) and four optional fields (cpc, wipo, originalapplicant, and currentowner), among which the wipo, title, abstract, and claim fields are aligned with Chinese and English segments. Supported by well-structured bilingual patent data, on the one hand, the resource construction algorithms can efficiently build a bilingual patent dictionary and a parallel patent segment bank; on the other hand, the deep natural language processing algorithms can be effectively implemented into many practical intelligent applications such as cross-language patent retrieval, patent spam filtering, patent network analysis, patent machine translation, etc.\",\"PeriodicalId\":191226,\"journal\":{\"name\":\"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)\",\"volume\":\"24 6\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCIS53392.2021.9754662\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCIS53392.2021.9754662","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CESMP: Chinese-English Segment-aligned Multi-field Patent Data
Patent data from various countries in the world implies the essence of scientific discovery and technological innovation of all human beings, but language differences have become a huge obstacle to patent data retrieval and communication. We hope to build a bridge from Chinese to English in the patent domain, so that English speakers can make better use of Chinese patent data. With the help of natural language processing technologies such as optical character recognition, Chinese text processing, machine translation and English text processing, we construct digital Chinese-English segment-aligned multi-field patent (CESMP) data from scanned Chinese patents. The current CESMP data consists of 610,310 patent documents in XML format. Each patent document contains six required fields (date, publication, ipc, title, abstract, and claim) and four optional fields (cpc, wipo, originalapplicant, and currentowner), among which the wipo, title, abstract, and claim fields are aligned with Chinese and English segments. Supported by well-structured bilingual patent data, on the one hand, the resource construction algorithms can efficiently build a bilingual patent dictionary and a parallel patent segment bank; on the other hand, the deep natural language processing algorithms can be effectively implemented into many practical intelligent applications such as cross-language patent retrieval, patent spam filtering, patent network analysis, patent machine translation, etc.