CESMP:中英文分段对齐的多领域专利数据

Wuying Liu, Lin Wang, Fumao Hu
{"title":"CESMP:中英文分段对齐的多领域专利数据","authors":"Wuying Liu, Lin Wang, Fumao Hu","doi":"10.1109/CCIS53392.2021.9754662","DOIUrl":null,"url":null,"abstract":"Patent data from various countries in the world implies the essence of scientific discovery and technological innovation of all human beings, but language differences have become a huge obstacle to patent data retrieval and communication. We hope to build a bridge from Chinese to English in the patent domain, so that English speakers can make better use of Chinese patent data. With the help of natural language processing technologies such as optical character recognition, Chinese text processing, machine translation and English text processing, we construct digital Chinese-English segment-aligned multi-field patent (CESMP) data from scanned Chinese patents. The current CESMP data consists of 610,310 patent documents in XML format. Each patent document contains six required fields (date, publication, ipc, title, abstract, and claim) and four optional fields (cpc, wipo, originalapplicant, and currentowner), among which the wipo, title, abstract, and claim fields are aligned with Chinese and English segments. Supported by well-structured bilingual patent data, on the one hand, the resource construction algorithms can efficiently build a bilingual patent dictionary and a parallel patent segment bank; on the other hand, the deep natural language processing algorithms can be effectively implemented into many practical intelligent applications such as cross-language patent retrieval, patent spam filtering, patent network analysis, patent machine translation, etc.","PeriodicalId":191226,"journal":{"name":"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)","volume":"24 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CESMP: Chinese-English Segment-aligned Multi-field Patent Data\",\"authors\":\"Wuying Liu, Lin Wang, Fumao Hu\",\"doi\":\"10.1109/CCIS53392.2021.9754662\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Patent data from various countries in the world implies the essence of scientific discovery and technological innovation of all human beings, but language differences have become a huge obstacle to patent data retrieval and communication. We hope to build a bridge from Chinese to English in the patent domain, so that English speakers can make better use of Chinese patent data. With the help of natural language processing technologies such as optical character recognition, Chinese text processing, machine translation and English text processing, we construct digital Chinese-English segment-aligned multi-field patent (CESMP) data from scanned Chinese patents. The current CESMP data consists of 610,310 patent documents in XML format. Each patent document contains six required fields (date, publication, ipc, title, abstract, and claim) and four optional fields (cpc, wipo, originalapplicant, and currentowner), among which the wipo, title, abstract, and claim fields are aligned with Chinese and English segments. Supported by well-structured bilingual patent data, on the one hand, the resource construction algorithms can efficiently build a bilingual patent dictionary and a parallel patent segment bank; on the other hand, the deep natural language processing algorithms can be effectively implemented into many practical intelligent applications such as cross-language patent retrieval, patent spam filtering, patent network analysis, patent machine translation, etc.\",\"PeriodicalId\":191226,\"journal\":{\"name\":\"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)\",\"volume\":\"24 6\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCIS53392.2021.9754662\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCIS53392.2021.9754662","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

世界各国的专利数据蕴含着全人类科学发现和技术创新的精髓,但语言差异成为专利数据检索和交流的巨大障碍。我们希望在专利领域搭建一座从中文到英文的桥梁,让说英语的人可以更好地利用中国的专利数据。利用光学字符识别、中文文本处理、机器翻译和英文文本处理等自然语言处理技术,从扫描的中文专利中构建数字汉英分段对齐多领域专利(CESMP)数据。当前CESMP数据由610,310个XML格式的专利文档组成。每个专利文献包含6个必填字段(日期、发表、ipc、标题、摘要和权利要求)和4个可选字段(cpc、wipo、原申请人和当前所有者),其中wipo、标题、摘要和权利要求字段与中文和英文段对齐。在结构良好的双语专利数据支持下,资源构建算法一方面可以高效地构建双语专利词典和并行专利段库;另一方面,深度自然语言处理算法可以有效地实现到跨语言专利检索、专利垃圾邮件过滤、专利网络分析、专利机器翻译等许多实际的智能应用中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CESMP: Chinese-English Segment-aligned Multi-field Patent Data
Patent data from various countries in the world implies the essence of scientific discovery and technological innovation of all human beings, but language differences have become a huge obstacle to patent data retrieval and communication. We hope to build a bridge from Chinese to English in the patent domain, so that English speakers can make better use of Chinese patent data. With the help of natural language processing technologies such as optical character recognition, Chinese text processing, machine translation and English text processing, we construct digital Chinese-English segment-aligned multi-field patent (CESMP) data from scanned Chinese patents. The current CESMP data consists of 610,310 patent documents in XML format. Each patent document contains six required fields (date, publication, ipc, title, abstract, and claim) and four optional fields (cpc, wipo, originalapplicant, and currentowner), among which the wipo, title, abstract, and claim fields are aligned with Chinese and English segments. Supported by well-structured bilingual patent data, on the one hand, the resource construction algorithms can efficiently build a bilingual patent dictionary and a parallel patent segment bank; on the other hand, the deep natural language processing algorithms can be effectively implemented into many practical intelligent applications such as cross-language patent retrieval, patent spam filtering, patent network analysis, patent machine translation, etc.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信