{"title":"半人马座:一个动态解析器生成器,用于并行自组织数据提取","authors":"Shigeyuki Sato, Hiroka Ihara, K. Taura","doi":"10.2197/ipsjjip.28.724","DOIUrl":null,"url":null,"abstract":": It is important to handle large-scale data in text formats such as XML, JSON, and CSV because these data very often appear in data exchange. For these data, instead of data ingestion to databases, ad hoc data extraction is highly desirable. The main issue of ad hoc data extraction is to serve both the programmability to allow handling various types of data intuitively and the performance for large-scale data. To pursue it, we develop C entaurus , a dynamic parser generator library for parallel ad hoc data extraction. This paper presents the design and implementation of C entaurus . The experimental results on ad hoc data extraction have demonstrated that C entaurus outperformed fast dedicated parser libraries in C ++ for XML and JSON, and achieved excellent scalability with actions implemented in Python.","PeriodicalId":430763,"journal":{"name":"J. Inf. Process.","volume":"150 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CENTAURUS: A Dynamic Parser Generator for Parallel Ad Hoc Data Extraction\",\"authors\":\"Shigeyuki Sato, Hiroka Ihara, K. Taura\",\"doi\":\"10.2197/ipsjjip.28.724\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": It is important to handle large-scale data in text formats such as XML, JSON, and CSV because these data very often appear in data exchange. For these data, instead of data ingestion to databases, ad hoc data extraction is highly desirable. The main issue of ad hoc data extraction is to serve both the programmability to allow handling various types of data intuitively and the performance for large-scale data. To pursue it, we develop C entaurus , a dynamic parser generator library for parallel ad hoc data extraction. This paper presents the design and implementation of C entaurus . The experimental results on ad hoc data extraction have demonstrated that C entaurus outperformed fast dedicated parser libraries in C ++ for XML and JSON, and achieved excellent scalability with actions implemented in Python.\",\"PeriodicalId\":430763,\"journal\":{\"name\":\"J. Inf. Process.\",\"volume\":\"150 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Inf. Process.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2197/ipsjjip.28.724\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Inf. Process.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2197/ipsjjip.28.724","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CENTAURUS: A Dynamic Parser Generator for Parallel Ad Hoc Data Extraction
: It is important to handle large-scale data in text formats such as XML, JSON, and CSV because these data very often appear in data exchange. For these data, instead of data ingestion to databases, ad hoc data extraction is highly desirable. The main issue of ad hoc data extraction is to serve both the programmability to allow handling various types of data intuitively and the performance for large-scale data. To pursue it, we develop C entaurus , a dynamic parser generator library for parallel ad hoc data extraction. This paper presents the design and implementation of C entaurus . The experimental results on ad hoc data extraction have demonstrated that C entaurus outperformed fast dedicated parser libraries in C ++ for XML and JSON, and achieved excellent scalability with actions implemented in Python.