自动生成硬件组件知识库

Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems Pub Date : 2019-06-23 DOI:10.1145/3316482.3326344

Luke Hsiao, Sen Wu, Nicholas Chiang, C. Ré, P. Levis

{"title":"自动生成硬件组件知识库","authors":"Luke Hsiao, Sen Wu, Nicholas Chiang, C. Ré, P. Levis","doi":"10.1145/3316482.3326344","DOIUrl":null,"url":null,"abstract":"Hardware component databases are critical resources in designing embedded systems. Since generating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have many random data entry errors. We present a machine-learning based approach for automating the generation of component databases directly from datasheets. Extracting data directly from datasheets is challenging because: (1) the data is relational in nature and relies on non-local context, (2) the documents are filled with technical jargon, and (3) the datasheets are PDFs, a format that decouples visual locality from locality in the document. The proposed approach uses a rich data model and weak supervision to address these challenges. We evaluate the approach on datasheets of three classes of hardware components and achieve an average quality of 75 F1 points which is comparable to existing human-curated knowledge bases. We perform two applications studies that demonstrate the extraction of multiple data modalities such as numerical properties and images. We show how different sources of supervision such as heuristics and human labels have distinct advantages which can be utilized together within a single methodology to automatically generate hardware component knowledge bases.","PeriodicalId":256029,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Automating the generation of hardware component knowledge bases\",\"authors\":\"Luke Hsiao, Sen Wu, Nicholas Chiang, C. Ré, P. Levis\",\"doi\":\"10.1145/3316482.3326344\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hardware component databases are critical resources in designing embedded systems. Since generating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have many random data entry errors. We present a machine-learning based approach for automating the generation of component databases directly from datasheets. Extracting data directly from datasheets is challenging because: (1) the data is relational in nature and relies on non-local context, (2) the documents are filled with technical jargon, and (3) the datasheets are PDFs, a format that decouples visual locality from locality in the document. The proposed approach uses a rich data model and weak supervision to address these challenges. We evaluate the approach on datasheets of three classes of hardware components and achieve an average quality of 75 F1 points which is comparable to existing human-curated knowledge bases. We perform two applications studies that demonstrate the extraction of multiple data modalities such as numerical properties and images. We show how different sources of supervision such as heuristics and human labels have distinct advantages which can be utilized together within a single methodology to automatically generate hardware component knowledge bases.\",\"PeriodicalId\":256029,\"journal\":{\"name\":\"Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3316482.3326344\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3316482.3326344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

硬件组件数据库是嵌入式系统设计的重要资源。由于生成这些数据库需要数十万小时的手工数据输入，因此它们是专有的，所提供的数据有限，并且有许多随机数据输入错误。我们提出了一种基于机器学习的方法，可以直接从数据表自动生成组件数据库。直接从数据表中提取数据具有挑战性，因为:(1)数据本质上是关系的，依赖于非本地上下文;(2)文档中充满了技术术语;(3)数据表是pdf格式，这种格式将文档中的可视化局部性与局部性分离开来。所提出的方法使用丰富的数据模型和弱监督来解决这些挑战。我们在三类硬件组件的数据表上评估了这种方法，并获得了75个F1点的平均质量，这与现有的人工管理知识库相当。我们进行了两个应用研究，展示了多种数据模式的提取，如数值属性和图像。我们展示了不同的监督来源(如启发式和人工标签)如何具有不同的优势，这些优势可以在单一方法中一起使用，以自动生成硬件组件知识库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automating the generation of hardware component knowledge bases

Hardware component databases are critical resources in designing embedded systems. Since generating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have many random data entry errors. We present a machine-learning based approach for automating the generation of component databases directly from datasheets. Extracting data directly from datasheets is challenging because: (1) the data is relational in nature and relies on non-local context, (2) the documents are filled with technical jargon, and (3) the datasheets are PDFs, a format that decouples visual locality from locality in the document. The proposed approach uses a rich data model and weak supervision to address these challenges. We evaluate the approach on datasheets of three classes of hardware components and achieve an average quality of 75 F1 points which is comparable to existing human-curated knowledge bases. We perform two applications studies that demonstrate the extraction of multiple data modalities such as numerical properties and images. We show how different sources of supervision such as heuristics and human labels have distinct advantages which can be utilized together within a single methodology to automatically generate hardware component knowledge bases.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

自引率

0.00%

发文量