Ferrets and topic maps: knowledge engineering for an analytical engine

Markup Languages Pub Date : 2001-04-01 DOI:10.1162/109966201317356371

J. Mason

{"title":"Ferrets and topic maps: knowledge engineering for an analytical engine","authors":"J. Mason","doi":"10.1162/109966201317356371","DOIUrl":null,"url":null,"abstract":"The “Ferret” analytical engine, developed originally by the Y-12 National Security Complex of the U.S. Department of Energy to seek classified data and associations in documents and present its findings in the light of formal rules, requires a structured information base that represents not just individual facts but a set of implications and a collection of rules. The fundamental knowledge base is evolving towards forms that enhance flexibility and portability. The developers early realized that the knowledge base can be captured in XML by a series of trees that represent taxonomies, analytical structures, and specific indicative facts, but over this a topic map is needed to express links across the trees. Above this, the classification rules could form another topic map that points into the lower layers. In its latest form, however, the knowledge base has come to be entirely represented in a topic map. The “Ferret” engine combines sophisticated searching with rule-driven analysis and reporting. In its original application, the Ferret engine performs the equivalent of 5,000 simultaneous searches while reading documents at several thousand words per second. The analysis traces implications of concepts discovered in searching and applies the rules for interpreting implications and the actions to be taken when a significant piece of information is found. Because the topic maps that represent this knowledgecan be switched easily, Ferret can be reprogrammed to many tasks, including selection and categorization, scanning of e-mail and newsfeeds, diagnostics, and query expansion, in addition to the original classification application. Information Classification and the Origins of the Ferret System When the Y-12 National Security Complex (Y-12), a manufacturing facility of the U.S. Department of Energy (DOE) in Oak Ridge, Tennessee, started developing tools to support its management of classified documents, it was faced with the task of capturing the knowledge of how to identify classified information. Once captured, such knowledge would have to be stored in a maintainable fashion that was also accessible to Ferret, the automated analytical tool that we had developed. The Ferret project team initially developed a knowledge base as part of the program development. Since this hand-built base was difficult for anyone other than the original developer to maintain, the team soon settled on a knowledge base in XML that depends on developer to maintain, the team soon settled on a knowledge base in XML that depends on some familiar techniques, like tables and hierarchical trees, and adds to them an adaptation of the new techniques of topic maps (ISO/IEC 13250:2000). The knowledge base is now in transition to a topic map representation based on the XTM (XML Topic Map, www.topicmaps.org) specification. Since the original classification project, the applications for both the Ferret engine and the knowledge-engineering techniques have expanded. Although Y-12 is no longer involved in the original function for which it was created as part of the Manhattan Project during World War II—the final enrichment of weapons-grade uranium—it has retained a major role in the making and maintaining of components for the U.S. thermonuclear stockpile. Accordingly, much of the information handled at the plant is classified and must be protected. Decisions about what is actually classified are made by DOE on a national basis and adapted to specific local situations by facilities like Y-12. Day-to-day classification decisions are made on the basis of this approved guidance by authorized derivative classifiers (ADCs), who form the front line of defense for classified information. The first application of the Ferret engine, developed as a tool to support the ADCs in their work, reads electronic documents and highlights potentially classified passages, displaying along with each portion of text the proposed classification and the rules from the guidance that support the classification. Although the work of the ADCs is grounded in the formal classification rules for identifying classified information, the practical application of those rules depends on much more detailed knowledge than is contained in the published guidance. Recognition of significant information depends on knowledge of the manufacturing process, the design of the products, and the properties of the materials of which the products are made. It also depends on an awareness of what decisions have been made in the past and what information is available to the general public at the unclassified level. Finally, the ADC must be able to draw inferences from the combined collection of information. In addition to the details of product designs and manufacturing, the classification process must recognize numerous pieces of indirect information. Many parts and materials have been given codenames so that they can be discussed without revealing classified data. To elaborate on one of these codenames might constitute a breach of security. There are many specific facts, such as the inventories of certain materials and the rates at which they are used in manufacturing, that may be classified. Some facts are not themselves classified, but in combination they can add up to classified data. For example, mentioning a particular product in conjunction with certain buildings might reveal something of the product’s components if those buildings are known to process only certain materials. Mentioning a geometric attribute might imply the overall shape or configuration of a part. General properties of materials, such as metals and plastics, constitute a large part of the knowledge. Individually, most of the facts about materials—things that might be learned from any chemistry or physics text—are not classified. But in the particular context of Y-12’s products, these unclassified facts may suggest sensitive information. Part of the role of the ADCs, and thus of the Ferret system that supports them, is to recognize when such combinations have occurred in our context. How Ferret Works: The Classified Automobile Classification analysis is generally done by comparing the information in question to formal guidance that has been developed by the appropriate authorities. Guidance is usually written in guidance that has been developed by the appropriate authorities. Guidance is usually written in terms of general concepts, such as the high-level design of our products and the materials used in them, that we need to protect. While some broad guidance is written in narrative form, most of the specific guidance is presented in tabular form. Each rule in a table states a condition to be evaluated and associates with it a resulting classification to be applied if the document under evaluation meets the condition in question. Frequently these rules form a series of conditions reflecting increasing detail to be sought in candidate documents and thus increasing levels of sensitivity and need for protection. If we were in the automotive industry, we might have classification rules that look something like the following table:","PeriodicalId":137935,"journal":{"name":"Markup Languages","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Markup Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1162/109966201317356371","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The “Ferret” analytical engine, developed originally by the Y-12 National Security Complex of the U.S. Department of Energy to seek classified data and associations in documents and present its findings in the light of formal rules, requires a structured information base that represents not just individual facts but a set of implications and a collection of rules. The fundamental knowledge base is evolving towards forms that enhance flexibility and portability. The developers early realized that the knowledge base can be captured in XML by a series of trees that represent taxonomies, analytical structures, and specific indicative facts, but over this a topic map is needed to express links across the trees. Above this, the classification rules could form another topic map that points into the lower layers. In its latest form, however, the knowledge base has come to be entirely represented in a topic map. The “Ferret” engine combines sophisticated searching with rule-driven analysis and reporting. In its original application, the Ferret engine performs the equivalent of 5,000 simultaneous searches while reading documents at several thousand words per second. The analysis traces implications of concepts discovered in searching and applies the rules for interpreting implications and the actions to be taken when a significant piece of information is found. Because the topic maps that represent this knowledgecan be switched easily, Ferret can be reprogrammed to many tasks, including selection and categorization, scanning of e-mail and newsfeeds, diagnostics, and query expansion, in addition to the original classification application. Information Classification and the Origins of the Ferret System When the Y-12 National Security Complex (Y-12), a manufacturing facility of the U.S. Department of Energy (DOE) in Oak Ridge, Tennessee, started developing tools to support its management of classified documents, it was faced with the task of capturing the knowledge of how to identify classified information. Once captured, such knowledge would have to be stored in a maintainable fashion that was also accessible to Ferret, the automated analytical tool that we had developed. The Ferret project team initially developed a knowledge base as part of the program development. Since this hand-built base was difficult for anyone other than the original developer to maintain, the team soon settled on a knowledge base in XML that depends on developer to maintain, the team soon settled on a knowledge base in XML that depends on some familiar techniques, like tables and hierarchical trees, and adds to them an adaptation of the new techniques of topic maps (ISO/IEC 13250:2000). The knowledge base is now in transition to a topic map representation based on the XTM (XML Topic Map, www.topicmaps.org) specification. Since the original classification project, the applications for both the Ferret engine and the knowledge-engineering techniques have expanded. Although Y-12 is no longer involved in the original function for which it was created as part of the Manhattan Project during World War II—the final enrichment of weapons-grade uranium—it has retained a major role in the making and maintaining of components for the U.S. thermonuclear stockpile. Accordingly, much of the information handled at the plant is classified and must be protected. Decisions about what is actually classified are made by DOE on a national basis and adapted to specific local situations by facilities like Y-12. Day-to-day classification decisions are made on the basis of this approved guidance by authorized derivative classifiers (ADCs), who form the front line of defense for classified information. The first application of the Ferret engine, developed as a tool to support the ADCs in their work, reads electronic documents and highlights potentially classified passages, displaying along with each portion of text the proposed classification and the rules from the guidance that support the classification. Although the work of the ADCs is grounded in the formal classification rules for identifying classified information, the practical application of those rules depends on much more detailed knowledge than is contained in the published guidance. Recognition of significant information depends on knowledge of the manufacturing process, the design of the products, and the properties of the materials of which the products are made. It also depends on an awareness of what decisions have been made in the past and what information is available to the general public at the unclassified level. Finally, the ADC must be able to draw inferences from the combined collection of information. In addition to the details of product designs and manufacturing, the classification process must recognize numerous pieces of indirect information. Many parts and materials have been given codenames so that they can be discussed without revealing classified data. To elaborate on one of these codenames might constitute a breach of security. There are many specific facts, such as the inventories of certain materials and the rates at which they are used in manufacturing, that may be classified. Some facts are not themselves classified, but in combination they can add up to classified data. For example, mentioning a particular product in conjunction with certain buildings might reveal something of the product’s components if those buildings are known to process only certain materials. Mentioning a geometric attribute might imply the overall shape or configuration of a part. General properties of materials, such as metals and plastics, constitute a large part of the knowledge. Individually, most of the facts about materials—things that might be learned from any chemistry or physics text—are not classified. But in the particular context of Y-12’s products, these unclassified facts may suggest sensitive information. Part of the role of the ADCs, and thus of the Ferret system that supports them, is to recognize when such combinations have occurred in our context. How Ferret Works: The Classified Automobile Classification analysis is generally done by comparing the information in question to formal guidance that has been developed by the appropriate authorities. Guidance is usually written in guidance that has been developed by the appropriate authorities. Guidance is usually written in terms of general concepts, such as the high-level design of our products and the materials used in them, that we need to protect. While some broad guidance is written in narrative form, most of the specific guidance is presented in tabular form. Each rule in a table states a condition to be evaluated and associates with it a resulting classification to be applied if the document under evaluation meets the condition in question. Frequently these rules form a series of conditions reflecting increasing detail to be sought in candidate documents and thus increasing levels of sensitivity and need for protection. If we were in the automotive industry, we might have classification rules that look something like the following table:

查看原文本刊更多论文

雪貂和主题图:分析引擎的知识工程

“雪貂”分析引擎最初由美国能源部Y-12国家安全综合体开发，用于在文件中寻找机密数据和关联，并根据正式规则呈现其发现，它需要一个结构化的信息库，不仅代表单个事实，还代表一组含义和规则集合。基础知识库正在向增强灵活性和可移植性的形式发展。开发人员很早就意识到，可以通过一系列表示分类法、分析结构和特定指示性事实的树来用XML捕获知识库，但是在此基础上需要一个主题图来表示树之间的链接。在此之上，分类规则可以形成指向较低层的另一个主题图。然而，在其最新形式中，知识库已经完全用主题图表示。“Ferret”引擎将复杂的搜索与规则驱动的分析和报告相结合。在最初的应用程序中，Ferret引擎以每秒几千字的速度读取文档，同时执行相当于5000次同时搜索。分析跟踪在搜索中发现的概念的含义，并应用解释含义的规则，以及在发现重要信息时应采取的行动。因为表示这些知识的主题图可以很容易地切换，所以除了原始的分类应用程序之外，Ferret还可以被重新编程为许多任务，包括选择和分类、扫描电子邮件和新闻提要、诊断和查询扩展。当美国能源部(DOE)位于田纳西州橡树岭的制造工厂Y-12国家安全综合体(Y-12)开始开发支持其机密文件管理的工具时，它面临着获取如何识别机密信息的知识的任务。一旦被捕获，这些知识就必须以一种可维护的方式存储，这种方式也可以被我们开发的自动分析工具Ferret访问。Ferret项目团队最初开发了一个知识库，作为程序开发的一部分。由于除了原始开发人员之外的任何人都很难维护这个手工构建的基础，因此团队很快就确定了一个依赖于开发人员维护的XML知识库，团队很快就确定了一个依赖于一些熟悉的技术(如表和层次树)的XML知识库，并在其中添加了对主题图新技术的适应(ISO/IEC 13250:2000)。知识库现在正在转换为基于XTM (XML主题图，www.topicmaps.org)规范的主题图表示。自最初的分类项目以来，Ferret引擎和知识工程技术的应用都得到了扩展。虽然Y-12不再参与最初的功能，它是第二次世界大战期间作为曼哈顿计划的一部分而创建的-武器级铀的最后浓缩-它仍然在制造和维护美国热核储备部件方面发挥着重要作用。因此，核电站处理的很多信息都属于机密，必须加以保护。哪些是真正的机密是由能源部在国家基础上做出的决定，并由Y-12等设施根据具体的当地情况进行调整。日常的分类决策是由授权衍生分类器(adc)根据该批准的指南做出的，他们构成了机密信息的防御前线。Ferret引擎的第一个应用程序是作为支持adc工作的工具而开发的，它读取电子文档并突出显示可能分类的段落，并在文本的每个部分显示拟议的分类和支持分类的指南规则。尽管adc的工作以识别机密信息的正式分类规则为基础，但这些规则的实际应用取决于比公布的指南所包含的更详细的知识。对重要信息的识别取决于对制造过程、产品设计和制造产品的材料特性的了解。它还取决于是否了解过去作出了哪些决定，以及一般公众在非机密级别上可以获得哪些信息。最后，ADC必须能够从组合的信息收集中得出推论。除了产品设计和制造的细节外，分类过程还必须识别许多间接信息。许多部件和材料都被赋予了代号，以便在不泄露机密数据的情况下进行讨论。详细说明其中一个代号可能构成对安全的破坏。有许多具体的事实，如某些材料的库存和它们在制造业中的使用率，可以分类。有些事实本身并不属于机密，但它们结合起来就构成了机密数据。例如，将特定产品与某些建筑物联系起来，如果已知这些建筑物只处理某些材料，则可能会揭示产品的某些组件。提及几何属性可能暗示了部件的整体形状或配置。材料的一般性质，如金属和塑料，构成了知识的很大一部分。就个人而言，大多数关于材料的事实——可以从任何化学或物理课本中学到的东西——都是不分类的。但在运-12产品的特殊背景下，这些未分类的事实可能意味着敏感信息。adc以及支持它们的Ferret系统的部分作用是识别在我们的环境中何时发生了这种组合。Ferret是如何工作的:分类汽车分类分析通常是通过将相关信息与相关部门制定的正式指导进行比较来完成的。指南通常写在由有关当局制定的指南中。指导通常是根据一般概念编写的，例如我们需要保护的产品的高级设计和其中使用的材料。虽然一些宽泛的指导是以叙述的形式写的，但大多数具体的指导是以表格的形式呈现的。表中的每个规则声明一个要评估的条件，并与之关联一个分类结果，如果正在评估的文档满足所讨论的条件，将应用该分类。这些规则往往构成一系列条件，反映出候选文件需要越来越详细，从而增加了敏感性和保护的需要。如果我们在汽车行业，我们的分类规则可能类似于下表:

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Markup Languages

自引率

0.00%

发文量