设计一个机器学习管道,用于从纵向调查问卷中自动提取元数据

IASSIST quarterly Pub Date : 2022-03-28 DOI:10.29173/iq1023
Suparna De, Harry Moss, Jon Johnson, Jenny Li, Haeron Pereira, Sanaz Jabbari
{"title":"设计一个机器学习管道,用于从纵向调查问卷中自动提取元数据","authors":"Suparna De, Harry Moss, Jon Johnson, Jenny Li, Haeron Pereira, Sanaz Jabbari","doi":"10.29173/iq1023","DOIUrl":null,"url":null,"abstract":"Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge. \nWhile CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift. \nThis paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a ‘training and test dataset’, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository. \nThe developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments.  Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics’ values, which enable model tuning as well as transparent management of data and experiments.","PeriodicalId":84870,"journal":{"name":"IASSIST quarterly","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires\",\"authors\":\"Suparna De, Harry Moss, Jon Johnson, Jenny Li, Haeron Pereira, Sanaz Jabbari\",\"doi\":\"10.29173/iq1023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge. \\nWhile CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift. \\nThis paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a ‘training and test dataset’, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository. \\nThe developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments.  Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics’ values, which enable model tuning as well as transparent management of data and experiments.\",\"PeriodicalId\":84870,\"journal\":{\"name\":\"IASSIST quarterly\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IASSIST quarterly\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.29173/iq1023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IASSIST quarterly","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29173/iq1023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

数据文档倡议生命周期(DDI-L)引入了一个强大的元数据模型来支持问卷内容和流的捕获,并通过支持版本控制和provenancing来鼓励使用BasedOn等对象来重用现有问题项。然而,缺乏包括问题文本和回答领域的问卷库,这意味着支持DDI计算机辅助面试(CAI)工具开发的生态系统受到了限制。档案保存与调查相关的PDF中的信息,但以有效的方式将其提取到DDI生命周期中是一个重大挑战。虽然CLOSER Discovery一直支持在DDI生命周期中提供高质量的问卷元数据,但这主要是手动完成的。需要探索更自动化的方法,以确保可扩展的元数据注释和提升。本文介绍了设计机器学习(ML)管道以自动从调查问卷中提取问题作为PDF的初步结果。使用CLOSER Discovery作为“训练和测试数据集”,已经探索了许多机器学习方法来将问卷中解析的文本分类为有效的DDI项目,以包含在符合DDI-L的存储库中。开发的ML管道采用了连续构建和集成的方法,有适当的流程来跟踪结构化DDI-L输入元数据、ML模型和模型参数的各种组合与定义的评估指标,从而实现实验的再现性和比较分析。有形输出包括各种元数据和模型参数的映射以及相应的评估度量值,这使得模型调整以及数据和实验的透明管理成为可能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires
Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge. While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift. This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a ‘training and test dataset’, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository. The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments.  Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics’ values, which enable model tuning as well as transparent management of data and experiments.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信