Identifying business information through deep learning: analyzing the tender documents of an Internet-based logistics bidding platform

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications Pub Date : 2023-05-04 DOI:10.1108/dta-08-2022-0308

Yingwen Yu, Jing Ma

{"title":"Identifying business information through deep learning: analyzing the tender documents of an Internet-based logistics bidding platform","authors":"Yingwen Yu, Jing Ma","doi":"10.1108/dta-08-2022-0308","DOIUrl":null,"url":null,"abstract":"PurposeThe tender documents, an essential data source for internet-based logistics tendering platforms, incorporate massive fine-grained data, ranging from information on tenderee, shipping location and shipping items. Automated information extraction in this area is, however, under-researched, making the extraction process a time- and effort-consuming one. For Chinese logistics tender entities, in particular, existing named entity recognition (NER) solutions are mostly unsuitable as they involve domain-specific terminologies and possess different semantic features.Design/methodology/approachTo tackle this problem, a novel lattice long short-term memory (LSTM) model, combining a variant contextual feature representation and a conditional random field (CRF) layer, is proposed in this paper for identifying valuable entities from logistic tender documents. Instead of traditional word embedding, the proposed model uses the pretrained Bidirectional Encoder Representations from Transformers (BERT) model as input to augment the contextual feature representation. Subsequently, with the Lattice-LSTM model, the information of characters and words is effectively utilized to avoid error segmentation.FindingsThe proposed model is then verified by the Chinese logistic tender named entity corpus. Moreover, the results suggest that the proposed model excels in the logistics tender corpus over other mainstream NER models. The proposed model underpins the automatic extraction of logistics tender information, enabling logistic companies to perceive the ever-changing market trends and make far-sighted logistic decisions.Originality/value(1) A practical model for logistic tender NER is proposed in the manuscript. By employing and fine-tuning BERT into the downstream task with a small amount of data, the experiment results show that the model has a better performance than other existing models. This is the first study, to the best of the authors' knowledge, to extract named entities from Chinese logistic tender documents. (2) A real logistic tender corpus for practical use is constructed and a program of the model for online-processing real logistic tender documents is developed in this work. The authors believe that the model will facilitate logistic companies in converting unstructured documents to structured data and further perceive the ever-changing market trends to make far-sighted logistic decisions.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":" ","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Technologies and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1108/dta-08-2022-0308","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

PurposeThe tender documents, an essential data source for internet-based logistics tendering platforms, incorporate massive fine-grained data, ranging from information on tenderee, shipping location and shipping items. Automated information extraction in this area is, however, under-researched, making the extraction process a time- and effort-consuming one. For Chinese logistics tender entities, in particular, existing named entity recognition (NER) solutions are mostly unsuitable as they involve domain-specific terminologies and possess different semantic features.Design/methodology/approachTo tackle this problem, a novel lattice long short-term memory (LSTM) model, combining a variant contextual feature representation and a conditional random field (CRF) layer, is proposed in this paper for identifying valuable entities from logistic tender documents. Instead of traditional word embedding, the proposed model uses the pretrained Bidirectional Encoder Representations from Transformers (BERT) model as input to augment the contextual feature representation. Subsequently, with the Lattice-LSTM model, the information of characters and words is effectively utilized to avoid error segmentation.FindingsThe proposed model is then verified by the Chinese logistic tender named entity corpus. Moreover, the results suggest that the proposed model excels in the logistics tender corpus over other mainstream NER models. The proposed model underpins the automatic extraction of logistics tender information, enabling logistic companies to perceive the ever-changing market trends and make far-sighted logistic decisions.Originality/value(1) A practical model for logistic tender NER is proposed in the manuscript. By employing and fine-tuning BERT into the downstream task with a small amount of data, the experiment results show that the model has a better performance than other existing models. This is the first study, to the best of the authors' knowledge, to extract named entities from Chinese logistic tender documents. (2) A real logistic tender corpus for practical use is constructed and a program of the model for online-processing real logistic tender documents is developed in this work. The authors believe that the model will facilitate logistic companies in converting unstructured documents to structured data and further perceive the ever-changing market trends to make far-sighted logistic decisions.

查看原文本刊更多论文

通过深度学习识别商业信息——分析基于互联网的物流投标平台的投标文件

目的招标文件是互联网物流招标平台的重要数据来源，包含了大量的细粒度数据，包括招标人信息、运输地点信息、运输物品信息等。然而，这一领域的自动化信息提取研究并不充分，使得提取过程既耗时又费力。特别是对于中国的物流招标实体，现有的命名实体识别(NER)解决方案大多不适合，因为它们涉及特定领域的术语，并且具有不同的语义特征。为了解决这个问题，本文提出了一种新的晶格长短期记忆(LSTM)模型，该模型结合了变量上下文特征表示和条件随机场(CRF)层，用于从物流投标文件中识别有价值的实体。与传统的词嵌入不同，该模型使用预训练的双向编码器表示作为输入来增强上下文特征表示。随后，利用Lattice-LSTM模型，有效地利用了字符和单词的信息，避免了错误分割。研究结果:提出的模型随后通过中国物流招标命名实体语料库进行验证。此外，研究结果表明，该模型在物流投标语料库中优于其他主流NER模型。该模型支持物流投标信息的自动提取，使物流公司能够感知不断变化的市场趋势，做出有远见的物流决策。原创性/价值(1)本文提出了一个实用的物流投标NER模型。通过将BERT应用于少量数据的下游任务中并对其进行微调，实验结果表明该模型比现有的其他模型具有更好的性能。据作者所知，这是第一次从中国物流招标文件中提取命名实体的研究。(2)构建了实用的物流实物标书语料库，开发了物流实物标书在线处理模型程序。该模型有助于物流企业将非结构化文档转化为结构化数据，进一步洞察不断变化的市场趋势，做出有远见的物流决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Technologies and Applications Social Sciences-Library and Information Sciences

CiteScore

3.80

自引率

6.20%

发文量

期刊介绍： Previously published as: Program Online from: 2018 Subject Area: Information & Knowledge Management, Library Studies