{"title":"A Web Information Extraction Framework with Adaptive and Failure Prediction Feature","authors":"Sudhir Kumar Patnaik, C. Babu","doi":"10.1145/3495008","DOIUrl":null,"url":null,"abstract":"The amount of information available on the internet today requires effective information extraction and processing to offer hyper-personalized user experiences. Inability to extract information by using traditional and machine learning techniques due to dynamic changes in website layout pose significant challenges to the technical community to keep up with such changes. The focus of existing machine learning-based information extraction framework is only on information extraction by using core extraction logic that is susceptible to website changes, thus missing out core features such as ability to handle proactive failure prediction and intelligent information extraction capabilities. The aim of this article is to build a robust and intelligent information extraction framework with the ability not only to proactively predict website failure but also automatically extract information using deep-learning techniques using You Only Look Once and Long Short-term Memory (LSTM) networks. The proactive detection using LSTM detects new location of the web page due to layout changes and enables automatic extraction of information of the new web page. A real-world case with retail website for intelligent information extraction and an offline experimentation environment is setup to demonstrate proactive failure prediction and automatic extraction resulting in high failure prediction, precision and recall of object detection and information extraction.","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3495008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The amount of information available on the internet today requires effective information extraction and processing to offer hyper-personalized user experiences. Inability to extract information by using traditional and machine learning techniques due to dynamic changes in website layout pose significant challenges to the technical community to keep up with such changes. The focus of existing machine learning-based information extraction framework is only on information extraction by using core extraction logic that is susceptible to website changes, thus missing out core features such as ability to handle proactive failure prediction and intelligent information extraction capabilities. The aim of this article is to build a robust and intelligent information extraction framework with the ability not only to proactively predict website failure but also automatically extract information using deep-learning techniques using You Only Look Once and Long Short-term Memory (LSTM) networks. The proactive detection using LSTM detects new location of the web page due to layout changes and enables automatic extraction of information of the new web page. A real-world case with retail website for intelligent information extraction and an offline experimentation environment is setup to demonstrate proactive failure prediction and automatic extraction resulting in high failure prediction, precision and recall of object detection and information extraction.
当今互联网上的海量信息需要有效的信息提取和处理,以提供超个性化的用户体验。由于网站布局的动态变化,无法通过使用传统和机器学习技术提取信息,这对技术社区跟上这些变化构成了重大挑战。现有的基于机器学习的信息提取框架只关注于利用易受网站变化影响的核心提取逻辑进行信息提取,而忽略了主动故障预测处理能力和智能信息提取能力等核心功能。本文的目的是建立一个强大的智能信息提取框架,不仅能够主动预测网站故障,而且还能够使用深度学习技术使用You only Look Once和长短期记忆(LSTM)网络自动提取信息。使用LSTM的主动检测可以检测由于布局变化导致的网页新位置,并自动提取新网页的信息。以零售网站为例,建立了智能信息提取的实际案例和离线实验环境,验证了主动故障预测和自动提取,从而实现了高故障预测、高精度和召回率的目标检测和信息提取。