Christina Antoniou, Kalliopi Kravari, Nick Bassiliades
{"title":"Semantic requirements construction using ontologies and boilerplates","authors":"Christina Antoniou, Kalliopi Kravari, Nick Bassiliades","doi":"10.1016/j.datak.2024.102323","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102323","url":null,"abstract":"<div><p>This paper presents a combination of an ontology and boilerplates, which are requirements templates for the syntactic structure of individual requirements that try to alleviate the problem of ambiguity caused using natural language and make it easier for inexperienced engineers to create requirements. However, still the use of boilerplates restricts the use of natural language only syntactically and not semantically. Boilerplates consists of fixed and attributes elements. Using ontologies, restricts the vocabulary of the words used in the requirements boilerplates to entities, their properties and entity relationships that are semantically meaningful to the application domain, leading thus to fewer errors. In this work we combine the advantages of boilerplates and ontologies. Usually, the attributes of boilerplates are completed with the help of the ontology. The contribution of this paper is that the whole boilerplates are stored in the ontology, based on the fact that RDF triples have similar syntax to the boilerplate syntax, so that attributes and fixed elements are part of the ontology. This combination helps to construct semantically and syntactically correct requirements. The contribution and novelty of our method is that we exploit the natural language syntax of boilerplates mapping them to Resource Description Framework triples which have also a linguistic nature. In this paper we created and present the development of a domain-specific ontology as well as a minimal set of boilerplates for a specific application domain, namely that of engineering software for an ATM, while maintaining flexibility on the one hand and generality on the other.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102323"},"PeriodicalIF":2.5,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141289760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert Buchmann , Johann Eder , Hans-Georg Fill , Ulrich Frank , Dimitris Karagiannis , Emanuele Laurenzi , John Mylopoulos , Dimitris Plexousakis , Maribel Yasmina Santos
{"title":"Large language models: Expectations for semantics-driven systems engineering","authors":"Robert Buchmann , Johann Eder , Hans-Georg Fill , Ulrich Frank , Dimitris Karagiannis , Emanuele Laurenzi , John Mylopoulos , Dimitris Plexousakis , Maribel Yasmina Santos","doi":"10.1016/j.datak.2024.102324","DOIUrl":"10.1016/j.datak.2024.102324","url":null,"abstract":"<div><p>The hype of Large Language Models manifests in disruptions, expectations or concerns in scientific communities that have focused for a long time on design-oriented research. The current experiences with Large Language Models and associated products (e.g. ChatGPT) lead to diverse positions regarding the foreseeable evolution of such products from the point of view of scholars who have been working with designed abstractions for most of their careers - typically relying on deterministic design decisions to ensure systems and automation reliability. Such expectations are collected in this paper in relation to a flavor of systems engineering that relies on explicit knowledge structures, introduced here as “semantics-driven systems engineering”.</p><p>The paper was motivated by the panel discussion that took place at CAiSE 2023 in Zaragoza, Spain, during the workshop on Knowledge Graphs for Semantics-driven Systems Engineering (KG4SDSE). The workshop brought together Conceptual Modeling researchers with an interest in specific applications of Knowledge Graphs and the semantic enrichment benefits they can bring to systems engineering. The panel context and consensus are summarized at the end of the paper, preceded by a proposed research agenda considering the expressed positions.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102324"},"PeriodicalIF":2.5,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141134203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabriele Scaffidi Militone, Daniele Apiletti, Giovanni Malnati
{"title":"Hermes, a low-latency transactional storage for binary data streams from remote devices","authors":"Gabriele Scaffidi Militone, Daniele Apiletti, Giovanni Malnati","doi":"10.1016/j.datak.2024.102315","DOIUrl":"10.1016/j.datak.2024.102315","url":null,"abstract":"<div><p>In many contexts where data is streamed on a large scale, such as video surveillance systems, there is a dual requirement: secure data storage and continuous access to audio and video content by third parties, such as human operators or specific business logic, even while the media files are still being collected. However, using transactions to ensure data persistence often limits system throughput and latency. This paper presents a solution that enables both high ingestion rates with transactional data persistence and near real-time, low-latency access to the stream during collection. This immediate access enables the prompt application of specialized data engineering algorithms during data acquisition. The proposed solution is particularly suitable for binary data sources such as audio and video recordings in surveillance systems, and it can be extended to various big data scenarios via well-defined general interfaces. The scalability of the approach is based on the microservice architecture. Preliminary results obtained with Apache Kafka and MongoDB replica sets show that the proposed solution provides up to 3 times higher throughput and 2.2 times lower latency compared to standard multi-document transactions.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102315"},"PeriodicalIF":2.5,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141042236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing fuzzy semantics of reviews for multi-criteria recommendations","authors":"Navreen Kaur Boparai , Himanshu Aggarwal , Rinkle Rani","doi":"10.1016/j.datak.2024.102314","DOIUrl":"10.1016/j.datak.2024.102314","url":null,"abstract":"<div><p>Hotel reviews play a vital role in tourism recommender system. They should be analyzed effectively to enhance the accuracy of recommendations which can be generated either from crisp ratings on a fixed scale or real sentiments of reviews. But crisp ratings cannot represent the actual feelings of reviewers. Existing tourism recommender systems mostly recommend hotels on the basis of vague and sparse ratings resulting in inaccurate recommendations or preferences for online users. This paper presents a semantic approach to analyze the online reviews being crawled from tripadvisor.in. It discovers the underlying fuzzy semantics of reviews with respect to the multiple criteria of hotels rather than using the crisp ratings. The crawled reviews are preprocessed via data cleaning such as stopword and punctuation removal, tokenization, lemmatization, pos tagging to understand the semantics efficiently. Nouns representing frequent features of hotels are extracted from pre-processed reviews which are further used to identify opinion phrases. Fuzzy weights are derived from normalized frequency of frequent nouns and combined with sentiment score of all the synonyms of adjectives in the identified opinion phrases. This results in fuzzy semantics which form an ideal representation of reviews for a multi-criteria tourism recommender system. The proposed work is implemented in python by crawling the recent reviews of Jaipur hotels from TripAdvisor and analyzing their semantics. The resultant fuzzy semantics form a manually tagged dataset of reviews tagged with sentiments of identified aspects, respectively. Experimental results show improved sentiment score while considering all the synonyms of adjectives. The results are further used to fine-tune BERT models to form encodings for a query-based recommender system. The proposed approach can help tourism and hospitality service providers to take advantage of such sentiment analysis to examine the negative comments or unpleasant experiences of tourists and making appropriate improvements. Moreover, it will help online users to get better recommendations while planning their trips.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102314"},"PeriodicalIF":2.5,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141034319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Business intelligence and cognitive loads: Proposition of a dashboard adoption model","authors":"Corentin Burnay, Mathieu Lega, Sarah Bouraga","doi":"10.1016/j.datak.2024.102310","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102310","url":null,"abstract":"<div><p>Decision makers in organizations strive to improve the quality of their decisions. One way to improve that process is to objectify the decisions with facts. Data-driven Decision Support Systems (data-driven DSS), and more specifically business intelligence (BI) intend to achieve this. Organizations invest massively in the development of BI data-driven DSS and expect them to be adopted and to effectively support decision makers. This raises many technical and methodological challenges, especially regarding the design of BI dashboards, which can be seen as the visible tip of the BI data-driven DSS iceberg and which play a major role in the adoption of the entire system. In this paper, the dashboard content is investigated as one possible root cause for BI data-driven DSS dashboard adoption or rejection through an early empirical research. More precisely, this work is composed of three parts. In the first part, the concept of cognitive loads is studied in the context of BI dashboards and the informational, the representational and the non-informational loads are introduced. In the second part, the effects of these loads on the adoption of BI dashboards are then studied through an experiment with 167 respondents and a Structural Equation Modeling (SEM) analysis. The result is a Dashboard Adoption Model, enriching the seminal Technology Acceptance Model with new content-oriented variables to support the design of more supportive BI data-driven DSS dashboards. Finally, in the third part, a set of indicators is proposed to help dashboards designers in the monitoring of the loads of their dashboards practically.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102310"},"PeriodicalIF":2.5,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140951807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine learning for predicting off-block delays: A case study at Paris — Charles de Gaulle International Airport","authors":"Thibault Falque , Bertrand Mazure , Karim Tabia","doi":"10.1016/j.datak.2024.102303","DOIUrl":"10.1016/j.datak.2024.102303","url":null,"abstract":"<div><p>Punctuality is a sensitive issue in large airports and hubs for passenger experience and for controlling operational costs. This paper presents a real and challenging problem of predicting and explaining flight off-block delays. We study the case of the international airport Paris Charles de Gaulle (Paris-CDG) starting from the specificities of this problem at Paris-CDG until the proposal of modelings then solutions and the analysis of the results on real data covering an entire year of activity. The proof of concept provided in this paper allows us to believe that the proposed approach could help improve the management of delays and reduce the impact of the resulting consequences.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102303"},"PeriodicalIF":2.5,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000272/pdfft?md5=ff8c7468240914b3ce61469a0954468c&pid=1-s2.0-S0169023X24000272-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141043841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data","authors":"Adel Remadi , Karim El Hage , Yasmina Hobeika , Francesca Bugiotti","doi":"10.1016/j.datak.2024.102313","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102313","url":null,"abstract":"<div><p>Manually integrating data of diverse formats and languages is vital to many artificial intelligence applications. However, the task itself remains challenging and time-consuming. This paper highlights the potential of Large Language Models (LLMs) to streamline data extraction and resolution processes. Our approach aims to address the ongoing challenge of integrating heterogeneous data sources, encouraging advancements in the field of data engineering. Applied on the specific use case of learning disorders in higher education, our research demonstrates LLMs’ capability to effectively extract data from unstructured sources. It is then further highlighted that LLMs can enhance data integration by providing the ability to resolve entities originating from multiple data sources. Crucially, the paper underscores the necessity of preliminary data modeling decisions to ensure the success of such technological applications. By merging human expertise with LLM-driven automation, this study advocates for the further exploration of semi-autonomous data engineering pipelines.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102313"},"PeriodicalIF":2.5,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000375/pdfft?md5=11ee9c76542d55fac49075892a9a8c7d&pid=1-s2.0-S0169023X24000375-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140918204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Berlin SPARQL benchmark to evaluate virtual SPARQL endpoints over relational databases","authors":"Milos Chaloupka, Martin Necasky","doi":"10.1016/j.datak.2024.102309","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102309","url":null,"abstract":"<div><p>The RDF is a popular and well-documented format for publishing structured data on the web. It enables data to be consumed without the knowledge of how the data is internally stored. There are already several native RDF storage solutions that provide a SPARQL endpoint. However, native RDF stores are not widely adopted. It is still more common to store data in a relational database. One of the useful features of native RDF storage solutions is providing a SPARQL endpoint, a web service to query RDF data with SPARQL. To provide this feature also on top of prevalent relational databases, solutions for virtual SPARQL endpoints on top of a relational database have appeared. To benchmark these solutions, a state-of-the-art tool, the Berlin SPARQL Benchmark (BSBM), is used. However, BSBM was designed primarily to benchmark native RDF stores. It can also be used to benchmark solutions for virtual SPARQL endpoints. However, since BSBM was not designed for virtual SPARQL endpoints, each implementation uses that tool differently for evaluation. As a result, the evaluation is not consistent and therefore hardly comparable. In this paper, we demonstrate how this well-defined benchmarking tool for SPARQL endpoints can be used to evaluate virtual endpoints over relational databases, perform the evaluation on the available implementations, and provide instructions on how to repeat the same evaluation in the future.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102309"},"PeriodicalIF":2.5,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140905621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saman Jamshidi , Mahin Mohammadi , Saeed Bagheri , Hamid Esmaeili Najafabadi , Alireza Rezvanian , Mehdi Gheisari , Mustafa Ghaderzadeh , Amir Shahab Shahabi , Zongda Wu
{"title":"Effective text classification using BERT, MTM LSTM, and DT","authors":"Saman Jamshidi , Mahin Mohammadi , Saeed Bagheri , Hamid Esmaeili Najafabadi , Alireza Rezvanian , Mehdi Gheisari , Mustafa Ghaderzadeh , Amir Shahab Shahabi , Zongda Wu","doi":"10.1016/j.datak.2024.102306","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102306","url":null,"abstract":"<div><p>Text classification plays a critical role in managing large volumes of electronically produced texts. As the number of such texts increases, manual analysis becomes impractical, necessitating an intelligent approach for processing information. Deep learning models have witnessed widespread application in text classification, including the use of recurrent neural networks like Many to One Long Short-Term Memory (MTO LSTM). Nonetheless, this model is limited by its reliance on only the last token for text labelling. To overcome this limitation, this study introduces a novel hybrid model that combines Bidirectional Encoder Representations from Transformers (BERT), Many To Many Long Short-Term Memory (MTM LSTM), and Decision Templates (DT) for text classification. In this new model, the text is first embedded using the BERT model and then trained using MTM LSTM to approximate the target at each token. Finally, the approximations are fused using DT. The proposed model is evaluated using the well-known IMDB movie review dataset for binary classification and Drug Review Dataset for multiclass classification. The results demonstrate superior performance in terms of accuracy, recall, precision, and F1 score compared to previous models. The hybrid model presented in this study holds significant potential for a wide range of text classification tasks and stands as a valuable contribution to the field.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102306"},"PeriodicalIF":2.5,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140825257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José Antonio García-Díaz , Ghassan Beydoun , Rafel Valencia-García
{"title":"Evaluating Transformers and Linguistic Features integration for Author Profiling tasks in Spanish","authors":"José Antonio García-Díaz , Ghassan Beydoun , Rafel Valencia-García","doi":"10.1016/j.datak.2024.102307","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102307","url":null,"abstract":"<div><p>Author profiling consists of extracting their demographic and psychographic information by examining their writings. This information can then be used to improve the reader experience and to detect bots or propagators of hoaxes and/or hate speech. Therefore, author profiling can be applied to build more robust and efficient Knowledge-Based Systems for tasks such as content moderation, user profiling, and information retrieval. Author profiling is typically performed automatically as a document classification task. Recently, language models based on transformers have also proven to be quite effective in this task. However, the size and heterogeneity of novel language models, makes it necessary to evaluate them in context. The contributions we make in this paper are four-fold: First, we evaluate which language models are best suited to perform author profiling in Spanish. These experiments include basic, distilled, and multilingual models. Second, we evaluate how feature integration can improve performance for this task. We evaluate two distinct strategies: knowledge integration and ensemble learning. Third, we evaluate the ability of linguistic features to improve the interpretability of the results. Fourth, we evaluate the performance of each language model in terms of memory, training, and inference times. Our results indicate that the use of lightweight models can indeed achieve similar performance to heavy models and that multilingual models are actually less effective than models trained with one language. Finally, we confirm that the best models and strategies for integrating features ultimately depend on the context of the task.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102307"},"PeriodicalIF":2.5,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000314/pdfft?md5=42a482dbed2e2a640c46e89a6f3a69c8&pid=1-s2.0-S0169023X24000314-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140825258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}