{"title":"Fast Radius Outlier Filter Variant for Large Point Clouds","authors":"Péter Szutor, Marianna Zichar","doi":"10.3390/data8100149","DOIUrl":"https://doi.org/10.3390/data8100149","url":null,"abstract":"Currently, several devices (such as laser scanners, Kinect, time of flight cameras, medical imaging equipment (CT, MRI, intraoral scanners)), and technologies (e.g., photogrammetry) are capable of generating 3D point clouds. Each point cloud type has its unique structure or characteristics, but they have a common point: they may be loaded with errors. Before further data processing, these unwanted portions of the data must be removed with filtering and outlier detection. There are several algorithms for detecting outliers, but their performances decrease when the size of the point cloud increases. The industry has a high demand for efficient algorithms to deal with large point clouds. The most commonly used algorithm is the radius outlier filter (ROL or ROR), which has several improvements (e.g., statistical outlier removal, SOR). Unfortunately, this algorithm is also limited since it is slow on a large number of points. This paper introduces a novel algorithm, based on the idea of the ROL filter, that finds outliers in huge point clouds while its time complexity is not exponential. As a result of the linear complexity, the algorithm can handle extra large point clouds, and the effectiveness of this is demonstrated in several tests.","PeriodicalId":36824,"journal":{"name":"Data","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135895712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Data Storage, Scalability, and Availability in Blockchain Systems: A Bibliometric Analysis","authors":"Meenakshi Kandpal, Veena Goswami, Rojalina Priyadarshini, Rabindra Kumar Barik","doi":"10.3390/data8100148","DOIUrl":"https://doi.org/10.3390/data8100148","url":null,"abstract":"In recent years, blockchain research has drawn attention from all across the world. It is a decentralized competence that is spread out and uncertain. Several nations and scholars have already successfully applied blockchain in numerous arenas. Blockchain is essential in delicate situations because it secures data and keeps it from being altered or forged. In addition, the market’s increased demand for data is driving demand for data scaling across all industries. Researchers from many nations have used blockchain in various sectors over time, thus bringing extreme focus to this newly escalating blockchain domain. Every research project begins with in-depth knowledge about the working domain, and new interest information about blockchain is quite scattered. This study analyzes academic literature on blockchain technology, emphasizing three key aspects: blockchain storage, scalability, and availability. These are critical areas within the broader field of blockchain technology. This study employs CiteSpace and VOSviewer to understand the current state of research in these areas comprehensively. These are bibliometric analysis tools commonly used in academic research to examine patterns and relationships within scientific literature. Thus, to visualize a way to store data with scalability and availability while keeping the security of the blockchain in sync, the required research has been performed on the storage, scalability, and availability of data in the blockchain environment. The ultimate goal is to contribute to developing secure and efficient data storage solutions within blockchain technology.","PeriodicalId":36824,"journal":{"name":"Data","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135896045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Retinal Oct-Angiography and Cardiovascular STAtus (RASTA) Dataset of Swept-Source Microvascular Imaging for Cardiovascular Risk Assessment","authors":"Clément Germanèse, Fabrice Meriaudeau, Pétra Eid, Ramin Tadayoni, Dominique Ginhac, Atif Anwer, Steinberg Laure-Anne, Charles Guenancia, Catherine Creuzot-Garcher, Pierre-Henry Gabrielle, Louis Arnould","doi":"10.3390/data8100147","DOIUrl":"https://doi.org/10.3390/data8100147","url":null,"abstract":"In the context of exponential demographic growth, the imbalance between human resources and public health problems impels us to envision other solutions to the difficulties faced in the diagnosis, prevention, and large-scale management of the most common diseases. Cardiovascular diseases represent the leading cause of morbidity and mortality worldwide. A large-scale screening program would make it possible to promptly identify patients with high cardiovascular risk in order to manage them adequately. Optical coherence tomography angiography (OCT-A), as a window into the state of the cardiovascular system, is a rapid, reliable, and reproducible imaging examination that enables the prompt identification of at-risk patients through the use of automated classification models. One challenge that limits the development of computer-aided diagnostic programs is the small number of open-source OCT-A acquisitions available. To facilitate the development of such models, we have assembled a set of images of the retinal microvascular system from 499 patients. It consists of 814 angiocubes as well as 2005 en face images. Angiocubes were captured with a swept-source OCT-A device of patients with varying overall cardiovascular risk. To the best of our knowledge, our dataset, Retinal oct-Angiography and cardiovascular STAtus (RASTA), is the only publicly available dataset comprising such a variety of images from healthy and at-risk patients. This dataset will enable the development of generalizable models for screening cardiovascular diseases from OCT-A retinal images.","PeriodicalId":36824,"journal":{"name":"Data","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135425069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synthetic Data Generation for Data Envelopment Analysis","authors":"Andrey V. Lychev","doi":"10.3390/data8100146","DOIUrl":"https://doi.org/10.3390/data8100146","url":null,"abstract":"The paper is devoted to the problem of generating artificial datasets for data envelopment analysis (DEA), which can be used for testing DEA models and methods. In particular, the papers that applied DEA to big data often used synthetic data generation to obtain large-scale datasets because real datasets of large size, available in the public domain, are extremely rare. This paper proposes the algorithm which takes as input some real dataset and complements it by artificial efficient and inefficient units. The generation process extends the efficient part of the frontier by inserting artificial efficient units, keeping the original efficient frontier unchanged. For this purpose, the algorithm uses the assurance region method and consistently relaxes weight restrictions during the iterations. This approach produces synthetic datasets that are closer to real ones, compared to other algorithms that generate data from scratch. The proposed algorithm is applied to a pair of small real-life datasets. As a result, the datasets were expanded to 50K units. Computational experiments show that artificially generated DMUs preserve isotonicity and do not increase the collinearity of the original data as a whole.","PeriodicalId":36824,"journal":{"name":"Data","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135538398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md. Ashiqur Rahman, Shuhena Salam Aonty, Kaushik Deb, Iqbal H. Sarker
{"title":"Attention-Based Human Age Estimation from Face Images to Enhance Public Security","authors":"Md. Ashiqur Rahman, Shuhena Salam Aonty, Kaushik Deb, Iqbal H. Sarker","doi":"10.3390/data8100145","DOIUrl":"https://doi.org/10.3390/data8100145","url":null,"abstract":"Age estimation from facial images has gained significant attention due to its practical applications such as public security. However, one of the major challenges faced in this field is the limited availability of comprehensive training data. Moreover, due to the gradual nature of aging, similar-aged faces tend to share similarities despite their race, gender, or location. Recent studies on age estimation utilize convolutional neural networks (CNN), treating every facial region equally and disregarding potentially informative patches that contain age-specific details. Therefore, an attention module can be used to focus extra attention on important patches in the image. In this study, tests are conducted on different attention modules, namely CBAM, SENet, and Self-attention, implemented with a convolutional neural network. The focus is on developing a lightweight model that requires a low number of parameters. A merged dataset and other cutting-edge datasets are used to test the proposed model’s performance. In addition, transfer learning is used alongside the scratch CNN model to achieve optimal performance more efficiently. Experimental results on different aging face databases show the remarkable advantages of the proposed attention-based CNN model over the conventional CNN model by attaining the lowest mean absolute error and the lowest number of parameters with a better cumulative score.","PeriodicalId":36824,"journal":{"name":"Data","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135816165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Potential Range Map Dataset of Indian Birds","authors":"Arpit Deomurari, Ajay Sharma, Dipankar Ghose, Randeep Singh","doi":"10.3390/data8090144","DOIUrl":"https://doi.org/10.3390/data8090144","url":null,"abstract":"Conservation management heavily relies on accurate species distribution data. However, distributional information for most species is limited to distributional range maps, which could not have enough resolution to take conservation action and know current distribution status. In many cases, distribution maps are difficult to access in proper data formats for analysis and conservation planning of species. In this study, we addressed this issue by developing Species Distribution Models (SDMs) that integrate species presence data from various citizen science initiatives. This allowed us to systematically construct current distribution maps for 1091 bird species across India. To create these SDMs, we used MaxEnt 3.4.4 (Maximum Entropy) as the base for species distribution modelling and combined it with multiple citizen science datasets containing information on species occurrence and 29 environmental variables. Using this method, we were able to estimate species distribution maps at both a national scale and a high spatial resolution of 1 km2. Thus, the results of our study provide species current species distribution maps for 968 bird species found in India. These maps significantly improve our knowledge of the geographic distribution of about 75% of India’s bird species and are essential for addressing spatial knowledge gaps for conservation issues. Additionally, by superimposing the distribution maps of different species, we can locate hotspots for bird diversity and align conservation action.","PeriodicalId":36824,"journal":{"name":"Data","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136153226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Odd Beta Prime-Burr X Distribution with Applications to Petroleum Rock Sample Data and COVID-19 Mortality Rate","authors":"Ahmad Abubakar Suleiman, Hanita Daud, Narinderjit Singh Sawaran Singh, Aliyu Ismail Ishaq, Mahmod Othman","doi":"10.3390/data8090143","DOIUrl":"https://doi.org/10.3390/data8090143","url":null,"abstract":"In this article, we pioneer a new Burr X distribution using the odd beta prime generalized (OBP-G) family of distributions called the OBP-Burr X (OBPBX) distribution. The density function of this model is symmetric, left-skewed, right-skewed, and reversed-J, while the hazard function is monotonically increasing, decreasing, bathtub, and N-shaped, making it suitable for modeling skewed data and failure rates. Various statistical properties of the new model are obtained, such as moments, moment-generating function, entropies, quantile function, and limit behavior. The maximum-likelihood-estimation procedure is utilized to determine the parameters of the model. A Monte Carlo simulation study is implemented to ascertain the efficiency of maximum-likelihood estimators. The findings demonstrate the empirical application and flexibility of the OBPBX distribution, as showcased through its analysis of petroleum rock samples and COVID-19 mortality data, along with its superior performance compared to well-known extended versions of the Burr X distribution. We anticipate that the new distribution will attract a wider readership and provide a vital tool for modeling various phenomena in different domains.","PeriodicalId":36824,"journal":{"name":"Data","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135063250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giorgia Perelli, Roberta Bernini, Massimo Lucarini, Alessandra Durazzo
{"title":"Update of Dietary Supplement Label Database Addressing on Coding in Italy","authors":"Giorgia Perelli, Roberta Bernini, Massimo Lucarini, Alessandra Durazzo","doi":"10.3390/data8090142","DOIUrl":"https://doi.org/10.3390/data8090142","url":null,"abstract":"Harmonized composition data for foods and dietary supplements are needed for research and for policy decision making. For a correct assessment of dietary intake, the categorization and the classification of food products and dietary supplements are necessary. In recent decades, the marketing of dietary supplements has increased. A food supplements-based database has, as a principal feature, an intrinsic dynamism related to the continuous changes in formulations, which consequently leads to the need for constant monitoring of the market and for regular updates of the database. This study presents an update to the Dietary Supplement Label Database in Italy focused on dietary supplements coding. The updated dataset here, presented for the first time, consists of the codes of 216 dietary supplements currently on the market in Italy that have functional foods as their characterizing ingredients, throughout the two commonly most used description and classification systems: LanguaLTM and FoodEx2-. This update represents a unique tool and guideline for other compilers and users for applying classification coding systems to dietary supplements. Moreover, this updated dataset represents a valuable resource for several applications such as epidemiological investigations, exposure studies, and dietary assessment.","PeriodicalId":36824,"journal":{"name":"Data","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135740300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dataset of Multi-Aspect Integrated Migration Indicators","authors":"Diletta Goglia, Laura Pollacci, Alina Sîrbu","doi":"10.3390/data8090139","DOIUrl":"https://doi.org/10.3390/data8090139","url":null,"abstract":"Nowadays, new branches of research are proposing the use of non-traditional data sources for the study of migration trends in order to find an original methodology to answer open questions about cross-border human mobility. New knowledge extracted from these data must be validated using traditional data, which are however distributed across different sources and difficult to integrate. In this context we present the Multi-aspect Integrated Migration Indicators (MIMI) dataset, a new dataset of migration indicators (flows and stocks) and possible migration drivers (cultural, economic, demographic and geographic indicators). This was obtained through acquisition, transformation and integration of disparate traditional datasets together with social network data from Facebook (Social Connectedness Index). This article describes the process of gathering, embedding and merging traditional and novel variables, resulting in this new multidisciplinary dataset that we believe could significantly contribute to nowcast/forecast bilateral migration trends and migration drivers.","PeriodicalId":36824,"journal":{"name":"Data","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135890427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DataPub Date : 2023-07-19DOI: 10.5220/0012084400003541
Nils Freyer, Dustin Thewes, Matthias Meinecke
{"title":"GUIDO: A Hybrid Approach to Guideline Discovery & Ordering from Natural Language Texts","authors":"Nils Freyer, Dustin Thewes, Matthias Meinecke","doi":"10.5220/0012084400003541","DOIUrl":"https://doi.org/10.5220/0012084400003541","url":null,"abstract":"Extracting workflow nets from textual descriptions can be used to simplify guidelines or formalize textual descriptions of formal processes like business processes and algorithms. The task of manually extracting processes, however, requires domain expertise and effort. While automatic process model extraction is desirable, annotating texts with formalized process models is expensive. Therefore, there are only a few machine-learning-based extraction approaches. Rule-based approaches, in turn, require domain specificity to work well and can rarely distinguish relevant and irrelevant information in textual descriptions. In this paper, we present GUIDO, a hybrid approach to the process model extraction task that first, classifies sentences regarding their relevance to the process model, using a BERT-based sentence classifier, and second, extracts a process model from the sentences classified as relevant, using dependency parsing. The presented approach achieves significantly better results than a pure rule-based approach. GUIDO achieves an average behavioral similarity score of $0.93$. Still, in comparison to purely machine-learning-based approaches, the annotation costs stay low.","PeriodicalId":36824,"journal":{"name":"Data","volume":"1 1","pages":"335-342"},"PeriodicalIF":2.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46354325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}