Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100480
Peng Wang , Jiang Li , Yingli Wang , Youchun liu , Yu Zhang
{"title":"Assessment of soil fertility in Xinjiang oasis cotton field based on big data techniques","authors":"Peng Wang , Jiang Li , Yingli Wang , Youchun liu , Yu Zhang","doi":"10.1016/j.bdr.2024.100480","DOIUrl":"10.1016/j.bdr.2024.100480","url":null,"abstract":"<div><p>Assessing soil fertility through traditional methods has faced challenges due to the vast amount of meteorological data and the complexity of heterogeneous data. In this study, we address these challenges by employing the K-means algorithm for cluster analysis on soil fertility data and developing a novel K-means algorithm within the Hadoop framework. Our research aims to provide a comprehensive analysis of soil fertility in the Shihezi region, particularly in the context of oasis cotton fields, leveraging big data techniques. The methodology involves utilizing soil nutrient data from 29 sampling points across six round fields in 2022. Through K-means clustering with varying K values, we determine that setting K to 3 yields optimal cluster effects, aligning closely with the actual soil fertility distribution. Furthermore, we compare the performance of our proposed K-means algorithm under the MapReduce framework with traditional serial K-means algorithms, demonstrating significant improvements in operational speed and successful completion of large-scale data computations. Our findings reveal that soil fertility in the Shihezi region can be classified into four distinct grades, providing valuable insights for agricultural practices and land management strategies. This classification contributes to a better understanding of soil resources in oasis cotton fields and facilitates informed decision-making processes for farmers and policymakers alike.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100480"},"PeriodicalIF":3.3,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141393452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100475
Shuo Wang , Xiang Yu , Dan Zhao , Guocai Ma , Wei Ren , Shuxin Duan
{"title":"Intelligent geological interpretation of AMT data based on machine learning","authors":"Shuo Wang , Xiang Yu , Dan Zhao , Guocai Ma , Wei Ren , Shuxin Duan","doi":"10.1016/j.bdr.2024.100475","DOIUrl":"10.1016/j.bdr.2024.100475","url":null,"abstract":"<div><p>AMT (Audio Magnetotelluric) is widely used for obtaining geological settings related to sandstone-type Uranium deposits, such as the range of buried sand body and the top boundary of baserock. However, these geological settings are hard to interpret via survey sections without conducting geological interpretation, which highly relies on experience and cognition. On the other hand, with the development of 3D technology, artificial geological interpretation shows low efficiency and reliability. In this paper, a machine learning model constructed using U-net was used for the geological interpretation of AMT data in the Naren-Yihegaole area. To train the model, a training dataset was built based on simulated data from random models. The issue of insufficient data samples has been addressed. In the prediction stage, sand bodies and baserock were delineated from the inversion resistivity images. The comparison between two interpretations, one by machine learning method, showed high consistency with the artificial one, but with better time-saving. It indicates that this technology is more individualized and effective than the traditional way.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100475"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141408443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100474
Marco Ortu, Maurizio Romano, Andrea Carta
{"title":"Semi-supervised topic representation through sentiment analysis and semantic networks","authors":"Marco Ortu, Maurizio Romano, Andrea Carta","doi":"10.1016/j.bdr.2024.100474","DOIUrl":"10.1016/j.bdr.2024.100474","url":null,"abstract":"<div><p>This paper proposes a novel approach to topic detection aimed at improving the semi-supervised clustering of customer reviews in the context of customers' services. The proposed methodology, named SeMi-supervised clustering for Assessment of Reviews using Topic and Sentiment (SMARTS) for Topic-Community Representation with Semantic Networks, combines semantic and sentiment analysis of words to derive topics related to positive and negative reviews of specific services. To achieve this, a semantic network of words is constructed based on word embedding semantic similarity to identify relationships between words used in the reviews. The resulting network is then used to derive the topics present in users' reviews, which are grouped by positive and negative sentiment based on words related to specific services. Clusters of words, obtained from the network's communities, are used to extract topics related to particular services and to improve the interpretation of users' assessments of those services. The proposed methodology is applied to tourism review data from Booking.com, and the results demonstrate the efficacy of the approach in enhancing the interpretability of the topics obtained by semi-supervised clustering. The methodology has the potential to provide valuable insights into the sentiment of customers toward tourism services, which could be utilized by service providers and decision-makers to enhance the quality of their services.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100474"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000509/pdfft?md5=46a689f4478007ad8db7233af95c8c2e&pid=1-s2.0-S2214579624000509-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141401445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100477
Huimin Han , Bouba oumarou Aboubakar , Mughair Bhatti , Bandeh Ali Talpur , Yasser A. Ali , Muna Al-Razgan , Yazeed Yasid Ghadi
{"title":"Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing","authors":"Huimin Han , Bouba oumarou Aboubakar , Mughair Bhatti , Bandeh Ali Talpur , Yasser A. Ali , Muna Al-Razgan , Yazeed Yasid Ghadi","doi":"10.1016/j.bdr.2024.100477","DOIUrl":"10.1016/j.bdr.2024.100477","url":null,"abstract":"<div><p>This study presents a comprehensive evaluation of two prominent deep learning models, Vision Transformer (ViT) and VGG16, within the domain of image captioning for remote sensing data. By leveraging the BLEU score, a widely accepted metric for assessing the quality of text generated by machine learning models against a set of reference captions, this research aims to dissect and understand the capabilities and performance nuances of these models across various sample sizes: 25, 50, 75, and 100 samples. Our findings reveal that the Vision Transformer model generally outperforms the VGG16 model across all evaluated sample sizes, achieving its peak performance at 50 samples with a BLEU score of 0.5507. This performance shows that ViT benefits from its ability to capture global dependencies within the data, providing a more nuanced understanding of the images. However, the performance slightly decreases as the sample size increases beyond 50, indicating potential challenges in scalability or overfitting to the training data. Conversely, the VGG16 model shows a different performance trajectory, starting with a lower BLEU score for smaller sample sizes but demonstrating a consistent improvement as the sample size increases, culminating in its highest BLEU score of 0.4783 for 100 samples. This pattern suggests that VGG16 may require a larger dataset to adequately learn and generalize from the data, although it achieves a more modest performance ceiling compared to ViT. Through a detailed analysis of these findings, the study underscores the strengths and limitations of each model in the context of image captioning. The Vision Transformer's superior performance highlights its potential for applications requiring high accuracy in text generation from images. In contrast, the gradual improvement exhibited by VGG16 suggests its utility in scenarios where large datasets are available, and scalability is a priority. This study contributes to the ongoing discourse in the AI community regarding the selection and optimization of deep learning models for complex tasks such as image captioning, offering insights that could guide future research and application development in this field.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100477"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141415449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-06-13DOI: 10.1016/j.bdr.2024.100476
{"title":"Research on the legal system of economic-ecological synergistic compensation in carbon neutral marine cities with a background in big data","authors":"","doi":"10.1016/j.bdr.2024.100476","DOIUrl":"10.1016/j.bdr.2024.100476","url":null,"abstract":"<div><p>With the increasingly severe global carbon emissions problem and the serious threat ecosystems face, carbon neutrality has gradually attracted widespread attention. This study provides an in-depth analysis of practical cases of international carbon neutrality initiatives and relevant experiences of marine cities, focusing on the construction and implementation of a legal system for economic, ecologically coordinated compensation. To evaluate the actual effectiveness of the legal system in marine cities, this study used a multiple linear regression model, considering factors such as the strictness of the legal system, enforcement efforts, and the level of participation of local enterprises and residents. The research results indicate that carbon emissions have significantly decreased in cities where legal systems are effectively enforced, from an average of 1.5 million tons per year to 1 million tons. At the same time, the economic growth rate of these cities has also significantly improved, increasing by about 2.5 percentage points from the original annual average of 4 % to 6.5 %. The study also found that the biodiversity index of these cities increased by 15 %, far higher than the average increase of 5 % in other cities, indicating the positive role of legal systems in protecting biodiversity. The public's participation rate in environmental protection activities has also increased from 25 % to 45 %, and the growth rate of green investment has reached an average of 8 % per year, far exceeding the 3 % growth rate of other cities. In terms of the ecosystem, data shows that the distribution of the ecosystem is stable, with an average ecological index of 508, which is in a relatively ideal state. The annual average growth rate of ecosystem restoration is about 3.5 %, further proving the effectiveness of ecological protection measures. Comprehensive empirical analysis shows that implementing the new legal system effectively reduces carbon emissions, enhances biodiversity, and promotes sustainable economic development. The economic growth rate increased from an average of 4.2 % to 5.1 % per year after implementing the new legal system, fully demonstrating the important role of the economic, ecologically coordinated compensation legal system in promoting carbon neutrality goals in marine cities.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100476"},"PeriodicalIF":3.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141412546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Diversified Top-k Pattern Matching in Big Graphs","authors":"Aissam Aouar , Saïd Yahiaoui , Lamia Sadeg , Kadda Beghdad Bey","doi":"10.1016/j.bdr.2024.100464","DOIUrl":"10.1016/j.bdr.2024.100464","url":null,"abstract":"<div><p>Typically, graph pattern matching is expressed in terms of subgraph isomorphism. Graph simulation and its variants were introduced to reduce the time complexity and obtain more meaningful results in big graphs. Among these models, the matching subgraphs obtained by tight simulation are more compact and topologically closer to the pattern graph than results produced by other approaches. However, the number of resulting subgraphs can be huge, overlapping each other and unequally relaxed from the pattern graph. Hence, we introduce a ranking and diversification method for tight simulation results, which allows the user to obtain the most diversified and relevant matching subgraphs. This approach exploits the weights on edges of the big graph to express the interest of the matching subgraph by tight simulation. Furthermore, we provide distributed scalable algorithms to evaluate the proposed methods based on distributed programming paradigms. The experiments on real data graphs succeed in demonstrating the effectiveness of the proposed models and the efficiency of the associated algorithms. The result diversification reached 123% within a time frame that does not exceed 40%, on average, of the duration required for tight simulation graph pattern matching.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"36 ","pages":"Article 100464"},"PeriodicalIF":3.3,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141043195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-05-14DOI: 10.1016/j.bdr.2024.100456
Paolo Mignone , Gianvito Pio , Michelangelo Ceci
{"title":"Distributed Heterogeneous Transfer Learning","authors":"Paolo Mignone , Gianvito Pio , Michelangelo Ceci","doi":"10.1016/j.bdr.2024.100456","DOIUrl":"10.1016/j.bdr.2024.100456","url":null,"abstract":"<div><p>Transfer learning has proved to be effective for building predictive models even in complex conditions with a low amount of available labeled data, by constructing a predictive model for a target domain also using the knowledge coming from a separate domain, called source domain. However, several existing transfer learning methods assume identical feature spaces between the source and the target domains. This assumption limits the possible real-world applications of such methods, since two separate, although related, domains could be described by totally different feature spaces. Heterogeneous transfer learning methods aim to overcome this limitation, but they usually <em>i)</em> make other assumptions on the features, such as requiring the same number of features, <em>ii)</em> are not generally able to distribute the workload over multiple computational nodes, <em>iii)</em> cannot work in the Positive-Unlabeled (PU) learning setting, which we also considered in this study, or <em>iv)</em> their applicability is limited to specific application domains, i.e., they are not general-purpose methods.</p><p>In this manuscript, we present a novel distributed heterogeneous transfer learning method, implemented in Apache Spark, that overcomes all the above-mentioned limitations. Specifically, it is able to work also in the PU learning setting by resorting to a clustering-based approach, and can align totally heterogeneous feature spaces, without exploiting peculiarities of specific application domains. Moreover, our distributed approach allows us to process large source and target datasets.</p><p>Our experimental evaluation was performed in three different application domains that can benefit from transfer learning approaches, namely the reconstruction of the human gene regulatory network, the prediction of cerebral stroke in hospital patients, and the prediction of customer energy consumption in power grids. The results show that the proposed approach is able to outperform 4 state-of-the-art heterogeneous transfer learning approaches and 3 baselines, and exhibits ideal performances in terms of scalability.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100456"},"PeriodicalIF":3.3,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000327/pdfft?md5=33cf99e10874514291bfc635b26d260f&pid=1-s2.0-S2214579624000327-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141025163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-05-08DOI: 10.1016/j.bdr.2024.100463
Feiya Li , Chunyun Fu , Dongye Sun , Jian Li , Jianwen Wang
{"title":"SD-SLAM: A semantic SLAM approach for dynamic scenes based on LiDAR point clouds","authors":"Feiya Li , Chunyun Fu , Dongye Sun , Jian Li , Jianwen Wang","doi":"10.1016/j.bdr.2024.100463","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100463","url":null,"abstract":"<div><p>Point cloud maps generated via LiDAR sensors using extensive remotely sensed data are commonly used by autonomous vehicles and robots for localization and navigation. However, dynamic objects contained in point cloud maps not only downgrade localization accuracy and navigation performance but also jeopardize the map quality. In response to this challenge, we propose in this paper a novel semantic SLAM approach for dynamic scenes based on LiDAR point clouds, referred to as SD-SLAM hereafter. The main contributions of this work are in three aspects: 1) introducing a semantic SLAM framework dedicatedly for dynamic scenes based on LiDAR point clouds, 2) employing semantics and Kalman filtering to effectively differentiate between dynamic and semi-static landmarks, and 3) making full use of semi-static and pure static landmarks with semantic information in the SD-SLAM process to improve localization and mapping performance. To evaluate the proposed SD-SLAM, tests were conducted using the widely adopted KITTI odometry dataset. Results demonstrate that the proposed SD-SLAM effectively mitigates the adverse effects of dynamic objects on SLAM, improving vehicle localization and mapping performance in dynamic scenes, and simultaneously constructing a static semantic map with multiple semantic classes for enhanced environment understanding.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"36 ","pages":"Article 100463"},"PeriodicalIF":3.3,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141083349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-05-08DOI: 10.1016/j.bdr.2024.100465
Mughair Aslam Bhatti , M.S. Syam , Huafeng Chen , Yurong Hu , Li Wai Keung , Zeeshan Zeeshan , Yasser A. Ali , Nadia Sarhan
{"title":"Utilizing convolutional neural networks (CNN) and U-Net architecture for precise crop and weed segmentation in agricultural imagery: A deep learning approach","authors":"Mughair Aslam Bhatti , M.S. Syam , Huafeng Chen , Yurong Hu , Li Wai Keung , Zeeshan Zeeshan , Yasser A. Ali , Nadia Sarhan","doi":"10.1016/j.bdr.2024.100465","DOIUrl":"10.1016/j.bdr.2024.100465","url":null,"abstract":"<div><p>This study presents the implementation and evaluation of a convolutional neural network (CNN) based image segmentation model using the U-Net architecture for forest image segmentation. The proposed algorithm starts by preprocessing the datasets of satellite images and corresponding masks from a repository source. Data preprocessing involves resizing, normalizing, and splitting the images and masks into training and testing datasets. The U-Net model architecture, comprising encoder and decoder parts with skip connections, is defined and compiled with binary cross-entropy loss and Adam optimizer. Training includes early stopping and checkpoint saving mechanisms to prevent overfitting and retain the best model weights. Evaluation metrics such as Intersection over Union (IoU), Dice coefficient, pixel accuracy, precision, recall, specificity, and F1-score are computed to assess the model's performance. Visualization of results includes comparing predicted segmentation masks with ground truth masks for qualitative analysis. The study emphasizes the importance of training data size in achieving accurate segmentation models and highlights the potential of U-Net architecture for forest image segmentation tasks.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"36 ","pages":"Article 100465"},"PeriodicalIF":3.3,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141026200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Big Data ResearchPub Date : 2024-05-01DOI: 10.1016/j.bdr.2024.100461
Yanan Wu, Rong Mei, Jie Xu
{"title":"Non pilot data-aided carrier and sampling frequency offsets estimation in fast time-varying channel","authors":"Yanan Wu, Rong Mei, Jie Xu","doi":"10.1016/j.bdr.2024.100461","DOIUrl":"https://doi.org/10.1016/j.bdr.2024.100461","url":null,"abstract":"<div><p>This paper considers the non pilot data-aided estimation of the carrier frequency offset (CFO) and sample frequency offset (SFO) of orthogonal frequency division multiplexing (OFDM) signals in fast time-varying channel. The main obstacle is the time-variant channel response, which deteriorates the estimation validity. A practical approach to mitigate this impact is to reduce the time consumption of one-shot estimation. In this way, we propose a method to reduce the time consumption to within one OFDM symbol duration. The maximum likelihood (ML) estimator is derived based on the observations of frequency domain constellations output of two FFTs on one symbol; its closed-form approximation is then derived to reduce the calculation burden. Remarkably, our method does not require any training symbol or pilot tone embedded in the signal spectrum, therefore achieves the highest spectral efficiency. Theoretical analysis and simulation results are employed to assess the performance of proposed method in comparison with existing alternatives.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"36 ","pages":"Article 100461"},"PeriodicalIF":3.3,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140901432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}