Latifa Aljiffry, Hassanin M. Al-Barhamtoshy, A. Jamal, Felwa A. Abukhodair
{"title":"Arabic Documents Layout Analysis (ADLA) using Fine-tuned Faster RCN","authors":"Latifa Aljiffry, Hassanin M. Al-Barhamtoshy, A. Jamal, Felwa A. Abukhodair","doi":"10.1109/ESOLEC54569.2022.10009375","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009375","url":null,"abstract":"At present, there is a massive interest in document digitization, image searching, and natural language processing models, using different types of models. The first step in applying any type of processing like the image to text converting, is layout analysis, which is the paper's interest field. The problem in layout analysis comes in Arabic language, where there is a well-noticed gap for research in this field. The main limitations of the existed research are common, the dataset size, where in its return, gives a not very accurate result. In this paper, we are using two distinct types of Arabic language datasets. We propose a tuned model for layout analysis for Arabic printed and early printed documents using Faster RCNN (ADLA). The proposed model is based on tuning Faster Region-based Convolutional Neural Network (RCNN) model to match our two datasets, with different regions of interest (RoI). For evaluation, we compared the proposed model with two distinct existing models (LABA & FFRA). The F1 score result for our proposed model exceeds the LABA model with 99.4%, whereas the LABA model has 90.5%. Our model exceeds the FFRA model with 99.59% accuracy, whereas the FFRA model got 99.83% accuracy result.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115297593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Anwar, Karim Omar, A. Abbas, Fakhreldin Abdelmonim, Mohammad Refaie, Walaa Medhat, Aly Abdelrazek, Yomna Eid, Eman Gawish
{"title":"Smart Customer Care: Scraping Social Media to Predict Customer Satisfaction in Egypt Using Machine Learning Models","authors":"M. Anwar, Karim Omar, A. Abbas, Fakhreldin Abdelmonim, Mohammad Refaie, Walaa Medhat, Aly Abdelrazek, Yomna Eid, Eman Gawish","doi":"10.1109/ESOLEC54569.2022.10009194","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009194","url":null,"abstract":"This paper proposes the utilization of posts from social media to extract and analyze customer opinions and sentiments towards any specified topic in Egypt. Summarized statistics and sentiment values are then displayed to the consumer (companies such as Vodafone, WE etc..) through both an attractive and functional user interface. Text, location, and time of thousands of posts are scrapped, stored, preprocessed, then managed through topic modelling to infer all the hidden themes and delivered to a Recurrent Neural Network (RNN) to output whether the topic was positive or negative. Topic modelling was implemented using the BERT architecture and AraBert word embedding. Sentiment analysis model training was conducted on approximately 4000 rows of processed data and made use of Arabic glove embedding to speed up sentiment and word pattern recognition. Five models were experimented on: LSTM, GRU, CNN, LSTM + CNN and GRU + CNN. Overall, the GRU was the model with the best results, concluding with an accuracy of (86.19%), loss of (0.3349) and an F1-score of (0.858) when validating through the test data.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134552099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Detection of Various Types of Lung Cancer Based on Histopathological Images Using a Lightweight End-to-End CNN Approach","authors":"Ahmed S. Sakr","doi":"10.1109/ESOLEC54569.2022.10009108","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009108","url":null,"abstract":"Lung cancer is one of the main causes of death and illness, and malignant lung tumours are the leading cause of both. According to reports, lung cancer incidence is on the rise. Lung cancer histopathology is an important element of patient care. Using artificial intelligence methods for the identification of lung cancer can become a highly valuable approach. In this article, we offer a modified lightweight end-to-end deep learning strategy based on convolutional neural networks (CNN) to accurately identify lung cancer. In this method, the input histopathology pictures are normalized before being fed into the CNN model, which is then used to detect lung cancer. The effectiveness of our approach is assessed using a publicly accessible database of histopathological pictures and compared to the most advanced cancer detection methods already in use. The examination of the results indicates that the suggested deep model for lung cancer diagnosis yields results of 0.995 percent, which is a better accuracy than other approaches. Due to this excellent outcome, our method is computationally effective.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134290024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Critical Survey on Arabic Named Entity Recognition and Diacritization Systems","authors":"Muhammad Nabil Rateb, S. Alansary","doi":"10.1109/ESOLEC54569.2022.10009095","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009095","url":null,"abstract":"Language technologies are considered a subdivision of the Artificial intelligence (AI) field, which sheds light on how toolkits are programmed to simulate the natural language of humans. Over the last decennia, there has been a unique advancement in the Natura Language Processing (NLP) field, namely regarding the Arabic language. Arabic is the language spoken by almost two billion Muslims worldwide and is one of the six officially acknowledged languages by the UN organization. This paper is dedicated to a survey on three cutting-edge toolkits utilized to process and analyze the Arabic language: Cameltools, Farasa, and Madamira. This paper presents a background on the challenges that have confronted Arabic Natura Language Processing (ANLP), predominantly concerning diacritization, and Named Entity Recognition (NER) systems. Next, it illustrates what are the main components of Cameltools, Farasa, and Madamira. After that, the evaluation processes of the three toolkits shall be presented and their results will be compared. Finally, the paper shall present observations based on the previous comparison. The survey reveals that Camel is the best since it has been inspired by the designs of the best toolkits provided in the field. Farasa outpaces Madamira in all comparisons regarding ANER and Arabic diacritization.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124048287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Galal, Ahmed Hassan, Hala H. Zayed, Walaa Medhat
{"title":"Comparison of Different Deep Learning Approaches to Arabic Sarcasm Detection","authors":"M. Galal, Ahmed Hassan, Hala H. Zayed, Walaa Medhat","doi":"10.1109/ESOLEC54569.2022.10009500","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009500","url":null,"abstract":"Irony and Sarcasm Detection (ISD) is a crucial task for many NLP applications, especially sentiment and opinion mining. It is also considered a challenging task even for humans. Several studies have focused on employing Deep Learning (DL) approaches, including building Deep Neural Networks (DNN) to detect irony and sarcasm content. However, most of them concentrated on detecting sarcasm in English rather than Arabic content. Especially studies concerning deep neural networks, including convolutional neural networks (CNN) and recurrent neural network (RNN) architectures. This paper investigates several deep learning approaches, including DNNs and fine-tuned pretrained transformer-based language models, for identifying Arabic sarcastic tweets. In addition, it presents a comprehensive evaluation of the impact of data preprocessing techniques and several pretrained word embedding models on the performance of the proposed deep models. Two shared tasks' datasets on Arabic sarcasm detection are used to develop, fine-tune, and evaluate the different techniques and methods presented in this paper. Results on the first dataset showed that fine-tuned pretrained transformer-based language model outperformed the developed DNNs. The proposed DNN models obtained comparable performance on the second dataset to the fine-tuned models. Results also proved the necessity of applying preprocessing techniques with the various Deep Learning approaches for better detection performance of these models.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130661238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sentiment Analysis: Amazon Electronics Reviews Using BERT and Textblob","authors":"Abdulrahman Mahgoub, Hesham Atef, Abdulrahman Nasser, Mohamed Yasser, Walaa Medhat, M. Darweesh, Passent El-Kafrawy","doi":"10.1109/ESOLEC54569.2022.10009176","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009176","url":null,"abstract":"The market needs a deeper and more comprehensive grasp of its insight, where the analytics world and methodologies such as “Sentiment Analysis” come in. These methods can assist people especially “business owners” in gaining live insights into their businesses and determining wheatear customers are satisfied or not. This paper plans to provide indicators by gathering real world Amazon reviews from Egyptian customers. By applying both Bidirectional Encoder Representations from Transformers “Bert” and “Text Blob” sentiment analysis methods. The processes shall determine the overall satisfaction of Egyptian customers in the electronics department - in order to focus on a specific domain. The two methods will be compared for both the Arabic and English languages. The results show that people in Amazon.eg are mostly satisfied with the percentage of 47%. For the performance, BERT outperformed Textblob indicating that word embedding model BERT is more superior than rule-based model Textblob with a difference of 15% - 25%.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114301669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving The Performance of Semantic Text Similarity Tasks on Short Text Pairs","authors":"Mohamed Taher Gamal, Passent El-Kafrawy","doi":"10.1109/ESOLEC54569.2022.10009072","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009072","url":null,"abstract":"Training semantic similarity model to detect duplicate text pairs is a challenging task as almost all of datasets are imbalanced, by data nature positive samples are fewer than negative samples, this issue can easily lead to model bias. Using traditional pairwise loss functions like pairwise binary cross entropy or Contrastive loss on imbalanced data may lead to model bias, however triplet loss showed improved performance compared to other loss functions. In triplet loss-based models data is fed to the model as follow: anchor sentence, positive sentence and negative sentence. The original data is permutated to follow the input structure. The default structure of training samples data is 363,861 training samples (90% of the data) distributed as 134,336 positive samples and 229,524 negative samples. The triplet structured data helped to generate much larger amount of balanced training samples 456,219. The test results showed higher accuracy and f1 scores in testing. We fine-tunned RoBERTa pre trained model using Triplet loss approach, testing showed better results. The best model scored 89.51 F1 score, and 91.45 Accuracy compared to 86.74 F1 score and 87.45 Accuracy in the second-best Contrastive loss-based BERT model.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124972669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Guidelines of Building a Treebank for Modern Standard Arabic","authors":"Amena Dheif, Ahmed Abd El Ghany, Sameh Al Ansary","doi":"10.1109/ESOLEC54569.2022.10009330","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009330","url":null,"abstract":"Treebanks are one of the most needed and used linguistic resources in the fields of Natural language processing (NLP) and Natural language understanding (NLU). Arabic has only two constituency-based treebanks and a number of dependency treebanks. The current research presents the guidelines for building a parsed Arabic treebank for Modern Standard Arabic (MSA). The guidelines show, firstly the choice of the grammar formalism, then the genre and size of the treebank, and finally the annotation layers of the treebank. The study also shows that using the traditional Arabic grammar syntactic theory to describe the Arabic syntax has proven to be more suitable than using any of the modern syntax theories. Working with the traditional Arabic grammar also helps avoid the errors that the available treebank fell in as a result of using guidelines that don't suit the Arabic grammar. The study adopts three layers of annotations: the morphological layer, the syntactic layer, and the grammatical function layer. The resultant tree is a very detailed and rich syntactic tree, which is preferable by the researcher over having a huge amount of data poorly and shallowly annotated.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116442242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamed Ramadan, Mohammad M. Alqahtani, Abdullah Algoson
{"title":"Identifying Equivalent Words from Different Arabic Dialects Using Deep Learning Techniques","authors":"Hamed Ramadan, Mohammad M. Alqahtani, Abdullah Algoson","doi":"10.1109/ESOLEC54569.2022.10009555","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009555","url":null,"abstract":"The Arabic language comprises many spoken dialects. These dialects vary from a standard written Modern Standard Arabic (MSA) in terms of syntactic, lexical, phonological, and morphological. Arabic Dialects differ, not only along a geographical continuum, but also with other sociolinguistic factors such as the urban, rural, Bedouin dimension. Currently, Dialectal Arabic (DA) is the essential written language of unofficial communication in the Arab World. These Dialects can be found on social media platforms, emails, Twitter, etc. There has been a high interest in research on computational models of Arabic dialects in the last decade. Most of these studies focus on Arabic dialect identification (classification) and building Arabic dialect corpora. However, finding Arabic dialect word synonyms from another Arabic dialects has received limited attention. To bridge this gap, this study will develop a model to identify the equivalent words from different Arab world dialects using deep learning techniques such as word2vec. This research merged and extended the existing Arabic dialects corpora and then applied some deep learning techniques to achieve the best results for dialectal word synonyms. The outcomes of this research are a new dataset of Arabic dialectical word synonyms and a model with acceptable accuracy of 81%.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130825228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sentiment Analysis From Subjectivity to (Im)Politeness Detection: Hate Speech From a Socio-Pragmatic Perspective","authors":"Samar Assem, S. Alansary","doi":"10.1109/ESOLEC54569.2022.10009298","DOIUrl":"https://doi.org/10.1109/ESOLEC54569.2022.10009298","url":null,"abstract":"Although sentiment analysis by definition is that field of Natural Language processing which focuses on analyzing texts that tackle evaluating, analyzing and detecting the state of mind of the human beings towards a range of domains, most of the studies limit it to opinion mining. Yet, opinion mining is just one sub-field of three others under the umbrella of sentiment analysis which are; opinion mining, emotion mining and ambiguity detection. Noticeably, ambiguity detection is considered to be a combination of the other two sub-fields thanks to its linguistic nature that considers statistical and/or syntactic-semantic levels of analysis are not adequate to reach a satisfying level of disambiguating human language. Henceforth, the current paper proposes digging deeply to reach pragmatic and socio-pragmatic levels of analysis in order to eliminate ambiguity and avoid misjudgments over texts and social media posts specifically in the sub-tasks of detecting hate speech. Accordingly, it suggests utilizing an eclectic linguistic model of analysis includes speech act theory and the theory of (im)politeness.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131624176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}