Social Media-Based Surveillance Systems for Health Informatics Using Machine and Deep Learning Techniques:A Comprehensive Review and Open Challenges

2024-03-02 01:30SaminaAminMuhammadAliZebHaniAlshahraniMohammedHamdiMohammadAlsulamiandAsadullahShaikh

Samina Amin,Muhammad Ali Zeb,Hani Alshahrani,Mohammed Hamdi,Mohammad Alsulami and Asadullah Shaikh

1Institute of Computing,Kohat University of Science and Technology,Kohat,26000,Pakistan

2Department of Computer Science,College of Computer Science and Information Systems,Najran University,Najran,61441,Saudi Arabia

3Department of Information System,College of Computer Science and Information Systems,Najran University,Najran,61441,Saudi Arabia

ABSTRACT Social media (SM) based surveillance systems, combined with machine learning (ML) and deep learning (DL)techniques,have shown potential for early detection of epidemic outbreaks.This review discusses the current state of SM-based surveillance methods for early epidemic outbreaks and the role of ML and DL in enhancing their performance.Since,every year,a large amount of data related to epidemic outbreaks,particularly Twitter data is generated by SM.This paper outlines the theme of SM analysis for tracking health-related issues and detecting epidemic outbreaks in SM,along with the ML and DL techniques that have been configured for the detection of epidemic outbreaks.DL has emerged as a promising ML technique that adapts multiple layers of representations or features of the data and yields state-of-the-art extrapolation results.In recent years,along with the success of ML and DL in many other application domains,both ML and DL are also popularly used in SM analysis.This paper aims to provide an overview of epidemic outbreaks in SM and then outlines a comprehensive analysis of ML and DL approaches and their existing applications in SM analysis.Finally,this review serves the purpose of offering suggestions, ideas, and proposals, along with highlighting the ongoing challenges in the field of early outbreak detection that still need to be addressed.

KEYWORDS Social media;epidemic;machine learning;deep learning;health informatics;pandemic

Nomenclature

SMSocial media

CDCCenters for disease control

DLDeep learning

SVMSupport vector machine

GRUGated recurrent unit

LRLogistic regression

RFRandom forest

SARSSevere acute respiratory syndrome

ILIInfluenza-like-illness

NLPNatural language processing

MAEMean absolute error

LiRLinear regression

CNNConvolutional neural networks

CBOWContinuous bag of word

LSTMLong-short term memory

MSEMean square error

TF-IDFTerm frequency inverse document frequency

ANNArtificial neural networks

NBNaïve bayes

MLPMulti-layer perceptron

DTDecision tree

WHOWorld health organization

MLMachine learning

GFTGoogle flu trends

RMSERoot mean square error

KNNK-nearest neighbors

OCROptical character recognition

NERNamed entity recognition

LLMLarge language models

1 Introduction

In recent years, Web 2.0, SM, and the news media have been extensively utilized to clarify trends in epidemic outbreak initiation and prevalence.Fig.1 presents widely used SM platforms that have attracted growing user engagement over the past few decades, drawing inspiration from a similar figure designed in [1].The immense popularity and proliferation of SM have gained social interaction among the public, thus generating a huge volume of information regarding any topic,such as political campaigns,sports,education,and products,etc.Similarly,SM also delivers a massive amount of information regarding epidemic outbreaks, if the outbreaks speedily rise in a region [2–4].An SM provides a unique way to explore and understand social collaboration and interaction between the public and healthcare responders, now more than ever before.In recent years, SM has attracted extensive interest as a conceivable mechanism to detect and monitor epidemic outbreaks in a region because it can deliver real-time monitoring systems at a lower cost as compared to traditional monitoring systems[5–7].

An epidemic occurs when a viral disease spreads widely and emerges from natural reservoirs to infect people[8,9].Moreover,it is also used to define active crises that are out of control;for instance,the severe acute respiratory syndrome (SARS) that took the lives of about 9000 people around the world in 2003.Similarly,a flu epidemic occurred in 2011,and a COVID-19(Corona Virus Disease–19)epidemic outbreak is currently underway.An outbreak is a sudden spike in infectious disease that happens in a society or geographic region, or it may affect many countries and thus can last for a couple of months or even for several years[8].Each year,some outbreaks are expected,like dengue or influenza/flu.An outbreak can often be considered a single incidence of a viral or infectious disease.It may be valid if the infectious agent is uncommon(such as COVID-19)or if it has significant public health implications(e.g.,bioterrorism viruses such as smallpox or Ebola virus).

Figure 1:Common SM applications

Now, in the era of data, the utilization of SM is common for sharing news, events, daily life activities,or even emotions to express.SM has also played a significant role in real-time analysis and more rapid forecasting has been considered in many areas.This includes disaster prediction[10–13],fake information detection[14],sports activities[15],political campaigns[16,17],sentiment analysis[18], communication [19,20], sarcasm detection [21,22], stock market fluctuations [23], and health surveillance systems[24–26].

In health surveillance systems,SM offers effective resources for epidemic outbreak detection and an active way of coping against the outbreaks [3,27,28].Numerous studies [29–31] have configured search engines or search queries to develop a method for tracking an epidemic outbreak in a region.Bhattacharya et al.[32] proposed a model for disease surveillance by utilizing SM and developed a belief surveillance mechanism for health promotion.In health care,this type of monitoring is deployed to evaluate a user’s level of confidence in the dissemination of health-related information in SM.

1.1 Social Media-Based Surveillance Systems in Health Informatics during Epidemic Outbreaks

SM-based surveillance methods rely on the analysis of publicly available SM data to identify signals of disease outbreaks.These methods can provide real-time information on the spread of diseases and help public health officials to respond quickly and effectively to outbreaks[33–35].Early detection of seasonal epidemic outbreaks can decrease their influence on daily lives.Therefore,early detection and surveillance systems are important for tracking and attempting to reduce the impact of epidemic outbreaks that become uncontrollable by a prompt reaction.The epidemic outbreak has been the 21stcentury’s most deadly infectious disease.Epidemic outbreaks are infectious diseases that can circulate across the nation as well as across the world if the pandemic assessment hits the extent of an epidemic and tends to wipe out the whole nation[36,37].The most searched/explored SM sources for data gathering by [35] are depicted in Fig.2.Following Fig.2, it can be observed that Twitter is the most popular SM source for health-related data gathering, configuring Twitter (64%), internet search queries/Google trends/Wikidepedia (15%), Crowdsourcing (4%), Instagram (3%), YouTube(2%),News articles(1%),SM search(1%),and other microblogs(6%)[35].

Figure 2:Types of SM platforms explored for health-related data collection

The sudden uptick in world travel and the integrated existence of contemporary civilization has contributed to growing attention to both existing and newly emerging outbreaks of threats.Public health officially requires timely and reliable reports about epidemic outbreaks,intending to take action and detect early warnings [38].Traditional systems of epidemic outbreak surveillance are primarily designed based on manually compiled virology and medical studies.The traditional outbreak tracking methods of notification from physicians may take days or weeks to compile and deliver,so identifying more rapidly available sources of information is an actual priority.Some prominent outbreaks have happened in the world,such as dengue,influenza/flu, yellow fever,cholera, COVID-19,and several others[39–41].

Several review studies have explored the applications of ML and DL methods in the area of SMbased surveillance systems from different perspectives.For instance, Lamba et al.[42] conducted a review using ML for medical informatic.Al-Garadi et al.[43]explored the prediction of cyberbullying on SM using ML approaches.Riswantini et al.[44]and Gupta et al.[35]conducted a comprehensive review of handling disease outbreaks on SM using ML techniques.In addition, some other reviews were conducted using DL approaches[45–47].

SM-based surveillance techniques have shown great potential for the health sector, especially when combined with the ML and DL methods.According to our knowledge,the existing approaches for detecting or predicting epidemic outbreaks in SM were designed to detect influenza and dengue outbreaks, including seasonal dengue fever, chikungunya, Ebola virus, and influenza or swine flu.This article focuses on reviewing the current methods, strategies, architecture, and framework for the prediction, detection, or classification trends of epidemic outbreaks in SM information.The investigated approaches analyze SM and most of them use Twitter data that has keywords related to a specific epidemic outbreak for quicker identification in an initiative aimed at attaining and promoting public health.In this review,we will also discuss the current state of SM-based surveillance techniques,their applications,and future research directions.

To the best of our knowledge,the novelty of this review lies in its comprehensive exploration of the utilization of both ML and DL techniques in SM-based surveillance systems for health informatics.While there are existing reviews that focus on either ML or DL separately[42,43,45,46].This review[48]provides a comprehensive analysis of the combined use of these techniques in the context of healthrelated surveillance on SM platforms.

In addition,this review addresses the specific challenges and open issues that are unique to SMbased health surveillance systems and still need to be addressed for health informatics.It highlights the ethical considerations related to data privacy and the difficulty of distinguishing between reliable health information and misinformation on SM.These discussions provide valuable insights for researchers and practitioners seeking to implement such surveillance systems effectively.

1.2 Research Motivation

The prime motivation behind this analysis is based on the following perspectives:

1.The use of SM has enabled faster monitoring of outbreak patterns compared to traditional data collection methods,facilitating real-time analysis.SM data is a valuable source of information that can aid in tracking epidemic outbreaks.

2.Detection of epidemic outbreaks from SM is a challenging problem,and it is still in its initial phase of enhancement,which needs further exploration.However,it is required to investigate research methods to improve current ML and DL techniques for the detection of epidemic outbreaks in SM.

3.A recent innovation in research conducted on the detection of epidemic outbreaks has encouraged us to conduct a systematic review to investigate, outline, summarize, and assess appropriate research studies.

4.Contribute to helping health organizations in detecting the spread of epidemic outbreaks by extracting information from SM in real-time.

1.3 Research Contribution

The primary contributions of this study are as follows:

1.Classifying epidemic outbreaks into dengue,flu/influenza,Ebola virus,Zika,and COVID-19 for SM-based surveillance system.

2.Analyzing the position of common types of epidemics outbreaks in SM:dengue,flu/influenza,Ebola virus,Zika,and COVID-19 in decision making.

3.Providing an overview of current ML techniques developed for the detection of epidemic outbreaks in SM surveillance systems.

4.Presenting a summary of various DL techniques that can be considered for epidemic outbreak detection in SM surveillance systems.

5.Providing a summary of various feature extraction techniques for better disease classification and detection used for SM text data.

6.Discussing the various techniques used for SM-based surveillance,their applications,and the open challenges that still need to be addressed.

7.This review makes a valuable contribution by exploring new learning models for SM analysis and identifying potential applications of NLP and DL.

The paper is structured into different sections as shown in Fig.3.Section 2 demonstrates a background study of ML and DL.Section 3 provides a discussion on review methodology consisting of the most relevant works related to epidemic outbreaks and surveillance systems for health informatics in SM.Section 4 provides an overview of the assessment and discussion of research questions that utilize ML and DL approaches using SM platforms and concludes with a summary of the research gaps identified in the literature review and a comparison of the existing proposed solutions.Section 5 presents research implications and Section 6 illustrates open research challenges and future research directions.Finally,Section 7 concludes the review.

Figure 3:Structure of the paper

2 Background

2.1 Machine Learning

ML is a broader domain that involves the study of algorithms and statistical models that enable computer systems to learn from and make predictions or decisions based on data.It is a software technique that utilizes knowledge (experience) to identify or predict patterns within a given dataset[42].Moreover, ML models discover patterns in the data and gain experience, which helps them perform better over time.In supervised ML,the primary field of study is classification,which aims to determine the appropriate category or class for a particular entity based on its features or parameters.Supervised classifiers utilize methods that capture the interactions between data and present several ML challenges.It entails using massive datasets to train algorithms,which then use these relationships and patterns to predict or decide on fresh data.Some common types of ML include random forest(RF),decision tree(DT),support vector machine(SVM),logistic regression(LR),etc.[42].

Fig.4 shows the structure of ML-based models, drawing inspiration from a similar figure published in [49].There are numerous domains and fields where ML can be applied such as speech recognition[50–52],NLP[53,54],sentiment analysis[55,56],and health informatics[57–60],etc.

Figure 4:Structure of ML-based techniques

2.2 Deep Learning

DL is a subset of ML that employs multi-layer neural networks to enable more complex computations[45].DL models possess a greater capacity to identify relevant information,but their effectiveness also depends on the quality of the data.If the data is well-structured, the DL model will have an easier time analyzing it.Some common types of neural networks in DL include convolutional neural networks(CNNs),recurrent neural networks(RNNs),long short-term memory networks(LSTMs),gated recurrent units(GRUs),Transformers,etc.[61].

It is motivated by the structure and operation of the human brain and is effective at many different tasks, including speech recognition, computer vision [62–65] health informatics [66–69],speech recognition [70–72], and NLP [73–75].Fig.5 shows a structure of DL-based techniques for NLP,drawing inspiration from a similar figure developed in[49].

Figure 5:Structure of DL-based techniques

3 Review Methodology

The methodology adopted in this review is presented as follows.

3.1 Review Protocol

In this review,various electronic repositories are explored to search for relevant articles.A proper selection and exclusion strategy is applied to filter the number of retrieved articles.The final selection is based on specific research questions designed for this study, and after a comprehensive analysis,research gaps are reported.

3.2 Research Questions

The following are the primary research questions that will be addressed in this review:

RQ1:What are the common types of epidemic outbreaks reported in SM and their role in information gathering as recognized in the published articles?

RQ2:What are the various ML and DL techniques utilized for the detection and classification of epidemic outbreaks in SM acknowledged in the literature review?

RQ3:Why is DL essential for epidemic outbreak detection and classification in SM,and what are the existing approaches for epidemic outbreak detection and classification with a DL perspective?

RQ4:What are the different feature extraction techniques used to keep the synthetic and semantic relationships among words in SM texts for better detection,which will help to discover the research gaps?

RQ5:Are the SM platforms efficient in the perspective of raising awareness about outbreaks and promoting public health by providing early warnings?

3.3 Scientific Data Sources

This study aims to explore the above-mentioned published works in the past recent years that configure SM data to detect and classify epidemic outbreaks.To identify relevant articles;the scientific libraries explored are Springer Link(https://link.springer.com/),ACM digital library(http://www.acm.org/dl), MDPI (https://www.mdpi.com/), Google Scholar (https://scholar.google.com.pk/), PubMed(https://pubmed.ncbi.nlm.nih.gov/),IEEE Xplore(https://ieeexplore.ieee.org/),ScienceDirect(http://www.sciencedirect.com/),Elsevier(https://www.elsevier.com/),PLOS ONE(https://journals.plos.org/plosone/) and other.Fig.6 reflects the distributions of the scientific data sources to conference proceedings,journal articles,and arXiv preprints.

Figure 6:The percentage of publications from various categories(journal articles,conference proceedings,and arXiv)

In the following step,a selection and exclusion search strategy is applied for selection and exclusion to select the most relevant articles.

3.4 Screening Procedure for Article Selection and Exclusion

This section presents the procedure for article selection and exclusion.A systematic search approach based on targeted keywords has been used to collect the most relevant articles.This involves formulating specific search queries, as outlined in Table 1, in order to identify and retrieve relevant literature.The purpose of this review includes research works conducted on SM-based epidemic outbreak detection and classification in real-time using ML and DL techniques.The selection and exclusion strategy, as shown in Fig.7 and inspired by a similar flowchart published in [76], is used to determine whether an article should be included or excluded.Based on the search strategy,Fig.8 depicts the number of promising potential articles.

Table 1: List of keywords adopted for querying relevant articles

The following key factors form the basis of the selection strategy:

• The respective search queries are related to the work

• Articles relevant to epidemic outbreak detection and classification in SM

• Analyzing SM data for the detection and classification of epidemic outbreaks

• Articles are written in the English language

• Exploring an association between the title of the research article,the keywords designed for this study

Figure 7:Flowchart:Search strategy for article selection and exclusion

Figure 8:Scientific data sources with numbers of potential articles

A total of 712 articles are found in scientific database searches.Out of 712 articles,140 are chosen for inclusion in the analysis,213 duplicates are excluded,218 articles are removed based on the abstract,and after reading the complete article,141 are removed by using the specifications depicted in Fig.9.The selection procedure is focused on various phases:the pre-phase is based on searching for relevant articles from the relevant data sources.Phase-1 is focused on a title-based search to pick appropriate articles.Phase-2 is focused on an abstract-based search to exclude the primary article.Phase-3 relies on a keyword-based search, while Phase-4 employs a comprehensive text-based search strategy, as depicted in Fig.9,inspired by the PRISMA flow diagram.In addition,Fig.10 depicts the distribution of publications related to ML and DL over the years, spanning from 2009 to May 2023, within the context of SM-based surveillance systems for health informatics.This visual representation showcases the trends in research output and the evolution of ML and DL applications in the field of healthcarerelated surveillance.

Figure 9:Methodological flow diagram-number of articles reviewed at each phase from primary search to the final number of the selected articles

Figure 10: Distribution of ML and DL-based publications for surveillance systems in SM per year(until May 2023)

4 Assessment and Discussion on Research Questions

The research questions posed in this study are thoroughly examined within the context of analyzing ML and DL approaches for the detection of epidemic outbreaks and the provision of early warnings.The answers,to these research questions,are as follows:

RQ1:What are the common types of epidemic outbreaks reported in SM and their role in information gathering as recognized in the published articles?

4.1 Surveillance Systems for Epidemic Outbreaks in Social Media

SM-based surveillance methods, combined with ML and DL techniques, have shown potential for early detection of epidemic outbreaks.This review will discuss the current state of SM-based surveillance methods for early epidemic outbreaks and the role of ML and DL in enhancing their performance.ML and DL techniques can enhance the performance of SM-based surveillance methods by enabling the automated detection of patterns and anomalies in large volumes of data.ML and DL algorithms can be trained on historical data to detect early signals of epidemic outbreaks and to predict the spread of diseases in real-time.

During spontaneous epidemic outbreaks,the public requires access to reliable and timely information on the incidence of the epidemic and its prevention[40].Many studies have been conducted in the field of SM analysis for detecting public sentiment about epidemic outbreaks[77].The nature of epidemic outbreaks is characterized by their dynamic and constantly changing temporal and spatial aspects,which can effectively be identified using SM data.This approach can be used to detect diseases such as flu,dengue,and COVID-19,and can aid in promoting public health by detecting patterns and providing early warnings about disease outbreaks[4,78].This method is much faster than traditional reporting methods where physicians and healthcare professionals report cases of disease to local health centers,which can take days or even weeks before related healthcare professionals/organizations can react and provide resources to control the epidemic[79].Unfortunately,this delay can result in precious lives being lost before the necessary action can be taken.By utilizing SM analytics,this research aims to decrease the time between the onset and detection of disease outbreaks,enabling a faster response to control the epidemic.

The main concern is public health, and healthcare professionals must stay informed about epidemic outbreaks and diseases that affect their communities in order to make timely and appropriate decisions[80].In recent years,SM particularly Tweets data have been utilized to have a positive impact on:disease identification such as to predict the present condition of a disease the traditional medically reported data, and SM data recently reported by the public [81,82], epidemic outbreaks detection[83,25]and estimating the probability of people falling sick[38].In addition,news media and web blogs have also been utilized to deliver early warnings of augmented disease progression before officially reported data[84,85],as well as for assuring the major transitions in the generative and fertility rate of the epidemic outbreak[85,86].

4.2 Common Types of Epidemic Outbreaks in SM and Their Role in Information Gathering

This section presents the common types of epidemic outbreaks in SM.The review encompasses the following disease outbreaks,as presented in Fig.11:Influenza/Flu,Dengue fever,COVID-19,and Ebola virus.Each of these diseases is explored to provide a comprehensive understanding of how SMbased approaches with ML and DL techniques have been developed to monitor and detect outbreaks associated with these specific diseases.

Figure 11:Overview of ML and DL classification for epidemic outbreaks in SM

4.2.1 Flu/Influenza Virus

Flu is a viral infectious outbreak that affects the lungs,respiratory system,and nose of a human being.Flu is also known as influenza;however,it is not similar to the flu viruses of the stomach can influence nausea and vomiting while influenza recovers by itself for most citizens.Yet influenza and its symptoms can also be devastating.Influenza, a respiratory disease, causes a significant number of deaths worldwide each year.The symptoms of flu are usually mild, such as headache, sneezing,fever, sore throat, and coughing [87].Influenza shots are typically administered during the winter season, and infected individuals are advised to seek medical attention from specialists rather than general practitioners.Failure to treat the flu can lead to serious complications and worsen the patient’s condition[88].

It is worth noting that both Wang et al.[89]and Chen et al.[90]utilized DL approaches in their proposed models for epidemic prediction using SM data.Wang et al.employed a partial differential equation (PDE)-based model for influenza prediction, while Chen et al.used two temporal topic models(supervised and unsupervised)to estimate trends in flu outbreaks in South American countries.These studies highlight the potential of DL approaches in accurately predicting and tracking epidemic outbreaks using SM data and suggest that these approaches could be extended to other types of epidemics in the future.

Ginsberg et al.[29] developed an automated approach for exploring search queries associated with influenza.The proposed approach produces more robust approaches for influenza-like-illness surveillance(ILIS),with regional and state-level measures of ILI occurrence in the US,by analyzing the search queries of the public from five years of Google web search log.The growing need global for online search can potentially allow for the model’s development in international contexts.They compared the prediction of their proposed approach with the weekly ILI level delivered by Utah State to assess the performance of the state-level prediction produced by the proposed approach.The model was trained in traditional techniques, there is a need to explore the outbreak in depth by deploying advanced approaches.

The increased usage of online social network sites like Twitter’s diverse demographic population[91].For real-time analysis, the SM data generates an effective resource.The main contribution of the study was to propose a framework that utilizes SM data to track an epidemic outbreak in a region and deliver early warnings, even for new outbreaks reliably.The model was evaluated on regression approaches using tweet data.The linear regression(LiR)performed better compared with the Pearson correlation.The limitation and for future work, it was mentioned, there is a need for manual annotation to train the model for the entire SM data.

Carneiro et al.[2]suggested that Google flu trends(GFT)can be utilized for influenza outbreak detection before traditional monitoring mechanisms such as the Centers for disease control and prevention(CDC).In their work,the Google Trends tools,and data processing are described as well as provided a detailed example of how to detect influenza outbreaks in SM.Google search queriesbased dataset was utilized for influenza detection on GFT.They suggested that GFT can be used as a blueprint for infectious epidemic outbreaks to provide an early monitoring system.For future direction,however,there is still room to enhance the proposed system for infectious disease outbreaks by utilizing DL approaches.

In[92],ML techniques were used to detect disease outbreaks based on the frequency of mentions of two diseases, flu, and cancer, on Twitter.The study aimed to monitor the spread of epidemics in different US regions by analyzing the number of mentions in each state.The location information was extracted from user timelines to perform the geographic analysis.The novelty of the study was to provide real-time surveillance of disease outbreaks in the US states,but it did not consider important features such as infected people,sentiment about the disease,and alarming situations.

4.2.2 Ebola Virus

Several studies have demonstrated the effectiveness of SM-based surveillance methods for early epidemic outbreak detection.For example,a study conducted during the 2014 Ebola outbreak in West Africa showed that Twitter data could be used to detect and track the spread of the disease in realtime [93–95].For instance, Odlum et al.[96] identified public health information needs that can be accurately through content analysis of SM data, such as tweets, and then met with the right health information.By longitudinal tracking, the present study aimed to evaluate the demand for Ebola health information at various pandemic durations.Guidry et al.[97]investigated the information and context of posts made on Twitter and Instagram about the Ebola epidemic by the CDC,the WHO,and Médecins Sans Frontières,with a concentrate on the communication methods that were employed during the epidemic.Even though all three health organizations used both forums,the findings point to Instagram as being especially useful for developing relevant, interactions with the publics during times of global health emergencies,as shown by the substantially greater levels of participation on the part of health agencies and the citizens.

Furthermore,using Twitter and news data,Kim et al.[98]proposed sentiment analysis and topicbased content Ebola pandemic with the help of the n-gram Latent Dirichlet Allocation(LDA)topic modeling techniques.Lazard et al.[99]developed a model for detecting the narrative of public concern regarding the Ebola virus from Twitter posts,and Van et al.[100]utilized tweets content to identify public concern regarding the Ebola virus for early warning during the pandemic outbreaks.

4.2.3 Zika Virus

Other studies have shown that ML/DL algorithms can improve the accuracy of SM-based surveillance methods for epidemic outbreak detection.For example, a study conducted during the 2016 Zika virus outbreak in Brazil showed that ML algorithms could accurately identify tweets related to the disease,even when they used non-specific terms[101,102].

In order to comprehend how a public health emergency of global importance manifests itself in SM, particularly Twitter, Pruss et al.[103] proposed the relevance of three different sorts of events:those that are location-related, actor-related, and concept-related.The work thereby adds to the body of knowledge about the processes that underlie participation, contributions, and engagement on this SM platform during a disease outbreak.They collected over 6 million tweets referring to the Zika pandemic from December 2015 to March 2016 during the pandemic.Using ML techniques,Daughton et al.[104]looked for concerns and self-disclosures of a particular behavior modification linked to the transmission of disease travel cancellation using tweets about the 2015–2016 Zika virus infection.If Twitter can identify this kind of activity,this method might offer a new source of data for illness modelling.To test the viability of using Twitter data for monitoring the ZIKV pandemic on a national and state(Florida)level,Masri et al.[105]used a recently created method called Cloudberry to filter a random sample of the data.To predict weekly ZIKV infections one week in advance,two auto-regressive models were calibrated using weekly ZIKV case numbers and Zika tweets.

From October 1st,2015,to February 25th,2016,Abouzahra et al.[106]gathered 67,000 tweets in both English and Spanish with the hashtags#Zikavirus and#Zika.We used text analytics methods to analyze the tweets and identify the key ideas.We examined the variations in how these ideas were used from one month to the next.Wood [107] developed an ML-based method for debunking and propagating conspiracy theories on Twitter during the 2015–2016 Zika pandemic.They collected around 25,162 tweets on the Zika virus from Twitter to disprove statements and hoaxes that circulate through SM.

4.2.4 Dengue Outbreak

Dengue is a contiguous viral epidemic outbreak that is transmitted by mosquitoes.Dengue fever is a viral infection that is spreading globally [108].Accurate and timely data monitoring is essential to effectively detect dengue outbreaks and assess the impact of preventive interventions [109].The characteristics of a dengue outbreak will vary from immunosuppressed fever to feared consequences such as viral infection and trauma.In tropical and subtropical regions, the dengue outbreak has emerged as a challenge for public health promotion, while the dengue outbreak is typically selflimiting[110].

Amin et al.[111]designed an automated approach for the detection and classification of dengue outbreaks in tweet data.In their work, they the deployed RNN-based method LSTM to efficiently process the flow of sequence data to classify the tweet messages into dengue positive and negative classes to detect the dengue-infected people.In their proposed work, a comparison was also made among ML including SVM,NB,LR,and DL such as ANN,DNN,and LSMT techniques to find the best approach for outbreak detection from SM data,and the performance of each model is evaluated on test data.They found that the LSTM is the best approach to detecting disease-infected people and analyzing the epidemic outbreak in SM compared to other state-of-art-techniques.For feature extraction, the term frequency-inverse documents frequency (TF-IDF) embedding technique was utilized.In this work,a novel benchmark dataset was designed for outbreak detection collected from 2017–2019.The future research direction mentioned in this work will be to configure word embedding techniques for better disease classification and detection.

Jain et al.[112] introduced an SM-based dengue surveillance and epidemic outbreak detection mechanism incorporating temporal and spatial patterns that help to recognize, classify, and design consumer behavioral trends on SM.Their proposed approach was based on geo-tagging predictive modeling has a major role in the deterrence and monitoring of mosquito-borne disease within limitedresource in a specific region.Tracking public opinions in real-time offers intuitive interfaces or early warnings related to outcomes.In this, LDA-based topic modeling approaches were developed to filter out a similar topic about deterrence,symptoms,and panic.For this purpose,ML classification techniques SVM and Naïve Bayes (NB) were developed.Future research is required to utilize other resources for data like news articles,web blogs,etc.Data that contains emoticons and text in an image may be considered in the future as well.

The increasing number of dengue cases in China has raised significant public health concerns,with the disease spreading to larger regions [113].The main contribution of their work was to design a timely and accurate approach for dengue prediction in China using state-of-the-art ML techniques by using Baidu search queries and environmental conditions (relative humidity, mean temperature,and precipitation)data collected in the year 2011 to 2014 in Guangdong.To implement the model, they compared and evaluated support vector regression (SVR), least absolute shrinkage,generalized additive,and regression models including gradient boosted regression tree(GBRT),step down-linear regression, selection operator linear regression, root mean square error (RMSE), and negative binomial regression to predict dengue cases.In this work, the proposed SVR approach achieves better performance results to forecast and track dengue outbreaks in comparison with other baseline methods.The features of this work will be helpful for healthcare organizations to identify the initiative needed to improve dengue surveillance.For future direction,the proposed model can be improved using DL methods as well.

4.2.5 COVID-19 Outbreak

COVID-19 is a viral infectious outbreak transmitted by a recently found coronavirus[114–116].It is extremely contagious, with many patients being able to move into hospitals for testing at the same time, which has significantly affected public healthcare systems.The priority of treatment is also dictated by the severity of the symptoms based on a diagnosis.Clinical experiments indicate that suspected people(people with mild symptoms)can deteriorate rapidly[115,117].Hence,it is essential to detect early patients’deterioration in order to improve the treatment plan.

Information and news headlines on COVID-19 were quickly posted and shared on SM during the starting months of 2020.The information pattern has been analyzed in SM,and on the web with around 18 years in the infodemiology field,the COVID-19 outbreak has been referred to as the first SM infodemic.However,there is insufficient confirmation about whether and how the SM infodemic has triggered information on COVID-19 and provided early warnings.In several regions of the world,the explosive growth of COVID-19 has been reported[86,118].

Detecting informative tweets related to the COVID-19 outbreak is an important task for providing real-time updates to the public, identifying misinformation, and tracking the spread of the disease[119,120].For instance,Samuel et al.[121]identified the sentiment of the public related to the COVID-19 outbreak,including its sentiment analysis by utilizing the tweets data on COVID-19.In their work,they used ML techniques NB and LR to identify public sentiment on COVID-19.After that, the effectiveness of the analysis is compared in classifying COVID-19 tweets in the United States.For limited tweet data, the NB approach achieved 91% accuracy while the LR approach achieved 74%accuracy for the shorter tweet.However,both approaches showed comparatively poor performance for longer tweets.There is a need to improve the performance of the ML techniques on long tweets.The performance can be improved with the help of DL techniques.Kabir et al.[122]presented a method that utilizes ML and topic modelling techniques to analyze user sentiment and public posts related to COVID-19 on SM.

Ardabili et al.[123] conducted an interesting study on predicting the COVID-19 outbreak by comparing the performance of ML and soft computing techniques.The authors used MLP and ANFIS models to predict the pandemic outbreak in five countries (Italy, Germany, Iran, USA,and China) by training the models on a dataset obtained from https://www.worldometers.info/coronavirus/#countries.The study found that ML techniques, particularly MLP, were effective in modeling the pandemic outbreak.However,future research should also focus on modeling the fatality rate to aid in planning new facilities in affected countries.Additionally,the use of DL techniques can aid in the detection of infected individuals.

Khanday et al.[124]built an effective approach for textual clinical data classification by utilizing ML approaches.The clinical textual data are classified into four classes with the help of classical and ensemble ML techniques.Furthermore, for feature extraction, they used TF-IDF, and Bag of Words(BOW),and for the classification,they used LR,Multinomial Naïve Bayes(MNB),SVM,and Decision Tree (DT).The data were classified into, i.e., COVID, SARS, COVID, and ARDS (Acute Repository Distress Syndrome).In the end,the comparative analysis among ML methods showed that the MNB technique outperformed with 96%accuracy the other techniques.For future work,there is a need to increase the accuracy using RNN as the traditional ML techniques are not able to efficiently work on the flow of sequence data.Prabhakar et al.[125]proposed topic modelling techniques for the detection of COVID-19 information from Twitter content.

Nowadays with the help of SM content, a lot of analysis and statistics can be done in case of epidemic outbreaks.Nemes et al.[126] presented an automated approach for predicting and manifesting public sentiment to check the correlation between labels and words of positive and negative sentiment in tweet messages.The analysis was performed with the help of NLP techniques and sentiment classification using the RNN approach.For further processing,they also analyze,visualize,and compile their exploration.The approach that was developed in this work performs accurately on small data even with vague tweet messages in assessing the sentiment polarity of the public regarding COVID-19.

4.3 Summary of Key Performance from Recent Literature

Reviewing the literature,it is possible to evaluate the performance of ML and DL approaches and sort of public tweet messages regarding various epidemic outbreaks on SM.A summary and detailed information on the key performance indicators for qualifying the selected articles using ML techniques are presented in Table 2.Furthermore,Table 3 presents the key performance indicators for qualifying the selected articles using DL techniques.

After reviewing the literature outlined in Tables 2 and 3, it can be concluded that the existing studies did not successfully identify disease-infected individuals from SM texts.There is currently no established methodology or procedure for identifying individuals with a disease from SM information.The literature mainly focuses on detecting the frequency of SM posts regarding a specific disease,rather than distinguishing between disease-infected people and non-disease-infected people in SM posts.To bridge the research gaps, a new approach utilizing DL approaches with word embedding techniques needs to be proposed.

Fig.12 reveals the most discussed epidemic outbreaks in SM along with the research articles.It shows that the most common outbreak regarding research-related analysis is the flu/influenza,dengue and now it would be COVID-19.

RQ2: What are the various ML and DL techniques utilized for the detection and classification of epidemic outbreaks in SM acknowledged in the literature review?

This section distinguishes the research articles from ML and DL perspectives that are utilized for the detection and classification of epidemic outbreaks in SM acknowledged in the literature review.Fig.13 reveals that the primary research studies have been conducted in ML configuring SVM(10%),MNB (4%), NB (9%), LR (14%), Linear Regression (10%), DT (10%), LDA (9%), RF (7%), and SVR(8%).Similarly,Fig.14 shows DL techniques deploying ANNs(29%),DNN(21%),RNN(21%),LSTM(20%),GRU(12%),CNN(3%),and LSTM+CNN(2%<).

Figure 13:Breakdown of the articles using ML techniques

Figure 14:Breakdown of the articles using DL techniques

RQ3:Why is DL essential for epidemic outbreak detection and classification in SM,and what are the existing approaches for epidemic outbreak detection and classification with a DL perspective?

Answer 3:Based on our analysis, it is evident that there are significant research gaps in the area of SM analytics for detecting epidemic outbreaks, particularly in terms of real-time disease surveillance for early warning purposes.DL approaches have shown promising results in addressing these challenges.To bridge these gaps, this review proposes a new approach that leverages DL techniques such as RNN/LSTM,CNN,and CNN+LSTM with word embedding techniques.These approaches can be explored further in future research to address the identified research gaps.

RQ4:What are the different feature extraction techniques used to keep the synthetic and semantic relationships among words in SM texts for better detection,which will help to discover the research gaps?

Answer 4:After examining the early disease detection approach that utilizes SM information,it can be concluded that existing research models primarily focus on using SM platforms to detect various epidemics, such as seasonal dengue virus, depression, cancer, and flu outbreaks.While the literature uses the term “prediction and detection,” it primarily refers to identifying instances of influenza or swine flu that have already been observed.However,due to the limitations of the current SM monitoring system, new approaches are necessary to effectively detect and monitor epidemic outbreaks in SM.

Based on the existing literature on DL, it is evident that using word embedding techniques to analyze Twitter texts can help capture the semantic and synthetic meaning between words, thus improving classification accuracy.This thesis aims to address the shortcomings of previous studies rather than replace existing disease detection and monitoring systems.However,challenges still exist,such as the limited character count of tweet messages, the prevalence of abbreviations and informal words, grammatical and spelling errors, as well as instances of mixed language and inappropriate sentence structure.

RQ5: Are the SM platforms efficient in the perspective of raising awareness about outbreaks and promoting public health by providing early warnings?

Answer 5:Using SM platforms to spread awareness of outbreaks and provide early warnings is a successful strategy.SM can link people to resources and health specialists in real-time, as well as provide updates and information regarding outbreaks.SM platforms can offer early warnings and outbreak awareness,making them an invaluable source of information during epidemic outbreaks.SM platforms can collect data from human sensors in the form of SM data,which can be analyzed to track and monitor an outbreak.This real-time analysis is faster than traditional methods of data collection and can be deployed to monitor various disease patterns.Additionally, the cost of collecting data through SM is lower than traditional methods[140].The collaboration of NLP,ML,DL,healthcare analysts,and SM text analysis is proposed as an effective approach for detecting epidemic outbreaks in a region.The review highlights the importance of SM analysis as a valuable surveillance tool during epidemic outbreaks, including flu, dengue, zika, Ebola, and COVID-19.The study emphasizes that real-time information from tweets can alert healthcare professionals and emergency responders to take necessary actions in order to control or monitor an epidemic outbreak.

According to the reviewed studies, social media platforms can be very important for spreading knowledge of outbreaks and promoting public health.Many studies have shown that SM can be an effective and timely approach for identifying and tracking disease outbreaks.Likewise,SM platforms can help disseminate health information,encourage healthy habits,and offer a venue for community involvement and public health campaign participation.The overall results indicate that SM can be a successful tool for spreading health awareness and offering early warnings of outbreaks,despite certain restrictions and difficulties related to its use in public health.

Overall, online SM platforms have the potential to effectively disseminate early alerts and raise public awareness of health issues during epidemics.However,it is important to note that SM can also spread misinformation and rumors, which can be detrimental to public health efforts.As a result,during outbreak scenarios,it is critical to closely monitor and verify information published on SM.

5 Implications

The research holds significant scientific impact and is of great interest to researchers in the fields of health informatics,data science,and public health as follows:

1.Contributing to the advancement of health surveillance methodologies by exploring the potential of SM-based systems.

2.It bridges the gap between ML and DL methodologies,presenting how they can complement each other in the context of health informatics.

3.Addressing the ethical challenges surrounding data privacy and misinformation on SM.

4.Highlighting the open challenges and limitations, the research offers valuable direction for future research developments.Researchers can contribute to addressing these issues by pointing out potential research gaps.

In summary, the research’s interdisciplinary approach, relevance to public health issues, and contribution to the field of SM-based health surveillance all contribute to its scientific impact.It makes a substantial and interesting addition to the scientific community and scholars by providing researchers with insightful information,ethical issues,and a roadmap for future study.

6 Open Research Challenges and Future Research Directions

SM-based surveillance techniques combined with ML/DL methods have the potential to revolutionize the health sector by providing real-time information on disease outbreaks and enabling more effective public health responses.However,there are still several challenges that need to be addressed to fully grasp the potential of these techniques.

1.Noisy data:SM data is often noisy,incomplete,and unstructured,which makes it difficult to extract meaningful information.To improve the quality of SM data by reducing noise,filling data gaps, and standardizing data formats.DL models, especially those that include RNNs,LSTM, or transformers, can recognize rapid spikes or hidden patterns in SM data that are related to epidemic outbreaks and crisis occurrences.

2.Privacy issue:there are concerns about the use of personal information on SM, and the potential misuse of this information for surveillance purposes.Developing privacy-preserving methods regarding the use of personal information on SM should be addressed for data collection and analysis.Incorporating knowledge from similar tasks or disease transfer learning and multi-task learning in DL and NLP can aid methods in becoming broader.

3.Data validation:the data acquired from SM-based surveillance methods require effective validation and verification.To reduce bias in SM data by incorporating data from multiple sources and integrating demographic and geographic data into analyses.The semantic understanding of SM posts can be improved by DL techniques, particularly transformer-based models like BERT,which can comprehend and interpret the context of non-standard language.

4.Data biasness:the accuracy of surveillance methods may be impacted by SM data that is biased toward certain demographics or geographical regions.SM-based surveillance systems should be integrated with other data sources, including electronic health records and traditional surveillance methods to provide a more comprehensive model of disease outbreaks.

5.Misinformation and rumors detection:SM can also spread misinformation and rumors,which can be detrimental to public health efforts.Therefore, it is important to carefully monitor and verify information shared on SM during outbreak situations.The early detection of misinformation or rumors and concern about a disease,as the sharing of fake news and rumors,has increased with the widespread use of SM.By using advanced DL and NLP techniques,the issue of misinformation or rumors can potentially be addressed.

6.Mental health detection from SM:Using the potential of ML and DL techniques,depression,and anxiety can be detected from SM,which can be valuable for psychiatrists and mental health professionals.

7.Pretrained large language models (LLM)with contextualized information can be used to improve the performance of the traditional ML and DL models for disease surveillance.

8.Additionally,optical character recognition(OCR)can be applied to extract textual data from screenshots of social media posts shared across different platforms.

9.Furthermore,named entity recognition (NER)can be used to automatically extract diseaserelated information from SM texts.These methods can be used to extract names of diseases,medicines,vaccinations,and other related information to develop contextually aware models for disease surveillance through SM.

In conclusion, future research developments should focus on improving the quality of SM data, addressing privacy concerns, reducing bias, developing explainable AI, NLP, pre-trained large language model, ML, and DL methods, and integrating SM-based surveillance with other data sources.The above research directions show that the challenges can be effectively tackled, and innovative techniques can be proposed and integrated into the health informatic systems.

7 Conclusions

The use of online SM platforms is effective in providing early warnings and raising awareness of outbreaks.SM can provide real-time updates and information on outbreaks, as well as connect individuals to health professionals and resources.SM-based surveillance systems,combined with ML and DL approaches, have demonstrated great potential for health organizations.In this paper, the discussions regarding epidemic outbreaks on SM have been highlighted.Applying DL to SM analysis for epidemic outbreaks has become a popular research topic recently.In this study,various ML and DL approaches and their applications in SM analysis have been outlined.For various SM analysis tasks, many of these ML and DL techniques have revealed state-of-the-art results.Shortly, with the advances in SM analysis and DL applications, it has been observed that there will be more exciting research in DL for epidemic outbreaks in SM.

Despite the potential benefits of SM-based surveillance methods and ML/DL techniques for epidemic outbreak detection,there are also some limitations and challenges that need to be addressed(discussed above).These include issues related to data quality, privacy concerns, and the need for effective validation and verification of results.

In conclusion, SM-based surveillance methods combined with ML/DL techniques have shown promise for the early detection of epidemic outbreaks.While there are still some challenges to overcome,these methods have the potential to improve public health responses to disease outbreaks and save lives.

Acknowledgement:The authors are thankful to the Deanship of Scientific Research at Najran University for funding this work,under the Research Groups Funding Program.

Funding Statement:The authors are thankful to the Deanship of Scientific Research at Najran University for funding this work, under the Research Groups Funding Program Grant Code(NU/RG/SERC/12/27).

Author Contributions:The authors confirm their contribution to the paper as follows:study conception and design: Samina Amin, Muhammad Ali Zeb, Hani Alshahrani; data collection: Mohammed Hamdi, Mohammad Alsulami, Asadullah Shaikh; analysis and interpretation of results: Samina Amin, Muhammad Ali Zeb, Hani Alshahrani; draft manuscript preparation: Mohammed Hamdi,Mohammad Alsulami, Asadullah Shaikh.All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials:The datasets used to support the analysis of this study are available within the paper.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.