COVID-19 Epidemic Models: A Study from Georgia State in the USA

The purpose of this study was to identify gender-specific differences and best-fit coronavirus (covid-19) model for the infected people of Georgia. Statistical methods chi-squared, ANOVA, logistic regression, Poisson and negative binomial regression models were utilized to analyze Covid-19 data, which were obtained from the Georgia Department of Public Health. The difference among the mean ages of deaths for overall underlying conditions (P = 0.0248) and with ‘no’ and ‘unknown’ medical conditions (P = 0.0196) were found to be significant. The covariates regions, minimum age, maximum age, and average age were found to have a significant effect (P < 0.0001). The negative binomial regression model exhibited a best-fit model in building a death curve compared to Poisson regression model obtained by the GLM method. The findings will help to determine gender-specific future virus models for effective interventions, and they can be generalized to the population with geographic and racial/ethnic similarities.


Introduction
Covid-19 is the novel coronavirus, which was first found in the city of Wuhan in China in November of 2019. The first source of interaction with the disease was found in a seafood market.
Wuhan is known as an important hub in China and contains an international airport which may have rapidly increased the spread of the virus [1]. Similar to the Middle Eastern Respiratory Syndrome and the Severe Acute Respiratory Syndrome, the novel 2019 coronavirus had the same host, which was a bat. The bat was said to have infected a pangolin. Humans were then infected from the pangolin through the process of zoonosis [2]. Super-spreading can exponentially increase the number of individuals infected by the virus as the novel covid-19 has the potential to spread rapidly [3]. There are mainly two ways in which the disease can spread, one of them being the fecal-oral route. The fecal-oral route of transmission occurs when proper hygiene is not placed after an individual discards feces. Contamination of common bathroom areas can increase the spread of the disease through possible selfinoculation. Transmission of droplets from the mucous membranes can also transmit the virus [4]. Having good hygiene and washing hands after bathroom usage can help decrease the risk of infection.
Covid-19 has a high rate of spreading, causing roughly three people to become infected from each infected person. Symptoms of covid-19 include respiratory issues such as difficulty in breathing.
Covid-19 is known to attack the alveoli and cause injury to the surfactants in the alveoli, causing alveolar collapse through the dramatic decrease in surface tension [5]. Surfactants are secreted into the type 2 cells in the alveoli and contain proteins and lipids.
The majority of surfactants contain lipids [6]. The phospholipids present in the surfactants greatly improve the ability to reduce surface tension among the alveoli. Covid-19 can enter the body through the mucous membranes. Receptors in the body bind to the spikes on the virus, allowing it to enter into the cell. Once it enters the cell, the virus can replicate itself, creating more viral particles, which can inhibit the body.
Elderly individuals are at the highest risk of the virus with men infected more than women. Having other health diseases such as diabetes or cardiovascular disease increases the chance of contracting covid-19 significantly [7]. Having a strong immune system seems to be of utmost importance in fighting the covid-19.
As mentioned in8, there is a 12.15% higher percent chance in contracting covid-19 among individuals who are 80 years of age or older in comparison to middle-aged individuals in the age group of 50 years or older who have a 1.25% chance of getting covid-19.
This number continues to decrease, as age becomes less. Having pre-existing health conditions and being immunocompromised can increase the likelihood of getting covid-19 [8]. Older people with other medical conditions such as asthma, diabetes, or heart disease may be more vulnerable to becoming severely ill [9].
There are numerous ways to decrease the rate of the spread of the virus. One effective way is to slowly decrease and ultimately cancel mass gatherings. Studies have shown that a prominent way of virus transmission can occur through mass gatherings where individuals are close together in an area, whether it is an indoor or outdoor event. Efforts to prevent high transmission of the virus resulted in the closure of sporting events, music events such as concerts, and small group gatherings, all of which are highly discouraged. Temporarily closing down religious places of worship such as churches have been implemented to help prevent the spread of the virus [10].
As mentioned by Liu et al. [7], lung infections are a prominent symptom of covid-197. are nasal congestion, dry cough, fever, diarrhea, shortness of breath, runny nose, among other symptoms [11]. The usual time of symptoms occurs within two to fourteen days after having contact with the virus [12]. There is currently no vaccine to prevent covid-19 and the best way to prevent illness is to avoid being exposed to this virus [13].
In the rapid progress of covid-19 pandemic, a huge amount of data has been collected from thousands of subjects, which raises concerns on how one can analyse such data and talk the scientific language to the public. Statistical and computational techniques are very useful at the cutting edge to understand such data and to make scientific conclusions. There is an urgency to utilize statistical methods on covid-19 epidemic data to know the patterns of the disease progression, intervention, and prevention. Though at the initial stage there is no direct preventive method exist but recorded data can lead the knowledge of variability and predictability of this disease due to some demographics, and thus can apply interventions to reduce its impact. Covid-19 data are collected by many agencies including hospitals, clinics, public health labs, health organizations, etc. The immediate release of such data for public access will accelerate the research to find intervention such as a vaccine or preventive medicine to stop the spread of this pandemic.
Fortunately, we obtained limited publicly accessible demographic data from the Georgia Department of Public Health (GDPH) [14].
Using this data, we investigated the appropriate statistical methods and algorithms to visualize disease occurrences/confirmed cases, recovery cases/alive, and total deaths. This study aimed to explore whether gender-specific differences exist in GDPH covid-19 data and to obtain the best-fit statistical model for death occurrences. The findings of this study will assist to (i) identify infected confirmed cases among males and females with covid-19 through descriptive analysis of accessible sociodemographic variables, (ii) conduct test of hypothesis for ages of deaths and multiple comparison test against covariates such as gender and underlying medical conditions, (iii) perform a Pearson chi-square test to check the independence of gender and underlying medical conditions, (iv) carry out a logistic regression method with certain covariates for individual-level data, and (v) utilize a generalized linear model to build a best-fit model using aggregate level data on the number of deaths.

Data source/variables and study population
The data of 156 counties was collected from the GDPH [ The age variable was continuous and grouped into three subgroups such as age1: 20-40 years, age2: 41-60 years, and age3: 61 years and above. The total number of confirmed cases within 156 counties in aggregate levels were grouped into four subgroups (0<region1<100; 100≤ region2<200; 200≤region3<300; 300≤ region4) to find the distributional differences of covid-19 occurrences among the subgroups. Logistic regression was used to illustrate the odds of an event (deaths/alive), given some demographic covariates. Poisson and negative binomial regression models [15,16] were used to find the best-fit death model based on aggregate level data.

Sample size calculations
The sample size was calculated using G*Power software package [17]. It was determined that 159 participants, 53 in each independent group, was sufficient to compare mean differences of continuous measurements when running a one-way analysis of variance (ANOVA) test. It was estimated that a total of 108 subjects would be sufficient to detect a statistically significant relationship for two discrete variables when running a Pearson's chi-squared test for a 2 x 3 contingency table.
Data (n1 = 12,159 total confirmed cases, n2 = 2,479 hospitalized, and n3 = 428 deaths) from 156 counties in which deaths were observed from 81 counties. It was calculated that 428 subjects would be large enough for logistic regression to compare the gender-specific differences. All the calculations were based on alpha (α) = 0.05, power = 80%, and a two-sided testing procedure.

Statistical analysis
The R-GUI software package [18] was used to perform analysis.
Since the dependent variable was categorical (covid-19 positive and negative), a logistic regression method was performed to calculate odds ratios and its 95% confidence intervals to determine the association between risk factors due to the occurrence of covid-19.
For the count data, the appropriate regression models are Poisson regression and negative binomial regression. Generalized Linear Models (GLM) can incorporate binary data, count data, and skewed data to model response variable as a function of covariates through assumptions on exponential family such as binomial, Poison, negative binomial, and others. The GLM was used by Khan et al. [19], to obtain inferences about parameters under three sampling plans [19]. The Poisson and negative binomial regression models were used for the analysis of covid-19 deaths data at aggregate level. The ages of deaths for both males and females were grouped with the underlying medical conditions. ANOVA was performed on each gender separately, and also combining both genders age with underlying medical conditions to detect whether the mean ages of deaths is significant. A Pearson chi-square test was used to detect if there is a significant relationship between gender and underlying medical conditions.

Results
This study investigated gender differences associated with covid- 19  No significant regression estimates were found for both males and females while running the logistic regression at the individuallevel data. Running a logistic regression for the deaths of male with covariates age, MC1, and MC2 when MC0 as the referent group; it was found that both MC1 and MC2 were protective but MC1 was observed very low likelihood of deaths at the individual level.
The age variable was grouped into age1, age2, and age3 to detect whether any subgroup had a higher likelihood of covid-19 deaths. It was found that age2 had a very low likelihood of deaths considering age1 as a referent, and MC1 remained a higher chance of survival compared to MC2 with MC0 as a referent.
While partitioning aggregate level data into four regions (region1, region2, region3, region4) and using logistic regression we obtained region2 and region3 were significant for minimum and maximum age distributions. The region4 was found to be significant for the average age distribution at aggregate analysis.
[ Negative binomial regression model is appropriate when the outcome variable is count, and the dispersion parameter is much higher than location parameter. Running the negative binomial regression model with the covariates (regions, minimum, maximum, average age) on the aggregate level data we obtained region2, region3, and region4 were highly significant considering region1 as the referent group. The odds ratios and confidence intervals divulged there were high likelihood of deaths found in region2, region3, and region4 compared with referent group region1 for aggregate level analysis. When running the logistic regression for deaths and covariates region2, region3, and region4 at aggregate level, it was found that the odds ratios were below 1.0, which indicates that the odds of coronavirus exposure among patients were lower and protective against the disease. The region4 was found to be more protective against the disease than region2 and region3 [Tables 1 & 2].

Region1
Ref.     reflects one bar had a high altitude but no death or few deaths. This bar was for those patients who were unable to report the county, they lived. The bar chart indicated that the bars that have high altitude also had high covid-19 cases and deaths. Deaths and alive were demarcated by red and white colour in the stack bar chart.
The higher proportion of red was visualized for a higher death in the county.

Discussion and Future Directions
The covid-19 pandemic is a global public health threat and is the greatest challenge we have faced since World War Two [20]. It has the potential to create devastating social, economic, and political crises that leave us deep scars. The World Health Organization defines public health surveillance as "the continuous, systematic collection, analysis, and interpretation of health-related data needed for the planning, implementation, and evaluation of public health practice" and calls it the "bedrock of outbreak and epidemic response" [21]. As the covid-19 pandemic has progressed, the effectiveness of national efforts to combat the virus has hinged on the ability of governments to measure its spread and use that information to target their public health efforts.
Due to a lack of testing, monitoring and the resulting uncertainty about where covid-19 is spreading, national governments have deemed it necessary to put their entire populations on lockdown.
The use of large public health data especially biological specimens will be extremely valuable to develop a biomarker for outcomes research, quality assurance, public health surveillance, and other beneficial purposes. When there will be a huge volume of recorded data available to the public, these analytical methods will carry over the benefit for researchers and public health practitioners to understand the nature of data and its rigorous statistical analysis.
Combining statistical methods and computer-based algorithms can play a significant part in generating statistical probabilistic models. The modelling approaches will provide us an understanding of existing covid-19 data and measure the risk of future pandemics in rural and urban communities.
The breakdown of the aggregate data into regions and using negative binomial regression on the number of deaths would be an appropriate method for future pandemic risk modelling direction to rural or urban areas. The findings of this study will expand to identify infected individuals for interventions and develop policy briefs for future pandemics.