library(tidyverse)
library(RSocrata)
# Read in data and clean up some stuff
= read.socrata("https://data.cdc.gov/api/odata/v4/6jg4-xsqq") |>
covid_net_hosps as_tibble() |>
rename(weekenddate = X_weekenddate, age = agecategory_legend, sex = sex_label, race = race_label) |>
mutate(weekenddate = ymd(weekenddate)) |>
arrange(weekenddate) |>
mutate(month = factor(months(weekenddate),
levels = c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December")))
More ggplot2
: plot types
Announcements
- Continue (start?!) brainstorming the dataset you want to use for your figures and presentation.
- I’m always available to help with picking a dataset and analyzing it.
Outline for today
- Smoothed line plots
- Plotting distributions
- Bar plots
- Histograms
- Box plots
The power of a package like ggplot2
is not in the fact that it can do many different kinds of plots but rather that creating these plots is a logical processes. Changing from one plot type to another often simply requires just changing the “geometry” object and potentially the “aesthetics”, or how variables map to different parts of the geometry.
Nevertheless, ggplot2
does allow one to create many different kinds of plots and understanding how to use these plots and when they are useful is instructive for using other visualization packages in R and other languages.
Reading in some COVID-19 data from the CDC
We’ll work with COVID-19 weekly rates (per 100k people) of laboratory-confirmed COVID-19 hospitalizations from the CDC COVID-NET surveillance project, which gathers data from 13 states.
Let’s work on a slice of these data with just the cases in Colorado, Georgia, and Tennessee for 2021:
=
co_ga_tn_cases_2021 |>
covid_net_hosps filter(state %in% c("Colorado", "Georgia", "Tennessee")) |>
filter(year(weekenddate) == 2021)
co_ga_tn_cases_2021
# A tibble: 6,864 × 10
state season weekenddate age sex race type weeklyrate cumulativerate
<chr> <chr> <date> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Tenness… 2020-… 2021-01-02 All All All Crud… 30.1 271.
2 Tenness… 2020-… 2021-01-02 All All Hisp… Crud… 29.4 196.
3 Tenness… 2020-… 2021-01-02 All All A/PI… Crud… 10.1 123.
4 Tenness… 2020-… 2021-01-02 All All AI/A… Crud… 0 67.4
5 Tenness… 2020-… 2021-01-02 All All Blac… Crud… 31.5 354.
6 Tenness… 2020-… 2021-01-02 All All Whit… Crud… 30.7 263.
7 Tenness… 2020-… 2021-01-02 All Fema… All Crud… 27.4 255.
8 Tenness… 2020-… 2021-01-02 All Male All Crud… 32.8 288.
9 Tenness… 2020-… 2021-01-02 ≥18 … All All Crud… 38.3 347.
10 Tenness… 2020-… 2021-01-02 0-17… All All Crud… 2 10.8
# ℹ 6,854 more rows
# ℹ 1 more variable: month <fct>
Note the use of the match operator %in%
. It’s a nice way to say that an element should exactly match one of a list of values.
Smoothed line plots
In previous weeks, you have produced simple scatter and line plots. With these the KY, WV, and TN case data, we can produce a scatter plot with geom_point
.
= co_ga_tn_cases_2021 |>
co_ga_tn_plot filter(age == "All", race == "All", sex == "All") |>
ggplot(aes(x=weekenddate, y=weeklyrate, color=state))
+ geom_point() co_ga_tn_plot
We can produce a line through the points by adding geom_line
.
+ geom_point() + geom_line() co_ga_tn_plot
To get a better sense of the trend, you might want a trend line like a linear regression. R has a very sophisticated regression framework but accessing it in ggplot2
only requires using the geom_smooth()
function.
+
co_ga_tn_plot geom_point(alpha = 0.3) + # alpha controls transparency: 0 = clear; 1 = opaque
geom_smooth(lwd = 0.8) # lwd = line width
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Obviously, geom_smooth()
produced a nice smooth line, but it’s not linear. Taking a closer look at the geom_smooth()
function help, you can see that the “statistic” the function uses is stat_smooth()
. Generally in ggplot2
, each geometry has a default statistic associated. Histograms have a “count” statistic associated, scatter plots (geom_point
) have an “identity” statistic (i.e., just return the coordinates of the point), etc. The statistic essentially tells ggplot
how to map the underlying data to the variable that you specify in the aesthetic using aes
.
Looking more closely at the help, geom_smooth()
uses the method = loess
, which fits a polynomial (default is a quadratic) to each point of the data using least squares. To fit instead a simple straight line, change method = "lm"
, which just uses the linear model lm()
function:
+
co_ga_tn_plot geom_point(alpha = 0.3) +
geom_smooth(lwd = 0.8, method = 'lm')
`geom_smooth()` using formula = 'y ~ x'
Returning to the default loess
regression line, you can make the line smoother or more “wiggly” by changing the span
argument, which is the fraction of points used to fit each local regression. For example, changing the span
from the default of 0.75 to 0.2 produces
+
co_ga_tn_plot geom_point(alpha = 0.3) +
geom_smooth(lwd = 0.8, span = 0.2)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Another default of the geom_smooth()
function is to produce a 95% confidence interval, which is represented in the gray bands. To get rid of this, just set se = FALSE
:
+
co_ga_tn_plot geom_point(alpha = 0.3) +
geom_smooth(lwd = 0.8, span = 0.2, se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Plotting distributions
Almost certainly, your data will contain multiple samples or replicates of a measurement. These samples produce a distribution, and analyzing these data often involves asking questions about this distribution, such as what is its mean, median, standard deviation, etc. Visualizing the distribution is also crucial and there are a number of plot type that are used for this task.
Bar plots
If your data has discrete categories (i.e., “categorical” data), then you may want a simple bar plot. As an example, we can create a bar plot with how many times hospitalizations per 100k exceeded 50 for each racial/ethnic category.
=
covid_net_hosps_exceed_50_per_100k |>
covid_net_hosps filter(age == "All", race != "All", sex == "All") |>
mutate(exceed_50_per_100k = weeklyrate >= 50) |>
filter(exceed_50_per_100k == TRUE)
covid_net_hosps_exceed_50_per_100k
# A tibble: 202 × 11
state season weekenddate age sex race type weeklyrate cumulativerate
<chr> <chr> <date> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Connect… 2019-… 2020-03-28 All All Blac… Crud… 51.5 61.6
2 Michigan 2019-… 2020-03-28 All All Blac… Crud… 65.2 82
3 Connect… 2019-… 2020-04-04 All All Blac… Crud… 78 140.
4 Michigan 2019-… 2020-04-04 All All Blac… Crud… 91.8 174.
5 Connect… 2019-… 2020-04-11 All All Blac… Crud… 83.4 223
6 Iowa 2019-… 2020-04-11 All All A/PI… Crud… 70.4 70.4
7 Iowa 2019-… 2020-04-11 All All Hisp… Crud… 65.1 65.1
8 Connect… 2019-… 2020-04-18 All All Blac… Crud… 75.6 299.
9 Iowa 2019-… 2020-04-18 All All A/PI… Crud… 141. 211.
10 Iowa 2019-… 2020-04-18 All All Hisp… Crud… 97.6 163.
# ℹ 192 more rows
# ℹ 2 more variables: month <fct>, exceed_50_per_100k <lgl>
Each row of this table is a different state. We can plot these data in a bar plot using geom_bar()
, which uses the count
statistic on the data and thus plots the number of time where a state exceeds 50 hospitalizations per 100k for each racial/ethnic category.
|>
covid_net_hosps_exceed_50_per_100k ggplot() +
geom_bar(aes(x = exceed_50_per_100k))
To see which racial/ethnic category account for those weeks where the cases exceed 50 per 100k, we can can color the portion of the bar according to the category by just adding the fill
aesthetic. This produces a so-called “stacked” bar plot.
|>
covid_net_hosps_exceed_50_per_100k ggplot() +
geom_bar(aes(x = exceed_50_per_100k, fill = race))
Hospitalizations clearly hit racial/ethnic minorities the hardest.
If you do not want the bars stacked by rather placed side by side, use position = "dodge"
in geom_bar()
:
|>
covid_net_hosps_exceed_50_per_100k ggplot() +
geom_bar(aes(x = exceed_50_per_100k, fill = race), position = "dodge")
Other options for position
are identity
, which just places the bars on top of each other (use the alpha
option to make the bars easier to see in this case), and fill
, which makes the bars of equal height and the y-axis measure the fraction in each category.
|>
covid_net_hosps_exceed_50_per_100k ggplot() +
geom_bar(aes(x = exceed_50_per_100k, fill = race), position = "fill")
Finally, you might notice the label on the tick mark on the above plots, TRUE
. This is because the bar plot is counting the number of each discrete value it finds in the exceed_50_per_100k
column. Since we only have rows in this table where exceed_50_per_100k == TRUE
, we could equally just set x=state
in the aesthetic and get a bar plot separated by state.
|>
covid_net_hosps_exceed_50_per_100k ggplot() +
geom_bar(aes(x = state, fill = race)) +
coord_flip() # coordinate flip makes reading the state names easier
Histograms
To get a histogram, we use the geom_histogram
function. We can for example look at the distribution of hospitalization rates across all the states.
|>
covid_net_hosps filter(age == "All", race != "All", sex == "All") |>
ggplot() +
geom_histogram(aes(x = weeklyrate))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
With a distribution that falls off quickly after a big spike, it can useful to put the y-axis on a log10 scale.
|>
covid_net_hosps filter(age == "All", race != "All", sex == "All") |>
ggplot() +
geom_histogram(aes(x = log10(1 + weeklyrate)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can add additional information, like the race/ethnic category.
|>
covid_net_hosps filter(age == "All", race != "All", sex == "All") |>
ggplot() +
geom_histogram(aes(x = log10(1 + weeklyrate), fill=race))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This defaults to a stacked histogram, so switch to identity
to better see how the two sample size categories compare (using alpha
to get some transparency).
|>
covid_net_hosps filter(age == "All", race != "All", sex == "All") |>
ggplot() +
geom_histogram(aes(x = log10(1 + weeklyrate), fill=race), position = "identity", alpha = 0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
But maybe thats too much data to stack together, so let’s just compare “White, non-Hispanic” and “Black, non-Hispanic”
|>
covid_net_hosps filter(age == "All", race == "Black, non-Hispanic" | race == "White, non-Hispanic", sex == "All") |>
ggplot() +
geom_histogram(aes(x = log10(1 + weeklyrate), fill=race), position = "identity", alpha = 0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now we can see how Black folks have been hit harder in these regions but also how some of these regions have fewer Black folks, which likely causes the spike at zero.
You can just draw lines instead of filled bars using geom_freqpoly()
|>
covid_net_hosps filter(age == "All", race == "Black, non-Hispanic" | race == "White, non-Hispanic", sex == "All") |>
ggplot() +
geom_freqpoly(aes(x = log10(1 + weeklyrate), color=race))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You might have noticed that R keeps reminding you to adjust the binwidth
. This is simply the width of the bins on the x-axis. For example, a binwidth of 0.2 makes a smooth histogram.
|>
covid_net_hosps filter(age == "All", race == "Black, non-Hispanic" | race == "White, non-Hispanic", sex == "All") |>
ggplot() +
geom_freqpoly(aes(x = log10(1 + weeklyrate), color=race), binwidth = 0.2)
Suppose that you want to smooth the above plot so that it looks like some continuous distribution. The statistical method used to do this is called a “kernel density estimate” or KDE. Essentially, a basic KDE tries to combine Gaussian distributions together in a way to approximate empirical distribution. Its basically the continuous version of a historgram (which has discrete bins). The geometry function to use for a KDE is geom_density
.
|>
covid_net_hosps filter(age == "All", race == "Black, non-Hispanic" | race == "White, non-Hispanic", sex == "All") |>
ggplot() +
geom_density(aes(x = log10(1 + weeklyrate), fill = race), alpha = 0.3)
Notice how the y-axis has a different scale now. This is because the area under the curve of the KDE must sum to one and the y-axis measures the density, or approximately the probability, of a specific log10 number of cases per 100k. The histogram can be normalized too by dividing each curve by the total number of days in the dataset.
Box plots
Suppose now that you want to plot the distribution of the hospitalization rates per 100k for each state.
|>
covid_net_hosps filter(age == "All", race == "All", sex == "All") |>
ggplot() +
geom_histogram(aes(x = log10(1 + weeklyrate), fill = state), alpha = 0.3)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This many states stacked on one another is probably too many to make the graph very readable. You need a more condensed way to plot the distribution for each state. First, you can just plot the points directly.
|>
covid_net_hosps filter(age == "All", race == "All", sex == "All") |>
ggplot() +
geom_point(aes(x = state, y = log10(1 + weeklyrate))) +
coord_flip()
This is better, but many of the points overlap. To get around this, you can use geom_jitter()
, which spreads the points out (and some alpha
).
|>
covid_net_hosps filter(age == "All", race == "All", sex == "All") |>
ggplot(aes(x = state, y = log10(1 + weeklyrate))) +
geom_jitter(alpha = 0.5) +
coord_flip()
That is sort of an improvement, but it’s still hard to see patterns because the points are still rather dense One plot that is particularly good at summarizing distributions is a box plot. Typically, a box plot show the median, the interquartile interval as a box (middle 50% of the data) and “whiskers” that extend to data points within 1.5 times the interquartile interval of the box. The only points plotted are outliers.
For the data on hospitalization rate per 100k for each state, the box plots are:
|>
covid_net_hosps filter(age == "All", race == "All", sex == "All") |>
ggplot(aes(x = state, y = log10(1 + weeklyrate))) +
geom_boxplot() +
coord_flip()
The box plots make it much easier to see the central tendency (median) as well as how dispersed each distribution is. For example, the above plot reveals that states like Iowa have some of the highest median hospitalization rates. You can modify boxplot properties including the color of the outliers using outlier.color
.
|>
covid_net_hosps filter(age == "All", race == "All", sex == "All") |>
ggplot(aes(x = state, y = log10(1 + weeklyrate))) +
geom_boxplot(outlier.color = "red") +
coord_flip()
It’s easier to read the box plots if they are organized by the median value. To do this, you the reorder
function for the x-axis in the aesthetic argument and replace x = state
with x = reorder(state, cases_per_100k, FUN = median)
.
|>
covid_net_hosps filter(age == "All", race == "All", sex == "All") |>
ggplot(aes(x = reorder(state, weeklyrate, FUN = median), y = log10(1 + weeklyrate))) +
geom_boxplot(outlier.color = "red") +
coord_flip()
This reorder
function helps with our bar plot from above with the number of times there were more than 50 hospitalizations per 100k. This example is a little more complicated because we need to tell reorder
which variable we want to sort the states by but the variable we want is actually the count of the number of times there are more than 50 hospitalizations per 100k, which the bar plot calculates. So we have to create that column first by first using group_by
then mutate(count = n())
to count the number of rows for each state. Now we can reorder.
|>
covid_net_hosps_exceed_50_per_100k group_by(state) |>
mutate(count = n()) |>
ggplot() +
geom_bar(aes(x = reorder(state, count), fill = race)) +
coord_flip() # coordinate flip makes reading the state names easier
Finally, a variant of the box plot is called a “violin plot”. Violin plots show the probability density using KDE. Let’s plot them for a subset of states
|>
covid_net_hosps filter(age == "All", race == "All", sex == "All") |>
ggplot(aes(x = reorder(state, log10(1 + weeklyrate), FUN = median), y = log10(1 + weeklyrate))) +
geom_violin() +
coord_flip()
Lab
Problems
Use the
gapminder
data (library(gapminder)
):- Create a box plot where the x-axis is
year
and each box plot shows the distribution of life expectancy for each year. - Add points with jitter on top of the box plot with the color according to continent.
- Create a box plot where the x-axis is
Use the
gapminder
data:- Plot life expectancy on the x-axis and GDP per capita on the y-axis (plot the points).
- Add a regression line (without confidence interval) for each continent where each line is colored by continent.
- You may use any regression method you prefer.
Use the imprinting data from Babak et al. (2015) (
babak-etal-2015_imprinted-mouse_tidy.csv
):- Plot the distribution of gene expression values that are positive (expression from the paternal chromosome) and the distribution of expression values that are negative (expression from the maternal chromosome) on the same plot.
- Choose your favorite plot type for this question (histogram or KDE).
Use the following CDC COVD-19 death data for this problem. It comes from the CDC WONDER database (https://wonder.cdc.gov/).
= read_tsv("cdc_covid19_mort_rates_state-age-sex_2020-2023.tsv") cdc_data
Rows: 234498 Columns: 7 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "\t" chr (3): State, Gender, Age dbl (3): Year, Deaths, Deaths_per_100k date (1): Date ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Create boxplots of the deaths per 100k for each age group in the data with the following features
- Use
cood_flip
to have the age groups on the y-axis and the deaths on the x-axis - Order the boxplots by the ages from youngest to oldest.
- Use
facet_grid
to make a grid of these plots withGender
across the columns and the year down the rows (hint: use theyear
function on theDate
)
- Use
Use CDC data from above.
- Plot the
Deaths_per_100k
over time for each different age group usingfacet_wrap
(first sum over both genders and all states to get a single rate for each age group and date). - Order the plots by age group from youngest to oldest.
- What is most notable in the 2020 and 2021 years compared to seasonal flu, which tends to have a single peak in intensity in the winter?
- Plot the