More ggplot2: plot types

Author

Jeremy Van Cleve

Published

October 22, 2024

Announcements

  • Continue (start?!) brainstorming the dataset you want to use for your figures and presentation.
  • I’m always available to help with picking a dataset and analyzing it.

Outline for today

  • Smoothed line plots
  • Plotting distributions
    • Bar plots
    • Histograms
    • Box plots

The power of a package like ggplot2 is not in the fact that it can do many different kinds of plots but rather that creating these plots is a logical processes. Changing from one plot type to another often simply requires just changing the “geometry” object and potentially the “aesthetics”, or how variables map to different parts of the geometry.

Nevertheless, ggplot2 does allow one to create many different kinds of plots and understanding how to use these plots and when they are useful is instructive for using other visualization packages in R and other languages.

Reading in some COVID-19 data from the CDC

We’ll work with COVID-19 weekly rates (per 100k people) of laboratory-confirmed COVID-19 hospitalizations from the CDC COVID-NET surveillance project, which gathers data from 13 states.

library(tidyverse)
library(RSocrata) 

# Read in data and clean up some stuff
covid_net_hosps = read.socrata("https://data.cdc.gov/api/odata/v4/6jg4-xsqq") |> 
  as_tibble() |>
  rename(weekenddate = X_weekenddate, age = agecategory_legend, sex = sex_label, race = race_label) |>
  mutate(weekenddate = ymd(weekenddate)) |>
  arrange(weekenddate) |>
  mutate(month = factor(months(weekenddate), 
                        levels = c("January", "February", "March", "April", "May", "June", 
                                   "July", "August", "September", "October", "November", "December")))

Let’s work on a slice of these data with just the cases in Colorado, Georgia, and Tennessee for 2021:

co_ga_tn_cases_2021 = 
  covid_net_hosps |>
  filter(state %in% c("Colorado", "Georgia", "Tennessee")) |>
  filter(year(weekenddate) == 2021)

co_ga_tn_cases_2021
# A tibble: 6,864 × 10
   state    season weekenddate age   sex   race  type  weeklyrate cumulativerate
   <chr>    <chr>  <date>      <chr> <chr> <chr> <chr>      <dbl>          <dbl>
 1 Tenness… 2020-… 2021-01-02  All   All   All   Crud…       30.1          271. 
 2 Tenness… 2020-… 2021-01-02  All   All   Hisp… Crud…       29.4          196. 
 3 Tenness… 2020-… 2021-01-02  All   All   A/PI… Crud…       10.1          123. 
 4 Tenness… 2020-… 2021-01-02  All   All   AI/A… Crud…        0             67.4
 5 Tenness… 2020-… 2021-01-02  All   All   Blac… Crud…       31.5          354. 
 6 Tenness… 2020-… 2021-01-02  All   All   Whit… Crud…       30.7          263. 
 7 Tenness… 2020-… 2021-01-02  All   Fema… All   Crud…       27.4          255. 
 8 Tenness… 2020-… 2021-01-02  All   Male  All   Crud…       32.8          288. 
 9 Tenness… 2020-… 2021-01-02  ≥18 … All   All   Crud…       38.3          347. 
10 Tenness… 2020-… 2021-01-02  0-17… All   All   Crud…        2             10.8
# ℹ 6,854 more rows
# ℹ 1 more variable: month <fct>

Note the use of the match operator %in%. It’s a nice way to say that an element should exactly match one of a list of values.

Smoothed line plots

In previous weeks, you have produced simple scatter and line plots. With these the KY, WV, and TN case data, we can produce a scatter plot with geom_point.

co_ga_tn_plot = co_ga_tn_cases_2021 |> 
  filter(age == "All", race == "All", sex == "All") |>
  ggplot(aes(x=weekenddate, y=weeklyrate, color=state))
co_ga_tn_plot + geom_point()

We can produce a line through the points by adding geom_line.

co_ga_tn_plot + geom_point() + geom_line()

To get a better sense of the trend, you might want a trend line like a linear regression. R has a very sophisticated regression framework but accessing it in ggplot2 only requires using the geom_smooth() function.

co_ga_tn_plot + 
  geom_point(alpha = 0.3) + # alpha controls transparency: 0 = clear; 1 = opaque
  geom_smooth(lwd = 0.8) # lwd = line width
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Obviously, geom_smooth() produced a nice smooth line, but it’s not linear. Taking a closer look at the geom_smooth() function help, you can see that the “statistic” the function uses is stat_smooth(). Generally in ggplot2, each geometry has a default statistic associated. Histograms have a “count” statistic associated, scatter plots (geom_point) have an “identity” statistic (i.e., just return the coordinates of the point), etc. The statistic essentially tells ggplot how to map the underlying data to the variable that you specify in the aesthetic using aes.

Looking more closely at the help, geom_smooth() uses the method = loess, which fits a polynomial (default is a quadratic) to each point of the data using least squares. To fit instead a simple straight line, change method = "lm", which just uses the linear model lm() function:

co_ga_tn_plot + 
  geom_point(alpha = 0.3) +
  geom_smooth(lwd = 0.8, method = 'lm')
`geom_smooth()` using formula = 'y ~ x'

Returning to the default loess regression line, you can make the line smoother or more “wiggly” by changing the span argument, which is the fraction of points used to fit each local regression. For example, changing the span from the default of 0.75 to 0.2 produces

co_ga_tn_plot + 
  geom_point(alpha = 0.3) + 
  geom_smooth(lwd = 0.8, span = 0.2)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Another default of the geom_smooth() function is to produce a 95% confidence interval, which is represented in the gray bands. To get rid of this, just set se = FALSE:

co_ga_tn_plot + 
  geom_point(alpha = 0.3) +
  geom_smooth(lwd = 0.8, span = 0.2, se = FALSE)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Plotting distributions

Almost certainly, your data will contain multiple samples or replicates of a measurement. These samples produce a distribution, and analyzing these data often involves asking questions about this distribution, such as what is its mean, median, standard deviation, etc. Visualizing the distribution is also crucial and there are a number of plot type that are used for this task.

Bar plots

If your data has discrete categories (i.e., “categorical” data), then you may want a simple bar plot. As an example, we can create a bar plot with how many times hospitalizations per 100k exceeded 50 for each racial/ethnic category.

covid_net_hosps_exceed_50_per_100k = 
  covid_net_hosps |>
  filter(age == "All", race != "All", sex == "All") |>
  mutate(exceed_50_per_100k = weeklyrate >= 50) |>
  filter(exceed_50_per_100k == TRUE)

covid_net_hosps_exceed_50_per_100k
# A tibble: 202 × 11
   state    season weekenddate age   sex   race  type  weeklyrate cumulativerate
   <chr>    <chr>  <date>      <chr> <chr> <chr> <chr>      <dbl>          <dbl>
 1 Connect… 2019-… 2020-03-28  All   All   Blac… Crud…       51.5           61.6
 2 Michigan 2019-… 2020-03-28  All   All   Blac… Crud…       65.2           82  
 3 Connect… 2019-… 2020-04-04  All   All   Blac… Crud…       78            140. 
 4 Michigan 2019-… 2020-04-04  All   All   Blac… Crud…       91.8          174. 
 5 Connect… 2019-… 2020-04-11  All   All   Blac… Crud…       83.4          223  
 6 Iowa     2019-… 2020-04-11  All   All   A/PI… Crud…       70.4           70.4
 7 Iowa     2019-… 2020-04-11  All   All   Hisp… Crud…       65.1           65.1
 8 Connect… 2019-… 2020-04-18  All   All   Blac… Crud…       75.6          299. 
 9 Iowa     2019-… 2020-04-18  All   All   A/PI… Crud…      141.           211. 
10 Iowa     2019-… 2020-04-18  All   All   Hisp… Crud…       97.6          163. 
# ℹ 192 more rows
# ℹ 2 more variables: month <fct>, exceed_50_per_100k <lgl>

Each row of this table is a different state. We can plot these data in a bar plot using geom_bar(), which uses the count statistic on the data and thus plots the number of time where a state exceeds 50 hospitalizations per 100k for each racial/ethnic category.

covid_net_hosps_exceed_50_per_100k |> 
  ggplot() +
  geom_bar(aes(x = exceed_50_per_100k))

To see which racial/ethnic category account for those weeks where the cases exceed 50 per 100k, we can can color the portion of the bar according to the category by just adding the fill aesthetic. This produces a so-called “stacked” bar plot.

covid_net_hosps_exceed_50_per_100k |> 
  ggplot() +
  geom_bar(aes(x = exceed_50_per_100k, fill = race))

Hospitalizations clearly hit racial/ethnic minorities the hardest.

If you do not want the bars stacked by rather placed side by side, use position = "dodge" in geom_bar():

covid_net_hosps_exceed_50_per_100k |> 
  ggplot() +
  geom_bar(aes(x = exceed_50_per_100k, fill = race), position = "dodge")

Other options for position are identity, which just places the bars on top of each other (use the alpha option to make the bars easier to see in this case), and fill, which makes the bars of equal height and the y-axis measure the fraction in each category.

covid_net_hosps_exceed_50_per_100k |> 
  ggplot() +
  geom_bar(aes(x = exceed_50_per_100k, fill = race), position = "fill")

Finally, you might notice the label on the tick mark on the above plots, TRUE. This is because the bar plot is counting the number of each discrete value it finds in the exceed_50_per_100k column. Since we only have rows in this table where exceed_50_per_100k == TRUE, we could equally just set x=state in the aesthetic and get a bar plot separated by state.

covid_net_hosps_exceed_50_per_100k |> 
  ggplot() +
  geom_bar(aes(x = state, fill = race)) +
  coord_flip() # coordinate flip makes reading the state names easier

Histograms

To get a histogram, we use the geom_histogram function. We can for example look at the distribution of hospitalization rates across all the states.

covid_net_hosps |> 
  filter(age == "All", race != "All", sex == "All") |>
  ggplot() +
  geom_histogram(aes(x = weeklyrate))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

With a distribution that falls off quickly after a big spike, it can useful to put the y-axis on a log10 scale.

covid_net_hosps |> 
  filter(age == "All", race != "All", sex == "All") |>
  ggplot() +
  geom_histogram(aes(x = log10(1 + weeklyrate)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can add additional information, like the race/ethnic category.

covid_net_hosps |> 
  filter(age == "All", race != "All", sex == "All") |>
  ggplot() +
  geom_histogram(aes(x = log10(1 + weeklyrate), fill=race))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This defaults to a stacked histogram, so switch to identity to better see how the two sample size categories compare (using alpha to get some transparency).

covid_net_hosps |> 
  filter(age == "All", race != "All", sex == "All") |>
  ggplot() +
  geom_histogram(aes(x = log10(1 + weeklyrate), fill=race), position = "identity", alpha = 0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

But maybe thats too much data to stack together, so let’s just compare “White, non-Hispanic” and “Black, non-Hispanic”

covid_net_hosps |> 
  filter(age == "All", race == "Black, non-Hispanic" | race == "White, non-Hispanic", sex == "All") |>
  ggplot() +
  geom_histogram(aes(x = log10(1 + weeklyrate), fill=race), position = "identity", alpha = 0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now we can see how Black folks have been hit harder in these regions but also how some of these regions have fewer Black folks, which likely causes the spike at zero.

You can just draw lines instead of filled bars using geom_freqpoly()

covid_net_hosps |> 
  filter(age == "All", race == "Black, non-Hispanic" | race == "White, non-Hispanic", sex == "All") |>
  ggplot() +
  geom_freqpoly(aes(x = log10(1 + weeklyrate), color=race))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You might have noticed that R keeps reminding you to adjust the binwidth. This is simply the width of the bins on the x-axis. For example, a binwidth of 0.2 makes a smooth histogram.

covid_net_hosps |> 
  filter(age == "All", race == "Black, non-Hispanic" | race == "White, non-Hispanic", sex == "All") |>
  ggplot() +
  geom_freqpoly(aes(x = log10(1 + weeklyrate), color=race), binwidth = 0.2)

Suppose that you want to smooth the above plot so that it looks like some continuous distribution. The statistical method used to do this is called a “kernel density estimate” or KDE. Essentially, a basic KDE tries to combine Gaussian distributions together in a way to approximate empirical distribution. Its basically the continuous version of a historgram (which has discrete bins). The geometry function to use for a KDE is geom_density.

covid_net_hosps |> 
  filter(age == "All", race == "Black, non-Hispanic" | race == "White, non-Hispanic", sex == "All") |>
  ggplot() +
  geom_density(aes(x = log10(1 + weeklyrate), fill = race), alpha = 0.3)

Notice how the y-axis has a different scale now. This is because the area under the curve of the KDE must sum to one and the y-axis measures the density, or approximately the probability, of a specific log10 number of cases per 100k. The histogram can be normalized too by dividing each curve by the total number of days in the dataset.

Box plots

Suppose now that you want to plot the distribution of the hospitalization rates per 100k for each state.

covid_net_hosps |> 
  filter(age == "All", race == "All", sex == "All") |>
  ggplot() +
  geom_histogram(aes(x = log10(1 + weeklyrate), fill = state), alpha = 0.3)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This many states stacked on one another is probably too many to make the graph very readable. You need a more condensed way to plot the distribution for each state. First, you can just plot the points directly.

covid_net_hosps |> 
  filter(age == "All", race == "All", sex == "All") |>
  ggplot() +
  geom_point(aes(x = state, y = log10(1 + weeklyrate))) + 
  coord_flip()

This is better, but many of the points overlap. To get around this, you can use geom_jitter(), which spreads the points out (and some alpha).

covid_net_hosps |> 
  filter(age == "All", race == "All", sex == "All") |>
  ggplot(aes(x = state, y = log10(1 + weeklyrate))) +
  geom_jitter(alpha = 0.5) +
  coord_flip()

That is sort of an improvement, but it’s still hard to see patterns because the points are still rather dense One plot that is particularly good at summarizing distributions is a box plot. Typically, a box plot show the median, the interquartile interval as a box (middle 50% of the data) and “whiskers” that extend to data points within 1.5 times the interquartile interval of the box. The only points plotted are outliers.

For the data on hospitalization rate per 100k for each state, the box plots are:

covid_net_hosps |> 
  filter(age == "All", race == "All", sex == "All") |>
  ggplot(aes(x = state, y = log10(1 + weeklyrate))) +
  geom_boxplot() +
  coord_flip()

The box plots make it much easier to see the central tendency (median) as well as how dispersed each distribution is. For example, the above plot reveals that states like Iowa have some of the highest median hospitalization rates. You can modify boxplot properties including the color of the outliers using outlier.color.

covid_net_hosps |> 
  filter(age == "All", race == "All", sex == "All") |>
  ggplot(aes(x = state, y = log10(1 + weeklyrate))) +
  geom_boxplot(outlier.color = "red") +
  coord_flip()

It’s easier to read the box plots if they are organized by the median value. To do this, you the reorder function for the x-axis in the aesthetic argument and replace x = state with x = reorder(state, cases_per_100k, FUN = median).

covid_net_hosps |> 
  filter(age == "All", race == "All", sex == "All") |>
  ggplot(aes(x = reorder(state, weeklyrate, FUN = median), y = log10(1 + weeklyrate))) +
  geom_boxplot(outlier.color = "red") +
  coord_flip()

This reorder function helps with our bar plot from above with the number of times there were more than 50 hospitalizations per 100k. This example is a little more complicated because we need to tell reorder which variable we want to sort the states by but the variable we want is actually the count of the number of times there are more than 50 hospitalizations per 100k, which the bar plot calculates. So we have to create that column first by first using group_by then mutate(count = n()) to count the number of rows for each state. Now we can reorder.

covid_net_hosps_exceed_50_per_100k |> 
  group_by(state) |>
  mutate(count = n()) |>
  ggplot() +
  geom_bar(aes(x = reorder(state, count), fill = race)) +
  coord_flip() # coordinate flip makes reading the state names easier

Finally, a variant of the box plot is called a “violin plot”. Violin plots show the probability density using KDE. Let’s plot them for a subset of states

covid_net_hosps |> 
  filter(age == "All", race == "All", sex == "All") |>
  ggplot(aes(x = reorder(state, log10(1 + weeklyrate), FUN = median), y = log10(1 + weeklyrate))) +
  geom_violin() +
  coord_flip()

Lab

Problems

  1. Use the gapminder data (library(gapminder)):

    • Create a box plot where the x-axis is year and each box plot shows the distribution of life expectancy for each year.
    • Add points with jitter on top of the box plot with the color according to continent.
  2. Use the gapminder data:

    • Plot life expectancy on the x-axis and GDP per capita on the y-axis (plot the points).
    • Add a regression line (without confidence interval) for each continent where each line is colored by continent.
    • You may use any regression method you prefer.
  3. Use the imprinting data from Babak et al. (2015) (babak-etal-2015_imprinted-mouse_tidy.csv):

    • Plot the distribution of gene expression values that are positive (expression from the paternal chromosome) and the distribution of expression values that are negative (expression from the maternal chromosome) on the same plot.
    • Choose your favorite plot type for this question (histogram or KDE).
  4. Use the following CDC COVD-19 death data for this problem. It comes from the CDC WONDER database (https://wonder.cdc.gov/).

    cdc_data = read_tsv("cdc_covid19_mort_rates_state-age-sex_2020-2023.tsv")
    Rows: 234498 Columns: 7
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: "\t"
    chr  (3): State, Gender, Age
    dbl  (3): Year, Deaths, Deaths_per_100k
    date (1): Date
    
    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

    Create boxplots of the deaths per 100k for each age group in the data with the following features

    • Use cood_flip to have the age groups on the y-axis and the deaths on the x-axis
    • Order the boxplots by the ages from youngest to oldest.
    • Use facet_grid to make a grid of these plots with Gender across the columns and the year down the rows (hint: use the year function on the Date)
  5. Use CDC data from above.

    • Plot the Deaths_per_100k over time for each different age group using facet_wrap
      (first sum over both genders and all states to get a single rate for each age group and date).
    • Order the plots by age group from youngest to oldest.
    • What is most notable in the 2020 and 2021 years compared to seasonal flu, which tends to have a single peak in intensity in the winter?