library(tidyverse)
= read_csv("https://github.com/nytimes/covid-19-data/raw/master/rolling-averages/us-states.csv")
us = read_csv("https://github.com/vancleve/BIO621-DWVR/raw/main/us_census_population_totals_est_2020-2021.csv")
pop = us %>% left_join(pop %>% mutate(state = State, population = `2021`) %>% select(state, population)) %>%
us select(!contains("avg")) %>%
mutate(cases_per_100k = signif(cases/population*1e5, digits = 3)) %>%
mutate(deaths_per_100k = signif(deaths/population*1e5, digits = 3)) %>%
mutate(day = factor(weekdays(date),
levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))) %>%
mutate(month = factor(months(date),
levels = c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"))) %>%
filter(!is.na(population), cases >= 0, deaths >= 0) %>%
select(-population)
More ggplot2
: plot types
Announcements
- Sign up for project presentations on Canvas!
- Start brainstorming the dataset you want to use for your figures and presentation.
- I’m always available to help with picking a dataset and analyzing it.
Outline for today
- Smoothed line plots
- Plotting distributions
- Bar plots
- Histograms
- Strip plots
The power of a package like ggplot2
is not in the fact that it can do many different kinds of plots but rather that creating these plots is a logical processes. Changing from one plot type to another often simply requires just changing the “geometry” object and potentially the “aesthetics”, or how variables map to different parts of the geometry.
Nevertheless, ggplot2
does allow one to create many different kinds of plots and understanding how to use these plots and when they are useful is instructive for using other visualization packages in R and other languages.
Reading in some COVID-19 data from the New York Times
We’ll continue with the New York Times COVID-19 daily case and death data for the US.
The above code also loads in estimated population sizes for each state in 2021 from US Census data and gets rid of rows for which there aren’t population data. We also add a few columns for the day and month that will be helpful later and remove days where cases were negative (which was a way for states to correct prior mistakes in reporting).
Let’s work on a slice of these data with just the cases in Kentucky, West Virginia, and Tennessee for the months of August and September 2022:
=
ky_wv_tn_cases_jul_aug_2022 %>%
us select(-contains("deaths")) %>%
select(-geoid) %>%
filter(state %in% c("Kentucky", "West Virginia", "Tennessee")) %>%
filter(date >= "2022-07-01" & date <= "2022-08-31")
ky_wv_tn_cases_jul_aug_2022
# A tibble: 186 × 6
date state cases cases_per_100k day month
<date> <chr> <dbl> <dbl> <fct> <fct>
1 2022-07-01 West Virginia 710 39.8 Friday July
2 2022-07-01 Tennessee 3158 45.3 Friday July
3 2022-07-01 Kentucky 349 7.74 Friday July
4 2022-07-02 West Virginia 0 0 Saturday July
5 2022-07-02 Tennessee 0 0 Saturday July
6 2022-07-02 Kentucky 354 7.85 Saturday July
7 2022-07-03 West Virginia 0 0 Sunday July
8 2022-07-03 Tennessee 0 0 Sunday July
9 2022-07-03 Kentucky 209 4.63 Sunday July
10 2022-07-04 West Virginia 0 0 Monday July
# … with 176 more rows
Note the use of the match operator %in%
. It’s a nice way to say that an element should exactly match one of a list of values.
Smoothed line plots
In previous weeks, you have produced simple scatter and line plots. With these the KY, WV, and TN case data, we can produce a scatter plot with geom_point
.
= ky_wv_tn_cases_jul_aug_2022 %>%
ky_wv_tn_plot ggplot(aes(x=date, y=cases_per_100k, color=state))
+ geom_point() ky_wv_tn_plot
We can produce a line through the points by adding geom_line
.
+ geom_point() + geom_line() ky_wv_tn_plot
Producing a line through these points isn’t that helpful in showing a trend because the data is very spikey. This has to do with states often not reporting data on the weekends and then adding weekend data to the next business day. Some states only report data once a week!
To get a better sense of the trend, you might want a trend line like a linear regression. R has a very sophisticated regression framework but accessing it in ggplot2
only requires using the geom_smooth()
function.
+
ky_wv_tn_plot geom_point(alpha = 0.3) + # alpha controls transparency: 0 = clear; 1 = opaque
geom_smooth(lwd = 0.8) # lwd = line width
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
Obviously, geom_smooth()
produced a nice smooth line, but it’s not linear. Taking a closer look at the geom_smooth()
function help, you can see that the “statistic” the function uses is stat_smooth()
. Generally in ggplot2
, each geometry has a default statistic associated. Histograms have a “count” statistic associated, scatter plots (geom_point
) have an “identity” statistic (i.e., just return the coordinates of the point), etc. The statistic essentially tells ggplot
how to map the underlying data to the variable that you specify in the aesthetic using aes
.
Looking more closely at the help, geom_smooth()
uses the method = loess
, which fits a polynomial (default is a quadratic) to each point of the data using least squares. To fit instead a simple straight line, change method = "lm"
, which just uses the linear model lm()
function:
+
ky_wv_tn_plot geom_point(alpha = 0.3) +
geom_smooth(lwd = 0.8, method = 'lm')
`geom_smooth()` using formula 'y ~ x'
Returning to the default loess
regression line, you can make the line smoother or more “wiggly” by changing the span
argument, which is the fraction of points used to fit each local regression. For example, changing the span
from the default of 0.75 to 0.1 produces
+
ky_wv_tn_plot geom_point(alpha = 0.3) +
geom_smooth(lwd = 0.8, span = 0.1)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
Another default of the geom_smooth()
function is to produce a 95% confidence interval, which is represented in the gray bands. To get rid of this, just set se = FALSE
:
+
ky_wv_tn_plot geom_point(alpha = 0.3) +
geom_smooth(lwd = 0.8, method = 'lm', se = FALSE)
`geom_smooth()` using formula 'y ~ x'
Plotting distributions
Almost certainly, your data will contain multiple samples or replicates of a measurement. These samples produce a distribution, and analyzing these data often involves asking questions about this distribution, such as what is its mean, median, standard deviation, etc. Visualizing the distribution is also crucial and there are a number of plot type that are used for this task.
Bar plots
If your data has discrete categories (i.e., “categorical” data), then you may want a simple bar plot. As an example, we can create a bar plot with how many days cases per 100k exceeded 100.
=
cases_exceed_100_per_100k %>%
us mutate(exceed_100_per_100k = cases_per_100k >= 100) %>%
filter(exceed_100_per_100k == TRUE)
%>% select(date, state, exceed_100_per_100k) cases_exceed_100_per_100k
# A tibble: 2,910 × 3
date state exceed_100_per_100k
<date> <chr> <lgl>
1 2020-10-05 Georgia TRUE
2 2020-10-07 South Dakota TRUE
3 2020-10-16 North Dakota TRUE
4 2020-10-19 Wisconsin TRUE
5 2020-10-20 North Dakota TRUE
6 2020-10-22 South Dakota TRUE
7 2020-10-22 North Dakota TRUE
8 2020-10-23 South Dakota TRUE
9 2020-10-23 North Dakota TRUE
10 2020-10-24 South Dakota TRUE
# … with 2,900 more rows
Each row of this table is now a different state. Plot these data in a bar plot using geom_bar()
, which uses the count
statistic on the data and thus plots the number of days where a state exceeds 100 cases per 100k.
%>%
cases_exceed_100_per_100k ggplot() +
geom_bar(aes(x = exceed_100_per_100k))
To see which days of the week account for those days where the cases exceed 100 per 100k, we can can color the portion of the bar according to the day of the week by just adding the fill
aesthetic where day = weekdays(date)
uses a function that converts a date to the date of the week. This produces a so-called “stacked” bar plot.
%>%
cases_exceed_100_per_100k ggplot() +
geom_bar(aes(x = exceed_100_per_100k, fill = day))
Not surprisingly given reporting biases, Mondays and Sundays account for a larger and smaller fraction, respectively, of the high case count days. Also note that when we created a day of the week variable, day
, we made it as a factor and specified the levels
in the order of the days of the week. This is what ensured the days were plotted in a nice order.
If you do not want the bars stacked by rather placed side by side, use position = "dodge"
in geom_bar()
:
%>%
cases_exceed_100_per_100k ggplot() +
geom_bar(aes(x = exceed_100_per_100k, fill = day), position = "dodge")
Other options for position
are identity
, which just places the bars on top of each other (use the alpha
option to make the bars easier to see in this case), and fill
, which makes the bars of equal height and the y-axis measure the fraction in each category.
%>%
cases_exceed_100_per_100k ggplot() +
geom_bar(aes(x = exceed_100_per_100k, fill = day), position = "fill")
Finally, you might notice the label on the tick mark on the above plots, TRUE
. This is because the bar plot is counting the number of each discrete value it finds in the exceed_100_per_100k
column. Since we only have rows in this table where exceed_100_per_100k == TRUE
, we could equally just set x=state
in the aesthetic and get a bar plot separated by state.
%>%
cases_exceed_100_per_100k ggplot() +
geom_bar(aes(x = state, fill = day)) +
coord_flip() # coordinate flip makes reading the state names easier
Histograms
To get a histogram, we use the geom_histogram
function. We can for example look at how many days across all the states had a certain number of cases per 100k.
%>% ggplot() +
us geom_histogram(aes(x = cases_per_100k))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
With a distribution that falls off quickly after a big spike, it can useful to put the y-axis on a log10 scale.
%>% ggplot() +
us geom_histogram(aes(x = log10(1 + cases_per_100k)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can add additional information, like days of the week have which number of cases per 100k
%>% ggplot() +
us geom_histogram(aes(x = log10(1 + cases_per_100k), fill=day))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This defaults to a stacked histogram, so switch to identity
to better see how the two sample size categories compare (using alpha
to get some transparency).
%>% ggplot() +
us geom_histogram(aes(x = log10(1 + cases_per_100k), fill=day), position = "identity", alpha = 0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
But maybe thats too many days to stack together, so let’s just compare Sunday and Monday.
%>% filter(day %in% c("Sunday", "Monday")) %>% ggplot() +
us geom_histogram(aes(x = log10(1 + cases_per_100k), fill=day), position = "identity", alpha = 0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now we can see how Monday is very similar to Sunday but just shifted over to indicate the higher overall number of cases due to a lack of reporting on the weekends.
You can just draw lines instead of filled bars using geom_freqpoly()
%>% filter(day %in% c("Sunday", "Monday")) %>% ggplot() +
us geom_freqpoly(aes(x = log10(1 + cases_per_100k), color=day))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You might have noticed that R keeps reminding you to adjust the binwidth
. This is simply the width of the bins on the x-axis. For example, a binwidth of 0.2 yields
%>% filter(day %in% c("Sunday", "Monday")) %>% ggplot() +
us geom_freqpoly(aes(x = log10(1 + cases_per_100k), color=day), binwidth = 0.2)
Suppose that you want to smooth the above plot so that it looks like some continuous distribution. The statistical method used to do this is called a “kernel density estimate” or KDE. Essentially, a basic KDE tries to combine Gaussian distributions together in a way to approximate empirical distribution. Its basically the continuous version of a historgram (which has discrete bins). The geometry function to use for a KDE is geom_density
.
%>% filter(day %in% c("Sunday", "Monday")) %>% ggplot() +
us geom_density(aes(x = log10(1 + cases_per_100k), fill = day), alpha = 0.3)
Notice how the y-axis has a different scale now. This is because the area under the curve of the KDE must sum to one and the y-axis measures the density, or approximately the probability, of a specific log10 number of cases per 100k. The histogram can be normalized too by dividing each curve by the total number of days in the dataset.
Strip plots
Suppose now that you want to plot the distribution of the number of cases per day per 100k for each state.
%>% ggplot() +
us geom_histogram(aes(x = log10(1 + cases_per_100k), fill = state), alpha = 0.3)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Yikes. Too many state. You need a more condensed way to plot the distribution of cases per 100k for each state. First, you can just plot the points directly.
%>% ggplot() +
us geom_point(aes(x = state, y = log10(1 + cases_per_100k))) +
coord_flip()
This is better, but many of the points overlap. To get around this, you can use geom_jitter()
, which spreads the points out (and some alpha
).
%>% ggplot(aes(x = state, y = log10(1 + cases_per_100k))) +
us geom_jitter(alpha = 0.5) +
coord_flip()
That is sort of an improvement, but it’s still hard to see patterns because the points are still rather dense One plot that is particularly good at summarizing distributions is a box plot. Typically, a box plot show the median, the interquartile interval as a box (middle 50% of the data) and “whiskers” that extend to data points within 1.5 times the interquartile interval of the box. The only points plotted are outliers.
For the data on cases per 100k per day for each state, the box plots are:
%>% ggplot(aes(x = state, y = log10(1 + cases_per_100k))) +
us geom_boxplot() +
coord_flip()
The box plots make it much easier to see the central tendency (median) as well as how dispersed each distribution is. For example, the above plot reveals that states like Arkansas have some of the highest median case rates. You can modify boxplot properties including the color of the outliers using outlier.color
.
%>% ggplot(aes(x = state, y = log10(1 + cases_per_100k))) +
us geom_boxplot(outlier.color = "red") +
coord_flip()
It’s easier to read the box plots if they are organized by the median value. To do this, you the reorder
function for the x-axis in the aesthetic argument and replace x = state
with x = reorder(state, cases_per_100k, FUN = median)
.
%>% ggplot(aes(x = reorder(state, cases_per_100k, FUN = median), y = log10(1 + cases_per_100k))) +
us geom_boxplot(outlier.color = "red") +
coord_flip()
This reorder
function helps with our bar plot from above with the number of days above with more than 100 cases per 100k. This example is a little more complicated because we need to tell reorder
which variable we want to sort the states by but the variable we want is actually the count of the number of days with more 100 cases per 100k, which the bar plot calculates. So we have to create that column first by first using group_by
then mutate(count = n())
to count the number of rows for each state. Now we can reorder.
%>%
cases_exceed_100_per_100k group_by(state) %>%
mutate(count = n()) %>%
ggplot() +
geom_bar(aes(x = reorder(state, count), fill = day)) +
coord_flip() # coordinate flip makes reading the state names easier
Finally, a variant of the box plot is called a “violin plot”. Violin plots show the probability density using KDE. Let’s plot them for a subset of states
%>%
us filter(state %in% c("Vermont", "California", "Kentucky", "New York", "North Dakota", "Texas", "Tennessee", "Florida")) %>%
ggplot(aes(x = reorder(state, log10(1 + cases_per_100k), FUN = median), y = log10(1 + cases_per_100k))) +
geom_violin() +
coord_flip()
Lab
Problems
- Use the
gapminder
data (library(gapminder)
):- Create a box plot where the x-axis is
year
and each box plot shows the distribution of life expectancy for each year. - Add points with jitter on top of the box plot with the color according to continent.
- Create a box plot where the x-axis is
- Use the
gapminder
data (library(gapminder)
):- Plot life expectancy on the x-axis and GDP per capita on the y-axis (plot the points).
- Add a regression line (without confidence interval) for each continent where each line is colored by continent.
- You may use any regression method you prefer.
- Use the imprinting data from Babak et al. (2015) (
babak-etal-2015_imprinted-mouse_tidy.csv
):- Plot the distribution of gene expression values that are positive (expression from the paternal chromosome) and the distribution of expression values that are negative (expression from the maternal chromosome) on the same plot.
- Choose your favorite plot type for this question (histogram or KDE).
- Plot boxplots of the deaths per 100k for each day of the week of 2021 using the
us
data table we created at the beginning of the lecture (you’ll need to include the whole block of code that createsus
in your submitted.qmd
so that it works when I run it).- Use
cood_flip
to have the days on the y-axis and the deaths on the x-axis - It may be useful to use the
log10(1 + x)
transformation we used in earlier examples - Order the boxplots by their median values.
- On which day is the median number of deaths per 100k the highest?
- Bonus! Create a new plot that uses
facet_wrap
to get boxplots for each day for each state. Looking at the boxplots, which state do you think reports deaths mostly on just one day of the week?
- Use