library(gapminder)
library(tidyverse)
Introduction to plotting and ggplot2
Outline for today
- Plotting with R
base
graphics (sucks!) - Basic plotting with
ggplot2
Plotting with R base
graphics
Why do you even want to plot things?
- See patterns / do comparisons
- Data presentation
The base package (basic functions included with R and automatically loaded) includes plotting functions that can be useful. You’ve already used it a few times in the problems but below, you will see that the limits of these functions become evident quickly and more advanced plotting functions are required. The package of functions we will use in place of the base functions is Hadley Wickham’s ggplot2
.
Before plotting, you need some data. A common dataset used in data exploration courses in R is the “Gapminder” data, which is a collection of time series data for countries.
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
The data covers the statistics listed above for
levels(gapminder$country)
[1] "Afghanistan" "Albania"
[3] "Algeria" "Angola"
[5] "Argentina" "Australia"
[7] "Austria" "Bahrain"
[9] "Bangladesh" "Belgium"
[11] "Benin" "Bolivia"
[13] "Bosnia and Herzegovina" "Botswana"
[15] "Brazil" "Bulgaria"
[17] "Burkina Faso" "Burundi"
[19] "Cambodia" "Cameroon"
[21] "Canada" "Central African Republic"
[23] "Chad" "Chile"
[25] "China" "Colombia"
[27] "Comoros" "Congo, Dem. Rep."
[29] "Congo, Rep." "Costa Rica"
[31] "Cote d'Ivoire" "Croatia"
[33] "Cuba" "Czech Republic"
[35] "Denmark" "Djibouti"
[37] "Dominican Republic" "Ecuador"
[39] "Egypt" "El Salvador"
[41] "Equatorial Guinea" "Eritrea"
[43] "Ethiopia" "Finland"
[45] "France" "Gabon"
[47] "Gambia" "Germany"
[49] "Ghana" "Greece"
[51] "Guatemala" "Guinea"
[53] "Guinea-Bissau" "Haiti"
[55] "Honduras" "Hong Kong, China"
[57] "Hungary" "Iceland"
[59] "India" "Indonesia"
[61] "Iran" "Iraq"
[63] "Ireland" "Israel"
[65] "Italy" "Jamaica"
[67] "Japan" "Jordan"
[69] "Kenya" "Korea, Dem. Rep."
[71] "Korea, Rep." "Kuwait"
[73] "Lebanon" "Lesotho"
[75] "Liberia" "Libya"
[77] "Madagascar" "Malawi"
[79] "Malaysia" "Mali"
[81] "Mauritania" "Mauritius"
[83] "Mexico" "Mongolia"
[85] "Montenegro" "Morocco"
[87] "Mozambique" "Myanmar"
[89] "Namibia" "Nepal"
[91] "Netherlands" "New Zealand"
[93] "Nicaragua" "Niger"
[95] "Nigeria" "Norway"
[97] "Oman" "Pakistan"
[99] "Panama" "Paraguay"
[101] "Peru" "Philippines"
[103] "Poland" "Portugal"
[105] "Puerto Rico" "Reunion"
[107] "Romania" "Rwanda"
[109] "Sao Tome and Principe" "Saudi Arabia"
[111] "Senegal" "Serbia"
[113] "Sierra Leone" "Singapore"
[115] "Slovak Republic" "Slovenia"
[117] "Somalia" "South Africa"
[119] "Spain" "Sri Lanka"
[121] "Sudan" "Swaziland"
[123] "Sweden" "Switzerland"
[125] "Syria" "Taiwan"
[127] "Tanzania" "Thailand"
[129] "Togo" "Trinidad and Tobago"
[131] "Tunisia" "Turkey"
[133] "Uganda" "United Kingdom"
[135] "United States" "Uruguay"
[137] "Venezuela" "Vietnam"
[139] "West Bank and Gaza" "Yemen, Rep."
[141] "Zambia" "Zimbabwe"
142 countries.
We’ve already seen some basic plots using base::plot
. For example, let’s plot how life expectancy has changed over the years
plot(gapminder$year, gapminder$lifeExp)
You can modify the axis labels by adding the xlab
and ylab
arguments and give the plot a title with the main
argument.
= "Life expectancy over time for all countries in the Gapminder data" # save space in the code
plottitle plot(gapminder$year, gapminder$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle)
Though there is a clear trend observable even at this scale, increasing life expectancy with time, but does this hold for individual countries? To see this, you could slice just a single country out of the data. Using the United States, for example,
= gapminder %>% filter(country == "United States")
us plot(us$year, us$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle)
When plotting a couple of countries together, you can color the points by country by using the col
argument to plot
:
= gapminder %>%
usukjpcn filter(country == "United States" | country == "United Kingdom" | country == "Japan" | country == "China" )
usukjpcn
# A tibble: 48 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 China Asia 1952 44 556263527 400.
2 China Asia 1957 50.5 637408000 576.
3 China Asia 1962 44.5 665770000 488.
4 China Asia 1967 58.4 754550000 613.
5 China Asia 1972 63.1 862030000 677.
6 China Asia 1977 64.0 943455000 741.
7 China Asia 1982 65.5 1000281000 962.
8 China Asia 1987 67.3 1084035000 1379.
9 China Asia 1992 68.7 1164970000 1656.
10 China Asia 1997 70.4 1230075000 2289.
# … with 38 more rows
plot(usukjpcn$year, usukjpcn$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
col = usukjpcn$country)
legend('bottomright', legend = levels(usukjpcn$country), col = 1:4, pch = 1)
levels(usukjpcn$country)
[1] "Afghanistan" "Albania"
[3] "Algeria" "Angola"
[5] "Argentina" "Australia"
[7] "Austria" "Bahrain"
[9] "Bangladesh" "Belgium"
[11] "Benin" "Bolivia"
[13] "Bosnia and Herzegovina" "Botswana"
[15] "Brazil" "Bulgaria"
[17] "Burkina Faso" "Burundi"
[19] "Cambodia" "Cameroon"
[21] "Canada" "Central African Republic"
[23] "Chad" "Chile"
[25] "China" "Colombia"
[27] "Comoros" "Congo, Dem. Rep."
[29] "Congo, Rep." "Costa Rica"
[31] "Cote d'Ivoire" "Croatia"
[33] "Cuba" "Czech Republic"
[35] "Denmark" "Djibouti"
[37] "Dominican Republic" "Ecuador"
[39] "Egypt" "El Salvador"
[41] "Equatorial Guinea" "Eritrea"
[43] "Ethiopia" "Finland"
[45] "France" "Gabon"
[47] "Gambia" "Germany"
[49] "Ghana" "Greece"
[51] "Guatemala" "Guinea"
[53] "Guinea-Bissau" "Haiti"
[55] "Honduras" "Hong Kong, China"
[57] "Hungary" "Iceland"
[59] "India" "Indonesia"
[61] "Iran" "Iraq"
[63] "Ireland" "Israel"
[65] "Italy" "Jamaica"
[67] "Japan" "Jordan"
[69] "Kenya" "Korea, Dem. Rep."
[71] "Korea, Rep." "Kuwait"
[73] "Lebanon" "Lesotho"
[75] "Liberia" "Libya"
[77] "Madagascar" "Malawi"
[79] "Malaysia" "Mali"
[81] "Mauritania" "Mauritius"
[83] "Mexico" "Mongolia"
[85] "Montenegro" "Morocco"
[87] "Mozambique" "Myanmar"
[89] "Namibia" "Nepal"
[91] "Netherlands" "New Zealand"
[93] "Nicaragua" "Niger"
[95] "Nigeria" "Norway"
[97] "Oman" "Pakistan"
[99] "Panama" "Paraguay"
[101] "Peru" "Philippines"
[103] "Poland" "Portugal"
[105] "Puerto Rico" "Reunion"
[107] "Romania" "Rwanda"
[109] "Sao Tome and Principe" "Saudi Arabia"
[111] "Senegal" "Serbia"
[113] "Sierra Leone" "Singapore"
[115] "Slovak Republic" "Slovenia"
[117] "Somalia" "South Africa"
[119] "Spain" "Sri Lanka"
[121] "Sudan" "Swaziland"
[123] "Sweden" "Switzerland"
[125] "Syria" "Taiwan"
[127] "Tanzania" "Thailand"
[129] "Togo" "Trinidad and Tobago"
[131] "Tunisia" "Turkey"
[133] "Uganda" "United Kingdom"
[135] "United States" "Uruguay"
[137] "Venezuela" "Vietnam"
[139] "West Bank and Gaza" "Yemen, Rep."
[141] "Zambia" "Zimbabwe"
Uh oh, the legend includes all the countries! This is because slicing the data doesn’t dropped unused factor levels. Dang. You can fix this using the droplevels()
function
= droplevels(usukjpcn)
usukjpcn plot(usukjpcn$year, usukjpcn$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
col = usukjpcn$country)
levels(usukjpcn$country)
[1] "China" "Japan" "United Kingdom" "United States"
legend('bottomright', legend = levels(usukjpcn$country), col = 1:4, pch = 1)
So far, the basic plot()
function is pretty useful. To see its limits, you can try something that shouldn’t be that much harder: plotting these data as points connected by a line. At first pass, you could try giving the plot()
function the type = "l"
argument to indicated you want lines.
plot(usukjpcn$year, usukjpcn$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
col = usukjpcn$country,
type = 'l')
However, while plot()
could color points by country, it doesn’t make individual lines for each country and simply makes one connected line. Getting around this in the basic plot()
function is tedious. One way involves using the lines()
function to indicate specifically each line you want.
# make an empty plot. lame.
plot(0:0, xlim = c(1952, 2007), ylim = c(20, 85),
xlab = "Year", ylab = "Life expectancy", main = plottitle, type = "n")
# make each line individually. again, lame.
lines(filter(usukjpcn, country == "United States")$year,
filter(usukjpcn, country == "United States")$lifeExp)
lines(filter(usukjpcn, country == "United Kingdom")$year,
filter(usukjpcn, country == "United Kingdom")$lifeExp)
lines(filter(usukjpcn, country == "Japan")$year,
filter(usukjpcn, country == "Japan")$lifeExp)
lines(filter(usukjpcn, country == "China")$year,
filter(usukjpcn, country == "China")$lifeExp)
That was quite tedious. Another way would be to have individual plots for each line.
par(mfrow=c(1,4))
plot(filter(usukjpcn, country == "United States")$year,
filter(usukjpcn, country == "United States")$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
type = 'l')
plot(filter(usukjpcn, country == "United Kingdom")$year,
filter(usukjpcn, country == "United Kingdom")$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
type = 'l')
plot(filter(usukjpcn, country == "Japan")$year,
filter(usukjpcn, country == "Japan")$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
type = 'l')
plot(filter(usukjpcn, country == "China")$year,
filter(usukjpcn, country == "China")$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
type = 'l')
That was also very tedious. Not only that, notice that the y-axes have different ranges, which makes comparison across countries difficult. Correcting for this and other issues adds to the tedium. Using ggplot2
to create these kind of plots is much easier.
Basic plotting with ggplot2
The ggplot2
package is built on the idea that graphics have can have a “grammar”, or set or rules, that specifies how they can and should be constructed. Implementing these rules not only makes creating graphics easier, but it makes such graphics consistent and clear. Wickham borrows this idea from the book, “The Grammar of Graphics”” by Wilkinson, Anand, and Grossman (2005)1. While this structure may seem a bit artificial at first, it makes creating graphics very modular and building up complex graphics much easier.
To understand how ggplot2
works, first consider how the following example replicates your life expectancy by year plot above:
library(ggplot2)
ggplot(data = gapminder) +
geom_point(aes(x = year, y = lifeExp))
The first line has the function ggplot
, which creates a coordinate system for the plot that you can add “layers” to. The argument to ggplot
indicates which dataset to use.
Adding layers is accomplished by “adding” to the base plot with +
and additional features via additional functions. The first feature added above is geom_point()
, which adds a layer of points and creates a scatter plot. ggplot2
comes with many other “geometries”, which are really just other kinds of visual objects, like lines and boxes.
Each geom()
function takes a mapping
argument that defines how variables map to the visual objects in the plot. The mapping
argument is always paired with the aes()
function, which stands for “aesthetics”, and the x
and y
arguments of aes()
specify which variables correspond to which axis.
Aesthetic mapping with aes()
In addition to the x
and y
axes, you can map variables to the color of the point. Above, you did this using the country
variable. Now, repeat this with ggplot()
.
%>% ggplot() +
usukjpcn geom_point(mapping = aes(x = year, y = lifeExp, color = country)) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country))
The above code with ggplot2
is both simpler and easier to read than the code creating the similar plot with the basic plot()
function. For example, ggplot
adds the legend automatically when you add an aesthetic like color.
Color is just one additional aesthetic. Others include the size and shape of the point.
For example, you can make the size of the point proportional to the population:
%>% ggplot() +
usukjpcn geom_point(mapping = aes(x = year, y = lifeExp, color = country, size = pop))
Adding another aesthetic like size caused ggplot
to automatically add another legend. Great! We could even try packing in another aesthetic, shape.
%>% ggplot() +
usukjpcn geom_point(mapping = aes(x = year, y = lifeExp, color = gdpPercap, size = pop, shape = country))
It’s starting to get a bit cluttered, but playing around a bit can help you figure out which aesthetic will visually help you best see patterns or differentiate different variable values.
You are now ready to recreate the multiline plot from above.
ggplot(data = usukjpcn) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country))
The only change you had to make was to change geom_point
to geom_line
, for a line plot. Easy peasy.
Facets
Finally, ggplot2
makes creating multiple plots very easy with “facets”, which are just subplots that each correspond to a specific value of a variable or more generally to a specific subset of the data. To add facets in using ggplot()
, you need the facet_wrap()
or facet_grid()
function. In the multiple plots created above with the basic graphics, the variable was country and each plot was the subset of data for each of the four countries. For example, with ggplot()
,
%>% ggplot() +
usukjpcn geom_line(mapping = aes(x = year, y = lifeExp, color = country)) +
facet_wrap(~ country)
The first argument to facet_wrap()
is a “formula”, which refers to how one writes regression formulas using R’s regression function lm()
. Essentially, the “~” is like an equals sign for a regression formula and the variables on the right hand side are dependent variables. The variables passed to facet_wrap
should be categorical or discrete (like factors); in other words, you can make multiple plots for different values of a numeric variable.
Adding an additional facet is possible to make a grid of plots. To see an example, load the genomic imprinting data from the Babak et al. (2015) study (this time in tidy format),
= read_csv("babak-etal-2015_imprinted-mouse_tidy.csv", na = "NA") imprint
and plot a histogram of the expression values for a subset of the data (four genes and four tissues)
= imprint %>%
fourgenes filter((Genes == "IGF2" | Genes == "GRB10" | Genes == "MEG3" | Genes == "PEG3") &
== "Preoptic Area (ref)" | tissue == "Hypothalamus" |
(tissue == "e17.5 Brain" | tissue == "e9.5 Yolk Sac" ))
tissue
%>% ggplot() +
fourgenes geom_histogram(aes(expression)) +
facet_grid(Genes ~ tissue)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here, facet_grid
arranges the plots nicely in a grid. Positive values of expression indicate expression from the paternal chromosome only and negative values from the maternal chromosome. Thus, this plot shows how some genes are “imprinting” differently depending on the tissue type.
Lab
Problems
Use GWAS data from last week (
gwas_catalog_v1.0.2-associations_e104_r2021-09-23_no-waist_hip_body_blood_education_math_top100.tsv
):- Create a plot with
ggplot()
that shows thePVALUE_MLOG
as a function ofCHR_POS
for two different values ofDISEASE/TRAIT
(pick your favorites!) - Make the points for each disease/trait a different color
- Create a plot with
Use the
gapminder
data:- Create a line plot with
ggplot()
that shows life expectancy on the y-axis and time on the x-axis. - Make each line represent a country
(hint: look here: https://ggplot2.tidyverse.org/reference/aes_group_order.html to learn how to group the data appropriately) - Color each line by the continent
- Make points whose size is proportional to GDP per capita (variable
gdpPercap
)
- Create a line plot with
Create a plot like the one in Problem 2
- Except, the new plot has multiple subplots where each subplot is a different continent.
- No need to color by continent though.
- Instead, color the points red where countries at that year had greater than 50 million people.
(hint: try to use Google for help here. there are multiple ways to do this too :-) )
Use the COVID-19 data below from the New York Times. It has daily case and death data.
= read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us-states.csv") us
- Choose three states and plot the cases per 100k versus time.
- Make the line color vary by the deaths per 100k.
- Each state should have its own panel by using
facet_wrap
. Make surefacet_wrap
creates a single column of plots with three rows. - What do you notice about the relationship between peak cases and peak deaths?
Footnotes
see the Canvas course site for a copy of the book↩︎