library(gapminder)
library(tidyverse)
Introduction to plotting and ggplot2
Outline for today
- Plotting with R
base
graphics (sucks!) - Basic plotting with
ggplot2
Plotting with R base
graphics
Why do you even want to plot things?
- See patterns / do comparisons
- Data presentation
The base package (basic functions included with R and automatically loaded) includes plotting functions that can be useful. You’ve already used it a few times in the problems but below, you will see that the limits of these functions become evident quickly and more advanced plotting functions are required. The package of functions we will use in place of the base functions is Hadley Wickham’s ggplot2
.
Before plotting, you need some data. A common dataset used in data exploration courses in R is the “Gapminder” data, which is a collection of time series data for countries.
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
The data covers the statistics listed above for
levels(gapminder$country)
[1] "Afghanistan" "Albania"
[3] "Algeria" "Angola"
[5] "Argentina" "Australia"
[7] "Austria" "Bahrain"
[9] "Bangladesh" "Belgium"
[11] "Benin" "Bolivia"
[13] "Bosnia and Herzegovina" "Botswana"
[15] "Brazil" "Bulgaria"
[17] "Burkina Faso" "Burundi"
[19] "Cambodia" "Cameroon"
[21] "Canada" "Central African Republic"
[23] "Chad" "Chile"
[25] "China" "Colombia"
[27] "Comoros" "Congo, Dem. Rep."
[29] "Congo, Rep." "Costa Rica"
[31] "Cote d'Ivoire" "Croatia"
[33] "Cuba" "Czech Republic"
[35] "Denmark" "Djibouti"
[37] "Dominican Republic" "Ecuador"
[39] "Egypt" "El Salvador"
[41] "Equatorial Guinea" "Eritrea"
[43] "Ethiopia" "Finland"
[45] "France" "Gabon"
[47] "Gambia" "Germany"
[49] "Ghana" "Greece"
[51] "Guatemala" "Guinea"
[53] "Guinea-Bissau" "Haiti"
[55] "Honduras" "Hong Kong, China"
[57] "Hungary" "Iceland"
[59] "India" "Indonesia"
[61] "Iran" "Iraq"
[63] "Ireland" "Israel"
[65] "Italy" "Jamaica"
[67] "Japan" "Jordan"
[69] "Kenya" "Korea, Dem. Rep."
[71] "Korea, Rep." "Kuwait"
[73] "Lebanon" "Lesotho"
[75] "Liberia" "Libya"
[77] "Madagascar" "Malawi"
[79] "Malaysia" "Mali"
[81] "Mauritania" "Mauritius"
[83] "Mexico" "Mongolia"
[85] "Montenegro" "Morocco"
[87] "Mozambique" "Myanmar"
[89] "Namibia" "Nepal"
[91] "Netherlands" "New Zealand"
[93] "Nicaragua" "Niger"
[95] "Nigeria" "Norway"
[97] "Oman" "Pakistan"
[99] "Panama" "Paraguay"
[101] "Peru" "Philippines"
[103] "Poland" "Portugal"
[105] "Puerto Rico" "Reunion"
[107] "Romania" "Rwanda"
[109] "Sao Tome and Principe" "Saudi Arabia"
[111] "Senegal" "Serbia"
[113] "Sierra Leone" "Singapore"
[115] "Slovak Republic" "Slovenia"
[117] "Somalia" "South Africa"
[119] "Spain" "Sri Lanka"
[121] "Sudan" "Swaziland"
[123] "Sweden" "Switzerland"
[125] "Syria" "Taiwan"
[127] "Tanzania" "Thailand"
[129] "Togo" "Trinidad and Tobago"
[131] "Tunisia" "Turkey"
[133] "Uganda" "United Kingdom"
[135] "United States" "Uruguay"
[137] "Venezuela" "Vietnam"
[139] "West Bank and Gaza" "Yemen, Rep."
[141] "Zambia" "Zimbabwe"
142 countries.
We’ve already seen some basic plots using base::plot
. For example, let’s plot how life expectancy has changed over the years
plot(gapminder$year, gapminder$lifeExp)
You can modify the axis labels by adding the xlab
and ylab
arguments and give the plot a title with the main
argument.
= "Life expectancy over time for all countries in the Gapminder data" # save space in the code
plottitle plot(gapminder$year, gapminder$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle)
Though there is a clear trend observable even at this scale, increasing life expectancy with time, but does this hold for individual countries? To see this, you could slice just a single country out of the data. Using the United States, for example,
= gapminder |> filter(country == "United States")
us plot(us$year, us$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle)
When plotting a couple of countries together, you can color the points by country by using the col
argument to plot
:
= gapminder |>
usukjpcn filter(country == "United States" | country == "United Kingdom" | country == "Japan" | country == "China" )
usukjpcn
# A tibble: 48 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 China Asia 1952 44 556263527 400.
2 China Asia 1957 50.5 637408000 576.
3 China Asia 1962 44.5 665770000 488.
4 China Asia 1967 58.4 754550000 613.
5 China Asia 1972 63.1 862030000 677.
6 China Asia 1977 64.0 943455000 741.
7 China Asia 1982 65.5 1000281000 962.
8 China Asia 1987 67.3 1084035000 1379.
9 China Asia 1992 68.7 1164970000 1656.
10 China Asia 1997 70.4 1230075000 2289.
# ℹ 38 more rows
plot(usukjpcn$year, usukjpcn$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
col = usukjpcn$country)
legend('bottomright', legend = levels(usukjpcn$country), col = 1:4, pch = 1)
Uh oh, the legend includes all the countries! This is because slicing the data doesn’t dropped unused factor levels. Dang. You can fix this using the droplevels()
function
= droplevels(usukjpcn)
usukjpcn plot(usukjpcn$year, usukjpcn$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
col = usukjpcn$country)
levels(usukjpcn$country)
[1] "China" "Japan" "United Kingdom" "United States"
legend('bottomright', legend = levels(usukjpcn$country), col = 1:4, pch = 1)
So far, the basic plot()
function is pretty useful. To see its limits, you can try something that shouldn’t be that much harder: plotting these data as points connected by a line. At first pass, you could try giving the plot()
function the type = "l"
argument to indicated you want lines.
plot(usukjpcn$year, usukjpcn$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
col = usukjpcn$country,
type = 'l')
However, while plot()
could color points by country, it doesn’t make individual lines for each country and simply makes one connected line. Getting around this in the basic plot()
function is tedious. One way involves using the lines()
function to indicate specifically each line you want.
# make an empty plot. lame.
plot(0:0, xlim = c(1952, 2007), ylim = c(20, 85),
xlab = "Year", ylab = "Life expectancy", main = plottitle, type = "n")
# make each line individually. again, lame.
lines(filter(usukjpcn, country == "United States")$year,
filter(usukjpcn, country == "United States")$lifeExp)
lines(filter(usukjpcn, country == "United Kingdom")$year,
filter(usukjpcn, country == "United Kingdom")$lifeExp)
lines(filter(usukjpcn, country == "Japan")$year,
filter(usukjpcn, country == "Japan")$lifeExp)
lines(filter(usukjpcn, country == "China")$year,
filter(usukjpcn, country == "China")$lifeExp)
That was quite tedious. Another way would be to have individual plots for each line.
par(mfrow=c(1,4))
plot(filter(usukjpcn, country == "United States")$year,
filter(usukjpcn, country == "United States")$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
type = 'l')
plot(filter(usukjpcn, country == "United Kingdom")$year,
filter(usukjpcn, country == "United Kingdom")$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
type = 'l')
plot(filter(usukjpcn, country == "Japan")$year,
filter(usukjpcn, country == "Japan")$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
type = 'l')
plot(filter(usukjpcn, country == "China")$year,
filter(usukjpcn, country == "China")$lifeExp,
xlab = "Year", ylab = "Life expectancy", main = plottitle,
type = 'l')
That was also very tedious. Not only that, notice that the y-axes have different ranges, which makes comparison across countries difficult. Correcting for this and other issues adds to the tedium. Using ggplot2
to create these kind of plots is much easier.
Basic plotting with ggplot2
The ggplot2
package is built on the idea that graphics have can have a “grammar”, or set or rules, that specifies how they can and should be constructed. Implementing these rules not only makes creating graphics easier, but it makes such graphics consistent and clear. Wickham borrows this idea from the book, “The Grammar of Graphics”” by Wilkinson, Anand, and Grossman (2005)1. While this structure may seem a bit artificial at first, it makes creating graphics very modular and building up complex graphics much easier.
To understand how ggplot2
works, first consider how the following example replicates the life expectancy by year plot above:
library(ggplot2)
ggplot(data = gapminder) +
geom_point(aes(x = year, y = lifeExp))
The first line has the function ggplot
, which creates a coordinate system for the plot that you can add “layers” to. The argument to ggplot
indicates which dataset to use.
Adding layers is accomplished by “adding” to the base plot with +
and additional features via additional functions. The first feature added above is geom_point()
, which adds a layer of points and creates a scatter plot. ggplot2
comes with many other “geometries”, which are really just other kinds of visual objects, like lines and boxes.
Each geom()
function takes a mapping
argument that defines how variables map to the visual objects in the plot. The mapping
argument is always paired with the aes()
function, which stands for “aesthetics”, and the x
and y
arguments of aes()
specify which variables correspond to which axis.
Aesthetic mapping with aes()
In addition to the x
and y
axes, you can map variables to the color of the point. Above, you did this using the country
variable. Now, repeat this with ggplot()
.
|> ggplot() +
usukjpcn geom_point(mapping = aes(x = year, y = lifeExp, color = country))
The above code with ggplot2
is both simpler and easier to read than the code creating the similar plot with the basic plot()
function. For example, ggplot
adds the legend automatically when you add an aesthetic like color.
Color is just one additional aesthetic. Others include the size and shape of the point.
For example, you can make the size of the point proportional to the population:
|> ggplot() +
usukjpcn geom_point(mapping = aes(x = year, y = lifeExp, color = country, size = pop))
Adding another aesthetic like size caused ggplot
to automatically add another legend. Great! We could even try packing in another aesthetic, shape.
|> ggplot() +
usukjpcn geom_point(mapping = aes(x = year, y = lifeExp, color = gdpPercap, size = pop, shape = country))
It’s starting to get a bit cluttered, but playing around a bit can help you figure out which aesthetic will visually help you best see patterns or differentiate different variable values.
The shape of the points has many options and can be specified with a number from below
where the hollow shapes (0–14) have a border determined by color
, the solid shapes (15–20) are filled with color
, the filled shapes (21–24) have a border of color and are filled with fill
. For example,
|> ggplot() +
usukjpcn geom_point(mapping = aes(x = year, y = lifeExp, shape = 23, color = country, fill = pop)) +
scale_shape_identity()
where scale_shape_identity()
tells ggplot
to use the number we give to shape
directly instead of just assigning it something by default as it would do for any other variable. We will talk more about scales when we talk about how to modify different plot elements.
With knowledge of aes
, we can now ready recreate the multiline plot from above.
ggplot(data = usukjpcn) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country))
The only change you had to make was to change geom_point
to geom_line
, for a line plot. Easy peasy.
Facets
Finally, ggplot2
makes creating multiple plots very easy with “facets”, which are just subplots that each correspond to a specific value of a variable or more generally to a specific subset of the data. To add facets in using ggplot()
, you need the facet_wrap()
or facet_grid()
function. In the multiple plots created above with the basic graphics, the variable was country and each plot was the subset of data for each of the four countries. For example, with ggplot()
,
|> ggplot() +
usukjpcn geom_line(mapping = aes(x = year, y = lifeExp, color = country)) +
facet_wrap(vars(country))
The first argument to facet_wrap()
uses the vars
function, which like aes
, is a function that can take unquoted variable or column names that you want to use. Variables for facet_wrap
should be categorical or discrete (like factors). For numerical variables, it is likely that are many possible values, so care is needed when trying to use these for facet_wrap
plots are many plots could be created.
With facet_wrap
, a vector of plots is created where each plot is a specific combination of the variables given to vars
. For example, we could look at the relationship between lifeExp
and log10(gdpPercap)
and see if it is different before or after the fall of the Soviet Union.
|> ggplot(aes(x = lifeExp, y = log10(gdpPercap))) +
gapminder geom_point() +
facet_wrap(vars(continent, year > 1990))
Now, facet_wrap
plots all the possible combinations of continent
and year > 1990
.
For two different variables in the facets, you can make a grid of plots with facet_grid
where the values of one variable change along the rows of the grid and the values of the other variable change along the columns. For example, changing the above plot of lifeExp
vs log10(gdpPercap)
with continent
and year > 1990
as facets to facet_grid
yields
|> ggplot(aes(x = lifeExp, y = log10(gdpPercap))) +
gapminder geom_point() +
facet_grid(vars(continent), vars(year > 1990))
Note that now we identify the row variable first and the column variable second and that each is identified in a separate vars
function. This allows us to specify multiple variables for both the rows and columns so that the facet_grid
can cover many possible combinations of variables.
Let’s use facet_grid
on the genomic imprinting data from the Babak et al. (2015) study (this time in tidy format),
= read_csv("babak-etal-2015_imprinted-mouse_tidy.csv", na = "NA") imprint
and plot a histogram of the expression values for a subset of the data (four genes and four tissues)
= imprint |>
fourgenes filter((Genes == "IGF2" | Genes == "GRB10" | Genes == "MEG3" | Genes == "PEG3") &
== "Preoptic Area (ref)" | tissue == "Hypothalamus" |
(tissue == "e17.5 Brain" | tissue == "e9.5 Yolk Sac" ))
tissue
|> ggplot() +
fourgenes geom_histogram(aes(expression)) +
facet_grid(vars(Genes), vars(tissue))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We’ve used a histogram now to count the fraction of times each expression value shows up in the data and used facet_grid
to separate out expression by tissue type and gene name. Positive values of expression indicate expression from the paternal chromosome only and negative values from the maternal chromosome. Thus, this plot shows how some genes are “imprinting” differently depending on the tissue type.
Lab
Use the GWAS data (
gwas_catalog_v1.0.2-associations_e104_r2021-09-23_no-waist_hip_body_blood_education_math_top100.tsv
):- Create a plot with
ggplot()
that shows thePVALUE_MLOG
as a function ofCHR_POS
for two different values ofDISEASE/TRAIT
(pick your favorites!) - Make the points for each disease/trait a different color
- Create a plot with
Use the
gapminder
data:- Create a line plot with
ggplot()
that shows life expectancy on the y-axis and time on the x-axis. - Make each line represent a country
(hint: look here: https://ggplot2.tidyverse.org/reference/aes_group_order.html to learn how to group the data appropriately) - Color each line by the continent
- Make points whose size is proportional to GDP per capita (variable
gdpPercap
)
- Create a line plot with
Create a plot like the one in Problem 2
- Except, the new plot has multiple subplots where each subplot is a different continent.
- No need to color by continent though.
- Instead, color the points red where countries at that year had greater than 50 million people.
(hint: try to use Google for help here. there are multiple ways to do this too :-) )
We’ll use a CDC dataset on COVID-19 mortality from here:
https://data.cdc.gov/Public-Health-Surveillance/Monthly-COVID-19-Death-Rates-per-100-000-Populatio/exs3-hbne/about_data
It has mortality rates per 100k people in thecrude_rate
column and demographic groups in thesubgroup1
andsubgroup2
columns.
library(RSocrata) = read.socrata("https://data.cdc.gov/api/odata/v4/exs3-hbne") covid
The kind of demographic information for each row of the table is in the
group
column. You should see what the distinct values ofgroup
are and what the values are forsubgroup1
andsubgroup2
to understand better the demographic category of each row. Be careful to filter using an appropriategroup
value for each part of the problem below.- Make a point plot of the covid death rate in
crude_rate
vs the date for the years 2020-2021.
Usefacet_wrap
to make this a set of plots for each age group in the data.
Hint: use theyear
function to more easily filter the rows.
Hint: filter bygroup
to get the right subset of rows. - Make a point plot of the covid death rate in
crude_rate
vs the date for 2020-2021.
Usefacet_wrap
to make a set of plots for each race/ethnicity value in the data.
Think carefully about what might not be included in these data that might explain any differences you see between these groups. - Make a point plot of the covid death rate in
crude_rate
vs the date for 2020-2021.
Usefacet_grid
to make a grid of plots for each combination of age and race/ethnicity value.
Hint: filter bygroup
first to get the right subset of data (there will be two different groups that yield the same data). - +3 points: make a version of the
facet_grid
plot above where there is a separate colored points for eachjurisdiction_residence
.
See here for what states belong to which region: https://www.hhs.gov/about/agencies/iea/regional-offices/index.html
- Make a point plot of the covid death rate in
Footnotes
see the Canvas course site for a copy of the book↩︎