Introduction to plotting and ggplot2

Author

Jeremy Van Cleve

Published

October 3, 2022

Outline for today

  • Plotting with R base graphics (sucks!)
  • Basic plotting with ggplot2

Plotting with R base graphics

Why do you even want to plot things?

  • See patterns / do comparisons
  • Data presentation

The base package (basic functions included with R and automatically loaded) includes plotting functions that can be useful. You’ve already used it a few times in the problems but below, you will see that the limits of these functions become evident quickly and more advanced plotting functions are required. The package of functions we will use in place of the base functions is Hadley Wickham’s ggplot2.

Before plotting, you need some data. A common dataset used in data exploration courses in R is the “Gapminder” data, which is a collection of time series data for countries.

library(gapminder)
library(tidyverse)
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

The data covers the statistics listed above for

levels(gapminder$country)
  [1] "Afghanistan"              "Albania"                 
  [3] "Algeria"                  "Angola"                  
  [5] "Argentina"                "Australia"               
  [7] "Austria"                  "Bahrain"                 
  [9] "Bangladesh"               "Belgium"                 
 [11] "Benin"                    "Bolivia"                 
 [13] "Bosnia and Herzegovina"   "Botswana"                
 [15] "Brazil"                   "Bulgaria"                
 [17] "Burkina Faso"             "Burundi"                 
 [19] "Cambodia"                 "Cameroon"                
 [21] "Canada"                   "Central African Republic"
 [23] "Chad"                     "Chile"                   
 [25] "China"                    "Colombia"                
 [27] "Comoros"                  "Congo, Dem. Rep."        
 [29] "Congo, Rep."              "Costa Rica"              
 [31] "Cote d'Ivoire"            "Croatia"                 
 [33] "Cuba"                     "Czech Republic"          
 [35] "Denmark"                  "Djibouti"                
 [37] "Dominican Republic"       "Ecuador"                 
 [39] "Egypt"                    "El Salvador"             
 [41] "Equatorial Guinea"        "Eritrea"                 
 [43] "Ethiopia"                 "Finland"                 
 [45] "France"                   "Gabon"                   
 [47] "Gambia"                   "Germany"                 
 [49] "Ghana"                    "Greece"                  
 [51] "Guatemala"                "Guinea"                  
 [53] "Guinea-Bissau"            "Haiti"                   
 [55] "Honduras"                 "Hong Kong, China"        
 [57] "Hungary"                  "Iceland"                 
 [59] "India"                    "Indonesia"               
 [61] "Iran"                     "Iraq"                    
 [63] "Ireland"                  "Israel"                  
 [65] "Italy"                    "Jamaica"                 
 [67] "Japan"                    "Jordan"                  
 [69] "Kenya"                    "Korea, Dem. Rep."        
 [71] "Korea, Rep."              "Kuwait"                  
 [73] "Lebanon"                  "Lesotho"                 
 [75] "Liberia"                  "Libya"                   
 [77] "Madagascar"               "Malawi"                  
 [79] "Malaysia"                 "Mali"                    
 [81] "Mauritania"               "Mauritius"               
 [83] "Mexico"                   "Mongolia"                
 [85] "Montenegro"               "Morocco"                 
 [87] "Mozambique"               "Myanmar"                 
 [89] "Namibia"                  "Nepal"                   
 [91] "Netherlands"              "New Zealand"             
 [93] "Nicaragua"                "Niger"                   
 [95] "Nigeria"                  "Norway"                  
 [97] "Oman"                     "Pakistan"                
 [99] "Panama"                   "Paraguay"                
[101] "Peru"                     "Philippines"             
[103] "Poland"                   "Portugal"                
[105] "Puerto Rico"              "Reunion"                 
[107] "Romania"                  "Rwanda"                  
[109] "Sao Tome and Principe"    "Saudi Arabia"            
[111] "Senegal"                  "Serbia"                  
[113] "Sierra Leone"             "Singapore"               
[115] "Slovak Republic"          "Slovenia"                
[117] "Somalia"                  "South Africa"            
[119] "Spain"                    "Sri Lanka"               
[121] "Sudan"                    "Swaziland"               
[123] "Sweden"                   "Switzerland"             
[125] "Syria"                    "Taiwan"                  
[127] "Tanzania"                 "Thailand"                
[129] "Togo"                     "Trinidad and Tobago"     
[131] "Tunisia"                  "Turkey"                  
[133] "Uganda"                   "United Kingdom"          
[135] "United States"            "Uruguay"                 
[137] "Venezuela"                "Vietnam"                 
[139] "West Bank and Gaza"       "Yemen, Rep."             
[141] "Zambia"                   "Zimbabwe"                

142 countries.

We’ve already seen some basic plots using base::plot. For example, let’s plot how life expectancy has changed over the years

plot(gapminder$year, gapminder$lifeExp)

You can modify the axis labels by adding the xlab and ylab arguments and give the plot a title with the main argument.

plottitle = "Life expectancy over time for all countries in the Gapminder data" # save space in the code
plot(gapminder$year, gapminder$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle)

Though there is a clear trend observable even at this scale, increasing life expectancy with time, but does this hold for individual countries? To see this, you could slice just a single country out of the data. Using the United States, for example,

us = gapminder %>% filter(country == "United States")
plot(us$year, us$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle)

When plotting a couple of countries together, you can color the points by country by using the col argument to plot:

usukjpcn = gapminder %>% 
  filter(country == "United States" | country == "United Kingdom" | country == "Japan" | country == "China" )
usukjpcn
# A tibble: 48 × 6
   country continent  year lifeExp        pop gdpPercap
   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
 1 China   Asia       1952    44    556263527      400.
 2 China   Asia       1957    50.5  637408000      576.
 3 China   Asia       1962    44.5  665770000      488.
 4 China   Asia       1967    58.4  754550000      613.
 5 China   Asia       1972    63.1  862030000      677.
 6 China   Asia       1977    64.0  943455000      741.
 7 China   Asia       1982    65.5 1000281000      962.
 8 China   Asia       1987    67.3 1084035000     1379.
 9 China   Asia       1992    68.7 1164970000     1656.
10 China   Asia       1997    70.4 1230075000     2289.
# … with 38 more rows
plot(usukjpcn$year, usukjpcn$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     col = usukjpcn$country)

legend('bottomright', legend = levels(usukjpcn$country), col = 1:4, pch = 1)

levels(usukjpcn$country)
  [1] "Afghanistan"              "Albania"                 
  [3] "Algeria"                  "Angola"                  
  [5] "Argentina"                "Australia"               
  [7] "Austria"                  "Bahrain"                 
  [9] "Bangladesh"               "Belgium"                 
 [11] "Benin"                    "Bolivia"                 
 [13] "Bosnia and Herzegovina"   "Botswana"                
 [15] "Brazil"                   "Bulgaria"                
 [17] "Burkina Faso"             "Burundi"                 
 [19] "Cambodia"                 "Cameroon"                
 [21] "Canada"                   "Central African Republic"
 [23] "Chad"                     "Chile"                   
 [25] "China"                    "Colombia"                
 [27] "Comoros"                  "Congo, Dem. Rep."        
 [29] "Congo, Rep."              "Costa Rica"              
 [31] "Cote d'Ivoire"            "Croatia"                 
 [33] "Cuba"                     "Czech Republic"          
 [35] "Denmark"                  "Djibouti"                
 [37] "Dominican Republic"       "Ecuador"                 
 [39] "Egypt"                    "El Salvador"             
 [41] "Equatorial Guinea"        "Eritrea"                 
 [43] "Ethiopia"                 "Finland"                 
 [45] "France"                   "Gabon"                   
 [47] "Gambia"                   "Germany"                 
 [49] "Ghana"                    "Greece"                  
 [51] "Guatemala"                "Guinea"                  
 [53] "Guinea-Bissau"            "Haiti"                   
 [55] "Honduras"                 "Hong Kong, China"        
 [57] "Hungary"                  "Iceland"                 
 [59] "India"                    "Indonesia"               
 [61] "Iran"                     "Iraq"                    
 [63] "Ireland"                  "Israel"                  
 [65] "Italy"                    "Jamaica"                 
 [67] "Japan"                    "Jordan"                  
 [69] "Kenya"                    "Korea, Dem. Rep."        
 [71] "Korea, Rep."              "Kuwait"                  
 [73] "Lebanon"                  "Lesotho"                 
 [75] "Liberia"                  "Libya"                   
 [77] "Madagascar"               "Malawi"                  
 [79] "Malaysia"                 "Mali"                    
 [81] "Mauritania"               "Mauritius"               
 [83] "Mexico"                   "Mongolia"                
 [85] "Montenegro"               "Morocco"                 
 [87] "Mozambique"               "Myanmar"                 
 [89] "Namibia"                  "Nepal"                   
 [91] "Netherlands"              "New Zealand"             
 [93] "Nicaragua"                "Niger"                   
 [95] "Nigeria"                  "Norway"                  
 [97] "Oman"                     "Pakistan"                
 [99] "Panama"                   "Paraguay"                
[101] "Peru"                     "Philippines"             
[103] "Poland"                   "Portugal"                
[105] "Puerto Rico"              "Reunion"                 
[107] "Romania"                  "Rwanda"                  
[109] "Sao Tome and Principe"    "Saudi Arabia"            
[111] "Senegal"                  "Serbia"                  
[113] "Sierra Leone"             "Singapore"               
[115] "Slovak Republic"          "Slovenia"                
[117] "Somalia"                  "South Africa"            
[119] "Spain"                    "Sri Lanka"               
[121] "Sudan"                    "Swaziland"               
[123] "Sweden"                   "Switzerland"             
[125] "Syria"                    "Taiwan"                  
[127] "Tanzania"                 "Thailand"                
[129] "Togo"                     "Trinidad and Tobago"     
[131] "Tunisia"                  "Turkey"                  
[133] "Uganda"                   "United Kingdom"          
[135] "United States"            "Uruguay"                 
[137] "Venezuela"                "Vietnam"                 
[139] "West Bank and Gaza"       "Yemen, Rep."             
[141] "Zambia"                   "Zimbabwe"                

Uh oh, the legend includes all the countries! This is because slicing the data doesn’t dropped unused factor levels. Dang. You can fix this using the droplevels() function

usukjpcn = droplevels(usukjpcn)
plot(usukjpcn$year, usukjpcn$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     col = usukjpcn$country)
levels(usukjpcn$country)
[1] "China"          "Japan"          "United Kingdom" "United States" 
legend('bottomright', legend = levels(usukjpcn$country), col = 1:4, pch = 1)

So far, the basic plot() function is pretty useful. To see its limits, you can try something that shouldn’t be that much harder: plotting these data as points connected by a line. At first pass, you could try giving the plot() function the type = "l" argument to indicated you want lines.

plot(usukjpcn$year, usukjpcn$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     col = usukjpcn$country,
     type = 'l')

However, while plot() could color points by country, it doesn’t make individual lines for each country and simply makes one connected line. Getting around this in the basic plot() function is tedious. One way involves using the lines() function to indicate specifically each line you want.

# make an empty plot. lame.
plot(0:0, xlim = c(1952, 2007), ylim = c(20, 85), 
     xlab = "Year", ylab = "Life expectancy", main = plottitle, type = "n") 

# make each line individually. again, lame.
lines(filter(usukjpcn, country == "United States")$year, 
     filter(usukjpcn, country == "United States")$lifeExp) 
lines(filter(usukjpcn, country == "United Kingdom")$year, 
     filter(usukjpcn, country == "United Kingdom")$lifeExp)
lines(filter(usukjpcn, country == "Japan")$year, 
     filter(usukjpcn, country == "Japan")$lifeExp) 
lines(filter(usukjpcn, country == "China")$year, 
     filter(usukjpcn, country == "China")$lifeExp)

That was quite tedious. Another way would be to have individual plots for each line.

par(mfrow=c(1,4))
plot(filter(usukjpcn, country == "United States")$year, 
     filter(usukjpcn, country == "United States")$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     type = 'l')
plot(filter(usukjpcn, country == "United Kingdom")$year, 
     filter(usukjpcn, country == "United Kingdom")$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     type = 'l')
plot(filter(usukjpcn, country == "Japan")$year, 
     filter(usukjpcn, country == "Japan")$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     type = 'l')
plot(filter(usukjpcn, country == "China")$year, 
     filter(usukjpcn, country == "China")$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     type = 'l')

That was also very tedious. Not only that, notice that the y-axes have different ranges, which makes comparison across countries difficult. Correcting for this and other issues adds to the tedium. Using ggplot2 to create these kind of plots is much easier.

Basic plotting with ggplot2

The ggplot2 package is built on the idea that graphics have can have a “grammar”, or set or rules, that specifies how they can and should be constructed. Implementing these rules not only makes creating graphics easier, but it makes such graphics consistent and clear. Wickham borrows this idea from the book, “The Grammar of Graphics”” by Wilkinson, Anand, and Grossman (2005)1. While this structure may seem a bit artificial at first, it makes creating graphics very modular and building up complex graphics much easier.

To understand how ggplot2 works, first consider how the following example replicates your life expectancy by year plot above:

library(ggplot2)
ggplot(data = gapminder) + 
  geom_point(aes(x = year, y = lifeExp))

The first line has the function ggplot, which creates a coordinate system for the plot that you can add “layers” to. The argument to ggplot indicates which dataset to use.

Adding layers is accomplished by “adding” to the base plot with + and additional features via additional functions. The first feature added above is geom_point(), which adds a layer of points and creates a scatter plot. ggplot2 comes with many other “geometries”, which are really just other kinds of visual objects, like lines and boxes.

Each geom() function takes a mapping argument that defines how variables map to the visual objects in the plot. The mapping argument is always paired with the aes() function, which stands for “aesthetics”, and the x and y arguments of aes() specify which variables correspond to which axis.

Aesthetic mapping with aes()

In addition to the x and y axes, you can map variables to the color of the point. Above, you did this using the country variable. Now, repeat this with ggplot().

usukjpcn %>% ggplot() + 
  geom_point(mapping = aes(x = year, y = lifeExp, color = country)) +
  geom_line(mapping = aes(x = year, y = lifeExp, color = country))

The above code with ggplot2 is both simpler and easier to read than the code creating the similar plot with the basic plot() function. For example, ggplot adds the legend automatically when you add an aesthetic like color.

Color is just one additional aesthetic. Others include the size and shape of the point.

For example, you can make the size of the point proportional to the population:

usukjpcn %>% ggplot() + 
  geom_point(mapping = aes(x = year, y = lifeExp, color = country, size = pop))

Adding another aesthetic like size caused ggplot to automatically add another legend. Great! We could even try packing in another aesthetic, shape.

usukjpcn %>% ggplot() + 
  geom_point(mapping = aes(x = year, y = lifeExp, color = gdpPercap, size = pop, shape = country))

It’s starting to get a bit cluttered, but playing around a bit can help you figure out which aesthetic will visually help you best see patterns or differentiate different variable values.

You are now ready to recreate the multiline plot from above.

ggplot(data = usukjpcn) + 
  geom_line(mapping = aes(x = year, y = lifeExp, color = country))

The only change you had to make was to change geom_point to geom_line, for a line plot. Easy peasy.

Facets

Finally, ggplot2 makes creating multiple plots very easy with “facets”, which are just subplots that each correspond to a specific value of a variable or more generally to a specific subset of the data. To add facets in using ggplot(), you need the facet_wrap() or facet_grid() function. In the multiple plots created above with the basic graphics, the variable was country and each plot was the subset of data for each of the four countries. For example, with ggplot(),

usukjpcn %>% ggplot() + 
  geom_line(mapping = aes(x = year, y = lifeExp, color = country)) +
  facet_wrap(~ country)

The first argument to facet_wrap() is a “formula”, which refers to how one writes regression formulas using R’s regression function lm(). Essentially, the “~” is like an equals sign for a regression formula and the variables on the right hand side are dependent variables. The variables passed to facet_wrap should be categorical or discrete (like factors); in other words, you can make multiple plots for different values of a numeric variable.

Adding an additional facet is possible to make a grid of plots. To see an example, load the genomic imprinting data from the Babak et al. (2015) study (this time in tidy format),

imprint = read_csv("babak-etal-2015_imprinted-mouse_tidy.csv", na = "NA")

and plot a histogram of the expression values for a subset of the data (four genes and four tissues)

fourgenes = imprint %>% 
  filter((Genes == "IGF2" | Genes == "GRB10" | Genes == "MEG3" | Genes == "PEG3") &
         (tissue == "Preoptic Area (ref)" | tissue == "Hypothalamus" | 
          tissue == "e17.5 Brain" | tissue == "e9.5 Yolk Sac" ))

fourgenes %>% ggplot() +
  geom_histogram(aes(expression)) +
  facet_grid(Genes ~ tissue)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here, facet_grid arranges the plots nicely in a grid. Positive values of expression indicate expression from the paternal chromosome only and negative values from the maternal chromosome. Thus, this plot shows how some genes are “imprinting” differently depending on the tissue type.

Lab

Problems

  1. Use GWAS data from last week (gwas_catalog_v1.0.2-associations_e104_r2021-09-23_no-waist_hip_body_blood_education_math_top100.tsv):

    • Create a plot with ggplot() that shows the PVALUE_MLOG as a function of CHR_POS for two different values of DISEASE/TRAIT (pick your favorites!)
    • Make the points for each disease/trait a different color
  2. Use the gapminder data:

    • Create a line plot with ggplot() that shows life expectancy on the y-axis and time on the x-axis.
    • Make each line represent a country
      (hint: look here: https://ggplot2.tidyverse.org/reference/aes_group_order.html to learn how to group the data appropriately)
    • Color each line by the continent
    • Make points whose size is proportional to GDP per capita (variable gdpPercap)
  3. Create a plot like the one in Problem 2

    • Except, the new plot has multiple subplots where each subplot is a different continent.
    • No need to color by continent though.
    • Instead, color the points red where countries at that year had greater than 50 million people.
      (hint: try to use Google for help here. there are multiple ways to do this too :-) )
  4. Use the COVID-19 data below from the New York Times. It has daily case and death data.

    us = read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us-states.csv")
    • Choose three states and plot the cases per 100k versus time.
    • Make the line color vary by the deaths per 100k.
    • Each state should have its own panel by using facet_wrap. Make sure facet_wrap creates a single column of plots with three rows.
    • What do you notice about the relationship between peak cases and peak deaths?

Footnotes

  1. see the Canvas course site for a copy of the book↩︎