Introduction to plotting and ggplot2

Author

Jeremy Van Cleve

Published

October 15, 2024

Outline for today

  • Plotting with R base graphics (sucks!)
  • Basic plotting with ggplot2

Plotting with R base graphics

Why do you even want to plot things?

  • See patterns / do comparisons
  • Data presentation

The base package (basic functions included with R and automatically loaded) includes plotting functions that can be useful. You’ve already used it a few times in the problems but below, you will see that the limits of these functions become evident quickly and more advanced plotting functions are required. The package of functions we will use in place of the base functions is Hadley Wickham’s ggplot2.

Before plotting, you need some data. A common dataset used in data exploration courses in R is the “Gapminder” data, which is a collection of time series data for countries.

library(gapminder)
library(tidyverse)
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

The data covers the statistics listed above for

levels(gapminder$country)
  [1] "Afghanistan"              "Albania"                 
  [3] "Algeria"                  "Angola"                  
  [5] "Argentina"                "Australia"               
  [7] "Austria"                  "Bahrain"                 
  [9] "Bangladesh"               "Belgium"                 
 [11] "Benin"                    "Bolivia"                 
 [13] "Bosnia and Herzegovina"   "Botswana"                
 [15] "Brazil"                   "Bulgaria"                
 [17] "Burkina Faso"             "Burundi"                 
 [19] "Cambodia"                 "Cameroon"                
 [21] "Canada"                   "Central African Republic"
 [23] "Chad"                     "Chile"                   
 [25] "China"                    "Colombia"                
 [27] "Comoros"                  "Congo, Dem. Rep."        
 [29] "Congo, Rep."              "Costa Rica"              
 [31] "Cote d'Ivoire"            "Croatia"                 
 [33] "Cuba"                     "Czech Republic"          
 [35] "Denmark"                  "Djibouti"                
 [37] "Dominican Republic"       "Ecuador"                 
 [39] "Egypt"                    "El Salvador"             
 [41] "Equatorial Guinea"        "Eritrea"                 
 [43] "Ethiopia"                 "Finland"                 
 [45] "France"                   "Gabon"                   
 [47] "Gambia"                   "Germany"                 
 [49] "Ghana"                    "Greece"                  
 [51] "Guatemala"                "Guinea"                  
 [53] "Guinea-Bissau"            "Haiti"                   
 [55] "Honduras"                 "Hong Kong, China"        
 [57] "Hungary"                  "Iceland"                 
 [59] "India"                    "Indonesia"               
 [61] "Iran"                     "Iraq"                    
 [63] "Ireland"                  "Israel"                  
 [65] "Italy"                    "Jamaica"                 
 [67] "Japan"                    "Jordan"                  
 [69] "Kenya"                    "Korea, Dem. Rep."        
 [71] "Korea, Rep."              "Kuwait"                  
 [73] "Lebanon"                  "Lesotho"                 
 [75] "Liberia"                  "Libya"                   
 [77] "Madagascar"               "Malawi"                  
 [79] "Malaysia"                 "Mali"                    
 [81] "Mauritania"               "Mauritius"               
 [83] "Mexico"                   "Mongolia"                
 [85] "Montenegro"               "Morocco"                 
 [87] "Mozambique"               "Myanmar"                 
 [89] "Namibia"                  "Nepal"                   
 [91] "Netherlands"              "New Zealand"             
 [93] "Nicaragua"                "Niger"                   
 [95] "Nigeria"                  "Norway"                  
 [97] "Oman"                     "Pakistan"                
 [99] "Panama"                   "Paraguay"                
[101] "Peru"                     "Philippines"             
[103] "Poland"                   "Portugal"                
[105] "Puerto Rico"              "Reunion"                 
[107] "Romania"                  "Rwanda"                  
[109] "Sao Tome and Principe"    "Saudi Arabia"            
[111] "Senegal"                  "Serbia"                  
[113] "Sierra Leone"             "Singapore"               
[115] "Slovak Republic"          "Slovenia"                
[117] "Somalia"                  "South Africa"            
[119] "Spain"                    "Sri Lanka"               
[121] "Sudan"                    "Swaziland"               
[123] "Sweden"                   "Switzerland"             
[125] "Syria"                    "Taiwan"                  
[127] "Tanzania"                 "Thailand"                
[129] "Togo"                     "Trinidad and Tobago"     
[131] "Tunisia"                  "Turkey"                  
[133] "Uganda"                   "United Kingdom"          
[135] "United States"            "Uruguay"                 
[137] "Venezuela"                "Vietnam"                 
[139] "West Bank and Gaza"       "Yemen, Rep."             
[141] "Zambia"                   "Zimbabwe"                

142 countries.

We’ve already seen some basic plots using base::plot. For example, let’s plot how life expectancy has changed over the years

plot(gapminder$year, gapminder$lifeExp)

You can modify the axis labels by adding the xlab and ylab arguments and give the plot a title with the main argument.

plottitle = "Life expectancy over time for all countries in the Gapminder data" # save space in the code
plot(gapminder$year, gapminder$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle)

Though there is a clear trend observable even at this scale, increasing life expectancy with time, but does this hold for individual countries? To see this, you could slice just a single country out of the data. Using the United States, for example,

us = gapminder |> filter(country == "United States")
plot(us$year, us$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle)

When plotting a couple of countries together, you can color the points by country by using the col argument to plot:

usukjpcn = gapminder |> 
  filter(country == "United States" | country == "United Kingdom" | country == "Japan" | country == "China" )
usukjpcn
# A tibble: 48 × 6
   country continent  year lifeExp        pop gdpPercap
   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
 1 China   Asia       1952    44    556263527      400.
 2 China   Asia       1957    50.5  637408000      576.
 3 China   Asia       1962    44.5  665770000      488.
 4 China   Asia       1967    58.4  754550000      613.
 5 China   Asia       1972    63.1  862030000      677.
 6 China   Asia       1977    64.0  943455000      741.
 7 China   Asia       1982    65.5 1000281000      962.
 8 China   Asia       1987    67.3 1084035000     1379.
 9 China   Asia       1992    68.7 1164970000     1656.
10 China   Asia       1997    70.4 1230075000     2289.
# ℹ 38 more rows
plot(usukjpcn$year, usukjpcn$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     col = usukjpcn$country)

legend('bottomright', legend = levels(usukjpcn$country), col = 1:4, pch = 1)

Uh oh, the legend includes all the countries! This is because slicing the data doesn’t dropped unused factor levels. Dang. You can fix this using the droplevels() function

usukjpcn = droplevels(usukjpcn)
plot(usukjpcn$year, usukjpcn$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     col = usukjpcn$country)
levels(usukjpcn$country)
[1] "China"          "Japan"          "United Kingdom" "United States" 
legend('bottomright', legend = levels(usukjpcn$country), col = 1:4, pch = 1)

So far, the basic plot() function is pretty useful. To see its limits, you can try something that shouldn’t be that much harder: plotting these data as points connected by a line. At first pass, you could try giving the plot() function the type = "l" argument to indicated you want lines.

plot(usukjpcn$year, usukjpcn$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     col = usukjpcn$country,
     type = 'l')

However, while plot() could color points by country, it doesn’t make individual lines for each country and simply makes one connected line. Getting around this in the basic plot() function is tedious. One way involves using the lines() function to indicate specifically each line you want.

# make an empty plot. lame.
plot(0:0, xlim = c(1952, 2007), ylim = c(20, 85), 
     xlab = "Year", ylab = "Life expectancy", main = plottitle, type = "n") 

# make each line individually. again, lame.
lines(filter(usukjpcn, country == "United States")$year, 
     filter(usukjpcn, country == "United States")$lifeExp) 
lines(filter(usukjpcn, country == "United Kingdom")$year, 
     filter(usukjpcn, country == "United Kingdom")$lifeExp)
lines(filter(usukjpcn, country == "Japan")$year, 
     filter(usukjpcn, country == "Japan")$lifeExp) 
lines(filter(usukjpcn, country == "China")$year, 
     filter(usukjpcn, country == "China")$lifeExp)

That was quite tedious. Another way would be to have individual plots for each line.

par(mfrow=c(1,4))
plot(filter(usukjpcn, country == "United States")$year, 
     filter(usukjpcn, country == "United States")$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     type = 'l')
plot(filter(usukjpcn, country == "United Kingdom")$year, 
     filter(usukjpcn, country == "United Kingdom")$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     type = 'l')
plot(filter(usukjpcn, country == "Japan")$year, 
     filter(usukjpcn, country == "Japan")$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     type = 'l')
plot(filter(usukjpcn, country == "China")$year, 
     filter(usukjpcn, country == "China")$lifeExp, 
     xlab = "Year", ylab = "Life expectancy", main = plottitle,
     type = 'l')

That was also very tedious. Not only that, notice that the y-axes have different ranges, which makes comparison across countries difficult. Correcting for this and other issues adds to the tedium. Using ggplot2 to create these kind of plots is much easier.

Basic plotting with ggplot2

The ggplot2 package is built on the idea that graphics have can have a “grammar”, or set or rules, that specifies how they can and should be constructed. Implementing these rules not only makes creating graphics easier, but it makes such graphics consistent and clear. Wickham borrows this idea from the book, “The Grammar of Graphics”” by Wilkinson, Anand, and Grossman (2005)1. While this structure may seem a bit artificial at first, it makes creating graphics very modular and building up complex graphics much easier.

To understand how ggplot2 works, first consider how the following example replicates the life expectancy by year plot above:

library(ggplot2)
ggplot(data = gapminder) + 
  geom_point(aes(x = year, y = lifeExp))

The first line has the function ggplot, which creates a coordinate system for the plot that you can add “layers” to. The argument to ggplot indicates which dataset to use.

Adding layers is accomplished by “adding” to the base plot with + and additional features via additional functions. The first feature added above is geom_point(), which adds a layer of points and creates a scatter plot. ggplot2 comes with many other “geometries”, which are really just other kinds of visual objects, like lines and boxes.

Each geom() function takes a mapping argument that defines how variables map to the visual objects in the plot. The mapping argument is always paired with the aes() function, which stands for “aesthetics”, and the x and y arguments of aes() specify which variables correspond to which axis.

Aesthetic mapping with aes()

In addition to the x and y axes, you can map variables to the color of the point. Above, you did this using the country variable. Now, repeat this with ggplot().

usukjpcn |> ggplot() + 
  geom_point(mapping = aes(x = year, y = lifeExp, color = country))

The above code with ggplot2 is both simpler and easier to read than the code creating the similar plot with the basic plot() function. For example, ggplot adds the legend automatically when you add an aesthetic like color.

Color is just one additional aesthetic. Others include the size and shape of the point.

For example, you can make the size of the point proportional to the population:

usukjpcn |> ggplot() + 
  geom_point(mapping = aes(x = year, y = lifeExp, color = country, size = pop))

Adding another aesthetic like size caused ggplot to automatically add another legend. Great! We could even try packing in another aesthetic, shape.

usukjpcn |> ggplot() + 
  geom_point(mapping = aes(x = year, y = lifeExp, color = gdpPercap, size = pop, shape = country))

It’s starting to get a bit cluttered, but playing around a bit can help you figure out which aesthetic will visually help you best see patterns or differentiate different variable values.

The shape of the points has many options and can be specified with a number from below

where the hollow shapes (0–14) have a border determined by color, the solid shapes (15–20) are filled with color, the filled shapes (21–24) have a border of color and are filled with fill. For example,

usukjpcn |> ggplot() + 
  geom_point(mapping = aes(x = year, y = lifeExp, shape = 23, color = country, fill = pop)) +
  scale_shape_identity()

where scale_shape_identity() tells ggplot to use the number we give to shape directly instead of just assigning it something by default as it would do for any other variable. We will talk more about scales when we talk about how to modify different plot elements.

With knowledge of aes, we can now ready recreate the multiline plot from above.

ggplot(data = usukjpcn) + 
  geom_line(mapping = aes(x = year, y = lifeExp, color = country))

The only change you had to make was to change geom_point to geom_line, for a line plot. Easy peasy.

Facets

Finally, ggplot2 makes creating multiple plots very easy with “facets”, which are just subplots that each correspond to a specific value of a variable or more generally to a specific subset of the data. To add facets in using ggplot(), you need the facet_wrap() or facet_grid() function. In the multiple plots created above with the basic graphics, the variable was country and each plot was the subset of data for each of the four countries. For example, with ggplot(),

usukjpcn |> ggplot() + 
  geom_line(mapping = aes(x = year, y = lifeExp, color = country)) +
  facet_wrap(vars(country))

The first argument to facet_wrap() uses the vars function, which like aes, is a function that can take unquoted variable or column names that you want to use. Variables for facet_wrap should be categorical or discrete (like factors). For numerical variables, it is likely that are many possible values, so care is needed when trying to use these for facet_wrap plots are many plots could be created.

With facet_wrap, a vector of plots is created where each plot is a specific combination of the variables given to vars. For example, we could look at the relationship between lifeExp and log10(gdpPercap) and see if it is different before or after the fall of the Soviet Union.

gapminder |> ggplot(aes(x = lifeExp, y = log10(gdpPercap))) +
  geom_point() +
  facet_wrap(vars(continent, year > 1990))

Now, facet_wrap plots all the possible combinations of continent and year > 1990.

For two different variables in the facets, you can make a grid of plots with facet_grid where the values of one variable change along the rows of the grid and the values of the other variable change along the columns. For example, changing the above plot of lifeExp vs log10(gdpPercap) with continent and year > 1990 as facets to facet_grid yields

gapminder |> ggplot(aes(x = lifeExp, y = log10(gdpPercap))) +
  geom_point() +
  facet_grid(vars(continent), vars(year > 1990))

Note that now we identify the row variable first and the column variable second and that each is identified in a separate vars function. This allows us to specify multiple variables for both the rows and columns so that the facet_grid can cover many possible combinations of variables.

Let’s use facet_grid on the genomic imprinting data from the Babak et al. (2015) study (this time in tidy format),

imprint = read_csv("babak-etal-2015_imprinted-mouse_tidy.csv", na = "NA")

and plot a histogram of the expression values for a subset of the data (four genes and four tissues)

fourgenes = imprint |> 
  filter((Genes == "IGF2" | Genes == "GRB10" | Genes == "MEG3" | Genes == "PEG3") &
         (tissue == "Preoptic Area (ref)" | tissue == "Hypothalamus" | 
          tissue == "e17.5 Brain" | tissue == "e9.5 Yolk Sac" ))

fourgenes |> ggplot() +
  geom_histogram(aes(expression)) +
  facet_grid(vars(Genes), vars(tissue))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We’ve used a histogram now to count the fraction of times each expression value shows up in the data and used facet_grid to separate out expression by tissue type and gene name. Positive values of expression indicate expression from the paternal chromosome only and negative values from the maternal chromosome. Thus, this plot shows how some genes are “imprinting” differently depending on the tissue type.

Lab

  1. Use the GWAS data (gwas_catalog_v1.0.2-associations_e104_r2021-09-23_no-waist_hip_body_blood_education_math_top100.tsv):

    • Create a plot with ggplot() that shows the PVALUE_MLOG as a function of CHR_POS for two different values of DISEASE/TRAIT (pick your favorites!)
    • Make the points for each disease/trait a different color
  2. Use the gapminder data:

    • Create a line plot with ggplot() that shows life expectancy on the y-axis and time on the x-axis.
    • Make each line represent a country
      (hint: look here: https://ggplot2.tidyverse.org/reference/aes_group_order.html to learn how to group the data appropriately)
    • Color each line by the continent
    • Make points whose size is proportional to GDP per capita (variable gdpPercap)
  3. Create a plot like the one in Problem 2

    • Except, the new plot has multiple subplots where each subplot is a different continent.
    • No need to color by continent though.
    • Instead, color the points red where countries at that year had greater than 50 million people.
      (hint: try to use Google for help here. there are multiple ways to do this too :-) )
  4. We’ll use a CDC dataset on COVID-19 mortality from here:
    https://data.cdc.gov/Public-Health-Surveillance/Monthly-COVID-19-Death-Rates-per-100-000-Populatio/exs3-hbne/about_data
    It has mortality rates per 100k people in the crude_rate column and demographic groups in the subgroup1 and subgroup2 columns.

    library(RSocrata)
    covid = read.socrata("https://data.cdc.gov/api/odata/v4/exs3-hbne")

    The kind of demographic information for each row of the table is in the group column. You should see what the distinct values of group are and what the values are for subgroup1 and subgroup2 to understand better the demographic category of each row. Be careful to filter using an appropriate group value for each part of the problem below.

    • Make a point plot of the covid death rate in crude_rate vs the date for the years 2020-2021.
      Use facet_wrap to make this a set of plots for each age group in the data.
      Hint: use the year function to more easily filter the rows.
      Hint: filter by group to get the right subset of rows.
    • Make a point plot of the covid death rate in crude_rate vs the date for 2020-2021.
      Use facet_wrap to make a set of plots for each race/ethnicity value in the data.
      Think carefully about what might not be included in these data that might explain any differences you see between these groups.
    • Make a point plot of the covid death rate in crude_rate vs the date for 2020-2021.
      Use facet_grid to make a grid of plots for each combination of age and race/ethnicity value.
      Hint: filter by group first to get the right subset of data (there will be two different groups that yield the same data).
    • +3 points: make a version of the facet_grid plot above where there is a separate colored points for each jurisdiction_residence.
      See here for what states belong to which region: https://www.hhs.gov/about/agencies/iea/regional-offices/index.html

Footnotes

  1. see the Canvas course site for a copy of the book↩︎