Visualizing lots of data

Author

Jeremy Van Cleve

Published

November 21, 2022

Outline for today

  • Grids of plots
  • Principal component analysis (PCA) plots

Reminder

  • Schedule for lightning talks

Plotting data with many variables

In many of the datasets that we’ve seen so far, we’ve already been faced with the problem that the data could be visualized in many different ways because they have many different variables. Some datasets were time series and it was clear that scatter plots or line plots were useful with time on the x-axis and multiple variables on the y-axis. For example, these global COVID-19 data from Our World in Data (https://github.com/owid/covid-19-data/tree/master/public/data) for the year 2021

library(tidyverse)
theme_set(theme_classic())

owid = read_csv("https://github.com/vancleve/BIO621-DWVR/raw/main/owid-covid-data_2021.csv")
owid
# A tibble: 83,697 × 67
   iso_code continent locat…¹ date       total…² new_c…³ new_c…⁴ total…⁵ new_d…⁶
   <chr>    <chr>     <chr>   <date>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 AFG      Asia      Afghan… 2021-01-01   52513     183    131.    2201      12
 2 AFG      Asia      Afghan… 2021-01-02   52586      73    117.    2211      10
 3 AFG      Asia      Afghan… 2021-01-03   52709     123    123     2221      10
 4 AFG      Asia      Afghan… 2021-01-04   52909     200    129.    2230       9
 5 AFG      Asia      Afghan… 2021-01-05   53011     102    123.    2237       7
 6 AFG      Asia      Afghan… 2021-01-06   53105      94    111.    2244       7
 7 AFG      Asia      Afghan… 2021-01-07   53207     102    125.    2253       9
 8 AFG      Asia      Afghan… 2021-01-08   53332     125    117     2257       4
 9 AFG      Asia      Afghan… 2021-01-09   53400      68    116.    2264       7
10 AFG      Asia      Afghan… 2021-01-10   53489      89    111.    2277      13
# … with 83,687 more rows, 58 more variables: new_deaths_smoothed <dbl>,
#   total_cases_per_million <dbl>, new_cases_per_million <dbl>,
#   new_cases_smoothed_per_million <dbl>, total_deaths_per_million <dbl>,
#   new_deaths_per_million <dbl>, new_deaths_smoothed_per_million <dbl>,
#   reproduction_rate <dbl>, icu_patients <dbl>,
#   icu_patients_per_million <dbl>, hosp_patients <dbl>,
#   hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, …
owid %>% 
  filter(location %in% c("United States", "Brazil", "India", "United Kingdom", "Sweden")) %>%
  ggplot(aes(x=date, y=new_deaths_smoothed_per_million)) +
  geom_line(aes(color=location)) + 
  labs(x = "Date", y = "Deaths per million persons (smoothed)", color = "State")

are presented nicely as a nice time series since the time course of the pandemic is a crucial feature that we might be interested in studying.

However, our plots may have many variables and no single one of them, like time, stands out as one against which we would like to compare the others. In this case, how the variables correlate with one another may be important. For example, let’s load a dataset from Keith Tarvin of Oberlin College on morphological measurements of blue jays.

load("blue_jays.rda")

glimpse(blue_jays)
Rows: 123
Columns: 9
$ BirdID     <fct> 0000-00000, 1142-05901, 1142-05905, 1142-05907, 1142-05909,…
$ KnownSex   <fct> M, M, M, F, M, F, M, M, F, F, M, F, M, F, M, F, F, M, M, M,…
$ BillDepth  <dbl> 8.26, 8.54, 8.39, 7.78, 8.71, 7.28, 8.74, 8.72, 8.20, 7.67,…
$ BillWidth  <dbl> 9.21, 8.76, 8.78, 9.30, 9.84, 9.30, 9.28, 9.94, 9.01, 9.31,…
$ BillLength <dbl> 25.92, 24.99, 26.07, 23.48, 25.47, 22.25, 25.35, 30.00, 22.…
$ Head       <dbl> 56.58, 56.36, 57.32, 53.77, 57.32, 52.25, 57.12, 60.67, 52.…
$ Mass       <dbl> 73.30, 75.10, 70.25, 65.50, 74.90, 63.90, 75.10, 78.10, 64.…
$ Skull      <dbl> 30.66, 31.38, 31.25, 30.29, 31.85, 30.00, 31.77, 30.67, 30.…
$ Sex        <int> 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,…

There are 123 individuals in this dataset and the variables include bill depth, width, length, head size, mass, sex, etc. If we wanted to see how head size and body mass correlate, we could simply do a scatter plot.

blue_jays %>% ggplot(aes(x=Mass, y=Head)) + geom_point()

It looks like head size and mass correlate positively, which makes sense. We know there are males and females, so we can add another variable by coloring the points by sex.

blue_jays %>% ggplot(aes(x=Mass, y=Head)) + geom_point(aes(color=KnownSex))

Grids of plots

Ok, so male birds are larger and heavier in this population. But there are a bunch of other variables too; how do they correlate with mass and head size and with each other? How do they differ with sex? In other words, we would like to make a scatter for each pair of variables in the dataset. To do this, we can use the facet function like we did earlier in the semester but now in a more sophisticated way. First, a little reminder about how the facet function works in ggplot. Let’s look back at the time that we used facet_wrap in Week 6.

library(gapminder)

usukjpcn = filter(gapminder, 
            country == "United States" 
            | country == "United Kingdom" 
            | country == "Japan" 
            | country == "China" )

ggplot(data = usukjpcn) + 
  geom_line(mapping = aes(x = year, y = lifeExp, color = country)) +
  facet_wrap(~ country)

We first wrote ggplot commands like we wanted to plot year vs lifeExp. Then, we used the facet_wrap function and told it we wanted a plot for each distinct value of the country variable. The output was then four plots, one for each country, with the same x- and y-axis. This kind of setup actually only plots three variables, year, lifeExp, and country, and does not do a scatter plot for each pair of variable. However, with a little data wrangling, we can create a new data table that will work with facet_grid.

Looking at the example above, facet_wrap (and facet_grid) generate multiple plots with the same x- and y-axis so we need to collapse our data for the different morphological measurements into two columns. Recall from our lecture on tidying data that we can do this with pivot_longer.

bgx = blue_jays %>% select(BirdID, KnownSex, Head, Mass, Skull) %>%
  pivot_longer(Head:Skull, names_to = "var_x", values_to = "val_x")

bgy = blue_jays %>% select(BirdID, KnownSex, Head, Mass, Skull) %>%
  pivot_longer(Head:Skull, names_to = "var_y", values_to = "val_y")

In the table above, we’ve selected only four variables, sex, head size, mass, and skull size. In order to create scatter plots head size, mass, and skull size, we gathered them together where var_x tells us which variable we have and val_x the value of that variable. Likewise, we created an identical table with var_y and val_y` since we’re going to plot each variable on the x- and y-axis. Next, we just join the two tables together.

bg = left_join(bgx, bgy) 
Joining, by = c("BirdID", "KnownSex")

Now, we can plot them with facet_grid(var_y ~ var_x).

bg %>% 
  ggplot(aes(val_x, val_y, color = KnownSex)) + 
  geom_point() + 
  facet_grid(var_y ~ var_x, scales = "free") + 
  labs(x = NULL, y = NULL)

Viola! We can see now that males are generally larger across all these variables, and that they all positively correlate with each other. In ecology, these kind of correlations with different morphological measurements is studied in the field of “allometry”.

Of course, we’re not the first ones to want to do this kind of plot. There are a few packages which perform this kind of plot automatically, such as GGally whose function ggpairs adds some nice things.

library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
blue_jays %>% 
  select(KnownSex, Head, Mass, Skull) %>% 
  ggpairs(aes(color=KnownSex))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

However, even with standard ggplot, we can still get more than 1/2 to what this fancy package does!

Principal component analysis (PCA) plots

Recall that the blue jay dataset has more than just the three variables above. What if it had 10 more variables? Or 100 more? We couldn’t easily interpret a 100x100 grid of scatter plots to understand how the variables relate to one another or whether there are different patterns for males and females. A very useful statistical tools used in these circumstances is called a principal component analysis, which is a type of dimensionality reduction. Dimensionality reduction uses the fact that there are often lots of correlations in high-dimensional data and thus there are “effectively” many fewer important dimensions that capture the independent variation of the whole dataset. For example, in the blue jay morphology data, we can see that many body measures are positively correlated. Thus, if we know just body mass, we can make a good guess about what head size or skull size might be. We could even do a linear regression of head size against body mass and skull size against body mass and then use body mass to predict both of those variables. If we do this kind of linear analysis, adding together variables and weighting them with coefficient, and make sure the weight sum of variables explains as much variation in the data as possible, we end up with a PCA. A PCA not only gives you a new variable that explains a lot of variation, it also gives you as many new variables as you had old variables where each new one after the first explains less variation in the data than the new variable before it.

Let’s see an example of PCA. We can use the prcomp function is R to generate the principal components.

bpca = blue_jays %>% select(-BirdID, -KnownSex, -Sex) %>% prcomp()

cbind(blue_jays, bpca$x) %>% ggplot(aes(x=PC1, y=PC2)) + 
  geom_point(aes(color=KnownSex))

This plot shows all the bird samples plotted on the two first principal components. You can immediately see that males and females are roughly two different clouds of points, which accords with our prior knowledge that the sexes differ in body morphology. To see how much of the variance each PC explains, we look at the summary.

summary(bpca)
Importance of components:
                          PC1     PC2     PC3     PC4     PC5      PC6
Standard deviation     4.9395 1.51406 0.90287 0.49411 0.30416 0.001672
Proportion of Variance 0.8763 0.08233 0.02928 0.00877 0.00332 0.000000
Cumulative Proportion  0.8763 0.95863 0.98791 0.99668 1.00000 1.000000

The first component explains almost 90% of the variance and the second component only 8%. To see

This gives the percent variance explained by each of the principal components. To see how each PC is composed of the underlying variables, we look at the rotation value:

bpca$rotation
                   PC1         PC2         PC3         PC4         PC5
BillDepth  -0.03961552  0.08989552 -0.06763782 -0.05770882 -0.99118308
BillWidth  -0.03362384  0.08147503  0.01955653 -0.99377559  0.06525873
BillLength -0.11159588  0.61166243 -0.51854443  0.05004609  0.09273888
Head       -0.21992974  0.72883067  0.28050368  0.07582643  0.05100298
Mass       -0.96162735 -0.25741373 -0.09198285  0.01098280  0.02072545
Skull      -0.10839813  0.11702017  0.79938175  0.02654960 -0.04081731
                     PC6
BillDepth  -5.705440e-04
BillWidth  -4.087104e-04
BillLength -5.774270e-01
Head        5.774110e-01
Mass        5.583775e-05
Skull      -5.772123e-01

This shows that the first PC is mostly body mass and that other variables are also positively correlated with body mass (PCs can be all multiplied by a negative, so what matters is the relative sign). We can also plot what the variables are as a function of the first two PCs.

as.data.frame(bpca$rotation) %>% 
  mutate(feature=rownames(bpca$rotation)) %>% 
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point(aes(color=feature), show.legend = FALSE) + 
  geom_text(aes(label=feature), size=3, position=position_jitter(width=0.05,height=0.05))

This just shows that mass dominates PC1 whereas PC2 measures how the rest of the variables all negatively correlate with mass once you take into account PC1.

Finally, let’s do a PCA for data that doesn’t have such a clear group as sex. One example are the gene expression data from the genomic imprinting dataset. Here, we’ll do a PCA using the Genes as variables. This entails using pivot_wider to put the Genes as variables or columns (which of course is how the original .xlsx file was, but we’re just using the tidy data as practice)

imprint = read_csv("babak-etal-2015_imprinted-mouse_tidy.csv")
Rows: 4125 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Genes, tissue
dbl (1): expression

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
imprint_t = imprint %>% pivot_wider(names_from = Genes, values_from = expression) %>% na.omit()
ipca = imprint_t %>% select(-tissue) %>% prcomp()

data.frame(tissue=imprint_t$tissue, ipca$x) %>% ggplot(aes(x=PC1, y=PC2)) + 
  geom_point() + 
  geom_text(aes(label=tissue), size=2.5, position=position_jitter(width=2,height=5))

We can see from the PCA that some tissues group together naturally (brain tissues) and others not as much. This could be the beginning of an investigation about whether there are biological reasons these other tissues groups the way they do. We can use summary to see how strong the first few PCs are.

summary(ipca)
Importance of components:
                           PC1     PC2      PC3     PC4      PC5      PC6
Standard deviation     37.5944 28.5723 20.49342 19.5804 18.51889 18.08067
Proportion of Variance  0.2138  0.1235  0.06353  0.0580  0.05188  0.04945
Cumulative Proportion   0.2138  0.3373  0.40084  0.4588  0.51072  0.56018
                            PC7      PC8      PC9     PC10     PC11     PC12
Standard deviation     17.12899 15.80112 15.58329 15.01881 14.04080 13.21549
Proportion of Variance  0.04439  0.03777  0.03674  0.03412  0.02982  0.02642
Cumulative Proportion   0.60456  0.64233  0.67907  0.71319  0.74302  0.76944
                          PC13     PC14     PC15     PC16     PC17     PC18
Standard deviation     12.9327 12.63925 11.55027 11.43424 11.08405 10.96401
Proportion of Variance  0.0253  0.02417  0.02018  0.01978  0.01859  0.01819
Cumulative Proportion   0.7947  0.81890  0.83909  0.85887  0.87745  0.89564
                           PC19    PC20    PC21    PC22    PC23    PC24    PC25
Standard deviation     10.40468 9.55383 9.33377 8.84628 8.18059 7.77386 7.58014
Proportion of Variance  0.01638 0.01381 0.01318 0.01184 0.01012 0.00914 0.00869
Cumulative Proportion   0.91201 0.92582 0.93900 0.95084 0.96096 0.97010 0.97880
                          PC26    PC27    PC28     PC29
Standard deviation     7.03762 6.83679 6.62491 1.67e-14
Proportion of Variance 0.00749 0.00707 0.00664 0.00e+00
Cumulative Proportion  0.98629 0.99336 1.00000 1.00e+00

The first PC only captures 21% of the variance and the second captures 12%. Thus, we need more the PCs in this dataset to describe a significant amount of the variance. Finally, we can plot how the genes make up the first two PCs. Then, we could use this information to see if the biological function of these genes explains how the tissues group along PC1 and PC2.

data.frame(gene=rownames(ipca$rotation), ipca$rotation) %>%
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point() + 
  geom_text(aes(label=gene), size=2.5, position=position_jitter(width=0.015,height=0.03))

Lab

Problems

  1. Using the gapminder data:

    • Perform a principal components analysis using the variables lifeExp, year, pop, and gdpPercap.
    • Plot all the points on a scatter plot with the the x- and y-axis corresponding to the first two principal components (PC1 and PC2).
    • Color the points according to year. Set the shape of the point (square, circle, etc) according to continent.
    • What patterns do you observe if any?
  2. The Our World in Data COVID-19 dataset is perfect for performing a PCA. Use the code below that generates the owid_total data table. This contains a set of variables from the OWID dataset and the total deaths per million by the end of the year 2021.

    owid_total = 
    read_csv("https://github.com/vancleve/BIO621-DWVR/raw/main/owid-covid-data_2021.csv") %>% 
    filter(date == "2021-12-31") %>% 
    select(location, total_deaths_per_million, total_cases_per_million, people_vaccinated_per_hundred, 
           population_density, median_age, male_smokers, cardiovasc_death_rate, diabetes_prevalence, life_expectancy, gdp_per_capita) %>% 
    filter(if_all(everything(), ~ !is.na(.x)))
    • Run a PCA including all the variables except location and total_deaths_per_million. What variables “weigh” the strongest (have the largest absolute value) in the first three principal components?
    • Use cbind(owid_total, your_PCA$x) to generate a new data frame
    • Use the above data frame to plot countries by their location in the coordinates of PC1 and PC2. Color the points by whether they have great than 1000 total deaths per million. What do you notice about the where the points are with the higher number of total cases? Interpret this in terms of the most important variables in PC1 and PC2.
    • Adding geom_text to label the points by their country (location) name is a nice addition to the plot, but its optional.