Visualizing lots of data

Author

Jeremy Van Cleve

Published

November 19, 2024

Outline for today

  • Grids of plots
  • Principal component analysis (PCA) plots

Reminder

  • Schedule for lightning talks

Plotting data with many variables

In many of the datasets that we’ve seen so far, we’ve already been faced with the problem that the data could be visualized in many different ways because they have many different variables. Some datasets were time series and it was clear that scatter plots or line plots were useful with time on the x-axis and multiple variables on the y-axis. For example, these global COVID-19 data from Our World in Data (https://github.com/owid/covid-19-data/tree/master/public/data) for the year 2021

library(tidyverse)
theme_set(theme_classic())
owid = read_csv("https://catalog.ourworldindata.org/garden/covid/latest/compact/compact.csv")
Rows: 453386 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): country, code, continent
dbl  (57): total_cases, new_cases, new_cases_smoothed, total_cases_per_milli...
date  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
owid
# A tibble: 453,386 × 61
   country     date       total_cases new_cases new_cases_smoothed
   <chr>       <date>           <dbl>     <dbl>              <dbl>
 1 Afghanistan 2020-01-01          NA        NA                 NA
 2 Afghanistan 2020-01-02          NA        NA                 NA
 3 Afghanistan 2020-01-03          NA        NA                 NA
 4 Afghanistan 2020-01-04          NA        NA                 NA
 5 Afghanistan 2020-01-05           0         0                 NA
 6 Afghanistan 2020-01-06           0         0                 NA
 7 Afghanistan 2020-01-07           0         0                 NA
 8 Afghanistan 2020-01-08           0         0                 NA
 9 Afghanistan 2020-01-09           0         0                 NA
10 Afghanistan 2020-01-10           0         0                  0
# ℹ 453,376 more rows
# ℹ 56 more variables: total_cases_per_million <dbl>,
#   new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
#   total_deaths <dbl>, new_deaths <dbl>, new_deaths_smoothed <dbl>,
#   total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
#   new_deaths_smoothed_per_million <dbl>, excess_mortality <dbl>,
#   excess_mortality_cumulative <dbl>, …
owid |> 
  filter(country %in% c("United States", "Brazil", "India", "United Kingdom", "Sweden")) |>
  filter(year(date) == 2021, !is.na(new_deaths_smoothed_per_million)) |>
  ggplot(aes(x=date, y=new_deaths_smoothed_per_million)) +
  geom_line(aes(color=country)) + 
  labs(x = "Date", y = "Deaths per million persons (smoothed)", color = "Country")

are presented nicely as a nice time series since the time course of the pandemic is a crucial feature that we might be interested in studying.

However, our plots may have many variables and no single one of them, like time, stands out as one against which we would like to compare the others. In this case, how the variables correlate with one another may be important. For example, let’s load a dataset from Keith Tarvin of Oberlin College on morphological measurements of blue jays.

load("blue_jays.rda")

glimpse(blue_jays)
Rows: 123
Columns: 9
$ BirdID     <fct> 0000-00000, 1142-05901, 1142-05905, 1142-05907, 1142-05909,…
$ KnownSex   <fct> M, M, M, F, M, F, M, M, F, F, M, F, M, F, M, F, F, M, M, M,…
$ BillDepth  <dbl> 8.26, 8.54, 8.39, 7.78, 8.71, 7.28, 8.74, 8.72, 8.20, 7.67,…
$ BillWidth  <dbl> 9.21, 8.76, 8.78, 9.30, 9.84, 9.30, 9.28, 9.94, 9.01, 9.31,…
$ BillLength <dbl> 25.92, 24.99, 26.07, 23.48, 25.47, 22.25, 25.35, 30.00, 22.…
$ Head       <dbl> 56.58, 56.36, 57.32, 53.77, 57.32, 52.25, 57.12, 60.67, 52.…
$ Mass       <dbl> 73.30, 75.10, 70.25, 65.50, 74.90, 63.90, 75.10, 78.10, 64.…
$ Skull      <dbl> 30.66, 31.38, 31.25, 30.29, 31.85, 30.00, 31.77, 30.67, 30.…
$ Sex        <int> 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,…

There are 123 individuals in this dataset and the variables include bill depth, width, length, head size, mass, sex, etc. If we wanted to see how head size and body mass correlate, we could simply do a scatter plot.

blue_jays |> ggplot(aes(x=Mass, y=Head)) + geom_point()

It looks like head size and mass correlate positively, which makes sense. We know there are males and females, so we can add another variable by coloring the points by sex.

blue_jays |> ggplot(aes(x=Mass, y=Head)) + geom_point(aes(color=KnownSex))

Grids of plots

Ok, so male birds are larger and heavier in this population. But there are a bunch of other variables too; how do they correlate with mass and head size and with each other? How do they differ with sex? In other words, we would like to make a scatter for each pair of variables in the dataset. To do this, we can use the facet function like we did earlier in the semester but now in a more sophisticated way. First, a little reminder about how the facet function works in ggplot. Let’s look back at the time that we used facet_wrap in Week 8.

library(gapminder)

usukjpcn = filter(gapminder, 
            country == "United States" 
            | country == "United Kingdom" 
            | country == "Japan" 
            | country == "China" )

ggplot(data = usukjpcn) + 
  geom_line(mapping = aes(x = year, y = lifeExp, color = country)) +
  facet_wrap(~ country)

We first wrote ggplot commands like we wanted to plot year vs lifeExp. Then, we used the facet_wrap function and told it we wanted a plot for each distinct value of the country variable. The output was then four plots, one for each country, with the same x- and y-axis. This kind of setup actually only plots three variables, year, lifeExp, and country, and does not do a scatter plot for each pair of variable. However, with a little data wrangling, we can create a new data table that will work with facet_grid.

Looking at the example above, we can surmise that facet_wrap and facet_grid generate multiple plots with the same x- and y-axis. Thus, if we would a grid of plots where each plot has a different combination of variables for the x- and y-axis, then we need to collapse our data for the different morphological measurements into two sets of two columns, one set for the variables that will go on the x-axis and one set for the variables that will go on the y-axis. Recall from our lecture on tidying data that we can do this with pivot_longer.

bgx = blue_jays |> select(BirdID, KnownSex, Head, Mass, Skull) |>
  pivot_longer(Head:Skull, names_to = "var_x", values_to = "val_x")

bgy = blue_jays |> select(BirdID, KnownSex, Head, Mass, Skull) |>
  pivot_longer(Head:Skull, names_to = "var_y", values_to = "val_y")

bgx
# A tibble: 369 × 4
   BirdID     KnownSex var_x val_x
   <fct>      <fct>    <chr> <dbl>
 1 0000-00000 M        Head   56.6
 2 0000-00000 M        Mass   73.3
 3 0000-00000 M        Skull  30.7
 4 1142-05901 M        Head   56.4
 5 1142-05901 M        Mass   75.1
 6 1142-05901 M        Skull  31.4
 7 1142-05905 M        Head   57.3
 8 1142-05905 M        Mass   70.2
 9 1142-05905 M        Skull  31.2
10 1142-05907 F        Head   53.8
# ℹ 359 more rows
bgy
# A tibble: 369 × 4
   BirdID     KnownSex var_y val_y
   <fct>      <fct>    <chr> <dbl>
 1 0000-00000 M        Head   56.6
 2 0000-00000 M        Mass   73.3
 3 0000-00000 M        Skull  30.7
 4 1142-05901 M        Head   56.4
 5 1142-05901 M        Mass   75.1
 6 1142-05901 M        Skull  31.4
 7 1142-05905 M        Head   57.3
 8 1142-05905 M        Mass   70.2
 9 1142-05905 M        Skull  31.2
10 1142-05907 F        Head   53.8
# ℹ 359 more rows

In the table above, we’ve selected only four variables, sex, head size, mass, and skull size. In order to create scatter plots for head size, mass, and skull size, we gathered them together where var_x tells us which variable we have and val_x the value of that variable. Likewise, we created an identical table with var_y and val_y` since we’re going to plot each variable on the x- and y-axis. Next, we just join the two tables together.

bg = inner_join(bgx, bgy) # we use `inner_join` but `full`, `left`, or `right` will do the same thing
Joining with `by = join_by(BirdID, KnownSex)`
Warning in inner_join(bgx, bgy): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
bg
# A tibble: 1,107 × 6
   BirdID     KnownSex var_x val_x var_y val_y
   <fct>      <fct>    <chr> <dbl> <chr> <dbl>
 1 0000-00000 M        Head   56.6 Head   56.6
 2 0000-00000 M        Head   56.6 Mass   73.3
 3 0000-00000 M        Head   56.6 Skull  30.7
 4 0000-00000 M        Mass   73.3 Head   56.6
 5 0000-00000 M        Mass   73.3 Mass   73.3
 6 0000-00000 M        Mass   73.3 Skull  30.7
 7 0000-00000 M        Skull  30.7 Head   56.6
 8 0000-00000 M        Skull  30.7 Mass   73.3
 9 0000-00000 M        Skull  30.7 Skull  30.7
10 1142-05901 M        Head   56.4 Head   56.4
# ℹ 1,097 more rows

The join is deceptively simply so let’s break it down a little bit. Its a “natural” join since we don’t give it a by = join_by and it selects the common columns, which are BirdID and KnownSex. Since each BirdID is unique in the blue_jays table (you can check this with count), we can really just think the join is done on BirdID. In both bgx and bgy, each BirdID has has three rows, one for each of the three variables head, mass, and skull. The join function looks at the key variable, BirdID and finds rows in both tables that share values of BirdID. It then adds to the output table one row for each unique combination of the remaining variables in the two tables, which are the var_x, val_x, var_y, and val_y variables. Thus, each BirdID value now has 3 x 3 = 9 rows in the output table.

With this joined table, we now can plot the grid of plots with facet_grid(vars(var_y), vars(var_x))

bg |> 
  ggplot(aes(val_x, val_y, color = KnownSex)) + 
  geom_point() + 
  # we put rows = rows = vars(var_y) and cols = vars(var_x) so that it calculates the scales correctly; 
  # the fixed variable in a column, var_x, should vary on the x-axis and likewise for the rows and the y-axis.
  facet_grid(rows = vars(var_y), cols = vars(var_x), scales = "free") +
  labs(x = NULL, y = NULL)

Viola! We can see now that males are generally larger across all these variables, and that they all positively correlate with each other. In ecology, these kind of correlations with different morphological measurements is studied in the field of “allometry”.

Of course, we’re not the first ones to want to do this kind of plot. There are a few packages which perform this kind of plot automatically, such as GGally whose function ggpairs adds some nice things.

library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
blue_jays |> 
  select(KnownSex, Head, Mass, Skull) |> 
  ggpairs(aes(color=KnownSex))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

However, even with standard ggplot, we can still get more than 1/2 to what this fancy package does!

Principal component analysis (PCA) plots

Recall that the blue jay dataset has more than just the three variables above. What if it had 10 more variables? Or 100 more? We couldn’t easily interpret a 100x100 grid of scatter plots to understand how the variables relate to one another or whether there are different patterns for males and females. A very useful statistical tools used in these circumstances is called a principal component analysis, which is a type of dimensionality reduction. Dimensionality reduction uses the fact that there are often lots of correlations in high-dimensional data and thus there are “effectively” many fewer important dimensions that capture the independent variation of the whole dataset. For example, in the blue jay morphology data, we can see that many body measures are positively correlated. Thus, if we know just body mass, we can make a good guess about what head size or skull size might be. We could even do a linear regression of head size against body mass and skull size against body mass and then use body mass to predict both of those variables. If we do this kind of linear analysis, adding together variables and weighting them with coefficient, and make sure the weight sum of variables explains as much variation in the data as possible, we end up with a PCA. A PCA not only gives you a new variable that explains a lot of variation, it also gives you as many new variables as you had old variables where each new one after the first explains less variation in the data than the new variable before it.

Let’s see an example of PCA. We can use the prcomp function in R to generate the principal components.

bpca = blue_jays |> select(-BirdID, -KnownSex, -Sex) |> prcomp()

cbind(blue_jays, bpca$x) |> ggplot(aes(x=PC1, y=PC2)) + 
  geom_point(aes(color=KnownSex))

This plot shows all the bird samples plotted on the two first principal components. You can immediately see that males and females are roughly two different clouds of points, which accords with our prior knowledge that the sexes differ in body morphology. To see how much of the variance each PC explains, we look at the summary.

summary(bpca)
Importance of components:
                          PC1     PC2     PC3     PC4     PC5      PC6
Standard deviation     4.9395 1.51406 0.90287 0.49411 0.30416 0.001672
Proportion of Variance 0.8763 0.08233 0.02928 0.00877 0.00332 0.000000
Cumulative Proportion  0.8763 0.95863 0.98791 0.99668 1.00000 1.000000

The first component explains almost 90% of the variance and the second component only 8%. To see how each PC is composed of the underlying variables, we look at the rotation value:

bpca$rotation
                   PC1         PC2         PC3         PC4         PC5
BillDepth  -0.03961552  0.08989552 -0.06763782 -0.05770882 -0.99118308
BillWidth  -0.03362384  0.08147503  0.01955653 -0.99377559  0.06525873
BillLength -0.11159588  0.61166243 -0.51854443  0.05004609  0.09273888
Head       -0.21992974  0.72883067  0.28050368  0.07582643  0.05100298
Mass       -0.96162735 -0.25741373 -0.09198285  0.01098280  0.02072545
Skull      -0.10839813  0.11702017  0.79938175  0.02654960 -0.04081731
                     PC6
BillDepth  -5.705440e-04
BillWidth  -4.087104e-04
BillLength -5.774270e-01
Head        5.774110e-01
Mass        5.583775e-05
Skull      -5.772123e-01

This shows that the first PC is mostly body mass and that other variables are also positively correlated with body mass (PCs can be all multiplied by a negative, so what matters is the relative sign). We can also plot what the variables are as a function of the first two PCs.

as_tibble(bpca$rotation) |> 
  mutate(feature=rownames(bpca$rotation)) |> 
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point(aes(color=feature), show.legend = FALSE) + 
  geom_text(aes(label=feature), size=3, position=position_jitter(width=0.05,height=0.05))

This just shows that mass dominates PC1 whereas PC2 measures how the rest of the variables all negatively correlate with mass once you take into account PC1.

Finally, let’s do a PCA for data that doesn’t have such a clear group as sex. One example are the gene expression data from the genomic imprinting dataset. Here, we’ll do a PCA using the Genes as variables. This entails using pivot_wider to put the Genes as variables or columns (which of course is how the original .xlsx file was, but we’re just using the tidy data as practice)

imprint = read_csv("babak-etal-2015_imprinted-mouse_tidy.csv")
Rows: 4125 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Genes, tissue
dbl (1): expression

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
imprint_t = imprint |> 
  pivot_wider(names_from = Genes, values_from = expression) |> 
  na.omit()
ipca = imprint_t |> 
  select(-tissue) |> 
  prcomp()

as_tibble(ipca$x) |> 
  add_column(tissue = imprint_t$tissue, .before = 1) |>
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point() + 
  geom_text(aes(label=tissue), size=2.5, position=position_jitter(width=2,height=5))

We can see from the PCA that some tissues group together naturally (brain tissues) and others not as much. This could be the beginning of an investigation about whether there are biological reasons these other tissues groups the way they do. We can use summary to see how strong the first few PCs are.

summary(ipca)
Importance of components:
                           PC1     PC2      PC3     PC4      PC5      PC6
Standard deviation     37.5944 28.5723 20.49342 19.5804 18.51889 18.08067
Proportion of Variance  0.2138  0.1235  0.06353  0.0580  0.05188  0.04945
Cumulative Proportion   0.2138  0.3373  0.40084  0.4588  0.51072  0.56018
                            PC7      PC8      PC9     PC10     PC11     PC12
Standard deviation     17.12899 15.80112 15.58329 15.01881 14.04080 13.21549
Proportion of Variance  0.04439  0.03777  0.03674  0.03412  0.02982  0.02642
Cumulative Proportion   0.60456  0.64233  0.67907  0.71319  0.74302  0.76944
                          PC13     PC14     PC15     PC16     PC17     PC18
Standard deviation     12.9327 12.63925 11.55027 11.43424 11.08405 10.96401
Proportion of Variance  0.0253  0.02417  0.02018  0.01978  0.01859  0.01819
Cumulative Proportion   0.7947  0.81890  0.83909  0.85887  0.87745  0.89564
                           PC19    PC20    PC21    PC22    PC23    PC24    PC25
Standard deviation     10.40468 9.55383 9.33377 8.84628 8.18059 7.77386 7.58014
Proportion of Variance  0.01638 0.01381 0.01318 0.01184 0.01012 0.00914 0.00869
Cumulative Proportion   0.91201 0.92582 0.93900 0.95084 0.96096 0.97010 0.97880
                          PC26    PC27    PC28     PC29
Standard deviation     7.03762 6.83679 6.62491 1.67e-14
Proportion of Variance 0.00749 0.00707 0.00664 0.00e+00
Cumulative Proportion  0.98629 0.99336 1.00000 1.00e+00

The first PC only captures 21% of the variance and the second captures 12%. Thus, we need more the PCs in this dataset to describe a significant amount of the variance. Finally, we can plot how the genes make up the first two PCs. Then, we could use this information to see if the biological function of these genes explains how the tissues group along PC1 and PC2.

as_tibble(ipca$rotation) |> 
  add_column(gene = rownames(ipca$rotation), .before = 1) |>
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point() + 
  geom_text(aes(label=gene), size=2.5, position=position_jitter(width=0.015,height=0.03))

Non-linear dimensionality reduction

PCA is a “linear” dimensionality reduction since it is just a weighted sum of the original variables. Other dimensionality reduction methods combine the variables together in a non-linear way and have become popular in molecular biology as a way of condensing gene expression data. Two particularly common ones are UMAP and t-SNE. Here is an example of PCA vs t-SNE:

Fig 2 from Kobak and Berens 20191

These have some advantages in terms of their visual appeal, but they should be used carefully2 and can produce spurious results if used incorrectly3.

Lab

Problems

  1. Using the gapminder data:

    • Perform a principal components analysis using the variables lifeExp, year, pop, and gdpPercap.
    • Plot all the points on a scatter plot with the the x- and y-axis corresponding to the first two principal components (PC1 and PC2).
    • Color the points according to year. Set the shape of the point (square, circle, etc) according to continent.
    • What patterns do you observe if any?
  2. For this question, we will use genetic data from a few European populations gathered as part of the CEPH-Human Genome Diversity Panel. A little more information on the data can be found here: https://github.com/NovembreLab/HGDP_PopStruct_Exercise/blob/master/PopGenWorkshop.pdf. The file can be loaded using the code below. The ID column is an identifying string for the individual, the Group column gives the individual’s geographic ancestry, and the remaining columns are SNPs.

    hgdp = read_csv("H938_Euro.LDprune.csv")
    Rows: 156 Columns: 61431
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    chr     (2): ID, Group
    dbl (61429): rs3094315, rs12562034, rs9442372, rs6687776, rs12145826, rs7553...
    
    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    • Perform a PCA on the SNPs.
    • Plot all the individuals using the first two principle components.
    • Color the individuals according to Group.
    • Which PC corresponds to a more north-south axis and which one to a more east-west axis?
  3. The scatter plot grid we created for the blue jay data had many extraneous plots (the diagonal plots and two copies of each plot, x vs y and then y vs x). There were only three scatter plots that we really needed from that 3x3 grid. Using facet_wrap, we can plot only these three plots.

    • Begin with the bg data table that we used that contains the var_x, var_y, val_x, and val_y columns.
    • Create one new column that contains the x,y variable combo for each scatter plot.
    • Use filter to get only the rows that correspond to the three variable combinations you need.
    • Create scatter plots using facet_wrap on the new column you created.
    • +2 points: use labeller in facet_wrap to add the correct x and y axis labels for each plot.

Footnotes

  1. Kobak, D., and P. Berens. 2019. The art of using t-SNE for single-cell transcriptomics. Nature Communications 10:5416.↩︎

  2. Kobak, D., and P. Berens. 2019. The art of using t-SNE for single-cell transcriptomics. Nature Communications 10:5416.↩︎

  3. Chari, T., and L. Pachter. 2023. The specious art of single-cell genomics. PLOS Computational Biology 19:e1011288.↩︎