library(tidyverse)
theme_set(theme_classic())
Visualizing lots of data
Outline for today
- Grids of plots
- Principal component analysis (PCA) plots
Reminder
- Schedule for lightning talks
Plotting data with many variables
In many of the datasets that we’ve seen so far, we’ve already been faced with the problem that the data could be visualized in many different ways because they have many different variables. Some datasets were time series and it was clear that scatter plots or line plots were useful with time on the x-axis and multiple variables on the y-axis. For example, these global COVID-19 data from Our World in Data (https://github.com/owid/covid-19-data/tree/master/public/data) for the year 2021
= read_csv("https://catalog.ourworldindata.org/garden/covid/latest/compact/compact.csv") owid
Rows: 453386 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, code, continent
dbl (57): total_cases, new_cases, new_cases_smoothed, total_cases_per_milli...
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
owid
# A tibble: 453,386 × 61
country date total_cases new_cases new_cases_smoothed
<chr> <date> <dbl> <dbl> <dbl>
1 Afghanistan 2020-01-01 NA NA NA
2 Afghanistan 2020-01-02 NA NA NA
3 Afghanistan 2020-01-03 NA NA NA
4 Afghanistan 2020-01-04 NA NA NA
5 Afghanistan 2020-01-05 0 0 NA
6 Afghanistan 2020-01-06 0 0 NA
7 Afghanistan 2020-01-07 0 0 NA
8 Afghanistan 2020-01-08 0 0 NA
9 Afghanistan 2020-01-09 0 0 NA
10 Afghanistan 2020-01-10 0 0 0
# ℹ 453,376 more rows
# ℹ 56 more variables: total_cases_per_million <dbl>,
# new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
# total_deaths <dbl>, new_deaths <dbl>, new_deaths_smoothed <dbl>,
# total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
# new_deaths_smoothed_per_million <dbl>, excess_mortality <dbl>,
# excess_mortality_cumulative <dbl>, …
|>
owid filter(country %in% c("United States", "Brazil", "India", "United Kingdom", "Sweden")) |>
filter(year(date) == 2021, !is.na(new_deaths_smoothed_per_million)) |>
ggplot(aes(x=date, y=new_deaths_smoothed_per_million)) +
geom_line(aes(color=country)) +
labs(x = "Date", y = "Deaths per million persons (smoothed)", color = "Country")
are presented nicely as a nice time series since the time course of the pandemic is a crucial feature that we might be interested in studying.
However, our plots may have many variables and no single one of them, like time, stands out as one against which we would like to compare the others. In this case, how the variables correlate with one another may be important. For example, let’s load a dataset from Keith Tarvin of Oberlin College on morphological measurements of blue jays.
load("blue_jays.rda")
glimpse(blue_jays)
Rows: 123
Columns: 9
$ BirdID <fct> 0000-00000, 1142-05901, 1142-05905, 1142-05907, 1142-05909,…
$ KnownSex <fct> M, M, M, F, M, F, M, M, F, F, M, F, M, F, M, F, F, M, M, M,…
$ BillDepth <dbl> 8.26, 8.54, 8.39, 7.78, 8.71, 7.28, 8.74, 8.72, 8.20, 7.67,…
$ BillWidth <dbl> 9.21, 8.76, 8.78, 9.30, 9.84, 9.30, 9.28, 9.94, 9.01, 9.31,…
$ BillLength <dbl> 25.92, 24.99, 26.07, 23.48, 25.47, 22.25, 25.35, 30.00, 22.…
$ Head <dbl> 56.58, 56.36, 57.32, 53.77, 57.32, 52.25, 57.12, 60.67, 52.…
$ Mass <dbl> 73.30, 75.10, 70.25, 65.50, 74.90, 63.90, 75.10, 78.10, 64.…
$ Skull <dbl> 30.66, 31.38, 31.25, 30.29, 31.85, 30.00, 31.77, 30.67, 30.…
$ Sex <int> 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,…
There are 123 individuals in this dataset and the variables include bill depth, width, length, head size, mass, sex, etc. If we wanted to see how head size and body mass correlate, we could simply do a scatter plot.
|> ggplot(aes(x=Mass, y=Head)) + geom_point() blue_jays
It looks like head size and mass correlate positively, which makes sense. We know there are males and females, so we can add another variable by coloring the points by sex.
|> ggplot(aes(x=Mass, y=Head)) + geom_point(aes(color=KnownSex)) blue_jays
Grids of plots
Ok, so male birds are larger and heavier in this population. But there are a bunch of other variables too; how do they correlate with mass and head size and with each other? How do they differ with sex? In other words, we would like to make a scatter for each pair of variables in the dataset. To do this, we can use the facet
function like we did earlier in the semester but now in a more sophisticated way. First, a little reminder about how the facet
function works in ggplot
. Let’s look back at the time that we used facet_wrap
in Week 8.
library(gapminder)
= filter(gapminder,
usukjpcn == "United States"
country | country == "United Kingdom"
| country == "Japan"
| country == "China" )
ggplot(data = usukjpcn) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country)) +
facet_wrap(~ country)
We first wrote ggplot
commands like we wanted to plot year
vs lifeExp
. Then, we used the facet_wrap
function and told it we wanted a plot for each distinct value of the country
variable. The output was then four plots, one for each country, with the same x- and y-axis. This kind of setup actually only plots three variables, year
, lifeExp
, and country
, and does not do a scatter plot for each pair of variable. However, with a little data wrangling, we can create a new data table that will work with facet_grid
.
Looking at the example above, we can surmise that facet_wrap
and facet_grid
generate multiple plots with the same x- and y-axis. Thus, if we would a grid of plots where each plot has a different combination of variables for the x- and y-axis, then we need to collapse our data for the different morphological measurements into two sets of two columns, one set for the variables that will go on the x-axis and one set for the variables that will go on the y-axis. Recall from our lecture on tidying data that we can do this with pivot_longer
.
= blue_jays |> select(BirdID, KnownSex, Head, Mass, Skull) |>
bgx pivot_longer(Head:Skull, names_to = "var_x", values_to = "val_x")
= blue_jays |> select(BirdID, KnownSex, Head, Mass, Skull) |>
bgy pivot_longer(Head:Skull, names_to = "var_y", values_to = "val_y")
bgx
# A tibble: 369 × 4
BirdID KnownSex var_x val_x
<fct> <fct> <chr> <dbl>
1 0000-00000 M Head 56.6
2 0000-00000 M Mass 73.3
3 0000-00000 M Skull 30.7
4 1142-05901 M Head 56.4
5 1142-05901 M Mass 75.1
6 1142-05901 M Skull 31.4
7 1142-05905 M Head 57.3
8 1142-05905 M Mass 70.2
9 1142-05905 M Skull 31.2
10 1142-05907 F Head 53.8
# ℹ 359 more rows
bgy
# A tibble: 369 × 4
BirdID KnownSex var_y val_y
<fct> <fct> <chr> <dbl>
1 0000-00000 M Head 56.6
2 0000-00000 M Mass 73.3
3 0000-00000 M Skull 30.7
4 1142-05901 M Head 56.4
5 1142-05901 M Mass 75.1
6 1142-05901 M Skull 31.4
7 1142-05905 M Head 57.3
8 1142-05905 M Mass 70.2
9 1142-05905 M Skull 31.2
10 1142-05907 F Head 53.8
# ℹ 359 more rows
In the table above, we’ve selected only four variables, sex, head size, mass, and skull size. In order to create scatter plots for head size, mass, and skull size, we gathered them together where var_x
tells us which variable we have and val_x
the value of that variable. Likewise, we created an identical table with var_y
and val_y` since we’re going to plot each variable on the x- and y-axis. Next, we just join the two tables together.
= inner_join(bgx, bgy) # we use `inner_join` but `full`, `left`, or `right` will do the same thing bg
Joining with `by = join_by(BirdID, KnownSex)`
Warning in inner_join(bgx, bgy): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
bg
# A tibble: 1,107 × 6
BirdID KnownSex var_x val_x var_y val_y
<fct> <fct> <chr> <dbl> <chr> <dbl>
1 0000-00000 M Head 56.6 Head 56.6
2 0000-00000 M Head 56.6 Mass 73.3
3 0000-00000 M Head 56.6 Skull 30.7
4 0000-00000 M Mass 73.3 Head 56.6
5 0000-00000 M Mass 73.3 Mass 73.3
6 0000-00000 M Mass 73.3 Skull 30.7
7 0000-00000 M Skull 30.7 Head 56.6
8 0000-00000 M Skull 30.7 Mass 73.3
9 0000-00000 M Skull 30.7 Skull 30.7
10 1142-05901 M Head 56.4 Head 56.4
# ℹ 1,097 more rows
The join is deceptively simply so let’s break it down a little bit. Its a “natural” join since we don’t give it a by = join_by
and it selects the common columns, which are BirdID
and KnownSex
. Since each BirdID
is unique in the blue_jays
table (you can check this with count
), we can really just think the join is done on BirdID
. In both bgx
and bgy
, each BirdID
has has three rows, one for each of the three variables head, mass, and skull. The join function looks at the key variable, BirdID
and finds rows in both tables that share values of BirdID
. It then adds to the output table one row for each unique combination of the remaining variables in the two tables, which are the var_x
, val_x
, var_y
, and val_y
variables. Thus, each BirdID
value now has 3 x 3 = 9 rows in the output table.
With this joined table, we now can plot the grid of plots with facet_grid(vars(var_y), vars(var_x))
|>
bg ggplot(aes(val_x, val_y, color = KnownSex)) +
geom_point() +
# we put rows = rows = vars(var_y) and cols = vars(var_x) so that it calculates the scales correctly;
# the fixed variable in a column, var_x, should vary on the x-axis and likewise for the rows and the y-axis.
facet_grid(rows = vars(var_y), cols = vars(var_x), scales = "free") +
labs(x = NULL, y = NULL)
Viola! We can see now that males are generally larger across all these variables, and that they all positively correlate with each other. In ecology, these kind of correlations with different morphological measurements is studied in the field of “allometry”.
Of course, we’re not the first ones to want to do this kind of plot. There are a few packages which perform this kind of plot automatically, such as GGally
whose function ggpairs
adds some nice things.
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
|>
blue_jays select(KnownSex, Head, Mass, Skull) |>
ggpairs(aes(color=KnownSex))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
However, even with standard ggplot
, we can still get more than 1/2 to what this fancy package does!
Principal component analysis (PCA) plots
Recall that the blue jay dataset has more than just the three variables above. What if it had 10 more variables? Or 100 more? We couldn’t easily interpret a 100x100 grid of scatter plots to understand how the variables relate to one another or whether there are different patterns for males and females. A very useful statistical tools used in these circumstances is called a principal component analysis, which is a type of dimensionality reduction. Dimensionality reduction uses the fact that there are often lots of correlations in high-dimensional data and thus there are “effectively” many fewer important dimensions that capture the independent variation of the whole dataset. For example, in the blue jay morphology data, we can see that many body measures are positively correlated. Thus, if we know just body mass, we can make a good guess about what head size or skull size might be. We could even do a linear regression of head size against body mass and skull size against body mass and then use body mass to predict both of those variables. If we do this kind of linear analysis, adding together variables and weighting them with coefficient, and make sure the weight sum of variables explains as much variation in the data as possible, we end up with a PCA. A PCA not only gives you a new variable that explains a lot of variation, it also gives you as many new variables as you had old variables where each new one after the first explains less variation in the data than the new variable before it.
Let’s see an example of PCA. We can use the prcomp
function in R to generate the principal components.
= blue_jays |> select(-BirdID, -KnownSex, -Sex) |> prcomp()
bpca
cbind(blue_jays, bpca$x) |> ggplot(aes(x=PC1, y=PC2)) +
geom_point(aes(color=KnownSex))
This plot shows all the bird samples plotted on the two first principal components. You can immediately see that males and females are roughly two different clouds of points, which accords with our prior knowledge that the sexes differ in body morphology. To see how much of the variance each PC explains, we look at the summary
.
summary(bpca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 4.9395 1.51406 0.90287 0.49411 0.30416 0.001672
Proportion of Variance 0.8763 0.08233 0.02928 0.00877 0.00332 0.000000
Cumulative Proportion 0.8763 0.95863 0.98791 0.99668 1.00000 1.000000
The first component explains almost 90% of the variance and the second component only 8%. To see how each PC is composed of the underlying variables, we look at the rotation
value:
$rotation bpca
PC1 PC2 PC3 PC4 PC5
BillDepth -0.03961552 0.08989552 -0.06763782 -0.05770882 -0.99118308
BillWidth -0.03362384 0.08147503 0.01955653 -0.99377559 0.06525873
BillLength -0.11159588 0.61166243 -0.51854443 0.05004609 0.09273888
Head -0.21992974 0.72883067 0.28050368 0.07582643 0.05100298
Mass -0.96162735 -0.25741373 -0.09198285 0.01098280 0.02072545
Skull -0.10839813 0.11702017 0.79938175 0.02654960 -0.04081731
PC6
BillDepth -5.705440e-04
BillWidth -4.087104e-04
BillLength -5.774270e-01
Head 5.774110e-01
Mass 5.583775e-05
Skull -5.772123e-01
This shows that the first PC is mostly body mass and that other variables are also positively correlated with body mass (PCs can be all multiplied by a negative, so what matters is the relative sign). We can also plot what the variables are as a function of the first two PCs.
as_tibble(bpca$rotation) |>
mutate(feature=rownames(bpca$rotation)) |>
ggplot(aes(x=PC1, y=PC2)) +
geom_point(aes(color=feature), show.legend = FALSE) +
geom_text(aes(label=feature), size=3, position=position_jitter(width=0.05,height=0.05))
This just shows that mass dominates PC1 whereas PC2 measures how the rest of the variables all negatively correlate with mass once you take into account PC1.
Finally, let’s do a PCA for data that doesn’t have such a clear group as sex. One example are the gene expression data from the genomic imprinting dataset. Here, we’ll do a PCA using the Genes
as variables. This entails using pivot_wider
to put the Genes
as variables or columns (which of course is how the original .xlsx
file was, but we’re just using the tidy data as practice)
= read_csv("babak-etal-2015_imprinted-mouse_tidy.csv") imprint
Rows: 4125 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Genes, tissue
dbl (1): expression
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
= imprint |>
imprint_t pivot_wider(names_from = Genes, values_from = expression) |>
na.omit()
= imprint_t |>
ipca select(-tissue) |>
prcomp()
as_tibble(ipca$x) |>
add_column(tissue = imprint_t$tissue, .before = 1) |>
ggplot(aes(x=PC1, y=PC2)) +
geom_point() +
geom_text(aes(label=tissue), size=2.5, position=position_jitter(width=2,height=5))
We can see from the PCA that some tissues group together naturally (brain tissues) and others not as much. This could be the beginning of an investigation about whether there are biological reasons these other tissues groups the way they do. We can use summary to see how strong the first few PCs are.
summary(ipca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 37.5944 28.5723 20.49342 19.5804 18.51889 18.08067
Proportion of Variance 0.2138 0.1235 0.06353 0.0580 0.05188 0.04945
Cumulative Proportion 0.2138 0.3373 0.40084 0.4588 0.51072 0.56018
PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 17.12899 15.80112 15.58329 15.01881 14.04080 13.21549
Proportion of Variance 0.04439 0.03777 0.03674 0.03412 0.02982 0.02642
Cumulative Proportion 0.60456 0.64233 0.67907 0.71319 0.74302 0.76944
PC13 PC14 PC15 PC16 PC17 PC18
Standard deviation 12.9327 12.63925 11.55027 11.43424 11.08405 10.96401
Proportion of Variance 0.0253 0.02417 0.02018 0.01978 0.01859 0.01819
Cumulative Proportion 0.7947 0.81890 0.83909 0.85887 0.87745 0.89564
PC19 PC20 PC21 PC22 PC23 PC24 PC25
Standard deviation 10.40468 9.55383 9.33377 8.84628 8.18059 7.77386 7.58014
Proportion of Variance 0.01638 0.01381 0.01318 0.01184 0.01012 0.00914 0.00869
Cumulative Proportion 0.91201 0.92582 0.93900 0.95084 0.96096 0.97010 0.97880
PC26 PC27 PC28 PC29
Standard deviation 7.03762 6.83679 6.62491 1.67e-14
Proportion of Variance 0.00749 0.00707 0.00664 0.00e+00
Cumulative Proportion 0.98629 0.99336 1.00000 1.00e+00
The first PC only captures 21% of the variance and the second captures 12%. Thus, we need more the PCs in this dataset to describe a significant amount of the variance. Finally, we can plot how the genes make up the first two PCs. Then, we could use this information to see if the biological function of these genes explains how the tissues group along PC1 and PC2.
as_tibble(ipca$rotation) |>
add_column(gene = rownames(ipca$rotation), .before = 1) |>
ggplot(aes(x=PC1, y=PC2)) +
geom_point() +
geom_text(aes(label=gene), size=2.5, position=position_jitter(width=0.015,height=0.03))
Non-linear dimensionality reduction
PCA is a “linear” dimensionality reduction since it is just a weighted sum of the original variables. Other dimensionality reduction methods combine the variables together in a non-linear way and have become popular in molecular biology as a way of condensing gene expression data. Two particularly common ones are UMAP and t-SNE. Here is an example of PCA vs t-SNE:
These have some advantages in terms of their visual appeal, but they should be used carefully2 and can produce spurious results if used incorrectly3.
Lab
Problems
Using the
gapminder
data:- Perform a principal components analysis using the variables
lifeExp
,year
,pop
, andgdpPercap
. - Plot all the points on a scatter plot with the the x- and y-axis corresponding to the first two principal components (PC1 and PC2).
- Color the points according to
year
. Set the shape of the point (square, circle, etc) according tocontinent
. - What patterns do you observe if any?
- Perform a principal components analysis using the variables
For this question, we will use genetic data from a few European populations gathered as part of the CEPH-Human Genome Diversity Panel. A little more information on the data can be found here: https://github.com/NovembreLab/HGDP_PopStruct_Exercise/blob/master/PopGenWorkshop.pdf. The file can be loaded using the code below. The
ID
column is an identifying string for the individual, theGroup
column gives the individual’s geographic ancestry, and the remaining columns are SNPs.= read_csv("H938_Euro.LDprune.csv") hgdp
Rows: 156 Columns: 61431 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (2): ID, Group dbl (61429): rs3094315, rs12562034, rs9442372, rs6687776, rs12145826, rs7553... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- Perform a PCA on the SNPs.
- Plot all the individuals using the first two principle components.
- Color the individuals according to
Group
. - Which PC corresponds to a more north-south axis and which one to a more east-west axis?
The scatter plot grid we created for the blue jay data had many extraneous plots (the diagonal plots and two copies of each plot, x vs y and then y vs x). There were only three scatter plots that we really needed from that 3x3 grid. Using
facet_wrap
, we can plot only these three plots.- Begin with the
bg
data table that we used that contains thevar_x
,var_y
,val_x
, andval_y
columns. - Create one new column that contains the x,y variable combo for each scatter plot.
- Use
filter
to get only the rows that correspond to the three variable combinations you need. - Create scatter plots using
facet_wrap
on the new column you created. - +2 points: use
labeller
infacet_wrap
to add the correct x and y axis labels for each plot.
- Begin with the
Footnotes
Kobak, D., and P. Berens. 2019. The art of using t-SNE for single-cell transcriptomics. Nature Communications 10:5416.↩︎
Kobak, D., and P. Berens. 2019. The art of using t-SNE for single-cell transcriptomics. Nature Communications 10:5416.↩︎
Chari, T., and L. Pachter. 2023. The specious art of single-cell genomics. PLOS Computational Biology 19:e1011288.↩︎