library(tidyverse)
theme_set(theme_classic())
= read_csv("https://github.com/vancleve/BIO621-DWVR/raw/main/owid-covid-data_2021.csv") owid
Visualizing lots of data
Outline for today
- Grids of plots
- Principal component analysis (PCA) plots
Reminder
- Schedule for lightning talks
Plotting data with many variables
In many of the datasets that we’ve seen so far, we’ve already been faced with the problem that the data could be visualized in many different ways because they have many different variables. Some datasets were time series and it was clear that scatter plots or line plots were useful with time on the x-axis and multiple variables on the y-axis. For example, these global COVID-19 data from Our World in Data (https://github.com/owid/covid-19-data/tree/master/public/data) for the year 2021
owid
# A tibble: 83,697 × 67
iso_code continent locat…¹ date total…² new_c…³ new_c…⁴ total…⁵ new_d…⁶
<chr> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AFG Asia Afghan… 2021-01-01 52513 183 131. 2201 12
2 AFG Asia Afghan… 2021-01-02 52586 73 117. 2211 10
3 AFG Asia Afghan… 2021-01-03 52709 123 123 2221 10
4 AFG Asia Afghan… 2021-01-04 52909 200 129. 2230 9
5 AFG Asia Afghan… 2021-01-05 53011 102 123. 2237 7
6 AFG Asia Afghan… 2021-01-06 53105 94 111. 2244 7
7 AFG Asia Afghan… 2021-01-07 53207 102 125. 2253 9
8 AFG Asia Afghan… 2021-01-08 53332 125 117 2257 4
9 AFG Asia Afghan… 2021-01-09 53400 68 116. 2264 7
10 AFG Asia Afghan… 2021-01-10 53489 89 111. 2277 13
# … with 83,687 more rows, 58 more variables: new_deaths_smoothed <dbl>,
# total_cases_per_million <dbl>, new_cases_per_million <dbl>,
# new_cases_smoothed_per_million <dbl>, total_deaths_per_million <dbl>,
# new_deaths_per_million <dbl>, new_deaths_smoothed_per_million <dbl>,
# reproduction_rate <dbl>, icu_patients <dbl>,
# icu_patients_per_million <dbl>, hosp_patients <dbl>,
# hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, …
%>%
owid filter(location %in% c("United States", "Brazil", "India", "United Kingdom", "Sweden")) %>%
ggplot(aes(x=date, y=new_deaths_smoothed_per_million)) +
geom_line(aes(color=location)) +
labs(x = "Date", y = "Deaths per million persons (smoothed)", color = "State")
are presented nicely as a nice time series since the time course of the pandemic is a crucial feature that we might be interested in studying.
However, our plots may have many variables and no single one of them, like time, stands out as one against which we would like to compare the others. In this case, how the variables correlate with one another may be important. For example, let’s load a dataset from Keith Tarvin of Oberlin College on morphological measurements of blue jays.
load("blue_jays.rda")
glimpse(blue_jays)
Rows: 123
Columns: 9
$ BirdID <fct> 0000-00000, 1142-05901, 1142-05905, 1142-05907, 1142-05909,…
$ KnownSex <fct> M, M, M, F, M, F, M, M, F, F, M, F, M, F, M, F, F, M, M, M,…
$ BillDepth <dbl> 8.26, 8.54, 8.39, 7.78, 8.71, 7.28, 8.74, 8.72, 8.20, 7.67,…
$ BillWidth <dbl> 9.21, 8.76, 8.78, 9.30, 9.84, 9.30, 9.28, 9.94, 9.01, 9.31,…
$ BillLength <dbl> 25.92, 24.99, 26.07, 23.48, 25.47, 22.25, 25.35, 30.00, 22.…
$ Head <dbl> 56.58, 56.36, 57.32, 53.77, 57.32, 52.25, 57.12, 60.67, 52.…
$ Mass <dbl> 73.30, 75.10, 70.25, 65.50, 74.90, 63.90, 75.10, 78.10, 64.…
$ Skull <dbl> 30.66, 31.38, 31.25, 30.29, 31.85, 30.00, 31.77, 30.67, 30.…
$ Sex <int> 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,…
There are 123 individuals in this dataset and the variables include bill depth, width, length, head size, mass, sex, etc. If we wanted to see how head size and body mass correlate, we could simply do a scatter plot.
%>% ggplot(aes(x=Mass, y=Head)) + geom_point() blue_jays
It looks like head size and mass correlate positively, which makes sense. We know there are males and females, so we can add another variable by coloring the points by sex.
%>% ggplot(aes(x=Mass, y=Head)) + geom_point(aes(color=KnownSex)) blue_jays
Grids of plots
Ok, so male birds are larger and heavier in this population. But there are a bunch of other variables too; how do they correlate with mass and head size and with each other? How do they differ with sex? In other words, we would like to make a scatter for each pair of variables in the dataset. To do this, we can use the facet
function like we did earlier in the semester but now in a more sophisticated way. First, a little reminder about how the facet
function works in ggplot
. Let’s look back at the time that we used facet_wrap
in Week 6.
library(gapminder)
= filter(gapminder,
usukjpcn == "United States"
country | country == "United Kingdom"
| country == "Japan"
| country == "China" )
ggplot(data = usukjpcn) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country)) +
facet_wrap(~ country)
We first wrote ggplot
commands like we wanted to plot year
vs lifeExp
. Then, we used the facet_wrap
function and told it we wanted a plot for each distinct value of the country
variable. The output was then four plots, one for each country, with the same x- and y-axis. This kind of setup actually only plots three variables, year
, lifeExp
, and country
, and does not do a scatter plot for each pair of variable. However, with a little data wrangling, we can create a new data table that will work with facet_grid
.
Looking at the example above, facet_wrap
(and facet_grid
) generate multiple plots with the same x- and y-axis so we need to collapse our data for the different morphological measurements into two columns. Recall from our lecture on tidying data that we can do this with pivot_longer
.
= blue_jays %>% select(BirdID, KnownSex, Head, Mass, Skull) %>%
bgx pivot_longer(Head:Skull, names_to = "var_x", values_to = "val_x")
= blue_jays %>% select(BirdID, KnownSex, Head, Mass, Skull) %>%
bgy pivot_longer(Head:Skull, names_to = "var_y", values_to = "val_y")
In the table above, we’ve selected only four variables, sex, head size, mass, and skull size. In order to create scatter plots head size, mass, and skull size, we gathered them together where var_x
tells us which variable we have and val_x
the value of that variable. Likewise, we created an identical table with var_y
and val_y` since we’re going to plot each variable on the x- and y-axis. Next, we just join the two tables together.
= left_join(bgx, bgy) bg
Joining, by = c("BirdID", "KnownSex")
Now, we can plot them with facet_grid(var_y ~ var_x)
.
%>%
bg ggplot(aes(val_x, val_y, color = KnownSex)) +
geom_point() +
facet_grid(var_y ~ var_x, scales = "free") +
labs(x = NULL, y = NULL)
Viola! We can see now that males are generally larger across all these variables, and that they all positively correlate with each other. In ecology, these kind of correlations with different morphological measurements is studied in the field of “allometry”.
Of course, we’re not the first ones to want to do this kind of plot. There are a few packages which perform this kind of plot automatically, such as GGally
whose function ggpairs
adds some nice things.
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
%>%
blue_jays select(KnownSex, Head, Mass, Skull) %>%
ggpairs(aes(color=KnownSex))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
However, even with standard ggplot
, we can still get more than 1/2 to what this fancy package does!
Principal component analysis (PCA) plots
Recall that the blue jay dataset has more than just the three variables above. What if it had 10 more variables? Or 100 more? We couldn’t easily interpret a 100x100 grid of scatter plots to understand how the variables relate to one another or whether there are different patterns for males and females. A very useful statistical tools used in these circumstances is called a principal component analysis, which is a type of dimensionality reduction. Dimensionality reduction uses the fact that there are often lots of correlations in high-dimensional data and thus there are “effectively” many fewer important dimensions that capture the independent variation of the whole dataset. For example, in the blue jay morphology data, we can see that many body measures are positively correlated. Thus, if we know just body mass, we can make a good guess about what head size or skull size might be. We could even do a linear regression of head size against body mass and skull size against body mass and then use body mass to predict both of those variables. If we do this kind of linear analysis, adding together variables and weighting them with coefficient, and make sure the weight sum of variables explains as much variation in the data as possible, we end up with a PCA. A PCA not only gives you a new variable that explains a lot of variation, it also gives you as many new variables as you had old variables where each new one after the first explains less variation in the data than the new variable before it.
Let’s see an example of PCA. We can use the prcomp
function is R to generate the principal components.
= blue_jays %>% select(-BirdID, -KnownSex, -Sex) %>% prcomp()
bpca
cbind(blue_jays, bpca$x) %>% ggplot(aes(x=PC1, y=PC2)) +
geom_point(aes(color=KnownSex))
This plot shows all the bird samples plotted on the two first principal components. You can immediately see that males and females are roughly two different clouds of points, which accords with our prior knowledge that the sexes differ in body morphology. To see how much of the variance each PC explains, we look at the summary
.
summary(bpca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 4.9395 1.51406 0.90287 0.49411 0.30416 0.001672
Proportion of Variance 0.8763 0.08233 0.02928 0.00877 0.00332 0.000000
Cumulative Proportion 0.8763 0.95863 0.98791 0.99668 1.00000 1.000000
The first component explains almost 90% of the variance and the second component only 8%. To see
This gives the percent variance explained by each of the principal components. To see how each PC is composed of the underlying variables, we look at the rotation
value:
$rotation bpca
PC1 PC2 PC3 PC4 PC5
BillDepth -0.03961552 0.08989552 -0.06763782 -0.05770882 -0.99118308
BillWidth -0.03362384 0.08147503 0.01955653 -0.99377559 0.06525873
BillLength -0.11159588 0.61166243 -0.51854443 0.05004609 0.09273888
Head -0.21992974 0.72883067 0.28050368 0.07582643 0.05100298
Mass -0.96162735 -0.25741373 -0.09198285 0.01098280 0.02072545
Skull -0.10839813 0.11702017 0.79938175 0.02654960 -0.04081731
PC6
BillDepth -5.705440e-04
BillWidth -4.087104e-04
BillLength -5.774270e-01
Head 5.774110e-01
Mass 5.583775e-05
Skull -5.772123e-01
This shows that the first PC is mostly body mass and that other variables are also positively correlated with body mass (PCs can be all multiplied by a negative, so what matters is the relative sign). We can also plot what the variables are as a function of the first two PCs.
as.data.frame(bpca$rotation) %>%
mutate(feature=rownames(bpca$rotation)) %>%
ggplot(aes(x=PC1, y=PC2)) +
geom_point(aes(color=feature), show.legend = FALSE) +
geom_text(aes(label=feature), size=3, position=position_jitter(width=0.05,height=0.05))
This just shows that mass dominates PC1 whereas PC2 measures how the rest of the variables all negatively correlate with mass once you take into account PC1.
Finally, let’s do a PCA for data that doesn’t have such a clear group as sex. One example are the gene expression data from the genomic imprinting dataset. Here, we’ll do a PCA using the Genes
as variables. This entails using pivot_wider
to put the Genes
as variables or columns (which of course is how the original .xlsx
file was, but we’re just using the tidy data as practice)
= read_csv("babak-etal-2015_imprinted-mouse_tidy.csv") imprint
Rows: 4125 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Genes, tissue
dbl (1): expression
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
= imprint %>% pivot_wider(names_from = Genes, values_from = expression) %>% na.omit()
imprint_t = imprint_t %>% select(-tissue) %>% prcomp()
ipca
data.frame(tissue=imprint_t$tissue, ipca$x) %>% ggplot(aes(x=PC1, y=PC2)) +
geom_point() +
geom_text(aes(label=tissue), size=2.5, position=position_jitter(width=2,height=5))
We can see from the PCA that some tissues group together naturally (brain tissues) and others not as much. This could be the beginning of an investigation about whether there are biological reasons these other tissues groups the way they do. We can use summary to see how strong the first few PCs are.
summary(ipca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 37.5944 28.5723 20.49342 19.5804 18.51889 18.08067
Proportion of Variance 0.2138 0.1235 0.06353 0.0580 0.05188 0.04945
Cumulative Proportion 0.2138 0.3373 0.40084 0.4588 0.51072 0.56018
PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 17.12899 15.80112 15.58329 15.01881 14.04080 13.21549
Proportion of Variance 0.04439 0.03777 0.03674 0.03412 0.02982 0.02642
Cumulative Proportion 0.60456 0.64233 0.67907 0.71319 0.74302 0.76944
PC13 PC14 PC15 PC16 PC17 PC18
Standard deviation 12.9327 12.63925 11.55027 11.43424 11.08405 10.96401
Proportion of Variance 0.0253 0.02417 0.02018 0.01978 0.01859 0.01819
Cumulative Proportion 0.7947 0.81890 0.83909 0.85887 0.87745 0.89564
PC19 PC20 PC21 PC22 PC23 PC24 PC25
Standard deviation 10.40468 9.55383 9.33377 8.84628 8.18059 7.77386 7.58014
Proportion of Variance 0.01638 0.01381 0.01318 0.01184 0.01012 0.00914 0.00869
Cumulative Proportion 0.91201 0.92582 0.93900 0.95084 0.96096 0.97010 0.97880
PC26 PC27 PC28 PC29
Standard deviation 7.03762 6.83679 6.62491 1.67e-14
Proportion of Variance 0.00749 0.00707 0.00664 0.00e+00
Cumulative Proportion 0.98629 0.99336 1.00000 1.00e+00
The first PC only captures 21% of the variance and the second captures 12%. Thus, we need more the PCs in this dataset to describe a significant amount of the variance. Finally, we can plot how the genes make up the first two PCs. Then, we could use this information to see if the biological function of these genes explains how the tissues group along PC1 and PC2.
data.frame(gene=rownames(ipca$rotation), ipca$rotation) %>%
ggplot(aes(x=PC1, y=PC2)) +
geom_point() +
geom_text(aes(label=gene), size=2.5, position=position_jitter(width=0.015,height=0.03))
Lab
Problems
Using the
gapminder
data:- Perform a principal components analysis using the variables
lifeExp
,year
,pop
, andgdpPercap
. - Plot all the points on a scatter plot with the the x- and y-axis corresponding to the first two principal components (PC1 and PC2).
- Color the points according to
year
. Set the shape of the point (square, circle, etc) according tocontinent
. - What patterns do you observe if any?
- Perform a principal components analysis using the variables
The Our World in Data COVID-19 dataset is perfect for performing a PCA. Use the code below that generates the
owid_total
data table. This contains a set of variables from the OWID dataset and the total deaths per million by the end of the year 2021.= owid_total read_csv("https://github.com/vancleve/BIO621-DWVR/raw/main/owid-covid-data_2021.csv") %>% filter(date == "2021-12-31") %>% select(location, total_deaths_per_million, total_cases_per_million, people_vaccinated_per_hundred, %>% population_density, median_age, male_smokers, cardiovasc_death_rate, diabetes_prevalence, life_expectancy, gdp_per_capita) filter(if_all(everything(), ~ !is.na(.x)))
- Run a PCA including all the variables except
location
andtotal_deaths_per_million
. What variables “weigh” the strongest (have the largest absolute value) in the first three principal components? - Use
cbind(owid_total, your_PCA$x)
to generate a new data frame - Use the above data frame to plot countries by their location in the coordinates of PC1 and PC2. Color the points by whether they have great than 1000 total deaths per million. What do you notice about the where the points are with the higher number of total cases? Interpret this in terms of the most important variables in PC1 and PC2.
- Adding
geom_text
to label the points by their country (location) name is a nice addition to the plot, but its optional.
- Run a PCA including all the variables except