Tidy Data

Author

Jeremy Van Cleve

Published

September 26, 2022

Outline for today

  • So many data formats, so little time
  • One format to rule them all: tidy data
  • Making tidy data
  • Slicing tidy data with dplyr

So many data formats, so little time

Happy families are all alike;
every unhappy family is unhappy in its own way.

Leo Tolstoy (first line of “Anna Karenina”)

Hadley Wickham points out the nice connection between the above quote and data formats 1. While data formatting might seem esoteric and boring, paying attention to it early on can pay off a great deal in the long term, which hopefully leads to a happy scientist. If one were to break data formatting into two phases, they might be these:

Phase 1. The literal production of data itself from your experiment since you must choose a format and medium in which to record the data. 

Phase 2. The input and formatting of data into your analysis tool (e.g., R). This might involve minimal changing of the "format" or a great deal of reshaping of the data.

Working backward, you do the least work on data formatting if you choose to save your data initially in a format that is easiest to analyze with R. Since we will use the graphing package, ggplot2, by Hadley Wickham, we will use the data format, tidy data, that he advocates. Lest you worry that this is too Hadley specific, or even R specific, this kind of format is pretty popular now as a way to “layer” or “compose” plots that use different graphical elements to represent different variables in the data (e.g., plotly, seaborn in python, etc). To get a sense for what his format is and why it might be useful, consider some different ways to organize data on tuberculosis (TB) cases where you have three variables, country, year, and population, for each measurement of cases. First, each variable including cases could have its own column:

library(tidyverse)
table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Here, each row represents one time you measured the TB cases. Alternatively, you could have each row represent a country with columns for different years. This means you need a table that measures the cases and one that measures the population:

table4a # cases
# A tibble: 3 × 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4b # population
# A tibble: 3 × 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

To see why the this format might get cumbersome, you just have to suppose you want to add more variables: each new variable requires a new table. Moreover, if you are interested in just one variable and how it changes across country and year, you are fine. But if you want to look across variables, say correlate population and caseload over time, then you have to slice two different tables. If you want to manipulate three variables, then you have to manipulate three different tables, etc. With the first format, we can include all the variables in a single table. The first format is the tidy format.

One format to rule them all: tidy data

The tidy format has the following three rules:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Visually, this looks like

The primary benefit of tidy data is that every variable is a vector (column), which means that slicing data is just slicing columns. Though this may become complicated when the number of variables is large, there are some helper functions that will make slicing tidy data much easier. The slicing functions come from the package dplyr and the plotting package ggplot2 assumes tidy data.

To give a flavor of the power of tidy data and ggplot2, we’ll look at an example. The “GWAS Catalog” (https://www.ebi.ac.uk/gwas/home) keeps track of the SNPs associated with traits in GWAS studies. We can download these data, take a reasonable subset to play with (the 100 diseases with the most SNPS), and plot some things. For example, here are SNPs associated with colon cancer and prostate cancer:

# https://www.ebi.ac.uk/gwas/docs/file-downloads
# we'll load a subset of these data; namely, the 100 diseases with the most SNPs in the database once we remove those associated with some common body/weight and education phenotypes
# to clean things up a little, we filter out rows where the CHR_POS isn't a single number; this way, the x-axis label is nice automatically.

gwas = read_tsv("gwas_catalog_v1.0.2-associations_e104_r2021-09-23_no-waist_hip_body_blood_education_math_top100.tsv") %>% 
  filter(!grepl("x|;", CHR_POS)) %>%
  filter(!is.na(CHR_POS)) %>%
  mutate(CHR_POS = as.numeric(CHR_POS))

filter(gwas, `DISEASE/TRAIT` == 'Prostate cancer' | `DISEASE/TRAIT` == "Colorectal cancer" | `DISEASE/TRAIT` == "Breast cancer") %>% 
  ggplot(aes(x=CHR_POS)) + geom_point(aes(y=PVALUE_MLOG, color=`DISEASE/TRAIT`))

Making tidy data

Often, data will not be in the tidy format by default, so it will be necessary to format it. The first step is to figure out what the “variables” and “observations” are. Sometimes this may require carefully thinking about the experimental design used to create the data. Two common problems with data that are untidy are:

  1. One variable might be spread across multiple columns.
  2. One observation might be scattered across multiple rows.

Typically, a dataset will only suffer one of the above problems. The first problem is dealt with using the pivot_longer function and the second with the pivot_wider function.

Making data “longer”

Recall that this table is tidy

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

whereas these two are not

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4b
# A tibble: 3 × 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

The problem with table4a and table4b is that the variable year is spread across multiple columns. Thus, you need to gather it together into a single column, which makes the data frame longer. The columns 1999 and 2000 will be collected as values of the variable year, which we pass to the argument names_to. The first table contains the number of cases (which is our values_to argument):

# note that backticks `` here. we need them since the column names start with a number 
# (i.e, this allows us to break normal R naming rulues!).

tidy4a = pivot_longer(table4a, c(`1999`, `2000`), names_to = "year", values_to = "cases") 
tidy4a
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

and the second table contains the population size

tidy4b = pivot_longer(table4b, c(`1999`, `2000`), names_to = "year", values_to = "population") 
tidy4b
# A tibble: 6 × 3
  country     year  population
  <chr>       <chr>      <int>
1 Afghanistan 1999    19987071
2 Afghanistan 2000    20595360
3 Brazil      1999   172006362
4 Brazil      2000   174504898
5 China       1999  1272915272
6 China       2000  1280428583

To join these two results together, you use the function full_join

full_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
# A tibble: 6 × 4
  country     year   cases population
  <chr>       <chr>  <int>      <int>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583

which is the same as the first table

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

These “joins” are related to joins you do with databases if you know SQL. For more on this connection and on joins, see Chapter 13 in “R for Data Science” (http://r4ds.had.co.nz/).

Making data “wider”

The following table is untidy

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

because observations (i.e, every year) are spread across multiple rows. To tidy it, identify the column that names the variables, which in this case is type, and then the column with the values, which is count:

pivot_wider(table2, names_from = type, values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Separating and uniting

There are other ways your data might be untidy

  1. A single column actually contains two or more variables (like a ratio of two variables). In this case, the separate function is used.
  2. Multiple columns actually contain a single variable and need to be combined. In this case, the unite function is used.

To read more about these, see Chapter 12 in “R for Data Science” (http://r4ds.had.co.nz/)

Slicing tidy data with dplyr

Now that you have made your data tidy, you probably want to slice and dice it. The dplyr package has handy functions for doing just this. These functions will making slicing MUCH EASIER than the base R way we’ve been doing it so far. Generally, dplyr is useful for doing the following five things

  1. Pick observations (i.e., rows) by their values for specific variables: filter().
  2. Reorder or sort the rows: arrange().
  3. Pick variables (i.e., columns) by their names: select().
  4. Create new variables with functions of existing variables (or modifying existing variables): mutate().
  5. Collapse many values down to a single summary: summarise().

Each of the functions above works in a similar way.

  • The first argument is the data frame.
  • The subsequent arguments describe what to do with the data frame. You can refer to columns in the data frame directly without using $.
  • The result is a new data frame.

We’ll use the GWAS data you loaded above to demonstrate each of the five tasks.

Filtering rows with filter()

Let’s look at the GWAS data with glimpse (which is from tidyverse).

glimpse(gwas)
Rows: 119,874
Columns: 38
$ `DATE ADDED TO CATALOG`      <date> 2019-07-05, 2019-07-05, 2019-07-05, 2019…
$ PUBMEDID                     <dbl> 30698716, 30698716, 30698716, 30698716, 3…
$ `FIRST AUTHOR`               <chr> "de Vries PS", "de Vries PS", "de Vries P…
$ DATE                         <date> 2019-01-29, 2019-01-29, 2019-01-29, 2019…
$ JOURNAL                      <chr> "Am J Epidemiol", "Am J Epidemiol", "Am J…
$ LINK                         <chr> "www.ncbi.nlm.nih.gov/pubmed/30698716", "…
$ STUDY                        <chr> "Multi-Ancestry Genome-Wide Association S…
$ `DISEASE/TRAIT`              <chr> "Triglyceride levels", "Triglyceride leve…
$ `INITIAL SAMPLE SIZE`        <chr> "89,893 European ancestry individuals, 20…
$ `REPLICATION SAMPLE SIZE`    <chr> "136,986 European ancestry individuals, 4…
$ REGION                       <chr> "15q15.1", "4p16.3", "8p22", "10q21.3", "…
$ CHR_ID                       <chr> "15", "4", "8", "10", "4", "16", "20", "1…
$ CHR_POS                      <dbl> 42387225, 3442204, 18415790, 63122540, 87…
$ `REPORTED GENE(S)`           <chr> "CAPN3", "HGFAC", "NAT2, PSD3", "EGR2, NR…
$ MAPPED_GENE                  <chr> "CAPN3", "HGFAC", "NAT2 - NA", "RNU6-543P…
$ UPSTREAM_GENE_ID             <chr> NA, NA, "ENSG00000156006", "ENSG000001994…
$ DOWNSTREAM_GENE_ID           <chr> NA, NA, NA, "ENSG00000148572", NA, NA, "E…
$ SNP_GENE_IDS                 <chr> "ENSG00000092529", "ENSG00000109758", NA,…
$ UPSTREAM_GENE_DISTANCE       <dbl> NA, NA, 14572, 12294, NA, NA, 11061, NA, …
$ DOWNSTREAM_GENE_DISTANCE     <dbl> NA, NA, NA, 10707, NA, NA, 11467, NA, NA,…
$ `STRONGEST SNP-RISK ALLELE`  <chr> "rs28364406-T", "rs13108218-G", "rs149574…
$ SNPS                         <chr> "rs28364406", "rs13108218", "rs1495743", …
$ MERGED                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ SNP_ID_CURRENT               <dbl> 28364406, 13108218, 1495743, 10761716, 60…
$ CONTEXT                      <chr> "intron_variant", "intron_variant", "inte…
$ INTERGENIC                   <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,…
$ `RISK ALLELE FREQUENCY`      <chr> "NR", "NR", "NR", "NR", "NR", "NR", "NR",…
$ `P-VALUE`                    <dbl> 3e-08, 5e-08, 2e-07, 2e-07, 3e-07, 7e-07,…
$ PVALUE_MLOG                  <dbl> 7.522879, 7.301030, 6.698970, 6.698970, 6…
$ `P-VALUE (TEXT)`             <chr> "(EA)", "(EA)", "(EA)", "(EA)", "(EA)", "…
$ `OR or BETA`                 <dbl> 0.081, 0.021, 0.021, 0.018, 0.018, 0.019,…
$ `95% CI (TEXT)`              <chr> "NR unit increase", "NR unit increase", "…
$ `PLATFORM [SNPS PASSING QC]` <chr> "Affymetrix, Illumina [NR] (imputed)", "A…
$ CNV                          <chr> "N", "N", "N", "N", "N", "N", "N", "N", "…
$ MAPPED_TRAIT                 <chr> "triglyceride measurement", "triglyceride…
$ MAPPED_TRAIT_URI             <chr> "http://www.ebi.ac.uk/efo/EFO_0004530", "…
$ `STUDY ACCESSION`            <chr> "GCST008076", "GCST008076", "GCST008076",…
$ `GENOTYPING TECHNOLOGY`      <chr> "Genome-wide genotyping array", "Genome-w…

Each observation (row) is a SNP from a GWAS study and we information on its location, the disease/trait associated, etc. Let’s filter just SNPs for colorectal cancer

filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer')
# A tibble: 559 × 38
   DATE ADDED T…¹ PUBME…² FIRST…³ DATE       JOURNAL LINK  STUDY DISEA…⁴ INITI…⁵
   <date>           <dbl> <chr>   <date>     <chr>   <chr> <chr> <chr>   <chr>  
 1 2016-12-21      2.71e7 Wang M  2016-05-05 Nat Co… www.… Comm… Colore… 1,023 …
 2 2016-12-21      2.71e7 Wang M  2016-05-05 Nat Co… www.… Comm… Colore… 1,023 …
 3 2016-12-21      2.71e7 Wang M  2016-05-05 Nat Co… www.… Comm… Colore… 1,023 …
 4 2016-12-21      2.71e7 Wang M  2016-05-05 Nat Co… www.… Comm… Colore… 1,023 …
 5 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 6 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 7 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 8 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 9 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
10 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
# … with 549 more rows, 29 more variables: `REPLICATION SAMPLE SIZE` <chr>,
#   REGION <chr>, CHR_ID <chr>, CHR_POS <dbl>, `REPORTED GENE(S)` <chr>,
#   MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>, DOWNSTREAM_GENE_ID <chr>,
#   SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
#   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>,
#   SNPS <chr>, MERGED <dbl>, SNP_ID_CURRENT <dbl>, CONTEXT <chr>,
#   INTERGENIC <dbl>, `RISK ALLELE FREQUENCY` <chr>, `P-VALUE` <dbl>, …

We could apply another filter by just adding an argument, such as having a -log(p-value) (higher values of this mean more significant!) of greater than 10

filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer', PVALUE_MLOG > 10)
# A tibble: 141 × 38
   DATE ADDED T…¹ PUBME…² FIRST…³ DATE       JOURNAL LINK  STUDY DISEA…⁴ INITI…⁵
   <date>           <dbl> <chr>   <date>     <chr>   <chr> <chr> <chr>   <chr>  
 1 2016-12-21      2.71e7 Wang M  2016-05-05 Nat Co… www.… Comm… Colore… 1,023 …
 2 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 3 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 4 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 5 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 6 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 7 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 8 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 9 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
10 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
# … with 131 more rows, 29 more variables: `REPLICATION SAMPLE SIZE` <chr>,
#   REGION <chr>, CHR_ID <chr>, CHR_POS <dbl>, `REPORTED GENE(S)` <chr>,
#   MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>, DOWNSTREAM_GENE_ID <chr>,
#   SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
#   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>,
#   SNPS <chr>, MERGED <dbl>, SNP_ID_CURRENT <dbl>, CONTEXT <chr>,
#   INTERGENIC <dbl>, `RISK ALLELE FREQUENCY` <chr>, `P-VALUE` <dbl>, …

Note that filter combines consecutive arguments by default using the “&”. You could equivalently give a single argument with the “&” to get the same slice

filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer' & PVALUE_MLOG > 10)
# A tibble: 141 × 38
   DATE ADDED T…¹ PUBME…² FIRST…³ DATE       JOURNAL LINK  STUDY DISEA…⁴ INITI…⁵
   <date>           <dbl> <chr>   <date>     <chr>   <chr> <chr> <chr>   <chr>  
 1 2016-12-21      2.71e7 Wang M  2016-05-05 Nat Co… www.… Comm… Colore… 1,023 …
 2 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 3 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 4 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 5 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 6 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 7 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 8 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
 9 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
10 2018-05-18      2.95e7 Tanika… 2018-02-17 Carcin… www.… GWAS… Colore… 6,692 …
# … with 131 more rows, 29 more variables: `REPLICATION SAMPLE SIZE` <chr>,
#   REGION <chr>, CHR_ID <chr>, CHR_POS <dbl>, `REPORTED GENE(S)` <chr>,
#   MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>, DOWNSTREAM_GENE_ID <chr>,
#   SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
#   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>,
#   SNPS <chr>, MERGED <dbl>, SNP_ID_CURRENT <dbl>, CONTEXT <chr>,
#   INTERGENIC <dbl>, `RISK ALLELE FREQUENCY` <chr>, `P-VALUE` <dbl>, …

Arrange rows with arrange()

The function arrange just sorts the rows based on the columns you specify. For example, to sort the colorectal cancer SNPs by DATE of the study,

colocancer = filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer')
arrange(colocancer, DATE)
# A tibble: 559 × 38
   DATE ADDED T…¹ PUBME…² FIRST…³ DATE       JOURNAL LINK  STUDY DISEA…⁴ INITI…⁵
   <date>           <dbl> <chr>   <date>     <chr>   <chr> <chr> <chr>   <chr>  
 1 2008-06-16      1.76e7 Zanke … 2007-07-08 Nat Ge… www.… Geno… Colore… 1,257 …
 2 2008-06-16      1.76e7 Tomlin… 2007-07-08 Nat Ge… www.… A ge… Colore… 930 Eu…
 3 2008-06-16      1.79e7 Broder… 2007-10-14 Nat Ge… www.… A ge… Colore… 930 Eu…
 4 2008-06-16      1.84e7 Tenesa… 2008-03-30 Nat Ge… www.… Geno… Colore… 981 Eu…
 5 2008-06-16      1.84e7 Tenesa… 2008-03-30 Nat Ge… www.… Geno… Colore… 981 Eu…
 6 2008-06-16      1.84e7 Tenesa… 2008-03-30 Nat Ge… www.… Geno… Colore… 981 Eu…
 7 2008-06-16      1.84e7 Tomlin… 2008-03-30 Nat Ge… www.… A ge… Colore… 922 Eu…
 8 2008-06-16      1.84e7 Tomlin… 2008-03-30 Nat Ge… www.… A ge… Colore… 922 Eu…
 9 2008-06-16      1.84e7 Tomlin… 2008-03-30 Nat Ge… www.… A ge… Colore… 922 Eu…
10 2008-06-16      1.84e7 Tomlin… 2008-03-30 Nat Ge… www.… A ge… Colore… 922 Eu…
# … with 549 more rows, 29 more variables: `REPLICATION SAMPLE SIZE` <chr>,
#   REGION <chr>, CHR_ID <chr>, CHR_POS <dbl>, `REPORTED GENE(S)` <chr>,
#   MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>, DOWNSTREAM_GENE_ID <chr>,
#   SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
#   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>,
#   SNPS <chr>, MERGED <dbl>, SNP_ID_CURRENT <dbl>, CONTEXT <chr>,
#   INTERGENIC <dbl>, `RISK ALLELE FREQUENCY` <chr>, `P-VALUE` <dbl>, …

and use desc to make the sort a descending one,

arrange(colocancer, desc(DATE))
# A tibble: 559 × 38
   DATE ADDED T…¹ PUBME…² FIRST…³ DATE       JOURNAL LINK  STUDY DISEA…⁴ INITI…⁵
   <date>           <dbl> <chr>   <date>     <chr>   <chr> <chr> <chr>   <chr>  
 1 2021-06-24      3.41e7 Nazari… 2021-05-01 Genes … www.… Geno… Colore… 211 Eu…
 2 2021-06-24      3.41e7 Nazari… 2021-05-01 Genes … www.… Geno… Colore… 211 Eu…
 3 2021-06-24      3.41e7 Nazari… 2021-05-01 Genes … www.… Geno… Colore… 211 Eu…
 4 2021-06-24      3.41e7 Nazari… 2021-05-01 Genes … www.… Geno… Colore… 211 Eu…
 5 2021-06-24      3.41e7 Nazari… 2021-05-01 Genes … www.… Geno… Colore… 211 Eu…
 6 2021-06-24      3.41e7 Nazari… 2021-05-01 Genes … www.… Geno… Colore… 211 Eu…
 7 2021-06-24      3.41e7 Nazari… 2021-05-01 Genes … www.… Geno… Colore… 211 Eu…
 8 2021-08-10      3.36e7 Matejc… 2021-02-24 Cancer… www.… Rare… Colore… 2,327 …
 9 2021-08-10      3.36e7 Matejc… 2021-02-24 Cancer… www.… Rare… Colore… 2,327 …
10 2021-02-26      3.25e7 Ishiga… 2020-06-08 Nat Ge… www.… Larg… Colore… 7,062 …
# … with 549 more rows, 29 more variables: `REPLICATION SAMPLE SIZE` <chr>,
#   REGION <chr>, CHR_ID <chr>, CHR_POS <dbl>, `REPORTED GENE(S)` <chr>,
#   MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>, DOWNSTREAM_GENE_ID <chr>,
#   SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
#   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>,
#   SNPS <chr>, MERGED <dbl>, SNP_ID_CURRENT <dbl>, CONTEXT <chr>,
#   INTERGENIC <dbl>, `RISK ALLELE FREQUENCY` <chr>, `P-VALUE` <dbl>, …

Select columns with select()

The select function simple selects specific columns (i.e., variables). To get only the disease, chromosome position, and -log(p-value) from the full data,

select(gwas, `DISEASE/TRAIT`, CHR_POS, PVALUE_MLOG)
# A tibble: 119,874 × 3
   `DISEASE/TRAIT`       CHR_POS PVALUE_MLOG
   <chr>                   <dbl>       <dbl>
 1 Triglyceride levels  42387225        7.52
 2 Triglyceride levels   3442204        7.30
 3 Triglyceride levels  18415790        6.70
 4 Triglyceride levels  63122540        6.70
 5 Triglyceride levels  87101557        6.52
 6 Triglyceride levels  81501185        6.15
 7 Triglyceride levels  45923216        5.70
 8 Triglyceride levels  44492177        5.52
 9 Triglyceride levels 230160042        5.30
10 Triglyceride levels 164672114        5.22
# … with 119,864 more rows

Add new variables with mutate()

You may want to add new variables that are functions of other variables. For example, you could create a column for the p-value from the PVALUE_MLOG (there is already one in the table, but we’ll make another)

sm_gwas = select(gwas, `DISEASE/TRAIT`, CHR_POS, PVALUE_MLOG, `P-VALUE`)
mutate(sm_gwas, new_p_value = 10^(-PVALUE_MLOG))
# A tibble: 119,874 × 5
   `DISEASE/TRAIT`       CHR_POS PVALUE_MLOG  `P-VALUE` new_p_value
   <chr>                   <dbl>       <dbl>      <dbl>       <dbl>
 1 Triglyceride levels  42387225        7.52 0.00000003  0.00000003
 2 Triglyceride levels   3442204        7.30 0.00000005  0.00000005
 3 Triglyceride levels  18415790        6.70 0.0000002   0.0000002 
 4 Triglyceride levels  63122540        6.70 0.0000002   0.0000002 
 5 Triglyceride levels  87101557        6.52 0.0000003   0.0000003 
 6 Triglyceride levels  81501185        6.15 0.0000007   0.0000007 
 7 Triglyceride levels  45923216        5.70 0.000002    0.000002  
 8 Triglyceride levels  44492177        5.52 0.000003    0.000003  
 9 Triglyceride levels 230160042        5.30 0.000005    0.000005  
10 Triglyceride levels 164672114        5.22 0.000006    0.000006  
# … with 119,864 more rows

Summaries with summarise()

The function summarise (damn British spelling, though the American one works too apparently) collapses the data frame to a single row:

summarize(gwas, mean_mlog_p_value = mean(PVALUE_MLOG)) 
# A tibble: 1 × 1
  mean_mlog_p_value
              <dbl>
1              21.4

This isn’t terribly useful until you use the group_by function to do the summarize action on data by “group”, which you specify according to values of variables. To see the average for each disease, group by DISEASE/TRAIT,

grouped_gwas = group_by(gwas, `DISEASE/TRAIT`)
summarize(grouped_gwas, mean_mlog_p_value = mean(PVALUE_MLOG))
# A tibble: 100 × 2
   `DISEASE/TRAIT`                   mean_mlog_p_value
   <chr>                                         <dbl>
 1 Adolescent idiopathic scoliosis                14.3
 2 Alanine aminotransferase levels                19.8
 3 Apolipoprotein A1 levels                       48.8
 4 Apolipoprotein B levels                        62.2
 5 Appendicular lean mass                         19.0
 6 Aspartate aminotransferase levels              18.5
 7 Asthma                                         17.4
 8 Basophil count                                 22.8
 9 Brain region volumes                           11.0
10 Breast cancer                                  12.8
# … with 90 more rows

or add the journal the study was published in

grouped_gwas = group_by(gwas, `DISEASE/TRAIT`, JOURNAL)
meanp = summarize(grouped_gwas, mean_mlog_p_value = mean(PVALUE_MLOG))
`summarise()` has grouped output by 'DISEASE/TRAIT'. You can override using the
`.groups` argument.
meanp
# A tibble: 579 × 3
# Groups:   DISEASE/TRAIT [100]
   `DISEASE/TRAIT`                 JOURNAL                       mean_mlog_p_v…¹
   <chr>                           <chr>                                   <dbl>
 1 Adolescent idiopathic scoliosis Am J Hum Genet                          16.7 
 2 Adolescent idiopathic scoliosis Hum Genet                               14.4 
 3 Adolescent idiopathic scoliosis Hum Mol Genet                           12.1 
 4 Adolescent idiopathic scoliosis Nat Commun                              12.8 
 5 Alanine aminotransferase levels Nat Commun                              20.7 
 6 Alanine aminotransferase levels Nat Genet                               19.3 
 7 Alanine aminotransferase levels Obesity (Silver Spring)                  5.67
 8 Alanine aminotransferase levels Sci Rep                                  7.15
 9 Apolipoprotein A1 levels        Arterioscler Thromb Vasc Biol          527.  
10 Apolipoprotein A1 levels        Nat Genet                               29.2 
# … with 569 more rows, and abbreviated variable name ¹​mean_mlog_p_value

Finally, let’s take the mean across all diseases for each journal and then sort by p-value to see the pattern with the journal a little better

arrange(summarize(group_by(gwas, JOURNAL), mean_mlog_p_value = mean(PVALUE_MLOG)), desc(mean_mlog_p_value))
# A tibble: 132 × 2
   JOURNAL                       mean_mlog_p_value
   <chr>                                     <dbl>
 1 Arterioscler Thromb Vasc Biol             317. 
 2 J Acad Nutr Diet                          149. 
 3 Sci Adv                                    53.5
 4 NPJ Genom Med                              52.6
 5 J Cell Mol Med                             40.5
 6 Circulation                                35.4
 7 Lancet                                     31.8
 8 Nat Med                                    30.5
 9 Circ Cardiovasc Genet                      29.9
10 Cell                                       29.8
# … with 122 more rows

Another nice function to use with summarize is the “counting” function n(), which can give the number of rows in each group. Since each row is a SNP, we can get the number of SNPs for each disease/trait this way:

arrange(summarize(group_by(gwas, `DISEASE/TRAIT`), n_SNPS = n()), desc(n_SNPS))
# A tibble: 100 × 2
   `DISEASE/TRAIT`                                                 n_SNPS
   <chr>                                                            <int>
 1 Electrocardiogram morphology (amplitude at temporal datapoints)   5272
 2 Height                                                            5014
 3 Heel bone mineral density                                         4508
 4 Type 2 diabetes                                                   3079
 5 Mean corpuscular hemoglobin                                       2920
 6 Protein quantitative trait loci (liver)                           2891
 7 Platelet count                                                    2506
 8 Red cell distribution width                                       2441
 9 Mean corpuscular volume                                           2384
10 Metabolite levels                                                 2377
# … with 90 more rows

Using the %>% operator to chain together functions

The slicing and summarzing operations we’ve been doing can quickly get messy as we added more operations. Since each operation is applied to the results of the previous one, it ends up with a bunch of very nested parentheses, which can be hard to read. The pipe operator %>% allows you to easily chain together operations so that its easier to read them from left to right. For example, the previous summarize and arrange operations could be put together like

gwas %>%
  group_by(`DISEASE/TRAIT`) %>%
  summarize(n_SNPS = n()) %>%
  arrange(desc(n_SNPS))
# A tibble: 100 × 2
   `DISEASE/TRAIT`                                                 n_SNPS
   <chr>                                                            <int>
 1 Electrocardiogram morphology (amplitude at temporal datapoints)   5272
 2 Height                                                            5014
 3 Heel bone mineral density                                         4508
 4 Type 2 diabetes                                                   3079
 5 Mean corpuscular hemoglobin                                       2920
 6 Protein quantitative trait loci (liver)                           2891
 7 Platelet count                                                    2506
 8 Red cell distribution width                                       2441
 9 Mean corpuscular volume                                           2384
10 Metabolite levels                                                 2377
# … with 90 more rows

Effectively, the pipe takes what is on the left side of it and uses it as the first argument in the function on the right side. In the above example, gwas, our initial data table, is on the left and it gets used as the first argument in group_by. The group_by function returns a data.frame, which is then used as the argument to the function of the right of the next pipe, which is summarize. The rule of thumb then is that if you want to convert a command without a pipe to one where you use a pipe, take the first argument out of the function, put it on the left of the function separated by the pipe. Likewise, to get rid of the pipe, take the input on the left and put it in the first argument of the function on the right of the pipe and delete the pipe.

A very significant advantage of structuring your slicing / wrangling commands this way when you use RStudio is that RStudio will help you auto-complete column names. As you’ve probably seen already, typing these names can get cumbersome, so this is a big help. RStudio doesn’t do this when you put the data.frame object in the function directly (for some reason unknown to me).

Lab

Problems

  1. Using the GWAS data, produce a data table that shows the mean, maximum, minimum, and variance of PVALUE_MLOG for each combination of disease and journal.

  2. Recall the gene expression last that was briefly introduced last week:

    library(readxl)
    imprint = read_excel("babak-etal-2015_imprinted-mouse.xlsx", na = "NaN")

    Each row is a gene, each column is a tissue type, and each cell contains a gene expression measurement.

    Make these data tidy (in one way) by collapsing the tissue columns and making the data table longer

    You will need the pivot_longer function. The first trick here will be first to identify the “observations” and then the “names_to”, which is the variable that changes across each observation.

    The second trick is specifying the columns across which you need to gather together into a single column. A hint for the second trick is that you can specify the columns you don’t want with the “-” sign. That is, with the GWAS data, for example, every column except PVALUE_MLOG would be -PVALUE_MLOG.

    The answer will be deceivingly simple! This is the elegance / frustration of R.

  3. Using the tidy data from Problem 2, find the number of genes (across all tissue types) that have an expression value <= 10 and > 2. These genes are “paternally imprinted” in the Babak et al. dataset, which means they are only expressed from the maternal copy of the gene.

    Hint: you will need to count each gene once even if it appears in multiple rows, which can be done with the distinct function.

  4. Use pivot_wider to return the tidy data from Problem 2 to their original wide format.

  5. Use the New York Times COVID-19 Data:

    us = read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv", show_col_types = FALSE)

    These data are tidy. Make them wider by making columns for each state and using the cases as the values. You should have only one row for each date between now and when the pandemic began. If you’re getting many thousands of rows, you need to give pivot_wider some id_cols (see the pivot_wider documentation…). OR give pivot_wider a simpler data table…Either way works.

Footnotes

  1. Wickham, Hadley. 2014. J Stat Softw, 59:1–23. DOI: 10.18637/jss.v059.i10↩︎