Tidy Data

Author

Jeremy Van Cleve

Published

September 24, 2024

Outline for today

  • So many data formats, so little time
  • One format to rule them all: tidy data
  • Making tidy data
  • Slicing tidy data with dplyr

So many data formats, so little time

Happy families are all alike;
every unhappy family is unhappy in its own way.

Leo Tolstoy (first line of “Anna Karenina”)

Hadley Wickham points out the nice connection between the above quote and data formats 1. While data formatting might seem esoteric and boring, paying attention to it early on can pay off a great deal in the long term, which hopefully leads to a happy scientist. If one were to break data formatting into two phases, they might be these:

Phase 1. The literal production of data itself from your experiment since you must choose a format and medium in which to record the data. 

Phase 2. The input and formatting of data into your analysis tool (e.g., R). This might involve minimal changing of the "format" or a great deal of reshaping of the data.

Working backward, you do the least work on data formatting if you choose to save your data initially in a format that is easiest to analyze with R. Since we will use the graphing package, ggplot2, by Hadley Wickham, we will use the data format, tidy data, that he advocates. Lest you worry that this is too Hadley specific, or even R specific, this kind of format is pretty popular now as a way to “layer” or “compose” plots that use different graphical elements to represent different variables in the data (e.g., plotly and seaborn in Python and AlgebraOfGraphics and TidierPlots.jl in Julia). To get a sense for what this format is and why it might be useful, consider some different ways to organize data on tuberculosis (TB) cases where you have three variables, country, year, and population, for each measurement of cases. First, each variable including cases could have its own column:

library(tidyverse)
table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Here, each row represents one time you measured the TB cases. Alternatively, you could have each row represent a country with columns for different years. This means you need a table that measures the cases and one that measures the population:

table4a # cases
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4b # population
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

To see why the this format might get cumbersome, you just have to suppose you want to add more variables: each new variable requires a new table. If you are interested in just one variable and how it changes across country and year, you are fine. But if you want to look across variables, say correlate population and caseload over time, then you have to slice two different tables. If you want to manipulate three variables, then you have to manipulate three different tables, etc. With the first format, we can include all the variables in a single table. The first format is the tidy format.

One format to rule them all: tidy data

The tidy format has the following three rules2:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Visually, this looks like

The primary benefit of tidy data is that every variable is a vector (column), which means that slicing data is just slicing columns. Though this may become complicated when the number of variables is large, there are some helper functions that will make slicing tidy data much easier. The slicing functions come from the package dplyr and the plotting package ggplot2 assumes tidy data.

To give a flavor of the power of tidy data and ggplot2, we’ll look at an example. The “GWAS Catalog” (https://www.ebi.ac.uk/gwas/home) keeps track of the SNPs associated with traits in genome-wide association studies (GWAS). We can download these data, take a reasonable subset to play with (the 100 diseases with the most SNPS), and plot some things. For example, here are SNPs associated with breast, colon, and prostate cancer:

# https://www.ebi.ac.uk/gwas/docs/file-downloads
# we'll load a subset of these data; namely, the 100 diseases with the most SNPs in the database once we remove those associated with some common body/weight and education phenotypes
# to clean things up a little, we filter out rows where the CHR_POS isn't a single number; this way, the x-axis label is nice automatically.

gwas = read_tsv("gwas_catalog_v1.0.2-associations_e104_r2021-09-23_no-waist_hip_body_blood_education_math_top100.tsv", col_types = cols(CHR_POS = col_number()))

filter(gwas, `DISEASE/TRAIT` == 'Prostate cancer' | `DISEASE/TRAIT` == "Colorectal cancer" | `DISEASE/TRAIT` == "Breast cancer") |>
  ggplot(aes(x=CHR_POS)) + 
  geom_point(aes(y=PVALUE_MLOG, color=`DISEASE/TRAIT`))

Making tidy data

Often, data will not be in the tidy format by default, so it will be necessary to format it. The first step is to figure out what the “variables” and “observations” are. Sometimes this may require carefully thinking about the experimental design used to create the data. Two common problems with data that are untidy are:

  1. One variable might be spread across multiple columns.
  2. One observation might be scattered across multiple rows.

Typically, a dataset will only suffer one of the above problems. The first problem is dealt with using the pivot_longer function and the second with the pivot_wider function.

Making data “longer”

Recall that this table is tidy

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

whereas these two are not

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4b
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

The problem with table4a and table4b is that the variable year is spread across multiple columns. Thus, you need to gather it together into a single column, which makes the data frame longer. The columns 1999 and 2000 will be collected as values of the variable year, which we pass to the argument names_to. The first table contains the number of cases (which is our values_to argument):

# note that backticks `` here. we need them since the column names start with a number 
# (i.e, this allows us to break normal R naming rulues!).

tidy4a = pivot_longer(table4a, cols = c(`1999`, `2000`), names_to = "year", values_to = "cases") 
tidy4a
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

and the second table contains the population size

tidy4b = pivot_longer(table4b, cols = c(`1999`, `2000`), names_to = "year", values_to = "population") 
tidy4b
# A tibble: 6 × 3
  country     year  population
  <chr>       <chr>      <dbl>
1 Afghanistan 1999    19987071
2 Afghanistan 2000    20595360
3 Brazil      1999   172006362
4 Brazil      2000   174504898
5 China       1999  1272915272
6 China       2000  1280428583

To join these two results together, you use the function full_join

full_join(tidy4a, tidy4b)
Joining with `by = join_by(country, year)`
# A tibble: 6 × 4
  country     year   cases population
  <chr>       <chr>  <dbl>      <dbl>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583

which is the same as the first table

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

These “joins” are related to joins you do with databases if you know SQL. We’ll talk more about this next week.

How does pivoting longer work?

To get a little closer work at how this works conceptually, we’ll borrow an example from Hadley’s R for Data Science book3. Suppose we have the following blood pressure data from three individuals

bpdf = tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)

where id refers to the individual and bp1 and bp2 are measurements of blood pressure at two different times. By the way, the tribble function is a handy way of creating a tibble data frame by hand.

We want to tidy table with three columns, id, which we already have, measurement, which will be either bp1 or bp2, and the value of the measurement. To do this we use the code

pivot_longer(bpdf, cols = c(bp1, bp2), names_to = "measurement", values_to = "value")
# A tibble: 6 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 A     bp2           120
3 B     bp1           140
4 B     bp2           115
5 C     bp1           120
6 C     bp2           125

To see better how this reshaping works, think about first about the id column. We are keeping this column and its values need to be repeated in new rows, one for each additional variable being pivoted, which is one additional variable here. The column names become a new column, measurement, which need to be repeated once for each original row of the dataset. The second new column captures the values, which are unwound row by row from the original table into the new table

Making data “wider”

The following table is untidy

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

because observations (i.e, every year) are spread across multiple rows. To tidy it, identify the column that names the variables, which in this case is type, and then the column with the values, which is count:

pivot_wider(table2, names_from = type, values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

How does pivoting wider work?

Let’s look at the long alternate version of our blood pressure example.

long_bpdf = tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)
long_bpdf
# A tibble: 5 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 B     bp1           140
3 B     bp2           115
4 A     bp2           120
5 A     bp3           105

To pivot this wider, we could do

pivot_wider(long_bpdf, names_from = measurement, values_from = value)
# A tibble: 2 × 4
  id      bp1   bp2   bp3
  <chr> <dbl> <dbl> <dbl>
1 A       100   120   105
2 B       140   115    NA

The pivot_wider function accomplishes this by looking for unique values in the measurement column since these will be the new columns. These values are

unique(long_bpdf$measurement)
[1] "bp1" "bp2" "bp3"

By default, the rows in the output are determined by all the variables that aren’t going into the new names or values. These are called the id_cols. Here, there is only one column, but in general there can be any number. The columns, bp1, bp2, and bp3, are now filled from the data in the original table and NA is inserted when there isn’t a corresponding row for that new columan row combination.

More pivoting wider and longer

There are other ways your data might be untidy

  1. A single column actually contains two or more variables (like a ratio of two variables). In this case, the separate or separate_wider_delim functions may be used.
  2. Multiple columns actually contain a single variable and need to be combined. In this case, the unite function is used.

To read more about these, see the “vingnette” (or code tutorial) in R by typing vignette("pivot")

Slicing tidy data with dplyr

Now that you have made your data tidy, you probably want to slice and dice it. The dplyr package has handy functions for doing just this. These functions will making slicing MUCH EASIER than the base R way we’ve been doing it so far. Generally, dplyr is useful for doing the following five things

  1. Pick observations (i.e., rows) by their values for specific variables: filter().
  2. Reorder or sort the rows: arrange().
  3. Pick variables (i.e., columns) by their names: select().
  4. Create new variables with functions of existing variables (or modifying existing variables): mutate().
  5. Collapse many values down to a single summary: summarize().

Each of the functions above works in a similar way.

  • The first argument is the data frame.
  • The subsequent arguments describe what to do with the data frame. You can refer to columns in the data frame directly without using $.
  • The result is a new data frame.

We’ll use the GWAS data you loaded above to demonstrate each of the five tasks.

Filtering rows with filter()

Let’s look at the GWAS data with glimpse (which is from tidyverse).

glimpse(gwas)
Rows: 123,569
Columns: 38
$ `DATE ADDED TO CATALOG`      <date> 2019-07-05, 2019-07-05, 2019-07-05, 2019…
$ PUBMEDID                     <dbl> 30698716, 30698716, 30698716, 30698716, 3…
$ `FIRST AUTHOR`               <chr> "de Vries PS", "de Vries PS", "de Vries P…
$ DATE                         <date> 2019-01-29, 2019-01-29, 2019-01-29, 2019…
$ JOURNAL                      <chr> "Am J Epidemiol", "Am J Epidemiol", "Am J…
$ LINK                         <chr> "www.ncbi.nlm.nih.gov/pubmed/30698716", "…
$ STUDY                        <chr> "Multi-Ancestry Genome-Wide Association S…
$ `DISEASE/TRAIT`              <chr> "Triglyceride levels", "Triglyceride leve…
$ `INITIAL SAMPLE SIZE`        <chr> "89,893 European ancestry individuals, 20…
$ `REPLICATION SAMPLE SIZE`    <chr> "136,986 European ancestry individuals, 4…
$ REGION                       <chr> "15q15.1", "4p16.3", "8p22", "10q21.3", "…
$ CHR_ID                       <chr> "15", "4", "8", "10", "4", "16", "20", "1…
$ CHR_POS                      <dbl> 42387225, 3442204, 18415790, 63122540, 87…
$ `REPORTED GENE(S)`           <chr> "CAPN3", "HGFAC", "NAT2, PSD3", "EGR2, NR…
$ MAPPED_GENE                  <chr> "CAPN3", "HGFAC", "NAT2 - NA", "RNU6-543P…
$ UPSTREAM_GENE_ID             <chr> NA, NA, "ENSG00000156006", "ENSG000001994…
$ DOWNSTREAM_GENE_ID           <chr> NA, NA, NA, "ENSG00000148572", NA, NA, "E…
$ SNP_GENE_IDS                 <chr> "ENSG00000092529", "ENSG00000109758", NA,…
$ UPSTREAM_GENE_DISTANCE       <dbl> NA, NA, 14572, 12294, NA, NA, 11061, NA, …
$ DOWNSTREAM_GENE_DISTANCE     <dbl> NA, NA, NA, 10707, NA, NA, 11467, NA, NA,…
$ `STRONGEST SNP-RISK ALLELE`  <chr> "rs28364406-T", "rs13108218-G", "rs149574…
$ SNPS                         <chr> "rs28364406", "rs13108218", "rs1495743", …
$ MERGED                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ SNP_ID_CURRENT               <dbl> 28364406, 13108218, 1495743, 10761716, 60…
$ CONTEXT                      <chr> "intron_variant", "intron_variant", "inte…
$ INTERGENIC                   <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,…
$ `RISK ALLELE FREQUENCY`      <chr> "NR", "NR", "NR", "NR", "NR", "NR", "NR",…
$ `P-VALUE`                    <dbl> 3e-08, 5e-08, 2e-07, 2e-07, 3e-07, 7e-07,…
$ PVALUE_MLOG                  <dbl> 7.522879, 7.301030, 6.698970, 6.698970, 6…
$ `P-VALUE (TEXT)`             <chr> "(EA)", "(EA)", "(EA)", "(EA)", "(EA)", "…
$ `OR or BETA`                 <dbl> 0.081, 0.021, 0.021, 0.018, 0.018, 0.019,…
$ `95% CI (TEXT)`              <chr> "NR unit increase", "NR unit increase", "…
$ `PLATFORM [SNPS PASSING QC]` <chr> "Affymetrix, Illumina [NR] (imputed)", "A…
$ CNV                          <chr> "N", "N", "N", "N", "N", "N", "N", "N", "…
$ MAPPED_TRAIT                 <chr> "triglyceride measurement", "triglyceride…
$ MAPPED_TRAIT_URI             <chr> "http://www.ebi.ac.uk/efo/EFO_0004530", "…
$ `STUDY ACCESSION`            <chr> "GCST008076", "GCST008076", "GCST008076",…
$ `GENOTYPING TECHNOLOGY`      <chr> "Genome-wide genotyping array", "Genome-w…

Each observation (row) is a SNP from a GWAS study and we information on its location, the disease/trait associated, etc. Let’s filter just SNPs for colorectal cancer

filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer')
# A tibble: 561 × 38
   DATE ADDED TO CATALO…¹ PUBMEDID `FIRST AUTHOR` DATE       JOURNAL LINK  STUDY
   <date>                    <dbl> <chr>          <date>     <chr>   <chr> <chr>
 1 2016-12-21             27145994 Wang M         2016-05-05 Nat Co… www.… Comm…
 2 2016-12-21             27145994 Wang M         2016-05-05 Nat Co… www.… Comm…
 3 2016-12-21             27145994 Wang M         2016-05-05 Nat Co… www.… Comm…
 4 2016-12-21             27145994 Wang M         2016-05-05 Nat Co… www.… Comm…
 5 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 6 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 7 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 8 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 9 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
10 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
# ℹ 551 more rows
# ℹ abbreviated name: ¹​`DATE ADDED TO CATALOG`
# ℹ 31 more variables: `DISEASE/TRAIT` <chr>, `INITIAL SAMPLE SIZE` <chr>,
#   `REPLICATION SAMPLE SIZE` <chr>, REGION <chr>, CHR_ID <chr>, CHR_POS <dbl>,
#   `REPORTED GENE(S)` <chr>, MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>,
#   DOWNSTREAM_GENE_ID <chr>, SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
#   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>, …

We could apply another filter by just adding an argument, such as having a -log(p-value) (higher values of this mean more significant!) of greater than 10

filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer', PVALUE_MLOG > 10)
# A tibble: 142 × 38
   DATE ADDED TO CATALO…¹ PUBMEDID `FIRST AUTHOR` DATE       JOURNAL LINK  STUDY
   <date>                    <dbl> <chr>          <date>     <chr>   <chr> <chr>
 1 2016-12-21             27145994 Wang M         2016-05-05 Nat Co… www.… Comm…
 2 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 3 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 4 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 5 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 6 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 7 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 8 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 9 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
10 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
# ℹ 132 more rows
# ℹ abbreviated name: ¹​`DATE ADDED TO CATALOG`
# ℹ 31 more variables: `DISEASE/TRAIT` <chr>, `INITIAL SAMPLE SIZE` <chr>,
#   `REPLICATION SAMPLE SIZE` <chr>, REGION <chr>, CHR_ID <chr>, CHR_POS <dbl>,
#   `REPORTED GENE(S)` <chr>, MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>,
#   DOWNSTREAM_GENE_ID <chr>, SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
#   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>, …

Note that filter combines consecutive arguments by default using the “&”. You could equivalently give a single argument with the “&” to get the same slice

filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer' & PVALUE_MLOG > 10)
# A tibble: 142 × 38
   DATE ADDED TO CATALO…¹ PUBMEDID `FIRST AUTHOR` DATE       JOURNAL LINK  STUDY
   <date>                    <dbl> <chr>          <date>     <chr>   <chr> <chr>
 1 2016-12-21             27145994 Wang M         2016-05-05 Nat Co… www.… Comm…
 2 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 3 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 4 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 5 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 6 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 7 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 8 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
 9 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
10 2018-05-18             29471430 Tanikawa C     2018-02-17 Carcin… www.… GWAS…
# ℹ 132 more rows
# ℹ abbreviated name: ¹​`DATE ADDED TO CATALOG`
# ℹ 31 more variables: `DISEASE/TRAIT` <chr>, `INITIAL SAMPLE SIZE` <chr>,
#   `REPLICATION SAMPLE SIZE` <chr>, REGION <chr>, CHR_ID <chr>, CHR_POS <dbl>,
#   `REPORTED GENE(S)` <chr>, MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>,
#   DOWNSTREAM_GENE_ID <chr>, SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
#   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>, …

Arrange rows with arrange()

The function arrange just sorts the rows based on the columns you specify. For example, to sort the colorectal cancer SNPs by DATE of the study,

colocancer = filter(gwas, `DISEASE/TRAIT` == 'Colorectal cancer')
arrange(colocancer, DATE)
# A tibble: 561 × 38
   DATE ADDED TO CATALO…¹ PUBMEDID `FIRST AUTHOR` DATE       JOURNAL LINK  STUDY
   <date>                    <dbl> <chr>          <date>     <chr>   <chr> <chr>
 1 2008-06-16             17618283 Zanke BW       2007-07-08 Nat Ge… www.… Geno…
 2 2008-06-16             17618284 Tomlinson I    2007-07-08 Nat Ge… www.… A ge…
 3 2008-06-16             17934461 Broderick P    2007-10-14 Nat Ge… www.… A ge…
 4 2008-06-16             18372901 Tenesa A       2008-03-30 Nat Ge… www.… Geno…
 5 2008-06-16             18372901 Tenesa A       2008-03-30 Nat Ge… www.… Geno…
 6 2008-06-16             18372901 Tenesa A       2008-03-30 Nat Ge… www.… Geno…
 7 2008-06-16             18372905 Tomlinson IP   2008-03-30 Nat Ge… www.… A ge…
 8 2008-06-16             18372905 Tomlinson IP   2008-03-30 Nat Ge… www.… A ge…
 9 2008-06-16             18372905 Tomlinson IP   2008-03-30 Nat Ge… www.… A ge…
10 2008-06-16             18372905 Tomlinson IP   2008-03-30 Nat Ge… www.… A ge…
# ℹ 551 more rows
# ℹ abbreviated name: ¹​`DATE ADDED TO CATALOG`
# ℹ 31 more variables: `DISEASE/TRAIT` <chr>, `INITIAL SAMPLE SIZE` <chr>,
#   `REPLICATION SAMPLE SIZE` <chr>, REGION <chr>, CHR_ID <chr>, CHR_POS <dbl>,
#   `REPORTED GENE(S)` <chr>, MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>,
#   DOWNSTREAM_GENE_ID <chr>, SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
#   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>, …

and use desc to make the sort a descending one,

arrange(colocancer, desc(DATE))
# A tibble: 561 × 38
   DATE ADDED TO CATALO…¹ PUBMEDID `FIRST AUTHOR` DATE       JOURNAL LINK  STUDY
   <date>                    <dbl> <chr>          <date>     <chr>   <chr> <chr>
 1 2021-06-24             34062886 Nazarian A     2021-05-01 Genes … www.… Geno…
 2 2021-06-24             34062886 Nazarian A     2021-05-01 Genes … www.… Geno…
 3 2021-06-24             34062886 Nazarian A     2021-05-01 Genes … www.… Geno…
 4 2021-06-24             34062886 Nazarian A     2021-05-01 Genes … www.… Geno…
 5 2021-06-24             34062886 Nazarian A     2021-05-01 Genes … www.… Geno…
 6 2021-06-24             34062886 Nazarian A     2021-05-01 Genes … www.… Geno…
 7 2021-06-24             34062886 Nazarian A     2021-05-01 Genes … www.… Geno…
 8 2021-08-10             33627384 Matejcic M     2021-02-24 Cancer… www.… Rare…
 9 2021-08-10             33627384 Matejcic M     2021-02-24 Cancer… www.… Rare…
10 2021-02-26             32514122 Ishigaki K     2020-06-08 Nat Ge… www.… Larg…
# ℹ 551 more rows
# ℹ abbreviated name: ¹​`DATE ADDED TO CATALOG`
# ℹ 31 more variables: `DISEASE/TRAIT` <chr>, `INITIAL SAMPLE SIZE` <chr>,
#   `REPLICATION SAMPLE SIZE` <chr>, REGION <chr>, CHR_ID <chr>, CHR_POS <dbl>,
#   `REPORTED GENE(S)` <chr>, MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>,
#   DOWNSTREAM_GENE_ID <chr>, SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
#   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>, …

Select columns with select()

The select function simple selects specific columns (i.e., variables). To get only the disease, chromosome position, and -log(p-value) from the full data,

select(gwas, `DISEASE/TRAIT`, CHR_POS, PVALUE_MLOG)
# A tibble: 123,569 × 3
   `DISEASE/TRAIT`       CHR_POS PVALUE_MLOG
   <chr>                   <dbl>       <dbl>
 1 Triglyceride levels  42387225        7.52
 2 Triglyceride levels   3442204        7.30
 3 Triglyceride levels  18415790        6.70
 4 Triglyceride levels  63122540        6.70
 5 Triglyceride levels  87101557        6.52
 6 Triglyceride levels  81501185        6.15
 7 Triglyceride levels  45923216        5.70
 8 Triglyceride levels  44492177        5.52
 9 Triglyceride levels 230160042        5.30
10 Triglyceride levels 164672114        5.22
# ℹ 123,559 more rows

Add new variables with mutate()

You may want to add new variables that are functions of other variables. For example, you could create a column for the p-value from the PVALUE_MLOG (there is already one in the table, but we’ll make another)

sm_gwas = select(gwas, `DISEASE/TRAIT`, CHR_POS, PVALUE_MLOG, `P-VALUE`)
mutate(sm_gwas, new_p_value = 10^(-PVALUE_MLOG))
# A tibble: 123,569 × 5
   `DISEASE/TRAIT`       CHR_POS PVALUE_MLOG  `P-VALUE` new_p_value
   <chr>                   <dbl>       <dbl>      <dbl>       <dbl>
 1 Triglyceride levels  42387225        7.52 0.00000003  0.00000003
 2 Triglyceride levels   3442204        7.30 0.00000005  0.00000005
 3 Triglyceride levels  18415790        6.70 0.0000002   0.0000002 
 4 Triglyceride levels  63122540        6.70 0.0000002   0.0000002 
 5 Triglyceride levels  87101557        6.52 0.0000003   0.0000003 
 6 Triglyceride levels  81501185        6.15 0.0000007   0.0000007 
 7 Triglyceride levels  45923216        5.70 0.000002    0.000002  
 8 Triglyceride levels  44492177        5.52 0.000003    0.000003  
 9 Triglyceride levels 230160042        5.30 0.000005    0.000005  
10 Triglyceride levels 164672114        5.22 0.000006    0.000006  
# ℹ 123,559 more rows

Summaries with summarize()

The function summarize (Hadley prefers the British spelling, summarise, which I refuse to acknowledge as an American 😤) collapses the data frame to a single row:

summarize(gwas, mean_mlog_p_value = mean(PVALUE_MLOG)) 
# A tibble: 1 × 1
  mean_mlog_p_value
              <dbl>
1              21.3

This isn’t terribly useful until you use the group_by function to do the summarize action on data by “group”, which you specify according to values of variables. To see the average for each disease, group by DISEASE/TRAIT,

grouped_gwas = group_by(gwas, `DISEASE/TRAIT`)
summarize(grouped_gwas, mean_mlog_p_value = mean(PVALUE_MLOG))
# A tibble: 100 × 2
   `DISEASE/TRAIT`                   mean_mlog_p_value
   <chr>                                         <dbl>
 1 Adolescent idiopathic scoliosis                14.3
 2 Alanine aminotransferase levels                20.0
 3 Apolipoprotein A1 levels                       48.2
 4 Apolipoprotein B levels                        61.8
 5 Appendicular lean mass                         18.4
 6 Aspartate aminotransferase levels              19.7
 7 Asthma                                         17.3
 8 Basophil count                                 22.8
 9 Brain region volumes                           10.8
10 Breast cancer                                  12.5
# ℹ 90 more rows

or add the journal the study was published in

grouped_gwas = group_by(gwas, `DISEASE/TRAIT`, JOURNAL)
meanp = summarize(grouped_gwas, mean_mlog_p_value = mean(PVALUE_MLOG))
`summarise()` has grouped output by 'DISEASE/TRAIT'. You can override using the
`.groups` argument.
meanp
# A tibble: 589 × 3
# Groups:   DISEASE/TRAIT [100]
   `DISEASE/TRAIT`                 JOURNAL                     mean_mlog_p_value
   <chr>                           <chr>                                   <dbl>
 1 Adolescent idiopathic scoliosis Am J Hum Genet                          16.7 
 2 Adolescent idiopathic scoliosis Hum Genet                               14.4 
 3 Adolescent idiopathic scoliosis Hum Mol Genet                           12.1 
 4 Adolescent idiopathic scoliosis Nat Commun                              12.5 
 5 Alanine aminotransferase levels Cell                                    37.2 
 6 Alanine aminotransferase levels Nat Commun                              20.5 
 7 Alanine aminotransferase levels Nat Genet                               20.0 
 8 Alanine aminotransferase levels Obesity (Silver Spring)                  5.67
 9 Alanine aminotransferase levels Sci Rep                                  7.15
10 Apolipoprotein A1 levels        Arterioscler Thromb Vasc B…            527.  
# ℹ 579 more rows

Finally, let’s take the mean across all diseases for each journal and then sort by p-value to see the pattern with the journal a little better

arrange(summarize(group_by(gwas, JOURNAL), mean_mlog_p_value = mean(PVALUE_MLOG)), desc(mean_mlog_p_value))
# A tibble: 133 × 2
   JOURNAL                       mean_mlog_p_value
   <chr>                                     <dbl>
 1 Arterioscler Thromb Vasc Biol             317. 
 2 J Acad Nutr Diet                          149. 
 3 Sci Adv                                    53.5
 4 NPJ Genom Med                              52.6
 5 J Cell Mol Med                             40.5
 6 Circulation                                35.4
 7 Lancet                                     31.8
 8 Cell                                       29.9
 9 Nat Med                                    29.5
10 Commun Biol                                28.8
# ℹ 123 more rows

Another nice function to use with summarize is the “counting” function n(), which can give the number of rows in each group. Since each row is a SNP, we can get the number of SNPs for each disease/trait this way:

arrange(summarize(group_by(gwas, `DISEASE/TRAIT`), n_SNPS = n()), desc(n_SNPS))
# A tibble: 100 × 2
   `DISEASE/TRAIT`                                                 n_SNPS
   <chr>                                                            <int>
 1 Electrocardiogram morphology (amplitude at temporal datapoints)   5501
 2 Height                                                            5033
 3 Heel bone mineral density                                         4638
 4 Type 2 diabetes                                                   3360
 5 Mean corpuscular hemoglobin                                       2960
 6 Protein quantitative trait loci (liver)                           2897
 7 Platelet count                                                    2544
 8 Red cell distribution width                                       2478
 9 Mean corpuscular volume                                           2417
10 Metabolite levels                                                 2386
# ℹ 90 more rows

Using the |> operator to chain together functions

The slicing and summarzing operations we’ve been doing can quickly get messy as we added more operations. Since each operation is applied to the results of the previous one, it ends up with a bunch of very nested parentheses, which can be hard to read. The pipe operator |> allows you to easily chain together operations so that its easier to read them from left to right. For example, the previous summarize and arrange operations could be put together like

gwas |>
  group_by(`DISEASE/TRAIT`) |>
  summarize(n_SNPS = n()) |>
  arrange(desc(n_SNPS))
# A tibble: 100 × 2
   `DISEASE/TRAIT`                                                 n_SNPS
   <chr>                                                            <int>
 1 Electrocardiogram morphology (amplitude at temporal datapoints)   5501
 2 Height                                                            5033
 3 Heel bone mineral density                                         4638
 4 Type 2 diabetes                                                   3360
 5 Mean corpuscular hemoglobin                                       2960
 6 Protein quantitative trait loci (liver)                           2897
 7 Platelet count                                                    2544
 8 Red cell distribution width                                       2478
 9 Mean corpuscular volume                                           2417
10 Metabolite levels                                                 2386
# ℹ 90 more rows

Effectively, the pipe takes what is on the left side of it and uses it as the first argument in the function on the right side. In the above example, gwas, our initial data table, is on the left and it gets used as the first argument in group_by. The group_by function returns a data.frame, which is then used as the argument to the function of the right of the next pipe, which is summarize. The rule of thumb then is that if you want to convert a command without a pipe to one where you use a pipe, take the first argument out of the function, put it on the left of the function separated by the pipe. Likewise, to get rid of the pipe, take the input on the left and put it in the first argument of the function on the right of the pipe and delete the pipe.

A very significant advantage of structuring your slicing / wrangling commands this way when you use RStudio is that RStudio will help you auto-complete column names. As you’ve probably seen already, typing these names can get cumbersome, so this is a big help. RStudio doesn’t do this when you put the data.frame object in the function directly (for some reason unknown to me).

Lab

For these problems, use the functions introduced above from the tidyverse packages like dplyr as much as possible.

Problems

  1. Using the GWAS data,

    library(tidyverse)
    gwas = read_tsv("gwas_catalog_v1.0.2-associations_e104_r2021-09-23_no-waist_hip_body_blood_education_math_top100.tsv", col_types = cols(CHR_POS = col_number()))
    Warning: One or more parsing issues, call `problems()` on your data frame for details,
    e.g.:
      dat <- vroom(...)
      problems(dat)

    produce a data table that shows the mean, maximum, minimum, and variance of PVALUE_MLOG for each combination of disease and journal. Sort the table by the mean value of PVALUE_MLOG.

  2. Recall the gene expression last that was briefly introduced last week:

    library(readxl)
    imprint = read_excel("babak-etal-2015_imprinted-mouse.xlsx", na = "NaN")

    Each row is a gene, each column is a tissue type, and each cell contains a gene expression measurement.

    Make these data tidy (in one way) by collapsing the tissue columns and making the data table longer

    You will need the pivot_longer function. The first trick here will be first to identify the “observations” and then the “names_to”, which is the variable that changes across each observation.

    The second trick is specifying the columns across which you need to gather together into a single column. A hint for the second trick is that you can specify the columns you don’t want with the - operator or ! (NOT) operator.

    The answer will be deceivingly simple! This is the elegance / frustration of R.

  3. Using the tidy data from Problem 2, find the number of genes (across all tissue types) that have an expression value <= 10 and > 2. These genes are “paternally imprinted” in the Babak et al. dataset, which means they are only expressed from the maternal copy of the gene.

    Hint: you will need to count each gene once even if it appears in multiple rows, which can be done with the distinct function.

  4. Use pivot_wider to return the genomic imprinting tidy data from Problem 2 to their original wide format.

  5. We’ll use some the CDC data on COVID-19 deaths for this problem

    library(RSocrata) 
    covid19deaths = read.socrata("https://data.cdc.gov/api/odata/v4/r8kw-7aab") |> as_tibble() |> filter(group == "By Week")

These data are tidy. Make a wider version of this table by making columns for each state and using the covid_19_deaths column as the values. You can discard all the other columns except week_ending_date, covid_19_deaths, and the columns for each state. You should have only one row for each date.

Footnotes

  1. Wickham, Hadley. 2014. J Stat Softw, 59:1–23. DOI: 10.18637/jss.v059.i10↩︎

  2. Wickham, Hadley. 2014. J Stat Softw, 59:1–23. DOI: 10.18637/jss.v059.i10↩︎

  3. Wickham, Hadley. 2023. R for Data Science (2e). O’Reilly Media. website↩︎