Getting data into R with data.frames

Author

Jeremy Van Cleve

Published

September 19, 2022

Outline for today

  • Another slice of slicing
  • Names and attributes
  • Factors
  • Data frames: a special kind of list
  • Reading data tables

Another slice of slicing

Last time, we covered much of the basics of slicing matrices but there are still some topics and some helper functions that will be useful to know when trying to accomplish certain tasks.

Assigning to a slice

Not only can you extract a slice of a matrix to analyze or plot but you can also assign values to that slice. First, create a matrix of all zeros to manipulate:

allz = matrix(0, nrow = 6, ncol = 6)
allz
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    0    0    0
[2,]    0    0    0    0    0    0
[3,]    0    0    0    0    0    0
[4,]    0    0    0    0    0    0
[5,]    0    0    0    0    0    0
[6,]    0    0    0    0    0    0

As before, you slice the first row.

allz[1,]
[1] 0 0 0 0 0 0

However, you can also assign values to it.

allz[1,] = 1:6
allz
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    0    0    0    0    0    0
[3,]    0    0    0    0    0    0
[4,]    0    0    0    0    0    0
[5,]    0    0    0    0    0    0
[6,]    0    0    0    0    0    0

Note that when assigning to a slice, the right-hand side must be of the same dimensionality as the left-hand side. For example, the following will not work:

allz[1,] = 1:4

The one exception to this rule is when the number of items on the right hand side is a multiple of the number of elements in the slice. The simplest example is

allz[1,] = 1
allz
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    1    1    1    1
[2,]    0    0    0    0    0    0
[3,]    0    0    0    0    0    0
[4,]    0    0    0    0    0    0
[5,]    0    0    0    0    0    0
[6,]    0    0    0    0    0    0

but you can also do

allz[1,] = 1:3
allz
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    1    2    3
[2,]    0    0    0    0    0    0
[3,]    0    0    0    0    0    0
[4,]    0    0    0    0    0    0
[5,]    0    0    0    0    0    0
[6,]    0    0    0    0    0    0

where the right hand side is use as many times as necessary to fill the slice.

Sorting

Sorting numeric and character values is an important task that comes up in many applications. The sort function has reasonable defaults where it produces increasing numeric values

set.seed(100)
rvec = sample(1:100, 20, replace = TRUE)
rvec
 [1] 74 89 78 23 86 70  4 55 70 98  7  7 55 43 82 61 12 99 51 72
sort(rvec)
 [1]  4  7  7 12 23 43 51 55 55 61 70 70 72 74 78 82 86 89 98 99

or character values

svec = c("hello", "world", "goodbye", "grand", "planet")
sort(svec, decreasing=TRUE)
[1] "world"   "planet"  "hello"   "grand"   "goodbye"

You can reverse the sort order by setting the argument decreasing = TRUE.

Getting the indices from slices

Sorting

Often, you will want to sort not only a vector by the rows of a data matrix based on some column of the matrix. Thus, you need the list of positions each row will go to (e.g., row 1 to row 10 because its 10th in the sorted order, etc). To obtain this, you can use the order function

svec
[1] "hello"   "world"   "goodbye" "grand"   "planet" 
order(svec)
[1] 3 4 1 5 2

which output precisely that list of indices. If you stick these indices back into the vector, you will obtain the original sort operation

svec[ order(svec) ]
[1] "goodbye" "grand"   "hello"   "planet"  "world"  
sort(svec)
[1] "goodbye" "grand"   "hello"   "planet"  "world"  

You can also use the “sort order” of one column to order the rows of a whole matrix or data table. For example, using a matrix of random values,

set.seed(42)
rmatx = matrix(sample(1:20, 36, replace = TRUE), nrow = 6, ncol = 6)
rmatx
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   17   17   20    5    3    4
[2,]    5   15   18   13    1    4
[3,]    1    7   15    5   10   18
[4,]   10    4    3   20   11   13
[5,]    4    5    9    2   15    5
[6,]   18   14    4    8    8    4

you could then sort the rows based on elements in the first column by first obtaining the indices used to sort that column

order(rmatx[,1])
[1] 3 5 2 4 1 6

and using the indices to order the rows

rmatx
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   17   17   20    5    3    4
[2,]    5   15   18   13    1    4
[3,]    1    7   15    5   10   18
[4,]   10    4    3   20   11   13
[5,]    4    5    9    2   15    5
[6,]   18   14    4    8    8    4
rmatx[ order(rmatx[,1]), ]
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    7   15    5   10   18
[2,]    4    5    9    2   15    5
[3,]    5   15   18   13    1    4
[4,]   10    4    3   20   11   13
[5,]   17   17   20    5    3    4
[6,]   18   14    4    8    8    4

Boolean (logical) slicing

Recall that you can slice by creating a logical condition (generating TRUE and FALSE values) and use that in the index of a matrix. Sometimes, you want the actual indices of the elements of that matrix that are sliced; i.e., you want the indices of the elements where the conditions is TRUE. To get these indices, you use the which function. For example, the logical vector and slice are

rmatx[,1] > 10
[1]  TRUE FALSE FALSE FALSE FALSE  TRUE
rmatx[ rmatx[,1] > 10, ]
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   17   17   20    5    3    4
[2,]   18   14    4    8    8    4

You can slice the same way with which:

which( rmatx[,1] > 10 )
[1] 1 6
rmatx[ which( rmatx[,1] > 10 ), ]
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   17   17   20    5    3    4
[2,]   18   14    4    8    8    4

Finally, there some special versions of the which function that give you the first index of the max or min element of a vector, which.max and which.min.

Names and attributes

We’ve talked about attributes and names before but there are some helpful functions for getting and setting the names associated with arrays and lists. You have already seen with lists how each element can be given a name.

l = list(a = 1, b = "one hundred")
named_svec = c(s1 = "hello", s2 = "world", s3 = "goodbye", s4 = "grand", s5 = "planet")
named_svec
       s1        s2        s3        s4        s5 
  "hello"   "world" "goodbye"   "grand"  "planet" 
str(named_svec)
 Named chr [1:5] "hello" "world" "goodbye" "grand" "planet"
 - attr(*, "names")= chr [1:5] "s1" "s2" "s3" "s4" ...

You can recover those names with the names function:

names(named_svec)
[1] "s1" "s2" "s3" "s4" "s5"

You can also set the names afterwards by assigning to names:

svec
[1] "hello"   "world"   "goodbye" "grand"   "planet" 
names(svec) = c("s1", "s2", "s3", "s4", "s5")
svec
       s1        s2        s3        s4        s5 
  "hello"   "world" "goodbye"   "grand"  "planet" 

Finally, you can return a version of the vector with the names stripped using the function uname

unnamed_svec = unname(named_svec)
unnamed_svec
[1] "hello"   "world"   "goodbye" "grand"   "planet" 

though note that this hasn’t changed the original vector:

named_svec
       s1        s2        s3        s4        s5 
  "hello"   "world" "goodbye"   "grand"  "planet" 

Finally, you can get rid of the names entirely by assigning names to NULL

names(named_svec) = NULL
named_svec
[1] "hello"   "world"   "goodbye" "grand"   "planet" 

Just as reminder, while we can name elements of vectors, they still have to hold the same data type, unlike lists that can hold anything.

str(list(a=1, b="two"))
List of 2
 $ a: num 1
 $ b: chr "two"
str(c(a=1, b="two"))
 Named chr [1:2] "1" "two"
 - attr(*, "names")= chr [1:2] "a" "b"

Factors

A special object that you will see when dealing with data frames is called a “factor”. A factor is a vector that can contain only predefined values and essentially stores categorical data (e.g., “tall”, “medium”, and “short” for plant height). Factors have a “levels” attribute that lists the allowable values. For example

fac_factor = factor(c("Famulski", "Burger", "Seifert", "Santollo", "Duncan", "Schneider"))
fac_factor
[1] Famulski  Burger    Seifert   Santollo  Duncan    Schneider
Levels: Burger Duncan Famulski Santollo Schneider Seifert

If you try to set an element of the factor object to a value outside of levels, you will receive a warning

fac_factor[1] = "Van Cleve"
Warning in `[<-.factor`(`*tmp*`, 1, value = "Van Cleve"): invalid factor level,
NA generated
fac_factor
[1] <NA>      Burger    Seifert   Santollo  Duncan    Schneider
Levels: Burger Duncan Famulski Santollo Schneider Seifert

and the element will be converted to the NA value, which is used for missing data.

Many R functions that read data tables take advantage of this behavior of factors so that columns may only contain certain values and the other values are missing data. This occurs when the function runs into a column with string data and the R function will often convert that column to a factor. Some of the functions that read data tables have nice arguments that let you tell them that specific strings, say “-”, represent missing data and should be be converted to NA.

While useful, factors are extremely annoying when your data are converted to them when you don’t expect it as further changes to the data table may result in NA values when you really wanted to add a new string value. This paper gives a good history of why factors are useful in R. It mostly comes down to factors being useful for categorical variables in regression models.

Data frames

Finally we have reached data frames. Data frames are the most common way of storing data in R. Essentially, a data frame is a list object containing vectors of equal length (i.e., the number of rows of the table). Put another way, a data frame is a list version of a matrix. Thus, data frames have properties such as length(), rnow(), ncol() colnames(), and rownames().

Creating a data frame is like creating a list where you name your elements, which here are columns (data not guaranteed to be accurate…):

dframe = data.frame(height_rank = 1:4, last_name = c("Van Cleve", "Linnen", "Seifert", "Pendergast"), first_name = c("Jeremy", "Catherine", "Ashley", "Julie"))
dframe
  height_rank  last_name first_name
1           1  Van Cleve     Jeremy
2           2     Linnen  Catherine
3           3    Seifert     Ashley
4           4 Pendergast      Julie

Slicing a data frame works like slicing a matrix or a list. Often, we will use the list convention where columns can be obtained with $. For example,

dframe$first_name
[1] "Jeremy"    "Catherine" "Ashley"    "Julie"    
dframe$last_name
[1] "Van Cleve"  "Linnen"     "Seifert"    "Pendergast"

Adding columns to a data frame is done with cbind (“column bind”), which glues together columns,

cbind(dframe, building = c("THM", "THM", "THM", "THM"), floor = c(2,2,2,3))
  height_rank  last_name first_name building floor
1           1  Van Cleve     Jeremy      THM     2
2           2     Linnen  Catherine      THM     2
3           3    Seifert     Ashley      THM     2
4           4 Pendergast      Julie      THM     3

and adding rows with rbind (“row bind”), which glues together rows,

rbind(dframe, data.frame(height_rank = 0, last_name = "Smith", first_name = "Jeramiah"))
  height_rank  last_name first_name
1           1  Van Cleve     Jeremy
2           2     Linnen  Catherine
3           3    Seifert     Ashley
4           4 Pendergast      Julie
5           0      Smith   Jeramiah

Again, note that each of these commands returned a new data.frame and the original is unchanged until we explicitly save back to that variable name:

dframe
  height_rank  last_name first_name
1           1  Van Cleve     Jeremy
2           2     Linnen  Catherine
3           3    Seifert     Ashley
4           4 Pendergast      Julie

Functions like cbind, rbind, and others that do operations on arrays and data frames usually create a copy of the data and return the modified copy. This is usually what you want since you’re not modifying your original variable/data until you explicitly assign the old variable to the new data. One case where you might not want to do this is when your data are so big (e.g., whole genomes, billions of tweets) that they take up a large fraction of the computer’s RAM, in which case you have to be very careful about creating copies of your data.

Reading data tables

Now that you know about data frames, you can start using some nice R functions to read in data. We have already seen this when loading data for the homeworks. As in those examples, we load the a few packages before loading the data since they are nice for reading csv and excel files. You probably don’t have the readxl package yet, which you can install with:

install.packages("readxl")

Then load:

library(tidyverse) # loads the `readr` package that loads things like csv files
library(readxl) # package for reading Excel files

Now, you can use the read_csv function to get the COVID-19 data table from the NY Times.

us = read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
Rows: 51526 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): state, fips
dbl  (2): cases, deaths
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notice that read_csv gives you some nice output telling us about the table you just read. This function and others like it (i.e., from the readr and readxl packages) do a lot for you automatically and have many nice features, some of which we’ll cover later.

Loading excel files in no harder. We’ll load some data from a RNA-seq paper on genomic imprinting (Babak et al. 2015. Nat Gen, http://dx.doi.org/10.1038/ng.3274), babak-etal-2015_imprinted-mouse.xlsx (located on course GitHub repo), with read_excel

imprint = read_excel("babak-etal-2015_imprinted-mouse.xlsx", na = "NaN")

Note that you have to tell the function what strings in the Excel spreadsheet correspond to NA or missing data (“NaN” in this case). The first column are the gene names for each row

imprint$Genes
  [1] "PEC3"              "UBE2NL"            "TRAPPC9"          
  [4] "EIF2C2"            "MIR344D-3"         "MIR344G"          
  [7] "A930009L07RIK"     "KCNK9"             "RASGRF1"          
 [10] "INPP5F"            "NAP1L5"            "USP29"            
 [13] "HERC3-new"         "GM9801-new"        "BEGAIN"           
 [16] "PX00010K13"        "PEG13"             "1110006E14RIK"    
 [19] "MBII-343"          "MIRG"              "ZDBF2"            
 [22] "IMPACT"            "NDN"               "NIBP"             
 [25] "AK139287"          "CCDC40-AS"         "PX00113D24"       
 [28] "PEC2"              "ZIM3"              "AK142849"         
 [31] "MKRN3"             "IPW"               "B830012L14RIK"    
 [34] "ADAM23"            "COMMD1"            "UBE3A"            
 [37] "WARS"              "BCL2L1"            "BLCAP"            
 [40] "CALCR"             "MAGEL2"            "CDH15"            
 [43] "COPG2"             "U80893"            "NR_015479-new"    
 [46] "GRB10"             "PEG10"             "SGCE"             
 [49] "PEG1"              "MCTS2"             "H13"              
 [52] "SNRPN"             "ZRSR1"             "MEG3"             
 [55] "RIAN"              "PEG3"              "PLAGL1"           
 [58] "1110014L15RIK-new" "ASB4"              "ZIM1"             
 [61] "KLHDC10"           "DLK1"              "GNAS"             
 [64] "PPP1R9A"           "PDE4D"             "A19"              
 [67] "ZFP264"            "KCNQ1OT1"          "AK050713"         
 [70] "RTL1"              "CDKN1C"            "PEG12"            
 [73] "KCNQ1"             "NESPAS"            "SLC38A4"          
 [76] "H19"               "IGF2"              "INS1"             
 [79] "TFPI2"             "AK043599"          "EDN3-new"         
 [82] "TMEM106A-new"      "DDC"               "RDM1"             
 [85] "TNK1-new"          "TREM1-new"         "2400006E01RIK-new"
 [88] "IGF2AS"            "UC008IHS.1-new"    "5133400J02RIK-new"
 [91] "PHLDA2"            "AF357359"          "GAB1"             
 [94] "SFMBT2"            "CD81"              "PHF17"            
 [97] "TSSC4"             "DACT2-new"         "SLC22A2"          
[100] "TSSC5"             "SLC22A3"           "KLF14"            
[103] "ATP10A"            "DCN"               "PON2"             
[106] "PON3"              "AMPD3"             "GABRB3"           
[109] "GATM"              "TBC1D12"           "DIO3"             
[112] "NAP1L4"            "MSUIT"             "ASCL2"            
[115] "OSBPL5"            "TNFRSF23"          "AIRN"             
[118] "HTR2A"             "MAPT"              "MKRN1-PS1"        
[121] "DLX5"              "IGF2R"             "TRY4-new"         
[124] "TSPAN32"           "ZIM2"             

and the column names are the tissue type that RNA expression was measured in

colnames(imprint)
 [1] "Genes"                   "Preoptic Area (ref)"    
 [3] "e17.5 Brain"             "Hypothalamus"           
 [5] "e15 Brain (ref)"         "e9.5 Yolk Sac"          
 [7] "Prefrontal Cortex (ref)" "e9.5 Placenta"          
 [9] "Whole Brain"             "Adrenal Gland"          
[11] "Olfactory Bulb"          "Cortex"                 
[13] "e9.5 Embryo (ref)"       "Hippocampus"            
[15] "TSCs (ref)"              "Cerebellum"             
[17] "Striatum"                "e17.5 Placenta (ref)"   
[19] "Pancreatic Islets"       "MEFs (ref)"             
[21] "Bladder"                 "Lung"                   
[23] "Duodenum"                "White Adipose"          
[25] "Skeletal Muscle"         "Skin"                   
[27] "Heart"                   "Stomach"                
[29] "Thymus"                  "Kidney"                 
[31] "Liver"                   "Whole Bone Marrow"      
[33] "Spleen"                  "Testes"                 

where the first element is the column name of the “Genes” column. You will manipulate these data later when we talk about tidy data and dplyr.

Lab

Now that you have all the essential elements of slicing, let’s do some more things with COVID-19 data, but this time with world wide data from “Our World in Data”: https://github.com/owid/covid-19-data/tree/master/public/data. This is a big data set, so it might take a few moments to download.

library(tidyverse)

owid = read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")

Before starting on the problems, take a look at which columns are provided in the table. This will help for solving the problems. Hint: use an R command to do this and you get a nice list of all the columns.

Problems

  1. Plot the new cases per day per million people for the pandemic for the United States. Use a line plot. Hint: use the help for plot, ?base::plot, to set the plot type.

  2. What day had the highest number of deaths in the United States? Hint: use the order function.

  3. What day and in what country was the worst (i.e., highest) for per capita death due to COVID-19?

  4. Create a new data frame called new_cases_per_100k consisting of the following columns, location, date, and cases, where cases is the number of new cases per 100,000 people. In two separate plots, plot the number of new cases per 100k people over time for the United Kingdom (plot 1) and Canada (plot 2). If you wanted to present these plots side by side so as to compare the severity of the pandemic in the UK vs Canada, what might you have to do to make them more comparable?

  5. In the owid data set, how many countries have data on ICU patients (column icu_patients)? Hint: you might need the is.na function and the unique function.

  6. In 2021, on how many days did the United States have fewer than 2 deaths per 10 million people due to COVID-19? What is answer the United Kingdom? You may run into issues where you’re comparing an NA value to a number, like 2. In this case, you should exclude the NA values by using !is.na(owid$new_cases_per_million) as the first in your series of logical conditions for your slice.

  • Challenge problem (+3 extra credit)

    Plot a heatmap of the imprinting data using the heatmap function. The rows and columns of the heatmap should be labeled properly with the gene names (rows) and tissue names (columns). The Babak et al. (2015) paper has a similar heatmap in Fig 1. Hint: read carefully the help for the heatmap function and know that you can convert data frames to matrices with as.matrix.