Getting data into R with data.frames and spreadsheets

Author

Jeremy Van Cleve

Published

February 3, 2026

Outline for today

  • Another slice of slicing
  • Names and attributes
  • Factors
  • Data frames: a special kind of list
  • Reading data tables

Another slice of slicing

Last time, we covered much of the basics of slicing matrices but there are still some topics and some helper functions that will be useful to know when trying to accomplish certain tasks.

Assigning to a slice

Not only can you extract a slice of a matrix to analyze or plot but you can also assign values to that slice. First, create a matrix of all zeros to manipulate:

allz = matrix(0, nrow = 6, ncol = 6)
allz
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    0    0    0
[2,]    0    0    0    0    0    0
[3,]    0    0    0    0    0    0
[4,]    0    0    0    0    0    0
[5,]    0    0    0    0    0    0
[6,]    0    0    0    0    0    0

As before, you slice the first row.

allz[1,]
[1] 0 0 0 0 0 0

However, you can also assign values to it.

allz[1,] = 1:6
allz
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    0    0    0    0    0    0
[3,]    0    0    0    0    0    0
[4,]    0    0    0    0    0    0
[5,]    0    0    0    0    0    0
[6,]    0    0    0    0    0    0

Note that when assigning to a slice, the right-hand side must be of the same dimensionality as the left-hand side. For example, the following will not work:

allz[1,] = 1:4

The one exception to this rule is when the number of items on the right hand side is a multiple of the number of elements in the slice. The simplest example is

allz[1,] = 1
allz
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    1    1    1    1
[2,]    0    0    0    0    0    0
[3,]    0    0    0    0    0    0
[4,]    0    0    0    0    0    0
[5,]    0    0    0    0    0    0
[6,]    0    0    0    0    0    0

but you can also do

allz[1,] = 1:3
allz
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    1    2    3
[2,]    0    0    0    0    0    0
[3,]    0    0    0    0    0    0
[4,]    0    0    0    0    0    0
[5,]    0    0    0    0    0    0
[6,]    0    0    0    0    0    0

where the right hand side is use as many times as necessary to fill the slice.

Sorting

Sorting numeric and character values is an important task that comes up in many applications. The sort function has reasonable defaults where it produces increasing numeric values

set.seed(100)
rvec = sample(1:100, 20, replace = TRUE)
rvec
 [1] 74 89 78 23 86 70  4 55 70 98  7  7 55 43 82 61 12 99 51 72
sort(rvec)
 [1]  4  7  7 12 23 43 51 55 55 61 70 70 72 74 78 82 86 89 98 99

or character values

svec = c("hello", "world", "goodbye", "grand", "planet")
sort(svec, decreasing=TRUE)
[1] "world"   "planet"  "hello"   "grand"   "goodbye"

You can reverse the sort order by setting the argument decreasing = TRUE.

Getting the indices from slices

Sorting

Often, you will want to sort not only a vector by the rows of a data matrix based on some column of the matrix. Thus, you need the list of positions each row will go to (e.g., row 1 to row 10 because its 10th in the sorted order, etc). To obtain this, you can use the order function

svec
[1] "hello"   "world"   "goodbye" "grand"   "planet" 
order(svec)
[1] 3 4 1 5 2

which output precisely that list of indices. If you stick these indices back into the vector, you will obtain the original sort operation

svec[ order(svec) ]
[1] "goodbye" "grand"   "hello"   "planet"  "world"  
sort(svec)
[1] "goodbye" "grand"   "hello"   "planet"  "world"  

You can also use the “sort order” of one column to order the rows of a whole matrix or data table. For example, using a matrix of random values,

set.seed(42)
rmatx = matrix(sample(1:20, 36, replace = TRUE), nrow = 6, ncol = 6)
rmatx
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   17   17   20    5    3    4
[2,]    5   15   18   13    1    4
[3,]    1    7   15    5   10   18
[4,]   10    4    3   20   11   13
[5,]    4    5    9    2   15    5
[6,]   18   14    4    8    8    4

you could then sort the rows based on elements in the first column by first obtaining the indices used to sort that column

order(rmatx[,1])
[1] 3 5 2 4 1 6

and using the indices to order the rows

rmatx
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   17   17   20    5    3    4
[2,]    5   15   18   13    1    4
[3,]    1    7   15    5   10   18
[4,]   10    4    3   20   11   13
[5,]    4    5    9    2   15    5
[6,]   18   14    4    8    8    4
rmatx[ order(rmatx[,1]), ]
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    7   15    5   10   18
[2,]    4    5    9    2   15    5
[3,]    5   15   18   13    1    4
[4,]   10    4    3   20   11   13
[5,]   17   17   20    5    3    4
[6,]   18   14    4    8    8    4

Boolean (logical) slicing

Recall that you can slice by creating a logical condition (generating TRUE and FALSE values) and use that in the index of a matrix. Sometimes, you want the actual indices of the elements of that matrix that are sliced; i.e., you want the indices of the elements where the conditions is TRUE. To get these indices, you use the which function. For example, the logical vector and slice are

rmatx[,1] > 10
[1]  TRUE FALSE FALSE FALSE FALSE  TRUE
rmatx[ rmatx[,1] > 10, ]
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   17   17   20    5    3    4
[2,]   18   14    4    8    8    4

You can slice the same way with which:

which( rmatx[,1] > 10 )
[1] 1 6
rmatx[ which( rmatx[,1] > 10 ), ]
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   17   17   20    5    3    4
[2,]   18   14    4    8    8    4

Finally, there some special versions of the which function that give you the first index of the max or min element of a vector, which.max and which.min.

Names and attributes

We’ve talked about attributes and names before but there are some helpful functions for getting and setting the names associated with arrays and lists. You have already seen with lists how each element can be given a name.

l = list(a = 1, b = "one hundred")
named_svec = c(s1 = "hello", s2 = "world", s3 = "goodbye", s4 = "grand", s5 = "planet")
named_svec
       s1        s2        s3        s4        s5 
  "hello"   "world" "goodbye"   "grand"  "planet" 
str(named_svec)
 Named chr [1:5] "hello" "world" "goodbye" "grand" "planet"
 - attr(*, "names")= chr [1:5] "s1" "s2" "s3" "s4" ...

You can recover those names with the names function:

names(named_svec)
[1] "s1" "s2" "s3" "s4" "s5"

You can also set the names afterwards by assigning to names:

svec
[1] "hello"   "world"   "goodbye" "grand"   "planet" 
names(svec) = c("s1", "s2", "s3", "s4", "s5")
svec
       s1        s2        s3        s4        s5 
  "hello"   "world" "goodbye"   "grand"  "planet" 

Finally, you can return a version of the vector with the names stripped using the function uname

unnamed_svec = unname(named_svec)
unnamed_svec
[1] "hello"   "world"   "goodbye" "grand"   "planet" 

though note that this hasn’t changed the original vector:

named_svec
       s1        s2        s3        s4        s5 
  "hello"   "world" "goodbye"   "grand"  "planet" 

Finally, you can get rid of the names entirely by assigning names to NULL

names(named_svec) = NULL
named_svec
[1] "hello"   "world"   "goodbye" "grand"   "planet" 

Just as reminder, while we can name elements of vectors, they still have to hold the same data type, unlike lists that can hold anything.

str(list(a=1, b="two"))
List of 2
 $ a: num 1
 $ b: chr "two"
str(c(a=1, b="two"))
 Named chr [1:2] "1" "two"
 - attr(*, "names")= chr [1:2] "a" "b"

Factors

A special object that you will see when dealing with data frames is called a “factor”. A factor is a vector that can contain only predefined values and essentially stores categorical data (e.g., “tall”, “medium”, and “short” for plant height). Factors have a “levels” attribute that lists the allowable values. For example

fac_factor = factor(c("Famulski", "Burger", "Seifert", "Santollo", "Duncan", "Long"))
fac_factor
[1] Famulski Burger   Seifert  Santollo Duncan   Long    
Levels: Burger Duncan Famulski Long Santollo Seifert

You can get the levels of a factor with

levels(fac_factor)
[1] "Burger"   "Duncan"   "Famulski" "Long"     "Santollo" "Seifert" 

If you try to set an element of the factor object to a value outside of levels, you will receive a warning

fac_factor[1] = "Van Cleve"
Warning in `[<-.factor`(`*tmp*`, 1, value = "Van Cleve"): invalid factor level,
NA generated
fac_factor
[1] <NA>     Burger   Seifert  Santollo Duncan   Long    
Levels: Burger Duncan Famulski Long Santollo Seifert

and the element will be converted to the NA value, which is used for missing data.

Many R functions that read data tables take advantage of this behavior of factors so that columns may only contain certain values and the other values are missing data. This occurs when the function runs into a column with string data and the R function will often convert that column to a factor. Some of the functions that read data tables have nice arguments that let you tell them that specific strings, say “-”, represent missing data and should be be converted to NA.

While useful, factors are extremely annoying when your data are converted to them when you don’t expect it as further changes to the data table may result in NA values when you really wanted to add a new string value. This paper gives a good history of why factors are useful in R. It mostly comes down to factors being useful for categorical variables in regression models.

The main place factors are used that we’ll encounter in this course is when plotting categorical variables. In those cases, the order the variables are plotted in will be determined the order of the levels in levels. In those cases, you may want to reorder the factors so that the variables are plotted in a specific order (say in descending order of frequency in the data). For this, there is a nice package called forcats that is included in the tidyverse that has the function fct_reorder that can help. Another thing we’ll run into is changing factor levels so that they have more descriptive labels. For this forcats has fct_recode. We’ll see examples of these kinds of scenarios later on when we’re plotting using ggplot2

Data frames

Finally, we have reached data frames. Data frames are the most common way of storing data in R. Essentially, a data frame is a list object containing vectors of equal length (i.e., the number of rows of the table). Put another way, a data frame is a list version of a matrix. Thus, data frames have properties such as length(), rnow(), ncol() colnames(), and rownames().

Creating a data frame is like creating a list where you name your elements, which here are columns (data not guaranteed to be accurate…):

dframe = data.frame(height_rank = 1:4, last_name = c("Van Cleve", "Linnen", "Seifert", "Pendergast"), first_name = c("Jeremy", "Catherine", "Ashley", "Julie"))
dframe
  height_rank  last_name first_name
1           1  Van Cleve     Jeremy
2           2     Linnen  Catherine
3           3    Seifert     Ashley
4           4 Pendergast      Julie

Slicing a data frame works like slicing a matrix and also like slicing a list. Often, we will use the list convention where columns can be obtained with $. For example,

dframe$first_name
[1] "Jeremy"    "Catherine" "Ashley"    "Julie"    
dframe$last_name
[1] "Van Cleve"  "Linnen"     "Seifert"    "Pendergast"

This is convient because the column names make clear what kind of data we’re slicing from the data frame.

Adding columns to a data frame is done with cbind (“column bind”), which glues together columns. To add the column building, we do the following

cbind(dframe, building = c("THM", "THM", "THM", "THM"), floor = c(2,2,2,3))
  height_rank  last_name first_name building floor
1           1  Van Cleve     Jeremy      THM     2
2           2     Linnen  Catherine      THM     2
3           3    Seifert     Ashley      THM     2
4           4 Pendergast      Julie      THM     3

Adding rows can be done with rbind (“row bind”), which glues together rows,

rbind(dframe, data.frame(height_rank = 0, last_name = "Smith", first_name = "Jeramiah"))
  height_rank  last_name first_name
1           1  Van Cleve     Jeremy
2           2     Linnen  Catherine
3           3    Seifert     Ashley
4           4 Pendergast      Julie
5           0      Smith   Jeramiah

Note here that he have to create a new data data.frame for our new row since dframe is a data.frame and so binding rows together here means binding together two data frames each with the same columns.

Note that each of these commands returned a new data.frame and the original is unchanged until we explicitly save back to that variable name:

dframe
  height_rank  last_name first_name
1           1  Van Cleve     Jeremy
2           2     Linnen  Catherine
3           3    Seifert     Ashley
4           4 Pendergast      Julie

Functions like cbind, rbind, and others that do operations on arrays and data frames usually create a copy of the data and return the modified copy. This is usually what you want since you’re not modifying your original variable/data until you explicitly assign the old variable to the new data. One case where you might not want to do this is when your data are so big (e.g., whole genomes, billions of tweets) that they take up a large fraction of the computer’s RAM, in which case you have to be very careful about creating copies of your data.

Reading data tables

Now that you know about data frames, you can start using some nice R functions to read in data. We have already seen this when loading data for the homeworks. As in those examples, we load the a few packages before loading the data since they are nice for reading csv and excel files. For reading excel files, you’ll need to install the readxl package if you don’t have it, which you can do with:

install.packages("readxl")

Then load:

library(tidyverse) # loads the `readr` package that loads things like csv files
library(readxl) # package for reading Excel files

Now, you can use the read_csv function to load csv or “comma separated value” files. For example to load COVID-19 and respiratory virus data from the CDC that was saved as a csv file, we load us_hosps_deaths_cdc_2020-01_2024-09.csv, which is in the project folder and course GitHub repo.

us_deaths = read_csv("us_hosps_deaths_cdc_2020-01_2024-09.csv")
Rows: 11128 Columns: 72
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): state, state_abbrv
dbl  (69): covid_19_deaths, total_deaths, percent_of_expected_deaths, pneumo...
dttm  (1): week_end_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notice that read_csv gives you some nice output telling us about the table you just read. This function and others like it (i.e., from the readr and readxl packages) do a lot for you automatically and have many nice features. For example, read_csv has the argument col_names = TRUE by default, which means it uses the first row of the table as the column names. Some tables may simple just straight into data without column names in which case you can set col_names = FALSE and it will give automatic names or give col_names a vector of column names manually. Sometimes data tables will have the first few lines with text describing the data and you can skip them by giving the argument skip the number of lines to skip. There are many other options so looking at the help with ?read_csv is recommended when you’re having trouble getting the data loaded correctly.

Loading excel files in no harder. We’ll load some data from a RNA-seq paper on genomic imprinting (Babak et al. 2015. Nat Gen, http://dx.doi.org/10.1038/ng.3274), babak-etal-2015_imprinted-mouse.xlsx (located in project folder and course GitHub repo), with read_excel

imprint = read_excel("babak-etal-2015_imprinted-mouse.xlsx", na = "NaN")

Note that you have to tell the function what strings in the Excel spreadsheet correspond to NA or missing data (“NaN” in this case). The first column are the gene names for each row

imprint$Genes
  [1] "PEC3"              "UBE2NL"            "TRAPPC9"          
  [4] "EIF2C2"            "MIR344D-3"         "MIR344G"          
  [7] "A930009L07RIK"     "KCNK9"             "RASGRF1"          
 [10] "INPP5F"            "NAP1L5"            "USP29"            
 [13] "HERC3-new"         "GM9801-new"        "BEGAIN"           
 [16] "PX00010K13"        "PEG13"             "1110006E14RIK"    
 [19] "MBII-343"          "MIRG"              "ZDBF2"            
 [22] "IMPACT"            "NDN"               "NIBP"             
 [25] "AK139287"          "CCDC40-AS"         "PX00113D24"       
 [28] "PEC2"              "ZIM3"              "AK142849"         
 [31] "MKRN3"             "IPW"               "B830012L14RIK"    
 [34] "ADAM23"            "COMMD1"            "UBE3A"            
 [37] "WARS"              "BCL2L1"            "BLCAP"            
 [40] "CALCR"             "MAGEL2"            "CDH15"            
 [43] "COPG2"             "U80893"            "NR_015479-new"    
 [46] "GRB10"             "PEG10"             "SGCE"             
 [49] "PEG1"              "MCTS2"             "H13"              
 [52] "SNRPN"             "ZRSR1"             "MEG3"             
 [55] "RIAN"              "PEG3"              "PLAGL1"           
 [58] "1110014L15RIK-new" "ASB4"              "ZIM1"             
 [61] "KLHDC10"           "DLK1"              "GNAS"             
 [64] "PPP1R9A"           "PDE4D"             "A19"              
 [67] "ZFP264"            "KCNQ1OT1"          "AK050713"         
 [70] "RTL1"              "CDKN1C"            "PEG12"            
 [73] "KCNQ1"             "NESPAS"            "SLC38A4"          
 [76] "H19"               "IGF2"              "INS1"             
 [79] "TFPI2"             "AK043599"          "EDN3-new"         
 [82] "TMEM106A-new"      "DDC"               "RDM1"             
 [85] "TNK1-new"          "TREM1-new"         "2400006E01RIK-new"
 [88] "IGF2AS"            "UC008IHS.1-new"    "5133400J02RIK-new"
 [91] "PHLDA2"            "AF357359"          "GAB1"             
 [94] "SFMBT2"            "CD81"              "PHF17"            
 [97] "TSSC4"             "DACT2-new"         "SLC22A2"          
[100] "TSSC5"             "SLC22A3"           "KLF14"            
[103] "ATP10A"            "DCN"               "PON2"             
[106] "PON3"              "AMPD3"             "GABRB3"           
[109] "GATM"              "TBC1D12"           "DIO3"             
[112] "NAP1L4"            "MSUIT"             "ASCL2"            
[115] "OSBPL5"            "TNFRSF23"          "AIRN"             
[118] "HTR2A"             "MAPT"              "MKRN1-PS1"        
[121] "DLX5"              "IGF2R"             "TRY4-new"         
[124] "TSPAN32"           "ZIM2"             

and the column names are the tissue type that RNA expression was measured in

colnames(imprint)
 [1] "Genes"                   "Preoptic Area (ref)"    
 [3] "e17.5 Brain"             "Hypothalamus"           
 [5] "e15 Brain (ref)"         "e9.5 Yolk Sac"          
 [7] "Prefrontal Cortex (ref)" "e9.5 Placenta"          
 [9] "Whole Brain"             "Adrenal Gland"          
[11] "Olfactory Bulb"          "Cortex"                 
[13] "e9.5 Embryo (ref)"       "Hippocampus"            
[15] "TSCs (ref)"              "Cerebellum"             
[17] "Striatum"                "e17.5 Placenta (ref)"   
[19] "Pancreatic Islets"       "MEFs (ref)"             
[21] "Bladder"                 "Lung"                   
[23] "Duodenum"                "White Adipose"          
[25] "Skeletal Muscle"         "Skin"                   
[27] "Heart"                   "Stomach"                
[29] "Thymus"                  "Kidney"                 
[31] "Liver"                   "Whole Bone Marrow"      
[33] "Spleen"                  "Testes"                 

where the first element is the column name of the “Genes” column. You will manipulate these data later when we talk about tidy data and dplyr.

Finally, if you look at both the COVID-19 data and the imprinting data

us_deaths
# A tibble: 11,128 × 72
   week_end_date       state covid_19_deaths total_deaths percent_of_expected_…¹
   <dttm>              <chr>           <dbl>        <dbl>                  <dbl>
 1 2020-08-08 04:00:00 Alab…             264         1379                    143
 2 2020-08-15 04:00:00 Alab…             230         1305                    137
 3 2020-08-22 04:00:00 Alab…             209         1303                    140
 4 2020-08-29 04:00:00 Alab…             185         1216                    127
 5 2020-09-05 04:00:00 Alab…             156         1216                    125
 6 2020-09-12 04:00:00 Alab…             138         1232                    128
 7 2020-09-19 04:00:00 Alab…             139         1200                    125
 8 2020-09-26 04:00:00 Alab…             105         1183                    119
 9 2020-10-03 04:00:00 Alab…             111         1103                    116
10 2020-10-10 04:00:00 Alab…             137         1204                    127
# ℹ 11,118 more rows
# ℹ abbreviated name: ¹​percent_of_expected_deaths
# ℹ 67 more variables: pneumonia_deaths <dbl>,
#   pneumonia_and_covid_19_deaths <dbl>, influenza_deaths <dbl>,
#   pneumonia_influenza_or_covid_19_deaths <dbl>, state_abbrv <chr>,
#   weekly_actual_days_reporting_any_data <dbl>,
#   weekly_percent_days_reporting_any_data <dbl>, …
imprint
# A tibble: 125 × 34
   Genes      `Preoptic Area (ref)` `e17.5 Brain` Hypothalamus `e15 Brain (ref)`
   <chr>                      <dbl>         <dbl>        <dbl>             <dbl>
 1 PEC3                         -10           -10       -10               -10   
 2 UBE2NL                       -10           -10       -10               -10   
 3 TRAPPC9                       10            10         9.37             10   
 4 EIF2C2                        10            10        10                10   
 5 MIR344D-3                    -10           -10       -10               -10   
 6 MIR344G                      -10           -10       -10               -10   
 7 A930009L0…                    10            10        10                10   
 8 KCNK9                         10            10         7.53              6.02
 9 RASGRF1                      -10           -10       -10               -10   
10 INPP5F                       -10           -10       -10               -10   
# ℹ 115 more rows
# ℹ 29 more variables: `e9.5 Yolk Sac` <dbl>, `Prefrontal Cortex (ref)` <dbl>,
#   `e9.5 Placenta` <dbl>, `Whole Brain` <dbl>, `Adrenal Gland` <dbl>,
#   `Olfactory Bulb` <dbl>, Cortex <dbl>, `e9.5 Embryo (ref)` <dbl>,
#   Hippocampus <dbl>, `TSCs (ref)` <dbl>, Cerebellum <dbl>, Striatum <dbl>,
#   `e17.5 Placenta (ref)` <dbl>, `Pancreatic Islets` <dbl>,
#   `MEFs (ref)` <dbl>, Bladder <dbl>, Lung <dbl>, Duodenum <dbl>, …

you should notice that both are of the tibble type. A tibble is a data.frame but with enhancements. First and maybe most importantly, it prints nicely when you evaluate it at the command line and in Quarto notebooks. Second, it leaves the column names alone on conversion to a data frame. Thus, we get columns like Preoptic Area (ref) in the imprinting data instead of

make.names("Preoptic Area (ref)")
[1] "Preoptic.Area..ref."

So a “normal” data.frame would do this to the data:

head(data.frame(imprint))
      Genes Preoptic.Area..ref. e17.5.Brain Hypothalamus e15.Brain..ref.
1      PEC3                 -10         -10     -10.0000             -10
2    UBE2NL                 -10         -10     -10.0000             -10
3   TRAPPC9                  10          10       9.3686              10
4    EIF2C2                  10          10      10.0000              10
5 MIR344D-3                 -10         -10     -10.0000             -10
6   MIR344G                 -10         -10     -10.0000             -10
  e9.5.Yolk.Sac Prefrontal.Cortex..ref. e9.5.Placenta Whole.Brain Adrenal.Gland
1      -12.0000                  -10.00             0    -10.0000     -12.00000
2      -12.0000                  -10.00           -12    -10.0000     -12.00000
3      -11.0000                   10.00           -11      9.2453     -12.00000
4       13.0447                    5.86           -11     10.0000       1.14430
5      -12.0000                  -10.00           -12     -7.5395      -0.60206
6      -12.0000                  -10.00           -12     -4.6984     -12.00000
  Olfactory.Bulb Cortex e9.5.Embryo..ref. Hippocampus TSCs..ref. Cerebellum
1       -10.0000    -10            -12.00    -10.0000   -12.0000    -2.4082
2       -10.0000    -10            -12.00    -10.0000   -12.0000    -1.8062
3        10.0000     10              0.84     10.0000    11.0000     5.1017
4         8.1534     10              1.05      8.5300    12.1818     6.5755
5       -10.0000    -10             -1.90     -3.6124   -12.0000   -10.0000
6       -10.0000    -10            -12.00     -2.7673   -12.0000   -10.0000
  Striatum e17.5.Placenta..ref. Pancreatic.Islets MEFs..ref.   Bladder
1  -5.1175               -12.00         -12.00000   -12.0000 -12.00000
2  -9.2689               -12.00         -12.00000   -12.0000 -12.00000
3 -12.0000                 0.80           1.05310   -12.0000 -12.00000
4   2.7378                 1.03           1.22330    13.5886 -11.00000
5  -9.2689               -12.00          -4.95140   -12.0000  -0.60206
6  -6.9237               -12.00          -0.69782   -12.0000  -0.90309
       Lung  Duodenum White.Adipose Skeletal.Muscle      Skin     Heart
1   0.00000 -12.00000     -12.00000         -11.000 -12.00000 -12.00000
2 -12.00000 -12.00000     -12.00000         -12.000 -12.00000 -12.00000
3   1.75490   0.56672     -12.00000         -11.000   0.40265 -12.00000
4  11.82630   0.30103       0.30103          12.407  -0.30103  -0.30103
5  -0.30103  -0.30103     -12.00000         -12.000 -12.00000 -12.00000
6  -0.90309  -0.90309     -12.00000         -12.000 -12.00000 -12.00000
   Stomach    Thymus   Kidney Liver Whole.Bone.Marrow    Spleen    Testes
1 -12.0000 -12.00000   0.0000   -11         -12.00000  11.22570 -12.00000
2 -12.0000 -12.00000 -12.0000   -12         -12.00000 -12.00000 -12.00000
3 -11.0000   0.69371   0.8319   -11         -11.00000 -11.00000 -11.00000
4  11.5286  12.00680  11.5538   -11          -0.50515  -0.77949   0.30103
5 -12.0000 -12.00000 -12.0000   -12         -12.00000 -12.00000 -11.00000
6 -12.0000 -12.00000 -12.0000   -12         -12.00000 -12.00000 -11.00000

In the html, the full data frame would be printed (here, head was used above to limit the output to the first few rows), which would means tons of scrolling, whereas only a preview of the tibble is printed, which is usually more convenient. The tibble type also doesn’t automatically convert character columns to factors. In old versions of R (pre 4.0.0), data.frame automatically did this to the consternation of many.

Lab

Now that you have all the essential elements of slicing, let’s do some more things with COVID-19 data, but this time with world wide data from “Our World in Data”: https://docs.owid.io/projects/etl/api/covid/. This is a big data set, so it might take a few moments to download.

library(tidyverse)

owid = read_csv("https://catalog.ourworldindata.org/garden/covid/latest/compact/compact.csv")

Before starting on the problems, take a look at which columns are provided in the table. This will help for solving the problems. As a reminder, include the code chunk above in your own qmd file to load the data so that I can see your code run on the loaded data.

Problems

  1. Plot the new cases per week per million people for the pandemic for the United States. Use a line plot.
    (hint: use the help for plot, ?base::plot, to figure out how to set the plot type.)
    (hint: try to make plot not too jagged by removing rows where new_cases == 0)

  2. What date had the highest number of new deaths in the United States?
    (hint: use the order (remember to sort descending) or which.max functions.)

  3. What date and in what country was the worst (i.e., highest) for per capita death due to COVID-19?

  4. What date and in what country was the worst (i.e., highest) for positivity rate COVID-19?

  5. Create a new data frame using the data from owid called new_cases_per_100k consisting of the following columns, location, date, and cases, where cases is the number of new cases per 100,000 people. In two separate plots, plot the number of new cases per 100k people over time for the United Kingdom (plot 1) and Canada (plot 2). If you wanted to present these plots side by side so as to compare the severity of the pandemic in the UK vs Canada, what might you have to do to make them more comparable?

  6. In 2021, on how many days did the United States have fewer than 0.4 deaths per million people due to COVID-19? What is answer the United Kingdom? Use the column new_deaths_per_million to answer this question.

  • Challenge problem (+3 extra credit)

    Plot a heatmap of the imprinting data (“babak-etal-2015_imprinted-mouse.xlsx”) using the heatmap function. The rows and columns of the heatmap should be labeled properly with the gene names (rows) and tissue names (columns). The Babak et al. (2015) paper has a similar heatmap in Fig 1. Hint: read carefully the help for the heatmap function and know that you can convert data frames to matrices with as.matrix.