Getting data into R with data.frames and spreadsheets
Author
Jeremy Van Cleve
Published
February 3, 2026
Outline for today
Another slice of slicing
Names and attributes
Factors
Data frames: a special kind of list
Reading data tables
Another slice of slicing
Last time, we covered much of the basics of slicing matrices but there are still some topics and some helper functions that will be useful to know when trying to accomplish certain tasks.
Assigning to a slice
Not only can you extract a slice of a matrix to analyze or plot but you can also assign values to that slice. First, create a matrix of all zeros to manipulate:
Note that when assigning to a slice, the right-hand side must be of the same dimensionality as the left-hand side. For example, the following will not work:
allz[1,] =1:4
The one exception to this rule is when the number of items on the right hand side is a multiple of the number of elements in the slice. The simplest example is
where the right hand side is use as many times as necessary to fill the slice.
Sorting
Sorting numeric and character values is an important task that comes up in many applications. The sort function has reasonable defaults where it produces increasing numeric values
You can reverse the sort order by setting the argument decreasing = TRUE.
Getting the indices from slices
Sorting
Often, you will want to sort not only a vector by the rows of a data matrix based on some column of the matrix. Thus, you need the list of positions each row will go to (e.g., row 1 to row 10 because its 10th in the sorted order, etc). To obtain this, you can use the order function
svec
[1] "hello" "world" "goodbye" "grand" "planet"
order(svec)
[1] 3 4 1 5 2
which output precisely that list of indices. If you stick these indices back into the vector, you will obtain the original sort operation
svec[ order(svec) ]
[1] "goodbye" "grand" "hello" "planet" "world"
sort(svec)
[1] "goodbye" "grand" "hello" "planet" "world"
You can also use the “sort order” of one column to order the rows of a whole matrix or data table. For example, using a matrix of random values,
Recall that you can slice by creating a logical condition (generating TRUE and FALSE values) and use that in the index of a matrix. Sometimes, you want the actual indices of the elements of that matrix that are sliced; i.e., you want the indices of the elements where the conditions is TRUE. To get these indices, you use the which function. For example, the logical vector and slice are
Finally, there some special versions of the which function that give you the first index of the max or min element of a vector, which.max and which.min.
Names and attributes
We’ve talked about attributes and names before but there are some helpful functions for getting and setting the names associated with arrays and lists. You have already seen with lists how each element can be given a name.
l =list(a =1, b ="one hundred")named_svec =c(s1 ="hello", s2 ="world", s3 ="goodbye", s4 ="grand", s5 ="planet")named_svec
Finally, you can get rid of the names entirely by assigning names to NULL
names(named_svec) =NULLnamed_svec
[1] "hello" "world" "goodbye" "grand" "planet"
Just as reminder, while we can name elements of vectors, they still have to hold the same data type, unlike lists that can hold anything.
str(list(a=1, b="two"))
List of 2
$ a: num 1
$ b: chr "two"
str(c(a=1, b="two"))
Named chr [1:2] "1" "two"
- attr(*, "names")= chr [1:2] "a" "b"
Factors
A special object that you will see when dealing with data frames is called a “factor”. A factor is a vector that can contain only predefined values and essentially stores categorical data (e.g., “tall”, “medium”, and “short” for plant height). Factors have a “levels” attribute that lists the allowable values. For example
If you try to set an element of the factor object to a value outside of levels, you will receive a warning
fac_factor[1] ="Van Cleve"
Warning in `[<-.factor`(`*tmp*`, 1, value = "Van Cleve"): invalid factor level,
NA generated
fac_factor
[1] <NA> Burger Seifert Santollo Duncan Long
Levels: Burger Duncan Famulski Long Santollo Seifert
and the element will be converted to the NA value, which is used for missing data.
Many R functions that read data tables take advantage of this behavior of factors so that columns may only contain certain values and the other values are missing data. This occurs when the function runs into a column with string data and the R function will often convert that column to a factor. Some of the functions that read data tables have nice arguments that let you tell them that specific strings, say “-”, represent missing data and should be be converted to NA.
While useful, factors are extremely annoying when your data are converted to them when you don’t expect it as further changes to the data table may result in NA values when you really wanted to add a new string value. This paper gives a good history of why factors are useful in R. It mostly comes down to factors being useful for categorical variables in regression models.
The main place factors are used that we’ll encounter in this course is when plotting categorical variables. In those cases, the order the variables are plotted in will be determined the order of the levels in levels. In those cases, you may want to reorder the factors so that the variables are plotted in a specific order (say in descending order of frequency in the data). For this, there is a nice package called forcats that is included in the tidyverse that has the function fct_reorder that can help. Another thing we’ll run into is changing factor levels so that they have more descriptive labels. For this forcats has fct_recode. We’ll see examples of these kinds of scenarios later on when we’re plotting using ggplot2
Data frames
Finally, we have reached data frames. Data frames are the most common way of storing data in R. Essentially, a data frame is a list object containing vectors of equal length (i.e., the number of rows of the table). Put another way, a data frame is a list version of a matrix. Thus, data frames have properties such as length(), rnow(), ncol()colnames(), and rownames().
Creating a data frame is like creating a list where you name your elements, which here are columns (data not guaranteed to be accurate…):
height_rank last_name first_name
1 1 Van Cleve Jeremy
2 2 Linnen Catherine
3 3 Seifert Ashley
4 4 Pendergast Julie
Slicing a data frame works like slicing a matrix and also like slicing a list. Often, we will use the list convention where columns can be obtained with $. For example,
dframe$first_name
[1] "Jeremy" "Catherine" "Ashley" "Julie"
dframe$last_name
[1] "Van Cleve" "Linnen" "Seifert" "Pendergast"
This is convient because the column names make clear what kind of data we’re slicing from the data frame.
Adding columns to a data frame is done with cbind (“column bind”), which glues together columns. To add the column building, we do the following
cbind(dframe, building =c("THM", "THM", "THM", "THM"), floor =c(2,2,2,3))
height_rank last_name first_name building floor
1 1 Van Cleve Jeremy THM 2
2 2 Linnen Catherine THM 2
3 3 Seifert Ashley THM 2
4 4 Pendergast Julie THM 3
Adding rows can be done with rbind (“row bind”), which glues together rows,
height_rank last_name first_name
1 1 Van Cleve Jeremy
2 2 Linnen Catherine
3 3 Seifert Ashley
4 4 Pendergast Julie
5 0 Smith Jeramiah
Note here that he have to create a new data data.frame for our new row since dframe is a data.frame and so binding rows together here means binding together two data frames each with the same columns.
Note that each of these commands returned a newdata.frame and the original is unchanged until we explicitly save back to that variable name:
dframe
height_rank last_name first_name
1 1 Van Cleve Jeremy
2 2 Linnen Catherine
3 3 Seifert Ashley
4 4 Pendergast Julie
Functions like cbind, rbind, and others that do operations on arrays and data frames usually create a copy of the data and return the modified copy. This is usually what you want since you’re not modifying your original variable/data until you explicitly assign the old variable to the new data. One case where you might not want to do this is when your data are so big (e.g., whole genomes, billions of tweets) that they take up a large fraction of the computer’s RAM, in which case you have to be very careful about creating copies of your data.
Reading data tables
Now that you know about data frames, you can start using some nice R functions to read in data. We have already seen this when loading data for the homeworks. As in those examples, we load the a few packages before loading the data since they are nice for reading csv and excel files. For reading excel files, you’ll need to install the readxl package if you don’t have it, which you can do with:
install.packages("readxl")
Then load:
library(tidyverse) # loads the `readr` package that loads things like csv fileslibrary(readxl) # package for reading Excel files
Now, you can use the read_csv function to load csv or “comma separated value” files. For example to load COVID-19 and respiratory virus data from the CDC that was saved as a csv file, we load us_hosps_deaths_cdc_2020-01_2024-09.csv, which is in the project folder and course GitHub repo.
Rows: 11128 Columns: 72
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, state_abbrv
dbl (69): covid_19_deaths, total_deaths, percent_of_expected_deaths, pneumo...
dttm (1): week_end_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice that read_csv gives you some nice output telling us about the table you just read. This function and others like it (i.e., from the readr and readxl packages) do a lot for you automatically and have many nice features. For example, read_csv has the argument col_names = TRUE by default, which means it uses the first row of the table as the column names. Some tables may simple just straight into data without column names in which case you can set col_names = FALSE and it will give automatic names or give col_names a vector of column names manually. Sometimes data tables will have the first few lines with text describing the data and you can skip them by giving the argument skip the number of lines to skip. There are many other options so looking at the help with ?read_csv is recommended when you’re having trouble getting the data loaded correctly.
Loading excel files in no harder. We’ll load some data from a RNA-seq paper on genomic imprinting (Babak et al. 2015. Nat Gen, http://dx.doi.org/10.1038/ng.3274), babak-etal-2015_imprinted-mouse.xlsx (located in project folder and course GitHub repo), with read_excel
imprint =read_excel("babak-etal-2015_imprinted-mouse.xlsx", na ="NaN")
Note that you have to tell the function what strings in the Excel spreadsheet correspond to NA or missing data (“NaN” in this case). The first column are the gene names for each row
you should notice that both are of the tibble type. A tibble is a data.frame but with enhancements. First and maybe most importantly, it prints nicely when you evaluate it at the command line and in Quarto notebooks. Second, it leaves the column names alone on conversion to a data frame. Thus, we get columns like Preoptic Area (ref) in the imprinting data instead of
make.names("Preoptic Area (ref)")
[1] "Preoptic.Area..ref."
So a “normal” data.frame would do this to the data:
In the html, the full data frame would be printed (here, head was used above to limit the output to the first few rows), which would means tons of scrolling, whereas only a preview of the tibble is printed, which is usually more convenient. The tibble type also doesn’t automatically convert character columns to factors. In old versions of R (pre 4.0.0), data.frame automatically did this to the consternation of many.
Lab
Now that you have all the essential elements of slicing, let’s do some more things with COVID-19 data, but this time with world wide data from “Our World in Data”: https://docs.owid.io/projects/etl/api/covid/. This is a big data set, so it might take a few moments to download.
Before starting on the problems, take a look at which columns are provided in the table. This will help for solving the problems. As a reminder, include the code chunk above in your own qmd file to load the data so that I can see your code run on the loaded data.
Problems
Plot the new cases per week per million people for the pandemic for the United States. Use a line plot.
(hint: use the help for plot, ?base::plot, to figure out how to set the plot type.)
(hint: try to make plot not too jagged by removing rows where new_cases == 0)
What date had the highest number of new deaths in the United States?
(hint: use the order (remember to sort descending) or which.max functions.)
What date and in what country was the worst (i.e., highest) for per capita death due to COVID-19?
What date and in what country was the worst (i.e., highest) for positivity rate COVID-19?
Create a new data frame using the data from owid called new_cases_per_100k consisting of the following columns, location, date, and cases, where cases is the number of new cases per 100,000 people. In two separate plots, plot the number of new cases per 100k people over time for the United Kingdom (plot 1) and Canada (plot 2). If you wanted to present these plots side by side so as to compare the severity of the pandemic in the UK vs Canada, what might you have to do to make them more comparable?
In 2021, on how many days did the United States have fewer than 0.4 deaths per million people due to COVID-19? What is answer the United Kingdom? Use the column new_deaths_per_million to answer this question.
Challenge problem (+3 extra credit)
Plot a heatmap of the imprinting data (“babak-etal-2015_imprinted-mouse.xlsx”) using the heatmap function. The rows and columns of the heatmap should be labeled properly with the gene names (rows) and tissue names (columns). The Babak et al. (2015) paper has a similar heatmap in Fig 1. Hint: read carefully the help for the heatmap function and know that you can convert data frames to matrices with as.matrix.