Last time, we covered much of the basics of slicing matrices but there are still some topics and some helper functions that will be useful to know when trying to accomplish certain tasks.
Assigning to a slice
Not only can you extract a slice of a matrix to analyze or plot but you can also assign values to that slice. First, create a matrix of all zeros to manipulate:
Note that when assigning to a slice, the right-hand side must be of the same dimensionality as the left-hand side. For example, the following will not work:
allz[1,] =1:4
The one exception to this rule is when the number of items on the right hand side is a multiple of the number of elements in the slice. The simplest example is
where the right hand side is use as many times as necessary to fill the slice.
Sorting
Sorting numeric and character values is an important task that comes up in many applications. The sort function has reasonable defaults where it produces increasing numeric values
You can reverse the sort order by setting the argument decreasing = TRUE.
Getting the indices from slices
Sorting
Often, you will want to sort not only a vector by the rows of a data matrix based on some column of the matrix. Thus, you need the list of positions each row will go to (e.g., row 1 to row 10 because its 10th in the sorted order, etc). To obtain this, you can use the order function
svec
[1] "hello" "world" "goodbye" "grand" "planet"
order(svec)
[1] 3 4 1 5 2
which output precisely that list of indices. If you stick these indices back into the vector, you will obtain the original sort operation
svec[ order(svec) ]
[1] "goodbye" "grand" "hello" "planet" "world"
sort(svec)
[1] "goodbye" "grand" "hello" "planet" "world"
You can also use the “sort order” of one column to order the rows of a whole matrix or data table. For example, using a matrix of random values,
Recall that you can slice by creating a logical condition (generating TRUE and FALSE values) and use that in the index of a matrix. Sometimes, you want the actual indices of the elements of that matrix that are sliced; i.e., you want the indices of the elements where the conditions is TRUE. To get these indices, you use the which function. For example, the logical vector and slice are
Finally, there some special versions of the which function that give you the first index of the max or min element of a vector, which.max and which.min.
Names and attributes
We’ve talked about attributes and names before but there are some helpful functions for getting and setting the names associated with arrays and lists. You have already seen with lists how each element can be given a name.
l =list(a =1, b ="one hundred")named_svec =c(s1 ="hello", s2 ="world", s3 ="goodbye", s4 ="grand", s5 ="planet")named_svec
Finally, you can get rid of the names entirely by assigning names to NULL
names(named_svec) =NULLnamed_svec
[1] "hello" "world" "goodbye" "grand" "planet"
Just as reminder, while we can name elements of vectors, they still have to hold the same data type, unlike lists that can hold anything.
str(list(a=1, b="two"))
List of 2
$ a: num 1
$ b: chr "two"
str(c(a=1, b="two"))
Named chr [1:2] "1" "two"
- attr(*, "names")= chr [1:2] "a" "b"
Factors
A special object that you will see when dealing with data frames is called a “factor”. A factor is a vector that can contain only predefined values and essentially stores categorical data (e.g., “tall”, “medium”, and “short” for plant height). Factors have a “levels” attribute that lists the allowable values. For example
and the element will be converted to the NA value, which is used for missing data.
Many R functions that read data tables take advantage of this behavior of factors so that columns may only contain certain values and the other values are missing data. This occurs when the function runs into a column with string data and the R function will often convert that column to a factor. Some of the functions that read data tables have nice arguments that let you tell them that specific strings, say “-”, represent missing data and should be be converted to NA.
While useful, factors are extremely annoying when your data are converted to them when you don’t expect it as further changes to the data table may result in NA values when you really wanted to add a new string value. This paper gives a good history of why factors are useful in R. It mostly comes down to factors being useful for categorical variables in regression models.
Data frames
Finally we have reached data frames. Data frames are the most common way of storing data in R. Essentially, a data frame is a list object containing vectors of equal length (i.e., the number of rows of the table). Put another way, a data frame is a list version of a matrix. Thus, data frames have properties such as length(), rnow(), ncol()colnames(), and rownames().
Creating a data frame is like creating a list where you name your elements, which here are columns (data not guaranteed to be accurate…):
height_rank last_name first_name
1 1 Van Cleve Jeremy
2 2 Linnen Catherine
3 3 Seifert Ashley
4 4 Pendergast Julie
5 0 Smith Jeramiah
Again, note that each of these commands returned a newdata.frame and the original is unchanged until we explicitly save back to that variable name:
dframe
height_rank last_name first_name
1 1 Van Cleve Jeremy
2 2 Linnen Catherine
3 3 Seifert Ashley
4 4 Pendergast Julie
Functions like cbind, rbind, and others that do operations on arrays and data frames usually create a copy of the data and return the modified copy. This is usually what you want since you’re not modifying your original variable/data until you explicitly assign the old variable to the new data. One case where you might not want to do this is when your data are so big (e.g., whole genomes, billions of tweets) that they take up a large fraction of the computer’s RAM, in which case you have to be very careful about creating copies of your data.
Reading data tables
Now that you know about data frames, you can start using some nice R functions to read in data. We have already seen this when loading data for the homeworks. As in those examples, we load the a few packages before loading the data since they are nice for reading csv and excel files. You probably don’t have the readxl package yet, which you can install with:
install.packages("readxl")
Then load:
library(tidyverse) # loads the `readr` package that loads things like csv fileslibrary(readxl) # package for reading Excel files
Now, you can use the read_csv function to get the COVID-19 data table from the NY Times.
us =read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
Rows: 51526 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, fips
dbl (2): cases, deaths
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice that read_csv gives you some nice output telling us about the table you just read. This function and others like it (i.e., from the readr and readxl packages) do a lot for you automatically and have many nice features, some of which we’ll cover later.
imprint =read_excel("babak-etal-2015_imprinted-mouse.xlsx", na ="NaN")
Note that you have to tell the function what strings in the Excel spreadsheet correspond to NA or missing data (“NaN” in this case). The first column are the gene names for each row
Before starting on the problems, take a look at which columns are provided in the table. This will help for solving the problems. Hint: use an R command to do this and you get a nice list of all the columns.
Problems
Plot the new cases per day per million people for the pandemic for the United States. Use a line plot. Hint: use the help for plot, ?base::plot, to set the plot type.
What day had the highest number of deaths in the United States? Hint: use the order function.
What day and in what country was the worst (i.e., highest) for per capita death due to COVID-19?
Create a new data frame called new_cases_per_100k consisting of the following columns, location, date, and cases, where cases is the number of new cases per 100,000 people. In two separate plots, plot the number of new cases per 100k people over time for the United Kingdom (plot 1) and Canada (plot 2). If you wanted to present these plots side by side so as to compare the severity of the pandemic in the UK vs Canada, what might you have to do to make them more comparable?
In the owid data set, how many countries have data on ICU patients (column icu_patients)? Hint: you might need the is.na function and the unique function.
In 2021, on how many days did the United States have fewer than 2 deaths per 10 million people due to COVID-19? What is answer the United Kingdom? You may run into issues where you’re comparing an NA value to a number, like 2. In this case, you should exclude the NA values by using !is.na(owid$new_cases_per_million) as the first in your series of logical conditions for your slice.
Challenge problem (+3 extra credit)
Plot a heatmap of the imprinting data using the heatmap function. The rows and columns of the heatmap should be labeled properly with the gene names (rows) and tissue names (columns). The Babak et al. (2015) paper has a similar heatmap in Fig 1. Hint: read carefully the help for the heatmap function and know that you can convert data frames to matrices with as.matrix.