= 1:16
vec vec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Previously, you have seen vectors, which are just lists of objects of a single type. However, you often want a matrix of objects of a single (or multiple!) types or even a higher dimensions group of objects. The two dimensional version of a vector is called a matrix
and the n-dimensional version is called an array
.
You can think of a matrix as a vector,
= 1:16
vec vec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
except you’ve specified that there are a certain number of rows and columns:
= matrix(vec, nrow = 4, ncol = 4)
matx matx
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
Notice how the vector “filled” each column of the matrix. This is because R is a “column-major” language (Fortran, MATLAB, and Julia are other column-major langauges). Some languages, such as C and Python, and have row-major order.
Since you often deal with matrices and matrix-like objects, R has two functions to give you the number of rows and columns. Also, the length of the matrix is just the rows times the columns.
nrow(matx)
[1] 4
ncol(matx)
[1] 4
length(matx)
[1] 16
Likewise, you can convert the vector to a 4x4 array:
= array(vec, dim = c(4,4))
arr arr
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
Two-dimensional arrays are exactly the same as matrices:
str(matx)
int [1:4, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
str(arr)
int [1:4, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
In the above, the dim
argument specifies the dimensionality of the array. Thus, you can convert the vector to a multidimensional array of 2x2x2x2 as well:
= array(vec, dim = c(2,2,2,2))
arr arr
, , 1, 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2, 1
[,1] [,2]
[1,] 5 7
[2,] 6 8
, , 1, 2
[,1] [,2]
[1,] 9 11
[2,] 10 12
, , 2, 2
[,1] [,2]
[1,] 13 15
[2,] 14 16
Note that this is a four-dimensional object, so you can’t print it without having to “flatten” it in some way.
Indexing matrices works similarly to indexing vectors except that you give a list of the elements you want from each dimension. First, suppose that you roll a twenty-sided die 100 times (because you are a super nerd) and you collect the results in a 10x10 matrix:
set.seed(100) # this gives us the same "random" matrix each time
= matrix(sample(1:20, 100, replace = TRUE), nrow = 10, ncol = 10)
rmatx rmatx
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 10 7 19 6 6 19 16 5 16 8
[2,] 6 7 4 13 20 14 18 10 5 15
[3,] 16 11 4 1 14 20 3 7 18 17
[4,] 19 18 20 9 11 14 5 12 8 12
[5,] 14 12 16 9 13 5 12 9 8 8
[6,] 12 3 7 9 14 15 20 11 2 10
[7,] 6 19 16 11 3 15 10 11 19 3
[8,] 4 8 11 7 14 4 15 3 12 6
[9,] 6 18 20 1 4 12 10 9 5 16
[10,] 2 2 2 9 19 19 12 16 14 6
To obtain the element in the second row, eighth column,
2,8] rmatx[
[1] 10
You can also get a “slice” of the matrix by using the colon operator:
2,1:5] rmatx[
[1] 6 7 4 13 20
yields the first five elements of the second row. Note that within the entry for the column element, you actually used a vector for the index. Thus, you can give any list of indices in any order you choose. For example,
c(1,3,5,7,9),c(2,4,6,8,10)] rmatx[
[,1] [,2] [,3] [,4] [,5]
[1,] 7 6 19 5 8
[2,] 11 1 20 7 17
[3,] 12 9 5 9 8
[4,] 19 11 15 11 3
[5,] 18 1 12 9 16
returns a slice of the matrix with only the odd rows and even columns. You can keep all the elements in a specific dimension by just leaving that spot blank. For example, to get the fourth row,
4,] rmatx[
[1] 19 18 20 9 11 14 5 12 8 12
In general, slicing an array involves giving a list of indices for each dimension of the array.
The magic of data wrangling and analysis with complex data comes in the many creative ways one can create these lists of indices and thus the slices that contain exactly the subset of the data that you want.
Recall that lists are like vectors but with potentially multiple types of objects. Thus, they are a little bit more complicated to slice. Take the following list,
= list(1:3, 4:6, 7:9)
x x
[[1]]
[1] 1 2 3
[[2]]
[1] 4 5 6
[[3]]
[1] 7 8 9
which is a list of three vectors each three elements long. To get the first element of the list, you would do normally do
str(x[c(1,2)])
List of 2
$ : int [1:3] 1 2 3
$ : int [1:3] 4 5 6
but (using the str
function) you can see that actually returned the first vector as a list of length one. Thus, the single brackets simply return another list that contains the elements requested. You can give the single brackets a vector of indices, so its natural that you should get a list back. To get the component of the list itself, you must use double brackets:
str(x[[1]])
int [1:3] 1 2 3
This will be useful too when we’re dealing with data.frame
s, which are basically just special lists. Another useful thing with lists (and other objects as we’ll see next week) is that you can name the elements with strings (that satisfy R objects naming rules!). This allows you to access elements of the list using the name instead of the index. For example, if
= list(a = 1:3, b = 4:6, c = 7:9)
x x
$a
[1] 1 2 3
$b
[1] 4 5 6
$c
[1] 7 8 9
then we can access the first element using its name “a” and the “$” operator:
$a x
[1] 1 2 3
which is equivalent to
1]] x[[
[1] 1 2 3
"a"]] x[[
[1] 1 2 3
$"a" x
[1] 1 2 3
$`a` x
[1] 1 2 3
Any valid R name doesn’t need quotes for using the $
, but you need them for fancy non-R friendly names:
= list(`1 fancy name`=1:3, b = 4:6, c = 7:9)
x $`1 fancy name` x
[1] 1 2 3
"1 fancy name"]] x[[
[1] 1 2 3
Negative integers omit the specified positions:
rmatx
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 10 7 19 6 6 19 16 5 16 8
[2,] 6 7 4 13 20 14 18 10 5 15
[3,] 16 11 4 1 14 20 3 7 18 17
[4,] 19 18 20 9 11 14 5 12 8 12
[5,] 14 12 16 9 13 5 12 9 8 8
[6,] 12 3 7 9 14 15 20 11 2 10
[7,] 6 19 16 11 3 15 10 11 19 3
[8,] 4 8 11 7 14 4 15 3 12 6
[9,] 6 18 20 1 4 12 10 9 5 16
[10,] 2 2 2 9 19 19 12 16 14 6
-c(1,3,5,7,9),c(2,4,6,8,10)] rmatx[
[,1] [,2] [,3] [,4] [,5]
[1,] 7 13 14 10 15
[2,] 18 9 14 12 12
[3,] 3 9 15 11 10
[4,] 8 7 4 3 6
[5,] 2 9 19 16 6
gives the even rows and even columns.
Logical vectors selects elements of the matrix where the index vector is TRUE.
This is one of the most useful ways to index.
If you want to get only the rolls of the die that were less than 10, then you create a matrix of logicals
< 10 rmatx
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
[2,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[3,] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
[4,] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
[5,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
[6,] FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[7,] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[8,] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
[9,] TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
[10,] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
and use it to index the matrix:
< 10] rmatx[rmatx
[1] 6 6 4 6 2 7 7 3 8 2 4 4 7 2 6 1 9 9 9 7 1 9 6 3 4 5 4 3 5 5 7 9 3 9 5 8 8 2
[39] 5 8 8 3 6 6
Notice here that you get a vector back, not a matrix. This is because when you give a single index to a matrix, it treats the matrix like a vector with the first column first, then the second column, etc (i.e., column-major order). It also makes sense that you don’t get a matrix back since the “< 10” condition could be met anywhere any number of times in the matrix (i.e., no guarantee that the result would be square like a matrix).
You could also just slice rows based on columns (or vice versa). This is the kind of slicing we’ll often do on a data table since we will want all rows (say, results from different experiments) whose column (say, factor in the experiment) matches a certain condition. For example, suppose we want to get all rows of the matrix rmatx
with a fourth column whose roll is less than 10. We first slice the fourth column with rmatx[,4]
and then compare it to 10,
rmatx
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 10 7 19 6 6 19 16 5 16 8
[2,] 6 7 4 13 20 14 18 10 5 15
[3,] 16 11 4 1 14 20 3 7 18 17
[4,] 19 18 20 9 11 14 5 12 8 12
[5,] 14 12 16 9 13 5 12 9 8 8
[6,] 12 3 7 9 14 15 20 11 2 10
[7,] 6 19 16 11 3 15 10 11 19 3
[8,] 4 8 11 7 14 4 15 3 12 6
[9,] 6 18 20 1 4 12 10 9 5 16
[10,] 2 2 2 9 19 19 12 16 14 6
4] < 10 rmatx[,
[1] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
Note that we get a vector of boolean values indicating whether the element in that row of the fourth column is less than 10. To get only those rows of the matrix rmatx
, we then use this vector to slice the matrix:
4] < 10,] rmatx[rmatx[,
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 10 7 19 6 6 19 16 5 16 8
[2,] 16 11 4 1 14 20 3 7 18 17
[3,] 19 18 20 9 11 14 5 12 8 12
[4,] 14 12 16 9 13 5 12 9 8 8
[5,] 12 3 7 9 14 15 20 11 2 10
[6,] 4 8 11 7 14 4 15 3 12 6
[7,] 6 18 20 1 4 12 10 9 5 16
[8,] 2 2 2 9 19 19 12 16 14 6
Given that you can slice matrices now, you will at some point want to apply some function to each element of that slice or to each row or column of the matrix. This can be done with a for
loop, but there are functions that simply this. Such functions are apply
functions in R and map
functions in Python and Mathematica.
In R, there are actually many apply
functions since there are different types of list objects. To see them all, you type
::apply ??base
which searches for all functions with “apply” in the description in the “base” package.
If you have a list, the easiest apply function is sapply
, which applies a function to each element of a vector or list and returns the output as a vector or matrix if possible. For example, you could sum each vector in the list you created above using the sum
function.
x
$`1 fancy name`
[1] 1 2 3
$b
[1] 4 5 6
$c
[1] 7 8 9
sapply(x, sum)
1 fancy name b c
6 15 24
Note that sapply
gave you back a vector but with each element named according to the names of the list elements (handy!).
If you want to use apply over a matrix, the apply
function is required. To add all the elements in each row (dimension 1) or column (dimension 2) of the matrix matx
, you can try
matx
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
apply(matx, 1, sum)
[1] 28 32 36 40
apply(matx, 2, sum)
[1] 10 26 42 58
However, you can do more complicated things by making custom functions and “applying” them. For example, in order to get the number of die rolls less than 10 in each column of your die roll matrix, you could try
rmatx
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 10 7 19 6 6 19 16 5 16 8
[2,] 6 7 4 13 20 14 18 10 5 15
[3,] 16 11 4 1 14 20 3 7 18 17
[4,] 19 18 20 9 11 14 5 12 8 12
[5,] 14 12 16 9 13 5 12 9 8 8
[6,] 12 3 7 9 14 15 20 11 2 10
[7,] 6 19 16 11 3 15 10 11 19 3
[8,] 4 8 11 7 14 4 15 3 12 6
[9,] 6 18 20 1 4 12 10 9 5 16
[10,] 2 2 2 9 19 19 12 16 14 6
apply(rmatx, 1, function(x) sum(x < 10))
[1] 5 4 4 3 5 4 3 6 5 5
where you have a small “anonymous” (unnamed) function here that gets a logical vector for each row, TRUE for > 10 and FALSE for <= 10, and sums that vector (TRUE = 1 and FALSE = 0).
Now let’s slice some real DATA. We’ll return to the COVID-19 data from the New York Times (https://github.com/nytimes/covid-19-data). Below we load the data and transform it from cumulative counts to daily counts of cases. Finally, we select case data for 2021 only.
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.2 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
= read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv") us
Rows: 51134 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, fips
dbl (2): cases, deaths
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
= us %>%
us_cases_2021_wide # make each state its own column
pivot_wider(c(date), names_from = state, values_from = cases) %>%
# for each column subtract, subtract the previous day's case #
mutate(across(where(is.numeric), ~ .x - lag(.x))) %>%
# filter out only 2021 cases
filter(date >= "2021-01-01" & date <= "2021-12-31")
= us_cases_2021_wide %>%
us_cases_2021 # put baack into tidy format
pivot_longer(-date, names_to = "state", values_to = "cases") %>%
# filter out rows with NA case
filter(!is.na(cases))
You can get the first few rows with
head(us_cases_2021)
# A tibble: 6 × 3
date state cases
<date> <chr> <dbl>
1 2021-01-01 Washington 521
2 2021-01-01 Illinois 447
3 2021-01-01 California 37951
4 2021-01-01 Arizona 6438
5 2021-01-01 Massachusetts 0
6 2021-01-01 Wisconsin 2085
and compare it to the “wide” version
head(us_cases_2021_wide)
# A tibble: 6 × 57
date Washi…¹ Illin…² Calif…³ Arizona Massa…⁴ Wisco…⁵ Texas Nebra…⁶ Utah
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-01-01 521 447 37951 6438 0 2085 7779 772 0
2 2021-01-02 1705 11390 52194 8895 9003 1129 14057 763 5135
3 2021-01-03 5267 4428 35911 17222 3481 2593 16019 707 1726
4 2021-01-04 2771 5354 40723 5158 4920 1626 23104 741 2160
5 2021-01-05 2385 7052 37284 10134 4620 4019 30992 1289 3318
6 2021-01-06 2077 7444 36907 6947 6851 4109 25322 1558 3769
# … with 47 more variables: Oregon <dbl>, Florida <dbl>, `New York` <dbl>,
# `Rhode Island` <dbl>, Georgia <dbl>, `New Hampshire` <dbl>,
# `North Carolina` <dbl>, `New Jersey` <dbl>, Colorado <dbl>, Maryland <dbl>,
# Nevada <dbl>, Tennessee <dbl>, Hawaii <dbl>, Indiana <dbl>, Kentucky <dbl>,
# Minnesota <dbl>, Oklahoma <dbl>, Pennsylvania <dbl>,
# `South Carolina` <dbl>, `District of Columbia` <dbl>, Kansas <dbl>,
# Missouri <dbl>, Vermont <dbl>, Virginia <dbl>, Connecticut <dbl>, …
The data are saved as a tibble
, which is really a fancy data.frame
, which is a special kind of list that we will discuss in more detail next week. For the moment, you can practice slicing the data.frame as if it were a matrix.
Load the .csv
file using the code above and show the R code in your .qmd
file for each of the following problems. You should be able to use a few lines of R code to obtain the answer directly (i.e., you shouldn’t just scroll though the data table and give the answer by hand). Use the us_cases_2021
and us_cases_2021_wide
data frames; the “long” format data frame us_cases_2021
may be easier to use in some problems whereas the `us_cases_2021_wide” may be easier to use in others.
Which states have had more than 75,000 cases during one day in 2021?
How many cases has Kentucky had during 2021? (hint: use the sum
function)
How many cases where there outside of California in the month of August? (hint: you’ll need the boolean &
to combine multiple conditions)
Which US state has the second fewest number of cases in 2021? (hint: use sapply
with sum
and then use the sort
function)
Which day in 2021 had the highest number of cases in the country (all states and territories)? (hint: use apply
and sum
to get the totals for each date. Then use the order
function to give you the right order to sort the dates…)