Vectors, slicing, and map(ping)

Author

Jeremy Van Cleve

Published

September 12, 2022

Outline for today

Matrices and arrays
Indexing and slicing
Mapping and applying

Matrices and arrays

Previously, you have seen vectors, which are just lists of objects of a single type. However, you often want a matrix of objects of a single (or multiple!) types or even a higher dimensions group of objects. The two dimensional version of a vector is called a matrix and the n-dimensional version is called an array.

You can think of a matrix as a vector,

vec = 1:16
vec

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

except you’ve specified that there are a certain number of rows and columns:

matx = matrix(vec, nrow = 4, ncol = 4)
matx

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Notice how the vector “filled” each column of the matrix. This is because R is a “column-major” language (Fortran, MATLAB, and Julia are other column-major langauges). Some languages, such as C and Python, and have row-major order.

Since you often deal with matrices and matrix-like objects, R has two functions to give you the number of rows and columns. Also, the length of the matrix is just the rows times the columns.

nrow(matx)

[1] 4

ncol(matx)

[1] 4

length(matx)

[1] 16

Likewise, you can convert the vector to a 4x4 array:

arr = array(vec, dim = c(4,4))
arr

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Two-dimensional arrays are exactly the same as matrices:

str(matx)

 int [1:4, 1:4] 1 2 3 4 5 6 7 8 9 10 ...

str(arr)

 int [1:4, 1:4] 1 2 3 4 5 6 7 8 9 10 ...

In the above, the dim argument specifies the dimensionality of the array. Thus, you can convert the vector to a multidimensional array of 2x2x2x2 as well:

arr = array(vec, dim = c(2,2,2,2))
arr

, , 1, 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2, 1

     [,1] [,2]
[1,]    5    7
[2,]    6    8

, , 1, 2

     [,1] [,2]
[1,]    9   11
[2,]   10   12

, , 2, 2

     [,1] [,2]
[1,]   13   15
[2,]   14   16

Note that this is a four-dimensional object, so you can’t print it without having to “flatten” it in some way.

Indexing and slicing

Indexing matrices works similarly to indexing vectors except that you give a list of the elements you want from each dimension. First, suppose that you roll a twenty-sided die 100 times (because you are a super nerd) and you collect the results in a 10x10 matrix:

set.seed(100) # this gives us the same "random" matrix each time
rmatx = matrix(sample(1:20, 100, replace = TRUE), nrow = 10, ncol = 10)
rmatx

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6

To obtain the element in the second row, eighth column,

rmatx[2,8]

[1] 10

You can also get a “slice” of the matrix by using the colon operator:

rmatx[2,1:5]

[1]  6  7  4 13 20

yields the first five elements of the second row. Note that within the entry for the column element, you actually used a vector for the index. Thus, you can give any list of indices in any order you choose. For example,

rmatx[c(1,3,5,7,9),c(2,4,6,8,10)]

     [,1] [,2] [,3] [,4] [,5]
[1,]    7    6   19    5    8
[2,]   11    1   20    7   17
[3,]   12    9    5    9    8
[4,]   19   11   15   11    3
[5,]   18    1   12    9   16

returns a slice of the matrix with only the odd rows and even columns. You can keep all the elements in a specific dimension by just leaving that spot blank. For example, to get the fourth row,

rmatx[4,]

 [1] 19 18 20  9 11 14  5 12  8 12

In general, slicing an array involves giving a list of indices for each dimension of the array.
The magic of data wrangling and analysis with complex data comes in the many creative ways one can create these lists of indices and thus the slices that contain exactly the subset of the data that you want.

Slicing lists

Recall that lists are like vectors but with potentially multiple types of objects. Thus, they are a little bit more complicated to slice. Take the following list,

x = list(1:3, 4:6, 7:9)
x

[[1]]
[1] 1 2 3

[[2]]
[1] 4 5 6

[[3]]
[1] 7 8 9

which is a list of three vectors each three elements long. To get the first element of the list, you would do normally do

str(x[c(1,2)])

List of 2
 $ : int [1:3] 1 2 3
 $ : int [1:3] 4 5 6

but (using the str function) you can see that actually returned the first vector as a list of length one. Thus, the single brackets simply return another list that contains the elements requested. You can give the single brackets a vector of indices, so its natural that you should get a list back. To get the component of the list itself, you must use double brackets:

str(x[[1]])

 int [1:3] 1 2 3

This will be useful too when we’re dealing with data.frames, which are basically just special lists. Another useful thing with lists (and other objects as we’ll see next week) is that you can name the elements with strings (that satisfy R objects naming rules!). This allows you to access elements of the list using the name instead of the index. For example, if

x = list(a = 1:3, b = 4:6, c = 7:9)
x

$a
[1] 1 2 3

$b
[1] 4 5 6

$c
[1] 7 8 9

then we can access the first element using its name “a” and the “$” operator:

x$a

[1] 1 2 3

which is equivalent to

x[[1]]

[1] 1 2 3

x[["a"]]

[1] 1 2 3

x$"a"

[1] 1 2 3

x$`a`

[1] 1 2 3

Any valid R name doesn’t need quotes for using the $, but you need them for fancy non-R friendly names:

x = list(`1 fancy name`=1:3,  b = 4:6, c = 7:9)
x$`1 fancy name`

[1] 1 2 3

x[["1 fancy name"]]

[1] 1 2 3

Other ways to slice

Negative integers omit the specified positions:

rmatx

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6

rmatx[-c(1,3,5,7,9),c(2,4,6,8,10)]

     [,1] [,2] [,3] [,4] [,5]
[1,]    7   13   14   10   15
[2,]   18    9   14   12   12
[3,]    3    9   15   11   10
[4,]    8    7    4    3    6
[5,]    2    9   19   16    6

gives the even rows and even columns.

Logical vectors selects elements of the matrix where the index vector is TRUE.
This is one of the most useful ways to index.
If you want to get only the rolls of the die that were less than 10, then you create a matrix of logicals

rmatx < 10

       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
 [2,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [3,] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
 [4,] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
 [5,] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
 [6,] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 [7,]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [8,]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
 [9,]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
[10,]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE

and use it to index the matrix:

rmatx[rmatx < 10]

 [1] 6 6 4 6 2 7 7 3 8 2 4 4 7 2 6 1 9 9 9 7 1 9 6 3 4 5 4 3 5 5 7 9 3 9 5 8 8 2
[39] 5 8 8 3 6 6

Notice here that you get a vector back, not a matrix. This is because when you give a single index to a matrix, it treats the matrix like a vector with the first column first, then the second column, etc (i.e., column-major order). It also makes sense that you don’t get a matrix back since the “< 10” condition could be met anywhere any number of times in the matrix (i.e., no guarantee that the result would be square like a matrix).

You could also just slice rows based on columns (or vice versa). This is the kind of slicing we’ll often do on a data table since we will want all rows (say, results from different experiments) whose column (say, factor in the experiment) matches a certain condition. For example, suppose we want to get all rows of the matrix rmatx with a fourth column whose roll is less than 10. We first slice the fourth column with rmatx[,4] and then compare it to 10,

rmatx

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6

rmatx[,4] < 10

 [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

Note that we get a vector of boolean values indicating whether the element in that row of the fourth column is less than 10. To get only those rows of the matrix rmatx, we then use this vector to slice the matrix:

rmatx[rmatx[,4] < 10,]

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]   10    7   19    6    6   19   16    5   16     8
[2,]   16   11    4    1   14   20    3    7   18    17
[3,]   19   18   20    9   11   14    5   12    8    12
[4,]   14   12   16    9   13    5   12    9    8     8
[5,]   12    3    7    9   14   15   20   11    2    10
[6,]    4    8   11    7   14    4   15    3   12     6
[7,]    6   18   20    1    4   12   10    9    5    16
[8,]    2    2    2    9   19   19   12   16   14     6

Mapping and applying

Given that you can slice matrices now, you will at some point want to apply some function to each element of that slice or to each row or column of the matrix. This can be done with a for loop, but there are functions that simply this. Such functions are apply functions in R and map functions in Python and Mathematica.

In R, there are actually many apply functions since there are different types of list objects. To see them all, you type

??base::apply

which searches for all functions with “apply” in the description in the “base” package.

If you have a list, the easiest apply function is sapply, which applies a function to each element of a vector or list and returns the output as a vector or matrix if possible. For example, you could sum each vector in the list you created above using the sum function.

$`1 fancy name`
[1] 1 2 3

$b
[1] 4 5 6

$c
[1] 7 8 9

sapply(x, sum)

1 fancy name            b            c 
           6           15           24

Note that sapply gave you back a vector but with each element named according to the names of the list elements (handy!).

If you want to use apply over a matrix, the apply function is required. To add all the elements in each row (dimension 1) or column (dimension 2) of the matrix matx, you can try

matx

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

apply(matx, 1, sum)

[1] 28 32 36 40

apply(matx, 2, sum)

[1] 10 26 42 58

However, you can do more complicated things by making custom functions and “applying” them. For example, in order to get the number of die rolls less than 10 in each column of your die roll matrix, you could try

rmatx

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6

apply(rmatx, 1, function(x) sum(x < 10))

 [1] 5 4 4 3 5 4 3 6 5 5

where you have a small “anonymous” (unnamed) function here that gets a logical vector for each row, TRUE for > 10 and FALSE for <= 10, and sums that vector (TRUE = 1 and FALSE = 0).

Lab

Now let’s slice some real DATA. We’ll return to the COVID-19 data from the New York Times (https://github.com/nytimes/covid-19-data). Below we load the data and transform it from cumulative counts to daily counts of cases. Finally, we select case data for 2021 only.

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

us = read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")

Rows: 51134 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): state, fips
dbl  (2): cases, deaths
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

us_cases_2021_wide = us %>%
  # make each state its own column
  pivot_wider(c(date), names_from = state, values_from = cases) %>% 
  # for each column subtract, subtract the previous day's case #
  mutate(across(where(is.numeric), ~ .x - lag(.x))) %>%
  # filter out only 2021 cases
  filter(date >= "2021-01-01" & date <= "2021-12-31")
  
us_cases_2021 = us_cases_2021_wide %>% 
  # put baack into tidy format
  pivot_longer(-date, names_to = "state", values_to = "cases") %>%
  # filter out rows with NA case
  filter(!is.na(cases))

You can get the first few rows with

head(us_cases_2021)

# A tibble: 6 × 3
  date       state         cases
  <date>     <chr>         <dbl>
1 2021-01-01 Washington      521
2 2021-01-01 Illinois        447
3 2021-01-01 California    37951
4 2021-01-01 Arizona        6438
5 2021-01-01 Massachusetts     0
6 2021-01-01 Wisconsin      2085

and compare it to the “wide” version

head(us_cases_2021_wide)

# A tibble: 6 × 57
  date       Washi…¹ Illin…² Calif…³ Arizona Massa…⁴ Wisco…⁵ Texas Nebra…⁶  Utah
  <date>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
1 2021-01-01     521     447   37951    6438       0    2085  7779     772     0
2 2021-01-02    1705   11390   52194    8895    9003    1129 14057     763  5135
3 2021-01-03    5267    4428   35911   17222    3481    2593 16019     707  1726
4 2021-01-04    2771    5354   40723    5158    4920    1626 23104     741  2160
5 2021-01-05    2385    7052   37284   10134    4620    4019 30992    1289  3318
6 2021-01-06    2077    7444   36907    6947    6851    4109 25322    1558  3769
# … with 47 more variables: Oregon <dbl>, Florida <dbl>, `New York` <dbl>,
#   `Rhode Island` <dbl>, Georgia <dbl>, `New Hampshire` <dbl>,
#   `North Carolina` <dbl>, `New Jersey` <dbl>, Colorado <dbl>, Maryland <dbl>,
#   Nevada <dbl>, Tennessee <dbl>, Hawaii <dbl>, Indiana <dbl>, Kentucky <dbl>,
#   Minnesota <dbl>, Oklahoma <dbl>, Pennsylvania <dbl>,
#   `South Carolina` <dbl>, `District of Columbia` <dbl>, Kansas <dbl>,
#   Missouri <dbl>, Vermont <dbl>, Virginia <dbl>, Connecticut <dbl>, …

The data are saved as a tibble, which is really a fancy data.frame, which is a special kind of list that we will discuss in more detail next week. For the moment, you can practice slicing the data.frame as if it were a matrix.

Problems

Load the .csv file using the code above and show the R code in your .qmd file for each of the following problems. You should be able to use a few lines of R code to obtain the answer directly (i.e., you shouldn’t just scroll though the data table and give the answer by hand). Use the us_cases_2021 and us_cases_2021_widedata frames; the “long” format data frame us_cases_2021 may be easier to use in some problems whereas the `us_cases_2021_wide” may be easier to use in others.

Which states have had more than 75,000 cases during one day in 2021?
How many cases has Kentucky had during 2021? (hint: use the sum function)
How many cases where there outside of California in the month of August? (hint: you’ll need the boolean & to combine multiple conditions)
Which US state has the second fewest number of cases in 2021? (hint: use sapply with sum and then use the sort function)
Which day in 2021 had the highest number of cases in the country (all states and territories)? (hint: use apply and sum to get the totals for each date. Then use the order function to give you the right order to sort the dates…)