Vectors, slicing, and map(ping)

Author

Jeremy Van Cleve

Published

January 23, 2024

Outline for today

Matrices and arrays
Indexing and slicing
Mapping and applying

Matrices and arrays

Previously, you have seen vectors, which are just lists of objects of a single type. However, you often want a matrix of objects of a single (or multiple!) types or even a higher dimensions group of objects. The two dimensional version of a vector is called a matrix and the n-dimensional version is called an array.

You can think of a matrix as a vector,

vec = 1:16
vec

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

except you’ve specified that there are a certain number of rows and columns:

matx = matrix(vec, nrow = 4, ncol = 4)
matx

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Notice how the vector “filled” each column of the matrix. This is because R is a “column-major” language (Fortran, MATLAB, and Julia are other column-major langauges). Some languages, such as C and Python, and have row-major order.

Since you often deal with matrices and matrix-like objects, R has two functions to give you the number of rows and columns. Also, the length of the matrix is just the rows times the columns.

nrow(matx)

[1] 4

ncol(matx)

[1] 4

length(matx)

[1] 16

Likewise, you can convert the vector to a 4x4 array:

arr = array(vec, dim = c(4,4))
arr

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Two-dimensional arrays are exactly the same as matrices:

str(matx)

 int [1:4, 1:4] 1 2 3 4 5 6 7 8 9 10 ...

str(arr)

 int [1:4, 1:4] 1 2 3 4 5 6 7 8 9 10 ...

In the above, the dim argument specifies the dimensionality of the array. Thus, you can convert the vector to a multidimensional array of 2x2x2x2 as well:

arr = array(vec, dim = c(2,2,2,2))
arr

, , 1, 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2, 1

     [,1] [,2]
[1,]    5    7
[2,]    6    8

, , 1, 2

     [,1] [,2]
[1,]    9   11
[2,]   10   12

, , 2, 2

     [,1] [,2]
[1,]   13   15
[2,]   14   16

Note that this is a four-dimensional object, so you can’t print it without having to “flatten” it in some way.

Indexing and slicing

Indexing matrices works similarly to indexing vectors except that you give a list of the elements you want from each dimension. First, suppose that you roll a twenty-sided die 100 times (because you are a super nerd) and you collect the results in a 10x10 matrix:

set.seed(100) # this gives us the same "random" matrix each time
rmatx = matrix(sample(1:20, 100, replace = TRUE), nrow = 10, ncol = 10)
rmatx

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6

To obtain the element in the second row, eighth column,

rmatx[2,8]

[1] 10

You can also get a “slice” of the matrix by using the colon operator:

rmatx[2,1:5]

[1]  6  7  4 13 20

yields the first five elements of the second row. Note that within the entry for the column element, you actually used a vector for the index. Thus, you can give any list of indices in any order you choose. For example,

rmatx[c(1,3,5,7,9),c(2,4,6,8,10)]

     [,1] [,2] [,3] [,4] [,5]
[1,]    7    6   19    5    8
[2,]   11    1   20    7   17
[3,]   12    9    5    9    8
[4,]   19   11   15   11    3
[5,]   18    1   12    9   16

returns a slice of the matrix with only the odd rows and even columns. You can keep all the elements in a specific dimension by just leaving that spot blank. For example, to get the fourth row,

rmatx[4,]

 [1] 19 18 20  9 11 14  5 12  8 12

In general, slicing an array involves giving a list of indices for each dimension of the array.
The magic of data wrangling and analysis with complex data comes in the many creative ways one can create these lists of indices and thus the slices that contain exactly the subset of the data that you want and often in an order that you specify.

Slicing lists

Recall that lists are like vectors but with potentially multiple types of objects. Thus, they are a little bit more complicated to slice. Take the following list,

x = list(1:3, 4:6, 7:9)
x

[[1]]
[1] 1 2 3

[[2]]
[1] 4 5 6

[[3]]
[1] 7 8 9

which is a list of three vectors each three elements long. To get the first element of the list as if it were a vector, you would try

x[1]

[[1]]
[1] 1 2 3

str(x[1])

List of 1
 $ : int [1:3] 1 2 3

but (using the str function) you can see that actually returned the first vector as a list of length one. Thus, the single brackets simply return another list that contains the elements requested. You can give the single brackets a vector of indices, so its natural that you should get a list back. This is actually no different for vectors; when you slice them, you get vectors back, and for R a single number is just a vector of length one (so everything is consistent).

To get the component of the list itself, you must use double brackets:

str(x[[1]])

 int [1:3] 1 2 3

This will be useful too when we’re dealing with data.frames, which are basically just special lists. Another useful thing with lists (and other objects as we’ll see next week) is that you can name the elements with strings (that satisfy R objects naming rules!). This allows you to access elements of the list using the name instead of the index. For example, if

x = list(a = 1:3, b = 4:6, c = 7:9)
x

$a
[1] 1 2 3

$b
[1] 4 5 6

$c
[1] 7 8 9

then we can access the first element using its name a and the $ operator:

x$a

[1] 1 2 3

This is actually equivalent to

x[[1]]

[1] 1 2 3

x[["a"]]

[1] 1 2 3

x$"a"

[1] 1 2 3

x$`a`

[1] 1 2 3

Any valid R name doesn’t need quotes or backticks for using the $, but you need them for fancy non-R friendly names:

x = list(`1 fancy name`=1:3,  b = 4:6, c = 7:9)
x$"1 fancy name"

[1] 1 2 3

x$`1 fancy name`

[1] 1 2 3

x[["1 fancy name"]]

[1] 1 2 3

Other ways to slice

Negative integers omit the specified positions:

rmatx

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6

rmatx[-c(1,3,5,7,9),c(2,4,6,8,10)]

     [,1] [,2] [,3] [,4] [,5]
[1,]    7   13   14   10   15
[2,]   18    9   14   12   12
[3,]    3    9   15   11   10
[4,]    8    7    4    3    6
[5,]    2    9   19   16    6

gives the even rows and even columns.

Logical vectors selects elements of the matrix where the index vector is TRUE.

This is one of the most useful ways to index.

If you want to get only the rolls of the die that were less than 8, then you create a matrix of logicals (TRUE or FALSE)

rmatx < 8

       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
 [2,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [3,] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
 [4,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [5,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [6,] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [7,]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [8,]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
 [9,]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
[10,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

and use it to index the matrix:

rmatx[rmatx < 8]

 [1] 6 6 4 6 2 7 7 3 2 4 4 7 2 6 1 7 1 6 3 4 5 4 3 5 5 7 3 5 2 5 3 6 6

Notice here that you get a vector back, not a matrix. This is because when you give a single index to a matrix, it treats the matrix like a vector with the first column first, then the second column, etc (i.e., column-major order). It also makes sense that you don’t get a matrix back since the “< 10” condition could be met anywhere any number of times in the matrix (i.e., no guarantee that the result would be square like a matrix).

You could also just slice rows based on columns (or vice versa). This is the kind of slicing we’ll often do on a data table since we will want all rows (say, results from different experiments) whose column (say, variable or factor in the experiment) matches a certain condition. For example, suppose we want to get all rows of the matrix rmatx with a fourth column whose roll is less than 10. We first slice the fourth column with rmatx[,4] and then compare it to 10,

rmatx

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6

rmatx[,4] < 8

 [1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

Note that we get a vector of logical values indicating whether the element in that row of the fourth column is less than 10. To get only those rows of the matrix rmatx, we then use this vector to slice the matrix:

rmatx[rmatx[,4] < 8,]

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]   10    7   19    6    6   19   16    5   16     8
[2,]   16   11    4    1   14   20    3    7   18    17
[3,]    4    8   11    7   14    4   15    3   12     6
[4,]    6   18   20    1    4   12   10    9    5    16

We can also then slice to get a subset of the columns. Say we only want the even columns

rmatx[rmatx[,4] < 8, seq(2,10,2)]

     [,1] [,2] [,3] [,4] [,5]
[1,]    7    6   19    5    8
[2,]   11    1   20    7   17
[3,]    8    7    4    3    6
[4,]   18    1   12    9   16

The function seq(start,end,increment) is a handy generalization of the start:end colon operator that allows us to choose a value to increment or sequence of numbers by.

Mapping or applying

Given that you can slice matrices now, you will at some point want to apply some function to each element of that slice or to each row or column of the matrix. This can be done with a for loop, but there are functions that simply this. Such functions are apply functions in R and map functions in Python, Julia, and Mathematica.

In R, there are actually many apply functions since there are vectors, lists, and other types of objects with multiple elements that one might to iterate over. To see them all, you type

??base::apply

which searches for all functions with “apply” in the description in the “base” package.

If you have a vector or list, the easiest apply function is sapply, which applies a function to each element of a vector or list and returns the output as a vector or matrix if possible. For example, you could sum each vector in the list you created above using the sum function.

$`1 fancy name`
[1] 1 2 3

$b
[1] 4 5 6

$c
[1] 7 8 9

sapply(x, sum)

1 fancy name            b            c 
           6           15           24

Note that sapply gave you back a vector but with each element named according to the names of the list elements (handy!). The s in sapply stands (I think…) for simplify. The function lapply works like sapply but returns a list (hence l(ist)apply),

lapply(x, sum)

$`1 fancy name`
[1] 6

$b
[1] 15

$c
[1] 24

so sapply is like taking the output of lapply and converting it to a vector if possible.

If you want to use apply over a matrix, the apply function is required. To add all the elements in each row (dimension 1) or column (dimension 2) of the matrix matx, you can try

rmatx

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6

apply(rmatx, 1, sum)

 [1] 112 112 111 128 106 103 113  84 101 101

apply(rmatx, 2, sum)

 [1]  95 105 119  75 118 137 121  93 107 101

However, you can do more complicated things by making custom functions and “applying” them. For example, in order to get the number of die rolls less than 8 in each row of your die roll matrix, you could try

rmatx

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6

apply(rmatx, 1, \(x) sum(x < 8))

 [1] 4 4 4 1 1 3 3 5 4 4

where we use the anonymous function syntax from last time that creates a function that gets a logical vector for each row, TRUE for < 8 and FALSE for >= 8, and sums that vector (TRUE = 1 and FALSE = 0). For the number of rolls less than 8 in each column, we do

apply(rmatx, 2, \(x) sum(x < 8))

 [1] 5 4 4 4 3 2 2 3 3 3

Finally, what if we wanted the number of rolls less than 8 in the whole matrix? This is actually easier than getting the answer for rows or columns. This is because if compare rmat to 8,

rmatx < 8

       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
 [2,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [3,] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
 [4,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [5,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [6,] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [7,]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [8,]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
 [9,]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
[10,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

we get a matrix of logicals. We can then just use sum, which sums over all the elements of the matrix (recall: a matrix is really just a vector where you line up the columns in a long list),

sum(rmatx < 8)

[1] 33

which matches of course other ways of doing the same calculation, say first getting the number of elements less than 8 for each row and then summing,

sum(apply(rmatx, 1, \(x) sum(x < 8)))

[1] 33

Lab

Data

Now let’s slice some real DATA. We’ll return to COVID-19 data from the CDC. These data are split up between hospitalizations and deaths. The data are saved locally, and we load them in below and combine them (see the code for an example of some of the data wrangling necessary to do this).

library(tidyverse)

# Read in hospitalization and deaths
us_hosps  = read_csv("US_COVID19_Hosps_ByWeek_ByState_20240125.csv")
us_deaths = read_csv("US_COVID19_Deaths_ByWeek_ByState_20240125.csv")

us_hosps_deaths = 
  us_deaths |> 
  mutate(`Week Ending Date` = mdy(`Week Ending Date`)) |> # death table dates need conversion
  rename(week_ending_date = `Week Ending Date`, state = State) |> # rename columns so they match across the two tables
  select(-c(`Data as of`, `Start Date`, `End Date`, Group, Year, Month, `MMWR Week`, Footnote)) |> # get rid of excess columns in deaths table
  inner_join( # join the two tables together
    us_hosps |>
      rename(state_abbrv = state) |> # hosps has states as abbreviations so we'll need to add full state names
      left_join(tibble(state_abbrv = state.abb, state = state.name) |> 
                  add_row(state_abbrv = c("USA", "DC", "PR"), state = c("United States", "District of Columbia", "Puerto Rico")))) |>
  filter(state != "United States")

You can get the first few rows with

head(us_hosps_deaths)

# A tibble: 6 × 20
  week_ending_date state `COVID-19 Deaths` `Total Deaths` Percent of Expected …¹
  <date>           <chr>             <dbl>          <dbl>                  <dbl>
1 2020-08-08       Alab…               264           1379                    143
2 2020-08-15       Alab…               230           1305                    137
3 2020-08-22       Alab…               209           1303                    140
4 2020-08-29       Alab…               185           1216                    127
5 2020-09-05       Alab…               156           1216                    125
6 2020-09-12       Alab…               138           1232                    128
# ℹ abbreviated name: ¹`Percent of Expected Deaths`
# ℹ 15 more variables: `Pneumonia Deaths` <dbl>,
#   `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
#   `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, state_abbrv <chr>,
#   avg_adm_all_covid_confirmed <dbl>,
#   pct_chg_avg_adm_all_covid_confirmed_per_100k <dbl>,
#   total_adm_all_covid_confirmed_past_7days <dbl>, …

The data are saved as a tibble, which is really a fancy data.frame, which is a special kind of list that we will discuss in more detail next week. For the moment, you can practice slicing the data.frame as if it were a matrix.

Problems

Use the code above in your problem set to load the .csv files and combine the two tables into the single us_hosps_deaths table. For each of the problems, show the R code in your .qmd file. You should be able to use a few lines of R code to obtain the answer directly (i.e., you shouldn’t just scroll though the data table and give the answer by hand). Use the vector indexing operations above to do your slicing (even if you know another way using dplyr or other R packages).

This is a big table so use the str(us_hosps_deaths) function to get information on all the columns and view(us_hosps_deaths) will bring up GUI viewer of the table. Also, make use of the links above in the “Data” section to the CDC website to see more info on what data each column contains.

Which state had the most deaths during one week in 2021? Which week was that?
(hint: remember you can compare dates: e.g., “2021-01-02” > “2021-01-01” is TRUE, etc.)
(hint: you’ll need the boolean & to combine multiple conditions)
(hint: your solution might use the which.max function)
How many COVID-19 deaths has Kentucky had during 2021?
(hint: use the sum function)
How many COVID-19 deaths were there outside of California in the month of August 2021?
Which US state has the second fewest number of COVID-19 hospitalizations in 2021? Use the column total_adm_all_covid_confirmed_past_7days
(hint: get a vector of all the (unique) states)
(hint: use sapply on that vector with an anonymous function that sums the hospitalization column for the right state and dates)
(hint: use sort to help you find the second fewest hospitalizations)
Which week in 2021 had the highest number of hospitalization in the country (across all states and territories)? Use the column total_adm_all_covid_confirmed_past_7days
(hint: use sapply and sum to get the totals for each date in a way similar to question 4. Then use the order function to give you the right order to sort the dates…)