Vectors, slicing, and map(ping)

Author

Jeremy Van Cleve

Published

September 10, 2024

Outline for today

  • Matrices and arrays
  • Indexing and slicing
  • Mapping and applying

Matrices and arrays

Previously, you have seen vectors, which are just lists of objects of a single type. However, you often want a matrix of objects of a single (or multiple!) types or even a higher dimensions group of objects. The two dimensional version of a vector is called a matrix and the n-dimensional version is called an array.

You can think of a matrix as a vector,

vec = 1:16
vec
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

except you’ve specified that there are a certain number of rows and columns:

matx = matrix(vec, nrow = 4, ncol = 4)
matx
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Notice how the vector “filled” each column of the matrix. This is because R is a “column-major” language (Fortran, MATLAB, and Julia are other column-major langauges). Some languages, such as C and Python, and have row-major order.

Since you often deal with matrices and matrix-like objects, R has two functions to give you the number of rows and columns. Also, the length of the matrix is just the rows times the columns.

nrow(matx)
[1] 4
ncol(matx)
[1] 4
length(matx)
[1] 16

Likewise, you can convert the vector to a 4x4 array:

arr = array(vec, dim = c(4,4))
arr
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Two-dimensional arrays are exactly the same as matrices:

str(matx)
 int [1:4, 1:4] 1 2 3 4 5 6 7 8 9 10 ...
str(arr)
 int [1:4, 1:4] 1 2 3 4 5 6 7 8 9 10 ...

In the above, the dim argument specifies the dimensionality of the array. Thus, you can convert the vector to a multidimensional array of 2x2x2x2 as well:

arr = array(vec, dim = c(2,2,2,2))
arr
, , 1, 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2, 1

     [,1] [,2]
[1,]    5    7
[2,]    6    8

, , 1, 2

     [,1] [,2]
[1,]    9   11
[2,]   10   12

, , 2, 2

     [,1] [,2]
[1,]   13   15
[2,]   14   16

Note that this is a four-dimensional object, so you can’t print it without having to “flatten” it in some way.

Indexing and slicing

Indexing matrices works similarly to indexing vectors except that you give a list of the elements you want from each dimension. First, suppose that you roll a twenty-sided die 100 times (because you are a super nerd) and you collect the results in a 10x10 matrix:

set.seed(100) # this gives us the same "random" matrix each time
rmatx = matrix(sample(1:20, 100, replace = TRUE), nrow = 10, ncol = 10)
rmatx
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6

To obtain the element in the second row, eighth column,

rmatx[2,8]
[1] 10

You can also get a “slice” of the matrix by using the colon operator:

rmatx[2,1:5]
[1]  6  7  4 13 20

yields the first five elements of the second row. Note that within the entry for the column element, you actually used a vector for the index. Thus, you can give any list of indices in any order you choose. For example,

rmatx[c(1,3,5,7,9),c(2,4,6,8,10)]
     [,1] [,2] [,3] [,4] [,5]
[1,]    7    6   19    5    8
[2,]   11    1   20    7   17
[3,]   12    9    5    9    8
[4,]   19   11   15   11    3
[5,]   18    1   12    9   16

returns a slice of the matrix with only the odd rows and even columns. You can keep all the elements in a specific dimension by just leaving that spot blank. For example, to get the fourth row,

rmatx[4,]
 [1] 19 18 20  9 11 14  5 12  8 12

In general, slicing an array involves giving a list of indices for each dimension of the array.
The magic of data wrangling and analysis with complex data comes in the many creative ways one can create these lists of indices and thus the slices that contain exactly the subset of the data that you want and often in an order that you specify.

Slicing lists

Recall that lists are like vectors but with potentially multiple types of objects. Thus, they are a little bit more complicated to slice. Take the following list,

x = list(1:3, 4:6, 7:9)
x
[[1]]
[1] 1 2 3

[[2]]
[1] 4 5 6

[[3]]
[1] 7 8 9

which is a list of three vectors each three elements long. To get the first element of the list as if it were a vector, you would try

x[1]
[[1]]
[1] 1 2 3
str(x[1])
List of 1
 $ : int [1:3] 1 2 3

but (using the str function) you can see that actually returned the first vector as a list of length one. Thus, the single brackets simply return another list that contains the elements requested. You can give the single brackets a vector of indices, so its natural that you should get a list back. This is actually no different for vectors; when you slice them, you get vectors back, and for R a single number is just a vector of length one (so everything is consistent).

To get the component of the list itself, you must use double brackets:

str(x[[1]])
 int [1:3] 1 2 3

This will be useful too when we’re dealing with data.frames, which are basically just special lists. Another useful thing with lists (and other objects as we’ll see next week) is that you can name the elements with strings (that satisfy R objects naming rules!). This allows you to access elements of the list using the name instead of the index. For example, if

x = list(a = 1:3, b = 4:6, c = 7:9)
x
$a
[1] 1 2 3

$b
[1] 4 5 6

$c
[1] 7 8 9

then we can access the first element using its name a and the $ operator:

x$a
[1] 1 2 3

This is actually equivalent to

x[[1]]
[1] 1 2 3
x[["a"]]
[1] 1 2 3
x$"a"
[1] 1 2 3
x$`a`
[1] 1 2 3

Any valid R name doesn’t need quotes or backticks for using the $, but you need them for fancy non-R friendly names:

x = list(`1 fancy name`=1:3,  b = 4:6, c = 7:9)
x$"1 fancy name"
[1] 1 2 3
x$`1 fancy name`
[1] 1 2 3
x[["1 fancy name"]]
[1] 1 2 3

Other ways to slice

Negative integers omit the specified positions:

rmatx
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6
rmatx[-c(1,3,5,7,9),c(2,4,6,8,10)]
     [,1] [,2] [,3] [,4] [,5]
[1,]    7   13   14   10   15
[2,]   18    9   14   12   12
[3,]    3    9   15   11   10
[4,]    8    7    4    3    6
[5,]    2    9   19   16    6

gives the even rows and even columns.

Logical vectors selects elements of the matrix where the index vector is TRUE.

This is one of the most useful ways to index.

If you want to get only the rolls of the die that were less than 8, then you create a matrix of logicals (TRUE or FALSE)

rmatx < 8
       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
 [2,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [3,] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
 [4,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [5,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [6,] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [7,]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [8,]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
 [9,]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
[10,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

and use it to index the matrix:

rmatx[rmatx < 8]
 [1] 6 6 4 6 2 7 7 3 2 4 4 7 2 6 1 7 1 6 3 4 5 4 3 5 5 7 3 5 2 5 3 6 6

Notice here that you get a vector back, not a matrix. This is because when you give a single index to a matrix, it treats the matrix like a vector with the first column first, then the second column, etc (i.e., column-major order). It also makes sense that you don’t get a matrix back since the “< 10” condition could be met anywhere any number of times in the matrix (i.e., no guarantee that the result would be square like a matrix).

You could also just slice rows based on columns (or vice versa). This is the kind of slicing we’ll often do on a data table since we will want all rows (say, results from different experiments) whose column (say, variable or factor in the experiment) matches a certain condition. For example, suppose we want to get all rows of the matrix rmatx with a fourth column whose roll is less than 10. We first slice the fourth column with rmatx[,4] and then compare it to 10,

rmatx
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6
rmatx[,4] < 8
 [1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

Note that we get a vector of logical values indicating whether the element in that row of the fourth column is less than 10. To get only those rows of the matrix rmatx, we then use this vector to slice the matrix:

rmatx[rmatx[,4] < 8,]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]   10    7   19    6    6   19   16    5   16     8
[2,]   16   11    4    1   14   20    3    7   18    17
[3,]    4    8   11    7   14    4   15    3   12     6
[4,]    6   18   20    1    4   12   10    9    5    16

We can also then slice to get a subset of the columns. Say we only want the even columns

rmatx[rmatx[,4] < 8, seq(2,10,2)]
     [,1] [,2] [,3] [,4] [,5]
[1,]    7    6   19    5    8
[2,]   11    1   20    7   17
[3,]    8    7    4    3    6
[4,]   18    1   12    9   16

The function seq(start,end,increment) is a handy generalization of the start:end colon operator that allows us to choose a value to increment or sequence of numbers by.

Let’s recap this last way of slicing a matrix since its very common and has the same logic as our later slice operations on data tables. If we want all rows of the matrix mat where column n is equal to x, then our slice statement is mat[ mat[,n]==x, ].

Mapping or applying

Given that you can slice matrices now, you will at some point want to apply some function to each element of that slice or to each row or column of the matrix. This can be done with a for loop like we have already seen, but there are functions designed precisely for this task. Such functions are apply functions in R and map functions in Python, Julia, and Mathematica.

In R, there are actually many apply functions since there are vectors, lists, and other types of objects with multiple elements that one might to iterate over. To see them all, you type

??base::apply

which searches for all functions with “apply” in the description in the “base” package.

If you have a vector or list, the easiest apply function is sapply, which applies a function to each element of a vector or list and returns the output as a vector or matrix if possible. For example, you could sum each vector in the list you created above using the sum function.

x
$`1 fancy name`
[1] 1 2 3

$b
[1] 4 5 6

$c
[1] 7 8 9
sapply(x, sum)
1 fancy name            b            c 
           6           15           24 

Note that sapply gave you back a vector but with each element named according to the names of the list elements (handy!). The s in sapply stands (I think…) for simplify. The function lapply works like sapply but returns a list (hence l(ist)apply),

lapply(x, sum)
$`1 fancy name`
[1] 6

$b
[1] 15

$c
[1] 24

so sapply is like taking the output of lapply and converting it to a vector if possible.

If you want to use apply over a matrix, the apply function is required. To add all the elements in each row (dimension 1) or column (dimension 2) of the matrix matx, you can try

rmatx
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6
apply(rmatx, 1, sum)
 [1] 112 112 111 128 106 103 113  84 101 101
apply(rmatx, 2, sum)
 [1]  95 105 119  75 118 137 121  93 107 101

However, you can do more complicated things by making custom functions and “applying” them. For example, in order to get the number of die rolls less than 8 in each row of your die roll matrix, you could try

rmatx
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   10    7   19    6    6   19   16    5   16     8
 [2,]    6    7    4   13   20   14   18   10    5    15
 [3,]   16   11    4    1   14   20    3    7   18    17
 [4,]   19   18   20    9   11   14    5   12    8    12
 [5,]   14   12   16    9   13    5   12    9    8     8
 [6,]   12    3    7    9   14   15   20   11    2    10
 [7,]    6   19   16   11    3   15   10   11   19     3
 [8,]    4    8   11    7   14    4   15    3   12     6
 [9,]    6   18   20    1    4   12   10    9    5    16
[10,]    2    2    2    9   19   19   12   16   14     6
apply(rmatx, 1, \(x) sum(x < 8))
 [1] 4 4 4 1 1 3 3 5 4 4

where we use the anonymous function syntax from last time that creates a function that gets a logical vector for each row, TRUE for < 8 and FALSE for >= 8, and sums that vector (TRUE = 1 and FALSE = 0). For the number of rolls less than 8 in each column, we do

apply(rmatx, 2, \(x) sum(x < 8))
 [1] 5 4 4 4 3 2 2 3 3 3

Finally, what if we wanted the number of rolls less than 8 in the whole matrix? This is actually easier than getting the answer for rows or columns. This is because if compare rmat to 8,

rmatx < 8
       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
 [2,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [3,] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
 [4,] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [5,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [6,] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [7,]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [8,]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
 [9,]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
[10,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

we get a matrix of logicals. We can then just use sum, which sums over all the elements of the matrix (recall: a matrix is really just a vector where you line up the columns in a long list),

sum(rmatx < 8)
[1] 33

which matches of course other ways of doing the same calculation, say first getting the number of elements less than 8 for each row and then summing,

sum(apply(rmatx, 1, \(x) sum(x < 8)))
[1] 33

Lab

Data

Now let’s slice some real DATA. We’ll use data on respiratory infections including COVID-19 from the Centers for Disease Control (CDC). These data are split up between hospitalizations and deaths. The data are accessible from a CDC database online, and we load them in below and combine them (see the code for an example of some of the data wrangling necessary to do this).

library(tidyverse)
library(RSocrata) 

# Read in hospitalization and deaths
us_hosps  = read.socrata("https://data.cdc.gov/api/odata/v4/aemt-mg7g") |> as_tibble()
us_deaths = read.socrata("https://data.cdc.gov/api/odata/v4/r8kw-7aab") |> as_tibble()
#us_hosps  = read_csv("US_COVID19_Hosps_ByWeek_ByState_20240125.csv")
#us_deaths = read_csv("US_COVID19_Deaths_ByWeek_ByState_20240125.csv")

us_hosps_deaths = 
  us_deaths |> 
  rename(week_end_date = week_ending_date) |> # rename this column to match column in `us_hosps`
  select(-c(`data_as_of`, `start_date`, `end_date`, group, year, month, mmwr_week, footnote)) |> # get rid of excess columns in deaths table
  inner_join( # join the two tables together
    us_hosps |>
      rename(state_abbrv = jurisdiction) |> # `us_hosps` has states as abbreviations so we'll need to add full state names
      inner_join(tibble(state_abbrv = state.abb, state = state.name) |> 
                  add_row(state_abbrv = c("USA", "DC", "PR"), state = c("United States", "District of Columbia", "Puerto Rico")))) |>
  filter(state != "United States")

You can get the first few rows with

head(us_hosps_deaths)
# A tibble: 6 × 72
  week_end_date       state  covid_19_deaths total_deaths percent_of_expected_…¹
  <dttm>              <chr>            <int>        <int>                  <dbl>
1 2020-08-08 00:00:00 Alaba…             264         1379                    143
2 2020-08-15 00:00:00 Alaba…             230         1305                    137
3 2020-08-22 00:00:00 Alaba…             209         1303                    140
4 2020-08-29 00:00:00 Alaba…             185         1216                    127
5 2020-09-05 00:00:00 Alaba…             156         1216                    125
6 2020-09-12 00:00:00 Alaba…             138         1232                    128
# ℹ abbreviated name: ¹​percent_of_expected_deaths
# ℹ 67 more variables: pneumonia_deaths <int>,
#   pneumonia_and_covid_19_deaths <int>, influenza_deaths <int>,
#   pneumonia_influenza_or_covid_19_deaths <int>, state_abbrv <chr>,
#   weekly_actual_days_reporting_any_data <int>,
#   weekly_percent_days_reporting_any_data <dbl>,
#   num_hospitals_previous_day_admission_adult_covid_confirmed <int>, …

The data are saved as a tibble, which is really a fancy data.frame, which is a special kind of list that we will discuss in more detail next week. For this lab, we will practice slicing the data.frame as if it were a matrix.

Please include the code chunk above that loads in the data and creates the us_hosps_deaths data table as the first code chunk in the markdown file that you submit for this assignment. That way, the table will be present when I run the code in your markdown file.

Problems

Use us_hosps_deaths data table from the code above for each of the problems belove. You should be able to use a few lines of R code to obtain the answer directly (i.e., you shouldn’t just scroll though the data table and give the answer by hand). Use the vector indexing operations discussed this week to do your slicing (even if you know another way using dplyr or other R packages).

This is a big table so use the str(us_hosps_deaths) function to get information on all the columns and view(us_hosps_deaths) will bring up the GUI viewer of the table. Also, make use of the links above in the “Data” section to the CDC website to see more info on what data each column contains.

  1. Which state had the most deaths during one week in 2021? Which week was that?
    (hint: remember you can compare dates: e.g., “2021-01-02” > “2021-01-01” is TRUE, etc.)
    (hint: you’ll need the boolean &, single ampersand, to combine multiple conditions when they involve vectors)
    (hint: your solution might use the which.max function)

  2. How many COVID-19 deaths has Kentucky had during 2021?
    (hint: use the sum function)

  3. How many COVID-19 deaths were there outside of California in the month of August 2021?

  4. Which US state has the second fewest COVID-19 hospitalizations in 2021? Use the column total_admissions_all_covid_confirmed
    (hint: get a vector of all the (unique) states)
    (hint: use sapply on that vector with an anonymous function that sums the hospitalization column for the right state and dates)
    (hint: use sort to help you find the second fewest hospitalizations)

  5. Which week in 2021 had the highest number of hospitalizations in the country (across all states and territories)? Use the column total_admissions_all_covid_confirmed
    (hint: use sapply and sum to get the totals for each date in a way similar to question 4. Then use the order function to give you the right order to sort the dates…)