Joins and Databases

Author

Jeremy Van Cleve

Published

February 13, 2024

Outline for today

  • Putting data frames together: joins
  • Using dplyr to talk to databases

Joining data

Joining data can be a superpower

Remember the CDC COVID-19 data on hospitalization and deaths we used a few weeks back? Here it is:

library(tidyverse)

# Read in hospitalization and deaths
us_hosps  = read_csv("US_COVID19_Hosps_ByWeek_ByState_20240125.csv")
us_deaths = read_csv("US_COVID19_Deaths_ByWeek_ByState_20240125.csv") |> mutate(`Week Ending Date` = mdy(`Week Ending Date`)) # death table dates need conversion

us_hosps_deaths = 
  us_deaths |> 
  rename(week_ending_date = `Week Ending Date`, state = State) |> # rename columns so they match across the two tables
  select(-c(`Data as of`, `Start Date`, `End Date`, Group, Year, Month, `MMWR Week`, Footnote)) |> # get rid of excess columns in deaths table
  inner_join( # join the two tables together
    us_hosps |>
      rename(state_abbrv = state) |> # hosps has states as abbreviations so we'll need to add full state names
      left_join(tibble(state_abbrv = state.abb, state = state.name) |> 
                  add_row(state_abbrv = c("USA", "DC", "PR"), state = c("United States", "District of Columbia", "Puerto Rico")))) |>
  filter(state != "United States")

Looking at the read_csv commands, you’ll notice that we start with two separate csv files, one for hospitalization and one for deaths. It turns out that the CDC gives us these tables separately: hospitalizations and deaths. Putting these data tables together requires “joining” them. Not only that, we have to do some extra work to even make the join work, which we accomplish with another join. By learning about joins, we’ll not only be able to understand this example, we’ll be able to combine data from all kinds of places.

Keys

One key to understanding joins (pardon the pun 😜) is understanding “keys”. Joining data tables is really about a common variable in two different tables; the two tables can be “joined” together by merging variables in both tables through common values of the variable common to both of them. These common variables are called keys. We can look for common variables in us_hosps and us_deaths:

us_hosps
# A tibble: 11,913 × 12
   week_ending_date state    avg_adm_all_covid_confirmed pct_chg_avg_adm_all_c…¹
   <date>           <chr>                          <dbl>                   <dbl>
 1 2020-12-05       Region 6                       1879                     4.4 
 2 2020-12-12       Region 6                       1949.                    3.7 
 3 2020-12-19       Region 6                       2042                     4.79
 4 2020-12-26       Region 6                       2231.                    9.27
 5 2021-01-02       Region 6                       2602.                   16.6 
 6 2021-01-09       Region 6                       2799.                    7.58
 7 2021-01-16       Region 6                       2627.                   -6.14
 8 2021-01-23       Region 6                       2327.                  -11.4 
 9 2021-01-30       Region 6                       2015.                  -13.4 
10 2021-02-06       Region 6                       1724.                  -14.4 
# ℹ 11,903 more rows
# ℹ abbreviated name: ¹​pct_chg_avg_adm_all_covid_confirmed_per_100k
# ℹ 8 more variables: total_adm_all_covid_confirmed_past_7days <dbl>,
#   total_adm_all_covid_confirmed_past_7days_per_100k <dbl>,
#   sum_adm_all_covid_confirmed <dbl>,
#   avg_total_patients_hospitalized_covid_confirmed <dbl>,
#   avg_percent_inpatient_beds_occupied_covid_confirmed <dbl>, …
us_deaths
# A tibble: 14,364 × 17
   `Data as of` `Start Date` `End Date` Group   Year      Month `MMWR Week`
   <chr>        <chr>        <chr>      <chr>   <chr>     <dbl>       <dbl>
 1 01/18/2024   12/29/2019   01/04/2020 By Week 2019/2020    NA           1
 2 01/18/2024   01/05/2020   01/11/2020 By Week 2020         NA           2
 3 01/18/2024   01/12/2020   01/18/2020 By Week 2020         NA           3
 4 01/18/2024   01/19/2020   01/25/2020 By Week 2020         NA           4
 5 01/18/2024   01/26/2020   02/01/2020 By Week 2020         NA           5
 6 01/18/2024   02/02/2020   02/08/2020 By Week 2020         NA           6
 7 01/18/2024   02/09/2020   02/15/2020 By Week 2020         NA           7
 8 01/18/2024   02/16/2020   02/22/2020 By Week 2020         NA           8
 9 01/18/2024   02/23/2020   02/29/2020 By Week 2020         NA           9
10 01/18/2024   03/01/2020   03/07/2020 By Week 2020         NA          10
# ℹ 14,354 more rows
# ℹ 10 more variables: `Week Ending Date` <date>, State <chr>,
#   `COVID-19 Deaths` <dbl>, `Total Deaths` <dbl>,
#   `Percent of Expected Deaths` <dbl>, `Pneumonia Deaths` <dbl>,
#   `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
#   `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>

We notice that the two data tables record their information for each week and for each state. So each combination of these variables denotes a unique observation and hence the tables are tidy. Moreover, those two variables, week ending date and state, are our keys that we can use to join together the tables. We can check that the combination of these variables works to uniquely identify each observation by looking to see if each combination occurs only once in the dataset:

us_hosps |> 
  count(week_ending_date, state) |>
  filter(n > 1)
# A tibble: 0 × 3
# ℹ 3 variables: week_ending_date <date>, state <chr>, n <int>

Great, what about us_deaths:

us_deaths |> 
  count(`Week Ending Date`, State) |>
  filter(n > 1)
# A tibble: 54 × 3
   `Week Ending Date` State                    n
   <date>             <chr>                <int>
 1 NA                 Alabama                 55
 2 NA                 Alaska                  55
 3 NA                 Arizona                 55
 4 NA                 Arkansas                55
 5 NA                 California              55
 6 NA                 Colorado                55
 7 NA                 Connecticut             55
 8 NA                 Delaware                55
 9 NA                 District of Columbia    55
10 NA                 Florida                 55
# ℹ 44 more rows

Ruh roh. It looks like NA and state coming up 55 times! Let’s check out why by filtering those NA dates

us_deaths |> filter(is.na(`Week Ending Date`))
# A tibble: 2,970 × 17
   `Data as of` `Start Date` `End Date` Group    Year  Month `MMWR Week`
   <chr>        <chr>        <chr>      <chr>    <chr> <dbl>       <dbl>
 1 01/18/2024   01/01/2020   01/31/2020 By Month 2020      1          NA
 2 01/18/2024   02/01/2020   02/29/2020 By Month 2020      2          NA
 3 01/18/2024   03/01/2020   03/31/2020 By Month 2020      3          NA
 4 01/18/2024   04/01/2020   04/30/2020 By Month 2020      4          NA
 5 01/18/2024   05/01/2020   05/31/2020 By Month 2020      5          NA
 6 01/18/2024   06/01/2020   06/30/2020 By Month 2020      6          NA
 7 01/18/2024   07/01/2020   07/31/2020 By Month 2020      7          NA
 8 01/18/2024   08/01/2020   08/31/2020 By Month 2020      8          NA
 9 01/18/2024   09/01/2020   09/30/2020 By Month 2020      9          NA
10 01/18/2024   10/01/2020   10/31/2020 By Month 2020     10          NA
# ℹ 2,960 more rows
# ℹ 10 more variables: `Week Ending Date` <date>, State <chr>,
#   `COVID-19 Deaths` <dbl>, `Total Deaths` <dbl>,
#   `Percent of Expected Deaths` <dbl>, `Pneumonia Deaths` <dbl>,
#   `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
#   `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>

If we look at the Group column, it says “By Month”, so these rows must be capturing monthly totals. Are there other rows we should be worried about?

us_deaths |> distinct(Group)
# A tibble: 4 × 1
  Group   
  <chr>   
1 By Week 
2 By Month
3 By Year 
4 By Total

Ah, yes. What if we keep “By Week”, do all these have Week Ending Dates?

us_deaths |> 
  filter(Group == "By Week", is.na(`Week Ending Date`))
# A tibble: 0 × 17
# ℹ 17 variables: Data as of <chr>, Start Date <chr>, End Date <chr>,
#   Group <chr>, Year <chr>, Month <dbl>, MMWR Week <dbl>,
#   Week Ending Date <date>, State <chr>, COVID-19 Deaths <dbl>,
#   Total Deaths <dbl>, Percent of Expected Deaths <dbl>,
#   Pneumonia Deaths <dbl>, Pneumonia and COVID-19 Deaths <dbl>,
#   Influenza Deaths <dbl>, Pneumonia, Influenza, or COVID-19 Deaths <dbl>,
#   Footnote <chr>

They do! Now we know that both week ending date and state in both the hospitalization and death tables correspond to match observations.

If we want to join the death data into the hospitalization table, we would call the week_ending_date and state the primary keys, which identify unqiue observations in the first table in the join, and Week Ending Date and State the foreign keys, which are the keys that correspond to the primary keys in the second table in the join.

us_deaths
# A tibble: 14,364 × 17
   `Data as of` `Start Date` `End Date` Group   Year      Month `MMWR Week`
   <chr>        <chr>        <chr>      <chr>   <chr>     <dbl>       <dbl>
 1 01/18/2024   12/29/2019   01/04/2020 By Week 2019/2020    NA           1
 2 01/18/2024   01/05/2020   01/11/2020 By Week 2020         NA           2
 3 01/18/2024   01/12/2020   01/18/2020 By Week 2020         NA           3
 4 01/18/2024   01/19/2020   01/25/2020 By Week 2020         NA           4
 5 01/18/2024   01/26/2020   02/01/2020 By Week 2020         NA           5
 6 01/18/2024   02/02/2020   02/08/2020 By Week 2020         NA           6
 7 01/18/2024   02/09/2020   02/15/2020 By Week 2020         NA           7
 8 01/18/2024   02/16/2020   02/22/2020 By Week 2020         NA           8
 9 01/18/2024   02/23/2020   02/29/2020 By Week 2020         NA           9
10 01/18/2024   03/01/2020   03/07/2020 By Week 2020         NA          10
# ℹ 14,354 more rows
# ℹ 10 more variables: `Week Ending Date` <date>, State <chr>,
#   `COVID-19 Deaths` <dbl>, `Total Deaths` <dbl>,
#   `Percent of Expected Deaths` <dbl>, `Pneumonia Deaths` <dbl>,
#   `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
#   `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>

Basic joins

There are two basic kinds of data table joins: joins that combine data from one table into another, which are called mutating joins, and joins that use one table to filter the rows of another table, which are called filtering joins. Mutating joins are usually done with the functions left_join, inner_join, right_join, and full_join, and filtering joins with semi_join and anti_join.

Mutating joins

We’ll start by examining left_join(x,y), which takes variables from table y and adds them to table x and keeps all the rows of table x (hence, “left” since x is the left argument). We use left_join in our CDC example to accomplish an important task in joining the hospitalization and death data. To see what that is, let’s look at the combinations of our key variables again for each table:

us_deaths |>
  distinct(`Week Ending Date`, State) |>
  arrange(State)
# A tibble: 11,448 × 2
   `Week Ending Date` State  
   <date>             <chr>  
 1 2020-01-04         Alabama
 2 2020-01-11         Alabama
 3 2020-01-18         Alabama
 4 2020-01-25         Alabama
 5 2020-02-01         Alabama
 6 2020-02-08         Alabama
 7 2020-02-15         Alabama
 8 2020-02-22         Alabama
 9 2020-02-29         Alabama
10 2020-03-07         Alabama
# ℹ 11,438 more rows
us_hosps |>
  distinct(week_ending_date, state) |>
  arrange(state)
# A tibble: 11,913 × 2
   week_ending_date state
   <date>           <chr>
 1 2020-08-08       AK   
 2 2020-08-15       AK   
 3 2020-08-22       AK   
 4 2020-08-29       AK   
 5 2020-09-05       AK   
 6 2020-09-12       AK   
 7 2020-09-19       AK   
 8 2020-09-26       AK   
 9 2020-10-03       AK   
10 2020-10-10       AK   
# ℹ 11,903 more rows

Ruh roh again. The deaths State has the full state name and the hospitalizations state has the two letter state abbrevation. So we’ll need to convert one to the other. Lucky for us, R has some builtin data tables that help, state.abb and state.name, which we combine into a new data table along with abbreviations for Washington DC, Puerto Rico, and the United States.

states = 
  tibble(state = state.abb, state_full = state.name) |> 
  add_row(state = c("USA", "DC", "PR"), 
          state_full = c("United States", "District of Columbia", "Puerto Rico"))
states
# A tibble: 53 × 2
   state state_full 
   <chr> <chr>      
 1 AL    Alabama    
 2 AK    Alaska     
 3 AZ    Arizona    
 4 AR    Arkansas   
 5 CA    California 
 6 CO    Colorado   
 7 CT    Connecticut
 8 DE    Delaware   
 9 FL    Florida    
10 GA    Georgia    
# ℹ 43 more rows

We can join this new states table with our us_hosps (we also select a subset of the columns to see that it worked using matches; for other handy ways to select columns, see ?select).

us_hosps = 
  us_hosps |>
  left_join(states) |>
  arrange(state) |>
  select(week_ending_date, state, state_full, matches("covid"))
Joining with `by = join_by(state)`

We can see that R told us when doing this join that it’s joining with `by = join_by(state)`. This actually means R tried to guess which key variables it should use to join the tables. It did this by look at columns in both tables and finding ones with identical names. This is called a natural join. In this specific case, the only column with the same name in both tables is state, so that is the primary and foreign key that was used for the join. Also note that since this is a left join and us_hosps is the primary table, all of its rows are kept and we simply get the new state_full column added.

Now we are ready to join the deaths and hospitalization tables together since we have full state names in both tables. We do this by telling left_join which variables are the primary and foreign keys since they have different names in the two tables.

us_hosps_deaths =
us_deaths |>
  left_join(us_hosps, join_by(`Week Ending Date` == week_ending_date, State == state_full))
us_hosps_deaths
# A tibble: 14,364 × 28
   `Data as of` `Start Date` `End Date` Group   Year      Month `MMWR Week`
   <chr>        <chr>        <chr>      <chr>   <chr>     <dbl>       <dbl>
 1 01/18/2024   12/29/2019   01/04/2020 By Week 2019/2020    NA           1
 2 01/18/2024   01/05/2020   01/11/2020 By Week 2020         NA           2
 3 01/18/2024   01/12/2020   01/18/2020 By Week 2020         NA           3
 4 01/18/2024   01/19/2020   01/25/2020 By Week 2020         NA           4
 5 01/18/2024   01/26/2020   02/01/2020 By Week 2020         NA           5
 6 01/18/2024   02/02/2020   02/08/2020 By Week 2020         NA           6
 7 01/18/2024   02/09/2020   02/15/2020 By Week 2020         NA           7
 8 01/18/2024   02/16/2020   02/22/2020 By Week 2020         NA           8
 9 01/18/2024   02/23/2020   02/29/2020 By Week 2020         NA           9
10 01/18/2024   03/01/2020   03/07/2020 By Week 2020         NA          10
# ℹ 14,354 more rows
# ℹ 21 more variables: `Week Ending Date` <date>, State <chr>,
#   `COVID-19 Deaths` <dbl>, `Total Deaths` <dbl>,
#   `Percent of Expected Deaths` <dbl>, `Pneumonia Deaths` <dbl>,
#   `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
#   `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>,
#   state <chr>, avg_adm_all_covid_confirmed <dbl>, …

If we look at the hospitalization columns of us_hosps_deaths, we can see that they are all initially NA, which reflects the fact that rows with these dates and states don’t exist in us_hosps. In fact, the data in us_hosps doesn’t start until August of 2020.

We could take this a step further by adding in more data. Suppose we wanted to get population size into the table so that we could eventually calculate per capita COVID-19 deaths and hospitalizations for each state. First, we need state level population number, which we can get from the United States Census, https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html, and have downloaded in the GitHub. We can load it with read_excel and do a little cleaning first.

library(readxl)

pop = 
  read_excel("NST-EST2023-POP.xlsx", 
                 range = "A10:F62", 
                 col_names = c("state", "Pop Apr 2020", "Pop Jul 2020", "Pop Jul 2021", "Pop Jul 2022", "Pop Jul 2023")) |> 
  mutate(state = sub("^\\.", "", state)) # get rid of annoying "." due to space in state column in original excel file
pop 
# A tibble: 53 × 6
   state             `Pop Apr 2020` `Pop Jul 2020` `Pop Jul 2021` `Pop Jul 2022`
   <chr>                      <dbl>          <dbl>          <dbl>          <dbl>
 1 Alabama                  5024294        5031864        5050380        5073903
 2 Alaska                    733374         732964         734923         733276
 3 Arizona                  7157902        7186683        7272487        7365684
 4 Arkansas                 3011490        3014348        3028443        3046404
 5 California              39538212       39503200       39145060       39040616
 6 Colorado                 5773707        5785219        5811596        5841039
 7 Connecticut              3605912        3577586        3603691        3608706
 8 Delaware                  989946         991862        1004881        1019459
 9 District of Colu…         689548         670839         669037         670949
10 Florida                 21538216       21591299       21830708       22245521
# ℹ 43 more rows
# ℹ 1 more variable: `Pop Jul 2023` <dbl>

Adding in these population values is simply another left_join:

us_hosps_deaths |>
  left_join(pop, join_by(State == state)) |>
  select(`Week Ending Date`, State, matches("covid"), matches("pop"))
# A tibble: 14,364 × 20
   `Week Ending Date` State         `COVID-19 Deaths` Pneumonia and COVID-19 D…¹
   <date>             <chr>                     <dbl>                      <dbl>
 1 2020-01-04         United States                 0                          0
 2 2020-01-11         United States                 1                          1
 3 2020-01-18         United States                 2                          2
 4 2020-01-25         United States                 3                          0
 5 2020-02-01         United States                 0                          0
 6 2020-02-08         United States                 4                          1
 7 2020-02-15         United States                 6                          1
 8 2020-02-22         United States                 6                          3
 9 2020-02-29         United States                 9                          5
10 2020-03-07         United States                38                         19
# ℹ 14,354 more rows
# ℹ abbreviated name: ¹​`Pneumonia and COVID-19 Deaths`
# ℹ 16 more variables: `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>,
#   avg_adm_all_covid_confirmed <dbl>,
#   pct_chg_avg_adm_all_covid_confirmed_per_100k <dbl>,
#   total_adm_all_covid_confirmed_past_7days <dbl>,
#   total_adm_all_covid_confirmed_past_7days_per_100k <dbl>, …

Mutating joins in theory

A conceptual diagram of our left_join would look like this:
Rows are matched by their keys, which are the colored columns. Rows in the right table y whose keys don’t exist in the left table x are left out. We get NAs in the rows of variables from the right table where the left table has a value but the right table doesn’t. The complementary situation occurs for a right join:
An inner_join(x,y) only keeps rows that have values for the key variables in both tables,
and a full_join keeps all rows or observations in both the left and right tables (or x and y),
We can represent these different join types with Venn diagrams too:

Filtering joins

Filtering joins are handy for filtering tables when the conditions for the filter might be complex enough to be contained within another table. A semi_join(x,y) keeps the rows in x that match a row in y whereas anti_join(x,y) keeps the rows in x that don’t have a match in y. For example, suppose we want to filter our us_hosps_deaths for a set of specific states. We could create a new tibble for this:

filter_dt = tribble(
  ~state, ~val,
  "California", 1,
  "Kentucky", 1,
  "New York", 1,
  "Texas", 1
)

and then do the join being careful to make sure we use join_by since the column names are different (even though just by capitalization!)

us_hosps_deaths |>
  semi_join(filter_dt, join_by(State == state))
# A tibble: 1,064 × 28
   `Data as of` `Start Date` `End Date` Group   Year      Month `MMWR Week`
   <chr>        <chr>        <chr>      <chr>   <chr>     <dbl>       <dbl>
 1 01/18/2024   12/29/2019   01/04/2020 By Week 2019/2020    NA           1
 2 01/18/2024   01/05/2020   01/11/2020 By Week 2020         NA           2
 3 01/18/2024   01/12/2020   01/18/2020 By Week 2020         NA           3
 4 01/18/2024   01/19/2020   01/25/2020 By Week 2020         NA           4
 5 01/18/2024   01/26/2020   02/01/2020 By Week 2020         NA           5
 6 01/18/2024   02/02/2020   02/08/2020 By Week 2020         NA           6
 7 01/18/2024   02/09/2020   02/15/2020 By Week 2020         NA           7
 8 01/18/2024   02/16/2020   02/22/2020 By Week 2020         NA           8
 9 01/18/2024   02/23/2020   02/29/2020 By Week 2020         NA           9
10 01/18/2024   03/01/2020   03/07/2020 By Week 2020         NA          10
# ℹ 1,054 more rows
# ℹ 21 more variables: `Week Ending Date` <date>, State <chr>,
#   `COVID-19 Deaths` <dbl>, `Total Deaths` <dbl>,
#   `Percent of Expected Deaths` <dbl>, `Pneumonia Deaths` <dbl>,
#   `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
#   `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>,
#   state <chr>, avg_adm_all_covid_confirmed <dbl>, …

The anti_join works analogously; here, we use it to find which rows in us_hosps_deaths have states that don’t have population data from the census pop data.

us_hosps_deaths |>
  anti_join(pop, join_by(State == state)) |>
  distinct(State)
# A tibble: 2 × 1
  State        
  <chr>        
1 United States
2 New York City

Interesting, the us_hosps_deaths has death data specifically from “New York City”.

Filtering joins in theory

A conceptual diagram of our semin_join would look like this
where only the rows of x are kept that match the key in y and no columns of y are added. Likewise, for anti_join we get
where only the non-matching rows of x are retained.

Non-equi joins

One key assumption we’ve made in the above joins is that the rows are retained in the joined data table by key columns that have equal values in rows in the left and right data tables. However, these values need not be equal; they could be greater than, less than, or satisfy another criterion. The dplyr package identifies four particularly useful types of non-equi joins:

  • Cross joins match every pair of rows.
  • Inequality joins use <, <=, >, and >= instead of ==.
  • Rolling joins are similar to inequality joins but only find the closest match.
  • Overlap joins are a special type of inequality join designed to work with ranges.

For more information, see https://r4ds.hadley.nz/joins#sec-non-equi-joins.

Databases

Even though there is a lot of data in excel spreadsheets and csv files and similar tabular files, there might be even more data living in databases, which are organized collections of data accessible by a user through special software. Many database systems use a special language called Structured Query Language or SQL for accessing the data contained int he database. Lucky for us, there are R packages and tools that translate the data wrangling commands we’ve been learning into equivalent SQL commands, and we’ll briefly discuss some of the features of these tools.

There are a few important differences between databases and the typical data frame you used in R up until now:

  1. Databases are stored on disk and can be very very large whereas data tables are typically small enough to be stored entirely within the working memory or RAM of a computer. This means some kinds of datasets, such as the all the users of Facebook and the information about them, must be stored in databases.

  2. Databases often have an index for quickly accessing specific rows, which is important when the database is many gigabytes or terabytes in size. Data frames do not have or really need an index.

  3. Databases are often row-oriented, which means data is stored row-by-row instead of column oriented like data frames. This adding data faster but doing data wrangling slower.

Connecting to databases

You need to use a database interface or DBI to connect to a database. These are specific to the kind of database, PostgreSQL, MySQL, etc but the R package DBI is helpful here and has many of the interfaces builtin. In order to experiment, we’ll use a SQLite database of San Francisco bike share data from 2018 from https://github.com/ropensci/bikedata. You will first need to download the file.

download.file("https://www.dropbox.com/s/pe4aja9wbcpan9a/bikedb.sqlite?dl=1", "bikedb.sqlite", mode="wb")

Now you can connect to the database

library(DBI)
library(dbplyr)

Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':

    ident, sql
dbcon = dbConnect(RSQLite::SQLite(), "bikedb.sqlite")
dbcon
<SQLiteConnection>
  Path: /Users/vancleve/science/teaching/UK/BIO-580/2024-Spring/bikedb.sqlite
  Extensions: TRUE

Databases may have multiple tables. To see which tables are in this database, we do

dbcon |> 
  dbListTables()
[1] "datafiles" "stations"  "trips"    

Let’s look at the stations table.

dbcon |>
  dbReadTable("stations") |>
  as_tibble()
# A tibble: 332 × 6
      id city  stn_id name                       longitude         latitude     
   <int> <chr> <chr>  <chr>                      <chr>             <chr>        
 1     1 sf    sf10   Washington St at Kearny St -122.404770255089 37.795392937…
 2     2 sf    sf100  Bryant St at 15th St       -122.410662       37.7671004   
 3     3 sf    sf101  San Bruno Ave at 16th St   -122.405676841736 37.766007742…
 4     4 sf    sf102  Irwin St at 8th St         -122.3995794      37.7668828   
 5     5 sf    sf104  4th St at 16th St          -122.390833497047 37.767044579…
 6     6 sf    sf105  16th St at Prosper St      -122.4318042      37.764285    
 7     7 sf    sf106  Sanchez St at 17th St      -122.4306746      37.7632417   
 8     8 sf    sf107  17th St at Dolores St      -122.4264968      37.7630152   
 9     9 sf    sf108  16th St Mission BART       -122.419956922531 37.764710085…
10    10 sf    sf109  17th St at Valencia St     -122.4219039      37.7633158   
# ℹ 322 more rows

We can see this is just a list of locations of bike share stations.

dbplyr

The package dbplyr let’s us access the database as if we were using our normal dplyr commands but in the background SQL commands are sent to the database. Let’s load the trips table.

sftrips = dbcon |>
  tbl("trips")
sftrips
# Source:   table<trips> [?? x 11]
# Database: sqlite 3.45.0 [/Users/vancleve/science/teaching/UK/BIO-580/2024-Spring/bikedb.sqlite]
      id city  trip_duration start_time          stop_time      start_station_id
   <int> <chr>         <int> <chr>               <chr>          <chr>           
 1     1 sf            75284 2018-01-31 22:52:35 2018-02-01 19… sf120           
 2     2 sf            85422 2018-01-31 16:13:34 2018-02-01 15… sf15            
 3     3 sf            71576 2018-01-31 14:23:55 2018-02-01 10… sf304           
 4     4 sf            61076 2018-01-31 14:53:23 2018-02-01 07… sf75            
 5     5 sf            39966 2018-01-31 19:52:24 2018-02-01 06… sf74            
 6     6 sf             6477 2018-01-31 22:58:44 2018-02-01 00… sf236           
 7     7 sf              453 2018-01-31 23:53:53 2018-02-01 00… sf110           
 8     8 sf              180 2018-01-31 23:52:09 2018-01-31 23… sf81            
 9     9 sf              996 2018-01-31 23:34:56 2018-01-31 23… sf134           
10    10 sf              825 2018-01-31 23:34:14 2018-01-31 23… sf305           
# ℹ more rows
# ℹ 5 more variables: end_station_id <chr>, bike_id <chr>, user_type <chr>,
#   birth_year <chr>, gender <chr>

You can see that the table shows the number of columns but not the number of rows. This is because the table is loaded lazily, which means data aren’t accessed until they must be. Since all we did was essentially ask for the beginning of the table, it didn’t have to read the whole thing into memory.

july4th = sftrips |> 
  filter(start_time < "2018-07-05 00:00:00", start_time >= "2018-07-04 00:00:00") |>
  group_by(start_station_id) |>
  summarize(mean_duration = mean(trip_duration)) |>
  arrange(desc(mean_duration))
july4th
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
# Source:     SQL [?? x 2]
# Database:   sqlite 3.45.0 [/Users/vancleve/science/teaching/UK/BIO-580/2024-Spring/bikedb.sqlite]
# Ordered by: desc(mean_duration)
   start_station_id mean_duration
   <chr>                    <dbl>
 1 sf152                   28855 
 2 sf155                   12484.
 3 sf206                   10292.
 4 sf149                    5198.
 5 sf132                    4019.
 6 sf180                    3608.
 7 sf202                    3590.
 8 sf207                    3517 
 9 sf228                    3372 
10 sf72                     3323.
# ℹ more rows

We could go back to the stations table to get location information about the stations and then use a join to add that information here. It can be helpful to actually get the data into memory in R, which we can do with collect,

july4th |>
  collect()
# A tibble: 282 × 2
   start_station_id mean_duration
   <chr>                    <dbl>
 1 sf152                   28855 
 2 sf155                   12484.
 3 sf206                   10292.
 4 sf149                    5198.
 5 sf132                    4019.
 6 sf180                    3608.
 7 sf202                    3590.
 8 sf207                    3517 
 9 sf228                    3372 
10 sf72                     3323.
# ℹ 272 more rows

Now we can see the output is a normal tibble.

Finally, we can see what the SQL code is to generate the same query that we execute with our dplyr code:

sftrips |> 
  filter(start_time < "2018-07-05 00:00:00", start_time >= "2018-07-04 00:00:00") |>
  group_by(start_station_id) |>
  summarize(mean_duration = mean(trip_duration)) |>
  arrange(desc(mean_duration)) |>
  show_query()
<SQL>
SELECT `start_station_id`, AVG(`trip_duration`) AS `mean_duration`
FROM (
  SELECT `trips`.*
  FROM `trips`
  WHERE
    (`start_time` < '2018-07-05 00:00:00') AND
    (`start_time` >= '2018-07-04 00:00:00')
) AS `q01`
GROUP BY `start_station_id`
ORDER BY `mean_duration` DESC

If you’re interested more in SQL (and you don’t have to be since you have dbplyr!), you can get some intro info in the “R for Data Science” book: https://r4ds.hadley.nz/databases#sql.

Lab

Problems

For some of these problems, we’ll use data from the library nycflights13, which contains airline flight data from 2013; you will need to install this package. Once loaded, you have access to five tables: airlines, airports, flights, planes, and weather.

library(nycflights13)
  1. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

  2. Imagine you’ve found the top 10 most popular destinations using this code:

    top_dest = flights |>
      count(dest, sort = TRUE) |>
      head(10)

    How can you find all flights to those destinations?

  3. Add the latitude and the longitude of the origin and destination airport to flights

  4. What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

  5. Use the bikedb.sqlite data and find the station with the most number of trips in the database. Use a join to find out where that station is located.