Joins and Databases

Author

Jeremy Van Cleve

Published

October 1, 2024

Outline for today

  • Putting data frames together: joins
  • Using dplyr to talk to databases

Joining data

Joining data can be a superpower

Remember the CDC COVID-19 data on hospitalizations and deaths we used a few weeks back? Here it is:

library(tidyverse)
library(RSocrata) 

# Read in hospitalization and deaths
us_deaths = read.socrata("https://data.cdc.gov/api/odata/v4/r8kw-7aab") |> as_tibble()
us_hosps  = read.socrata("https://data.cdc.gov/api/odata/v4/aemt-mg7g") |> as_tibble()

us_deaths_hosps = 
  us_deaths |> 
  rename(week_end_date = week_ending_date) |> # rename this column to match column in `us_hosps`
  select(-c(`data_as_of`, `start_date`, `end_date`, group, year, month, mmwr_week, footnote)) |> 
  inner_join( # join the two tables together
    us_hosps |>
      rename(state_abbrv = jurisdiction) |> # `us_hosps` has states as abbreviations so we'll need to add full state names
      inner_join(tibble(state_abbrv = state.abb, state = state.name) |> 
                  add_row(state_abbrv = c("USA", "DC", "PR"), state = c("United States", "District of Columbia", "Puerto Rico"))))

Looking at the read.socrata commands, you’ll notice that we start with two separate files, one for hospitalization and one for deaths. It turns out that the CDC gives us these tables separately: hospitalizations and deaths. Putting these data tables together requires “joining” them. Not only that, we have to do some extra work to even make the join work, which we accomplish with another join. By learning about joins, we’ll not only be able to understand this example, we’ll be able to combine data from all kinds of places.

Keys

One key to understanding joins (pardon the pun 😜) is understanding “keys”. Joining data tables is really about a common variable in two different tables; the two tables can be “joined” together by merging variables in both tables through common values of the variable common to both of them. These common variables are called keys. We can look for common variables in us_deaths and us_hosps:

us_deaths
# A tibble: 16,794 × 17
   data_as_of          start_date          end_date            group year  month
   <dttm>              <dttm>              <dttm>              <chr> <chr> <int>
 1 2024-10-03 00:00:00 2019-12-29 00:00:00 2020-01-04 00:00:00 By W… 2019…    NA
 2 2024-10-03 00:00:00 2020-01-05 00:00:00 2020-01-11 00:00:00 By W… 2020     NA
 3 2024-10-03 00:00:00 2020-01-12 00:00:00 2020-01-18 00:00:00 By W… 2020     NA
 4 2024-10-03 00:00:00 2020-01-19 00:00:00 2020-01-25 00:00:00 By W… 2020     NA
 5 2024-10-03 00:00:00 2020-01-26 00:00:00 2020-02-01 00:00:00 By W… 2020     NA
 6 2024-10-03 00:00:00 2020-02-02 00:00:00 2020-02-08 00:00:00 By W… 2020     NA
 7 2024-10-03 00:00:00 2020-02-09 00:00:00 2020-02-15 00:00:00 By W… 2020     NA
 8 2024-10-03 00:00:00 2020-02-16 00:00:00 2020-02-22 00:00:00 By W… 2020     NA
 9 2024-10-03 00:00:00 2020-02-23 00:00:00 2020-02-29 00:00:00 By W… 2020     NA
10 2024-10-03 00:00:00 2020-03-01 00:00:00 2020-03-07 00:00:00 By W… 2020     NA
# ℹ 16,784 more rows
# ℹ 11 more variables: mmwr_week <int>, week_ending_date <dttm>, state <chr>,
#   covid_19_deaths <int>, total_deaths <int>,
#   percent_of_expected_deaths <dbl>, pneumonia_deaths <int>,
#   pneumonia_and_covid_19_deaths <int>, influenza_deaths <int>,
#   pneumonia_influenza_or_covid_19_deaths <int>, footnote <chr>
us_hosps
# A tibble: 12,369 × 64
   week_end_date       jurisdiction weekly_actual_days_reporting_any_data
   <dttm>              <chr>                                        <int>
 1 2020-08-08 00:00:00 AZ                                             458
 2 2020-08-15 00:00:00 AZ                                             453
 3 2020-08-22 00:00:00 AZ                                             455
 4 2020-08-29 00:00:00 AZ                                             433
 5 2020-09-05 00:00:00 AZ                                             466
 6 2020-09-12 00:00:00 AZ                                             435
 7 2020-09-19 00:00:00 AZ                                             465
 8 2020-09-26 00:00:00 AZ                                             460
 9 2020-10-03 00:00:00 AZ                                             447
10 2020-10-10 00:00:00 AZ                                             463
# ℹ 12,359 more rows
# ℹ 61 more variables: weekly_percent_days_reporting_any_data <dbl>,
#   num_hospitals_previous_day_admission_adult_covid_confirmed <int>,
#   num_hospitals_previous_day_admission_pediatric_covid_confirmed <int>,
#   num_hospitals_previous_day_admission_influenza_confirmed <int>,
#   num_hospitals_total_patients_hospitalized_confirmed_influenza <int>,
#   num_hospitals_icu_patients_confirmed_influenza <int>, …

We notice that the two data tables record their information for each week and for each state or jurisdiction. So each combination of these variables denotes a unique observation and hence the tables are tidy. Moreover, those two variables, week_ending_date/week_end_date and state/jurisdiction, are our keys that we can use to join together the tables. We can check that the combination of these variables works to uniquely identify each observation by looking to see if each combination occurs only once in the dataset:

us_deaths |> 
  count(week_ending_date, state) |>
  filter(n > 1)
# A tibble: 54 × 3
   week_ending_date state                    n
   <dttm>           <chr>                <int>
 1 NA               Alabama                 63
 2 NA               Alaska                  63
 3 NA               Arizona                 63
 4 NA               Arkansas                63
 5 NA               California              63
 6 NA               Colorado                63
 7 NA               Connecticut             63
 8 NA               Delaware                63
 9 NA               District of Columbia    63
10 NA               Florida                 63
# ℹ 44 more rows

Ruh roh. It looks like NA and state coming up 55 times! Let’s check out why by filtering those NA dates

us_deaths |> filter(is.na(week_ending_date))
# A tibble: 3,402 × 17
   data_as_of          start_date          end_date            group year  month
   <dttm>              <dttm>              <dttm>              <chr> <chr> <int>
 1 2024-10-03 00:00:00 2020-01-01 00:00:00 2020-01-31 00:00:00 By M… 2020      1
 2 2024-10-03 00:00:00 2020-02-01 00:00:00 2020-02-29 00:00:00 By M… 2020      2
 3 2024-10-03 00:00:00 2020-03-01 00:00:00 2020-03-31 00:00:00 By M… 2020      3
 4 2024-10-03 00:00:00 2020-04-01 00:00:00 2020-04-30 00:00:00 By M… 2020      4
 5 2024-10-03 00:00:00 2020-05-01 00:00:00 2020-05-31 00:00:00 By M… 2020      5
 6 2024-10-03 00:00:00 2020-06-01 00:00:00 2020-06-30 00:00:00 By M… 2020      6
 7 2024-10-03 00:00:00 2020-07-01 00:00:00 2020-07-31 00:00:00 By M… 2020      7
 8 2024-10-03 00:00:00 2020-08-01 00:00:00 2020-08-31 00:00:00 By M… 2020      8
 9 2024-10-03 00:00:00 2020-09-01 00:00:00 2020-09-30 00:00:00 By M… 2020      9
10 2024-10-03 00:00:00 2020-10-01 00:00:00 2020-10-31 00:00:00 By M… 2020     10
# ℹ 3,392 more rows
# ℹ 11 more variables: mmwr_week <int>, week_ending_date <dttm>, state <chr>,
#   covid_19_deaths <int>, total_deaths <int>,
#   percent_of_expected_deaths <dbl>, pneumonia_deaths <int>,
#   pneumonia_and_covid_19_deaths <int>, influenza_deaths <int>,
#   pneumonia_influenza_or_covid_19_deaths <int>, footnote <chr>

If we look at the group column, it says “By Month”, so these rows must be capturing monthly totals. Are there other rows we should be worried about?

us_deaths |> distinct(group)
# A tibble: 4 × 1
  group   
  <chr>   
1 By Week 
2 By Month
3 By Year 
4 By Total

Ah, yes. What if we keep “By Week”, do all these have week_ending_dates?

us_deaths |> 
  filter(group == "By Week", is.na(week_ending_date))
# A tibble: 0 × 17
# ℹ 17 variables: data_as_of <dttm>, start_date <dttm>, end_date <dttm>,
#   group <chr>, year <chr>, month <int>, mmwr_week <int>,
#   week_ending_date <dttm>, state <chr>, covid_19_deaths <int>,
#   total_deaths <int>, percent_of_expected_deaths <dbl>,
#   pneumonia_deaths <int>, pneumonia_and_covid_19_deaths <int>,
#   influenza_deaths <int>, pneumonia_influenza_or_covid_19_deaths <int>,
#   footnote <chr>

They do! This means that week_ending_date only has values for the “By Week” rows in us_deaths. Now we can check us_deaths to see if it has unique combinations of week end date and state:

us_hosps |> 
  count(week_end_date, jurisdiction) |>
  filter(n > 1)
# A tibble: 0 × 3
# ℹ 3 variables: week_end_date <dttm>, jurisdiction <chr>, n <int>

Great, it does! That means both week ending date and state/jurisdiction in both the deaths and hospitalizations tables should correspond to matching observations.

If we want to join the death data and the hospitalization table, we would call week_ending_date and state the primary keys, which identify unique observations in the first table in the join, and week_end_date and jurisdiction the foreign keys, which are the keys that correspond to the primary keys in the second table in the join.

Basic joins

There are two basic kinds of data table joins: joins that combine data from one table into another, which are called mutating joins, and joins that use one table to filter the rows of another table, which are called filtering joins. Mutating joins are usually done with the functions left_join, inner_join, right_join, and full_join, and filtering joins with semi_join and anti_join.

Mutating joins

We’ll start by examining inner_join(x,y), which takes variables from table y and adds them to table x and keeps only the rows where values exist in both tables x and y (the name “inner” will make more sense when we talk about all kinds of mutating joins). We use inner_join in our CDC example to join the hospitalization and death data. To do the join, we need our keys so we need to look at the combinations of our key variables again for each table:

us_deaths |>
  distinct(week_ending_date, state) |>
  arrange(state)
# A tibble: 13,446 × 2
   week_ending_date    state  
   <dttm>              <chr>  
 1 2020-01-04 00:00:00 Alabama
 2 2020-01-11 00:00:00 Alabama
 3 2020-01-18 00:00:00 Alabama
 4 2020-01-25 00:00:00 Alabama
 5 2020-02-01 00:00:00 Alabama
 6 2020-02-08 00:00:00 Alabama
 7 2020-02-15 00:00:00 Alabama
 8 2020-02-22 00:00:00 Alabama
 9 2020-02-29 00:00:00 Alabama
10 2020-03-07 00:00:00 Alabama
# ℹ 13,436 more rows
us_hosps |>
  distinct(week_end_date, jurisdiction) |>
  arrange(jurisdiction)
# A tibble: 12,369 × 2
   week_end_date       jurisdiction
   <dttm>              <chr>       
 1 2020-08-08 00:00:00 AK          
 2 2020-08-15 00:00:00 AK          
 3 2020-08-22 00:00:00 AK          
 4 2020-08-29 00:00:00 AK          
 5 2020-09-05 00:00:00 AK          
 6 2020-09-12 00:00:00 AK          
 7 2020-09-19 00:00:00 AK          
 8 2020-09-26 00:00:00 AK          
 9 2020-10-03 00:00:00 AK          
10 2020-10-10 00:00:00 AK          
# ℹ 12,359 more rows

Ruh roh again. The deaths State has the full state name and the hospitalizations state has the two letter state abbreviation. This means that R can’t join the tables yet since it won’t know which state in us_deaths goes with which jurisdiction in us_hosps. So, we’ll need to convert one to the other. Lucky for us, R has some builtin data tables that help, state.abb and state.name, which we combine into a new data table along with abbreviations for Washington DC, Puerto Rico, and the United States.

states = 
  tibble(state_abbrv = state.abb, state = state.name) |> 
  add_row(state_abbrv = c("USA", "DC", "PR"), 
          state = c("United States", "District of Columbia", "Puerto Rico"))
states
# A tibble: 53 × 2
   state_abbrv state      
   <chr>       <chr>      
 1 AL          Alabama    
 2 AK          Alaska     
 3 AZ          Arizona    
 4 AR          Arkansas   
 5 CA          California 
 6 CO          Colorado   
 7 CT          Connecticut
 8 DE          Delaware   
 9 FL          Florida    
10 GA          Georgia    
# ℹ 43 more rows

How can we combine the state names from states into us_hosps? We can this with a join! To see this, new states table with our us_hosps .

us_hosps_w_states = 
  us_hosps |>
  rename(state_abbrv = jurisdiction) |>
  inner_join(states) |>
  arrange(state) |>
  relocate(week_end_date, state_abbrv, state) # this moves these columns to the beginning of the data frame
Joining with `by = join_by(state_abbrv)`
us_hosps_w_states
# A tibble: 11,501 × 65
   week_end_date       state_abbrv state   weekly_actual_days_reporting_any_data
   <dttm>              <chr>       <chr>                                   <int>
 1 2020-08-08 00:00:00 AL          Alabama                                   629
 2 2020-08-15 00:00:00 AL          Alabama                                   637
 3 2020-08-22 00:00:00 AL          Alabama                                   632
 4 2020-08-29 00:00:00 AL          Alabama                                   636
 5 2020-09-05 00:00:00 AL          Alabama                                   640
 6 2020-09-12 00:00:00 AL          Alabama                                   637
 7 2020-09-19 00:00:00 AL          Alabama                                   631
 8 2020-09-26 00:00:00 AL          Alabama                                   635
 9 2020-10-03 00:00:00 AL          Alabama                                   641
10 2020-10-10 00:00:00 AL          Alabama                                   637
# ℹ 11,491 more rows
# ℹ 61 more variables: weekly_percent_days_reporting_any_data <dbl>,
#   num_hospitals_previous_day_admission_adult_covid_confirmed <int>,
#   num_hospitals_previous_day_admission_pediatric_covid_confirmed <int>,
#   num_hospitals_previous_day_admission_influenza_confirmed <int>,
#   num_hospitals_total_patients_hospitalized_confirmed_influenza <int>,
#   num_hospitals_icu_patients_confirmed_influenza <int>, …

We can see that R told us when doing this join that it’s joining with `by = join_by(state_abbrv)`. This actually means R tried to guess which key variables it should use to join the tables. It did this by look at columns in both tables and finding ones with identical names. This is called a natural join. In this specific case, the only column with the same name in both tables is state_abbrv, so that is the primary and foreign key that was used for the join. Also note that since this is a left join and us_hosps is the primary table, all of its rows are kept and we simply get the new state_full column added.

Now we are ready to join the deaths and hospitalization tables together since we have full state names in both tables. We do this by telling inner_join which variables are the primary and foreign keys since they have different names in the two tables.

us_deaths_hosps =
us_deaths |>
  inner_join(us_hosps_w_states, join_by(week_ending_date == week_end_date, state == state)) |>
  select(-data_as_of, -start_date, -end_date, -group, -year, -month)
us_deaths_hosps
# A tibble: 11,501 × 74
   mmwr_week week_ending_date    state         covid_19_deaths total_deaths
       <int> <dttm>              <chr>                   <int>        <int>
 1        32 2020-08-08 00:00:00 United States            7870        63719
 2        33 2020-08-15 00:00:00 United States            7270        63644
 3        34 2020-08-22 00:00:00 United States            6389        62562
 4        35 2020-08-29 00:00:00 United States            5755        61104
 5        36 2020-09-05 00:00:00 United States            5021        60247
 6        37 2020-09-12 00:00:00 United States            4628        59665
 7        38 2020-09-19 00:00:00 United States            4280        59740
 8        39 2020-09-26 00:00:00 United States            4305        60613
 9        40 2020-10-03 00:00:00 United States            4246        59802
10        41 2020-10-10 00:00:00 United States            4823        61779
# ℹ 11,491 more rows
# ℹ 69 more variables: percent_of_expected_deaths <dbl>,
#   pneumonia_deaths <int>, pneumonia_and_covid_19_deaths <int>,
#   influenza_deaths <int>, pneumonia_influenza_or_covid_19_deaths <int>,
#   footnote <chr>, state_abbrv <chr>,
#   weekly_actual_days_reporting_any_data <int>,
#   weekly_percent_days_reporting_any_data <dbl>, …

Let’s quickly compare how big this combined table is to the two original tables.

nrow(us_deaths)
[1] 16794
nrow(us_hosps)
[1] 12369
nrow(us_deaths_hosps)
[1] 11501

The table created with the inner_join has fewer rows than either of the original tables. This is because inner join only keeps rows in the output table where the combination of the key values exist in both of the tables in the join. A left_join(x,y) or right_join(x,y), on the other hand, will keep all the rows in the table in the x (“left”) or y (“right”) tables, respectively. This however means that there are variables in the y table for the left_join and in the x table in the right_join that don’t have values in the output table and so R puts NA values in those places. To see this, suppose we do a left_join and combine the us_hosps into the us_deaths.

us_deaths |>
  left_join(us_hosps_w_states, join_by(week_ending_date == week_end_date, state == state)) |>
  select(-data_as_of, -start_date, -end_date, -group, -year, -month)
# A tibble: 16,794 × 74
   mmwr_week week_ending_date    state         covid_19_deaths total_deaths
       <int> <dttm>              <chr>                   <int>        <int>
 1         1 2020-01-04 00:00:00 United States               0        60170
 2         2 2020-01-11 00:00:00 United States               1        60734
 3         3 2020-01-18 00:00:00 United States               2        59362
 4         4 2020-01-25 00:00:00 United States               3        59162
 5         5 2020-02-01 00:00:00 United States               0        58834
 6         6 2020-02-08 00:00:00 United States               4        59482
 7         7 2020-02-15 00:00:00 United States               6        58815
 8         8 2020-02-22 00:00:00 United States               6        58912
 9         9 2020-02-29 00:00:00 United States               9        59334
10        10 2020-03-07 00:00:00 United States              38        59694
# ℹ 16,784 more rows
# ℹ 69 more variables: percent_of_expected_deaths <dbl>,
#   pneumonia_deaths <int>, pneumonia_and_covid_19_deaths <int>,
#   influenza_deaths <int>, pneumonia_influenza_or_covid_19_deaths <int>,
#   footnote <chr>, state_abbrv <chr>,
#   weekly_actual_days_reporting_any_data <int>,
#   weekly_percent_days_reporting_any_data <dbl>, …

If we look at the first rows of this table, we can see that many of the hospitalization variables are NA. This is because these dates, which are early 2020, don’t exist in the us_hosps data, which only starts in August of 2020:

us_hosps |> arrange(week_end_date)
# A tibble: 12,369 × 64
   week_end_date       jurisdiction weekly_actual_days_reporting_any_data
   <dttm>              <chr>                                        <int>
 1 2020-08-08 00:00:00 AZ                                             458
 2 2020-08-08 00:00:00 LA                                             919
 3 2020-08-08 00:00:00 SC                                             424
 4 2020-08-08 00:00:00 IA                                             802
 5 2020-08-08 00:00:00 NM                                             128
 6 2020-08-08 00:00:00 PA                                             991
 7 2020-08-08 00:00:00 GU                                              13
 8 2020-08-08 00:00:00 ME                                             177
 9 2020-08-08 00:00:00 ND                                             264
10 2020-08-08 00:00:00 WA                                             575
# ℹ 12,359 more rows
# ℹ 61 more variables: weekly_percent_days_reporting_any_data <dbl>,
#   num_hospitals_previous_day_admission_adult_covid_confirmed <int>,
#   num_hospitals_previous_day_admission_pediatric_covid_confirmed <int>,
#   num_hospitals_previous_day_admission_influenza_confirmed <int>,
#   num_hospitals_total_patients_hospitalized_confirmed_influenza <int>,
#   num_hospitals_icu_patients_confirmed_influenza <int>, …

In some cases, we may prefer the left_join here since we want our table to reflect the fact that deaths are being recorded by the CDC for early 2020 even though hospitalization numbers are not.

Mutating joins in theory

A conceptual diagram of our left_join would look like this:
Rows are matched by their keys, which are the colored columns. Rows in the right table y whose keys don’t exist in the left table x are left out. We get NAs in the rows of variables from the right table where the left table has a value but the right table doesn’t. The complementary situation occurs for a right join:
An inner_join(x,y) only keeps rows that have values for the key variables in both tables,
and a full_join keeps all rows or observations in both the left and right tables (or x and y),
We can represent these different join types with Venn diagrams too:

Filtering joins

Filtering joins are handy for filtering tables when the conditions for the filter might be complex enough to be contained within another table. A semi_join(x,y) keeps the rows in x that match a row in y whereas anti_join(x,y) keeps the rows in x that don’t have a match in y. For example, suppose we want to filter our us_deaths_hosps for a set of specific states. We could create a new tibble for this:

filter_dt = tribble(
  ~state, ~val,
  "California", 1,
  "Kentucky", 1,
  "New York", 1,
  "Texas", 1
)

and then do the join:

us_deaths_hosps |>
  semi_join(filter_dt)
Joining with `by = join_by(state)`
# A tibble: 868 × 74
   mmwr_week week_ending_date    state      covid_19_deaths total_deaths
       <int> <dttm>              <chr>                <int>        <int>
 1        32 2020-08-08 00:00:00 California             998         6309
 2        33 2020-08-15 00:00:00 California             878         6291
 3        34 2020-08-22 00:00:00 California             830         6478
 4        35 2020-08-29 00:00:00 California             728         5878
 5        36 2020-09-05 00:00:00 California             614         5784
 6        37 2020-09-12 00:00:00 California             530         5874
 7        38 2020-09-19 00:00:00 California             482         5755
 8        39 2020-09-26 00:00:00 California             453         5563
 9        40 2020-10-03 00:00:00 California             346         5671
10        41 2020-10-10 00:00:00 California             375         5568
# ℹ 858 more rows
# ℹ 69 more variables: percent_of_expected_deaths <dbl>,
#   pneumonia_deaths <int>, pneumonia_and_covid_19_deaths <int>,
#   influenza_deaths <int>, pneumonia_influenza_or_covid_19_deaths <int>,
#   footnote <chr>, state_abbrv <chr>,
#   weekly_actual_days_reporting_any_data <int>,
#   weekly_percent_days_reporting_any_data <dbl>, …

The anti_join works analogously; here, we can use it to note something about what happened when we gave us_hosps its full state name.

us_hosps |>
  anti_join(states, join_by(jurisdiction == state_abbrv)) |>
  distinct(jurisdiction)
# A tibble: 4 × 1
  jurisdiction
  <chr>       
1 GU          
2 VI          
3 MP          
4 AS          

The rows in this table are the ones that do not have a value for state_abbrv in the states table. These jurisdictions are US territories like Guam, US Virgin Islands, etc. Our previous use of inner_join tossed out these rows since they don’t have a value in state_abbrv.

Filtering joins in theory

A conceptual diagram of our semin_join would look like this
where only the rows of x are kept that match the key in y and no columns of y are added. Likewise, for anti_join we get

where only the non-matching rows of x are retained.

Non-equi joins

One key assumption we’ve made in the above joins is that the rows are retained in the joined data table by key columns that have equal values in rows in the left and right data tables. However, these values need not be equal; they could be greater than, less than, or satisfy another criterion. The dplyr package identifies four particularly useful types of non-equi joins:

  • Cross joins match every pair of rows (generates the Cartesian product of the two tables).
  • Inequality joins use <, <=, >, and >= instead of ==.
  • Rolling joins are similar to inequality joins but only find the closest match.
  • Overlap joins are a special type of inequality join designed to work with ranges.

For more information, see https://r4ds.hadley.nz/joins#sec-non-equi-joins.

Databases

Even though there is a lot of data in excel spreadsheets and csv files and similar tabular files, there might be even more data living in databases, which are organized collections of data accessible by a user through special software. Many database systems use a special language called Structured Query Language or SQL for accessing the data contained int he database. Lucky for us, there are R packages and tools that translate the data wrangling commands we’ve been learning into equivalent SQL commands, and we’ll briefly discuss some of the features of these tools.

There are a few important differences between databases and the typical data frame you used in R up until now:

  1. Databases are stored on disk and can be very very large whereas data tables are typically small enough to be stored entirely within the working memory or RAM of a computer. This means some kinds of datasets, such as the all the users of Facebook and the information about them, must be stored in databases.

  2. Databases often have an index for quickly accessing specific rows, which is important when the database is many gigabytes or terabytes in size. Data frames do not have or really need an index.

  3. Databases are often row-oriented, which means data is stored row-by-row instead of column oriented like data frames. This makes adding data faster but doing data wrangling slower.

Connecting to databases

You need to use a database interface or DBI to connect to a database. These are specific to the kind of database, PostgreSQL, MySQL, etc but the R package DBI is helpful here and has many of the interfaces builtin. In order to experiment, we’ll use a SQLite database of San Francisco bike share data from 2018 from https://github.com/ropensci/bikedata. The file, bikedb.sqlite, should be in the project template “BIO540-DWV” in the course space on Posit Cloud. You canconnect to the database using the code below.

library(DBI)
library(dbplyr)

Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':

    ident, sql
dbcon = dbConnect(RSQLite::SQLite(), "bikedb.sqlite")
dbcon
<SQLiteConnection>
  Path: /Users/vancleve/science/teaching/UK/BIO-540/2024-Fall/bikedb.sqlite
  Extensions: TRUE

Databases may have multiple tables. To see which tables are in this database, we do

dbcon |> 
  dbListTables()
[1] "datafiles" "stations"  "trips"    

Let’s look at the stations table.

dbcon |>
  dbReadTable("stations") |>
  as_tibble()
# A tibble: 332 × 6
      id city  stn_id name                       longitude         latitude     
   <int> <chr> <chr>  <chr>                      <chr>             <chr>        
 1     1 sf    sf10   Washington St at Kearny St -122.404770255089 37.795392937…
 2     2 sf    sf100  Bryant St at 15th St       -122.410662       37.7671004   
 3     3 sf    sf101  San Bruno Ave at 16th St   -122.405676841736 37.766007742…
 4     4 sf    sf102  Irwin St at 8th St         -122.3995794      37.7668828   
 5     5 sf    sf104  4th St at 16th St          -122.390833497047 37.767044579…
 6     6 sf    sf105  16th St at Prosper St      -122.4318042      37.764285    
 7     7 sf    sf106  Sanchez St at 17th St      -122.4306746      37.7632417   
 8     8 sf    sf107  17th St at Dolores St      -122.4264968      37.7630152   
 9     9 sf    sf108  16th St Mission BART       -122.419956922531 37.764710085…
10    10 sf    sf109  17th St at Valencia St     -122.4219039      37.7633158   
# ℹ 322 more rows

We can see this is just a list of locations of bike share stations.

dbplyr

The package dbplyr let’s us access the database as if we were using our normal dplyr commands but in the background SQL commands are sent to the database. Let’s load the trips table.

sftrips = dbcon |>
  tbl("trips")
sftrips
# Source:   table<`trips`> [?? x 11]
# Database: sqlite 3.46.0 [/Users/vancleve/science/teaching/UK/BIO-540/2024-Fall/bikedb.sqlite]
      id city  trip_duration start_time          stop_time      start_station_id
   <int> <chr>         <int> <chr>               <chr>          <chr>           
 1     1 sf            75284 2018-01-31 22:52:35 2018-02-01 19… sf120           
 2     2 sf            85422 2018-01-31 16:13:34 2018-02-01 15… sf15            
 3     3 sf            71576 2018-01-31 14:23:55 2018-02-01 10… sf304           
 4     4 sf            61076 2018-01-31 14:53:23 2018-02-01 07… sf75            
 5     5 sf            39966 2018-01-31 19:52:24 2018-02-01 06… sf74            
 6     6 sf             6477 2018-01-31 22:58:44 2018-02-01 00… sf236           
 7     7 sf              453 2018-01-31 23:53:53 2018-02-01 00… sf110           
 8     8 sf              180 2018-01-31 23:52:09 2018-01-31 23… sf81            
 9     9 sf              996 2018-01-31 23:34:56 2018-01-31 23… sf134           
10    10 sf              825 2018-01-31 23:34:14 2018-01-31 23… sf305           
# ℹ more rows
# ℹ 5 more variables: end_station_id <chr>, bike_id <chr>, user_type <chr>,
#   birth_year <chr>, gender <chr>

You can see that the table shows the number of columns but not the number of rows. This is because the table is loaded lazily, which means data aren’t accessed until they must be. Since all we did was essentially ask for the beginning of the table, it didn’t have to read the whole thing into memory. Suppose we wanted to collect all the bike trips that were taken on July 4th, which is a US holiday, and we wanted to see which station generated the longest trips. We could use our normal dplyr tools as follows:

july4th = sftrips |> 
  filter(start_time < "2018-07-05 00:00:00", start_time >= "2018-07-04 00:00:00") |>
  group_by(start_station_id) |>
  summarize(mean_duration = mean(trip_duration)) |>
  arrange(desc(mean_duration)) |>
  collect()
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
july4th
# A tibble: 282 × 2
   start_station_id mean_duration
   <chr>                    <dbl>
 1 sf152                   28855 
 2 sf155                   12484.
 3 sf206                   10292.
 4 sf149                    5198.
 5 sf132                    4019.
 6 sf180                    3608.
 7 sf202                    3590.
 8 sf207                    3517 
 9 sf228                    3372 
10 sf72                     3323.
# ℹ 272 more rows

The final command above, collect, collects all the data from the database so that we now just have a normal tibble and just some of the results from database. This can be useful since pulling from the database can be a little slower but only do this if the resulting table is small enough to store in RAM.

We could go back to the stations table to get location information about the stations and then use a join to add that information here.

stations = dbcon |>
  dbReadTable("stations") |>
  as_tibble()

july4th |>
  left_join(stations, by = join_by(start_station_id == stn_id))
# A tibble: 282 × 7
   start_station_id mean_duration    id city  name            longitude latitude
   <chr>                    <dbl> <int> <chr> <chr>           <chr>     <chr>   
 1 sf152                   28855     51 sf    47th St at San… -122.281… 37.8356…
 2 sf155                   12484.    54 sf    Emeryville Pub… -122.293… 37.8405…
 3 sf206                   10292.   105 sf    College Ave at… -122.251… 37.8381…
 4 sf149                    5198.    47 sf    Emeryville Tow… -122.285… 37.8312…
 5 sf132                    4019.    31 sf    24th St at Cha… -122.426… 37.7518…
 6 sf180                    3608.    80 sf    Telegraph Ave … -122.268… 37.8126…
 7 sf202                    3590.   101 sf    Washington St … -122.274… 37.8007…
 8 sf207                    3517    106 sf    Broadway at Co… -122.251… 37.8357…
 9 sf228                    3372    129 sf    Foothill Blvd … -122.217… 37.77993
10 sf72                     3323.   305 sf    Page St at Sco… -122.435… 37.7724…
# ℹ 272 more rows

We can see that the station that generated the longest rides on July 4th is 47th St at San Pablo Ave station.

Finally, we can see what the SQL code is to generate the same query that we execute with our dplyr code:

sftrips |> 
  filter(start_time < "2018-07-05 00:00:00", start_time >= "2018-07-04 00:00:00") |>
  group_by(start_station_id) |>
  summarize(mean_duration = mean(trip_duration)) |>
  arrange(desc(mean_duration)) |>
  show_query()
<SQL>
SELECT `start_station_id`, AVG(`trip_duration`) AS `mean_duration`
FROM (
  SELECT `trips`.*
  FROM `trips`
  WHERE
    (`start_time` < '2018-07-05 00:00:00') AND
    (`start_time` >= '2018-07-04 00:00:00')
) AS `q01`
GROUP BY `start_station_id`
ORDER BY `mean_duration` DESC

If you’re interested more in SQL (and you don’t have to be since you have dbplyr!), you can get some intro info in the “R for Data Science” book: https://r4ds.hadley.nz/databases#sql.

Lab

For some of these problems, we’ll use data from the library nycflights13, which contains airline flight data from 2013. Once loaded, you have access to five tables: airlines, airports, flights, planes, and weather.

library(nycflights13)
  1. Find the day that has the worst total departure delay (over all flights for that day).

  2. Imagine you’ve found the top 10 most popular destinations using this code:

    top_dest = flights |>
      count(dest, sort = TRUE) |>
      head(10)

    How can you find all flights to those destinations?

  3. Add the latitude and the longitude of the origin and destination airport to flights

  4. What do the tail numbers in flights that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

  5. Use the bikedb.sqlite data and find the station with the most number of trips in the database. Use a join to find out where that station is located.

  • Challenge problem (+5 extra credit)

    Using the bikedb.sqlite, find the longest bike trip by distance between the stations. What are the names of starting and ending stations for this trip?

    You may calculate the distance between the stations using the longitutde and latitude using the geosphere package.