library(tidyverse)
# Read in hospitalization and deaths
= read_csv("US_COVID19_Hosps_ByWeek_ByState_20240125.csv")
us_hosps = read_csv("US_COVID19_Deaths_ByWeek_ByState_20240125.csv") |> mutate(`Week Ending Date` = mdy(`Week Ending Date`)) # death table dates need conversion
us_deaths
=
us_hosps_deaths |>
us_deaths rename(week_ending_date = `Week Ending Date`, state = State) |> # rename columns so they match across the two tables
select(-c(`Data as of`, `Start Date`, `End Date`, Group, Year, Month, `MMWR Week`, Footnote)) |> # get rid of excess columns in deaths table
inner_join( # join the two tables together
|>
us_hosps rename(state_abbrv = state) |> # hosps has states as abbreviations so we'll need to add full state names
left_join(tibble(state_abbrv = state.abb, state = state.name) |>
add_row(state_abbrv = c("USA", "DC", "PR"), state = c("United States", "District of Columbia", "Puerto Rico")))) |>
filter(state != "United States")
Joins and Databases
Outline for today
- Putting data frames together: joins
- Using
dplyr
to talk to databases
Joining data
Joining data can be a superpower
Remember the CDC COVID-19 data on hospitalization and deaths we used a few weeks back? Here it is:
Looking at the read_csv
commands, you’ll notice that we start with two separate csv
files, one for hospitalization and one for deaths. It turns out that the CDC gives us these tables separately: hospitalizations and deaths. Putting these data tables together requires “joining” them. Not only that, we have to do some extra work to even make the join work, which we accomplish with another join. By learning about joins, we’ll not only be able to understand this example, we’ll be able to combine data from all kinds of places.
Keys
One key to understanding joins (pardon the pun 😜) is understanding “keys”. Joining data tables is really about a common variable in two different tables; the two tables can be “joined” together by merging variables in both tables through common values of the variable common to both of them. These common variables are called keys. We can look for common variables in us_hosps
and us_deaths
:
us_hosps
# A tibble: 11,913 × 12
week_ending_date state avg_adm_all_covid_confirmed pct_chg_avg_adm_all_c…¹
<date> <chr> <dbl> <dbl>
1 2020-12-05 Region 6 1879 4.4
2 2020-12-12 Region 6 1949. 3.7
3 2020-12-19 Region 6 2042 4.79
4 2020-12-26 Region 6 2231. 9.27
5 2021-01-02 Region 6 2602. 16.6
6 2021-01-09 Region 6 2799. 7.58
7 2021-01-16 Region 6 2627. -6.14
8 2021-01-23 Region 6 2327. -11.4
9 2021-01-30 Region 6 2015. -13.4
10 2021-02-06 Region 6 1724. -14.4
# ℹ 11,903 more rows
# ℹ abbreviated name: ¹pct_chg_avg_adm_all_covid_confirmed_per_100k
# ℹ 8 more variables: total_adm_all_covid_confirmed_past_7days <dbl>,
# total_adm_all_covid_confirmed_past_7days_per_100k <dbl>,
# sum_adm_all_covid_confirmed <dbl>,
# avg_total_patients_hospitalized_covid_confirmed <dbl>,
# avg_percent_inpatient_beds_occupied_covid_confirmed <dbl>, …
us_deaths
# A tibble: 14,364 × 17
`Data as of` `Start Date` `End Date` Group Year Month `MMWR Week`
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 01/18/2024 12/29/2019 01/04/2020 By Week 2019/2020 NA 1
2 01/18/2024 01/05/2020 01/11/2020 By Week 2020 NA 2
3 01/18/2024 01/12/2020 01/18/2020 By Week 2020 NA 3
4 01/18/2024 01/19/2020 01/25/2020 By Week 2020 NA 4
5 01/18/2024 01/26/2020 02/01/2020 By Week 2020 NA 5
6 01/18/2024 02/02/2020 02/08/2020 By Week 2020 NA 6
7 01/18/2024 02/09/2020 02/15/2020 By Week 2020 NA 7
8 01/18/2024 02/16/2020 02/22/2020 By Week 2020 NA 8
9 01/18/2024 02/23/2020 02/29/2020 By Week 2020 NA 9
10 01/18/2024 03/01/2020 03/07/2020 By Week 2020 NA 10
# ℹ 14,354 more rows
# ℹ 10 more variables: `Week Ending Date` <date>, State <chr>,
# `COVID-19 Deaths` <dbl>, `Total Deaths` <dbl>,
# `Percent of Expected Deaths` <dbl>, `Pneumonia Deaths` <dbl>,
# `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
# `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>
We notice that the two data tables record their information for each week and for each state. So each combination of these variables denotes a unique observation and hence the tables are tidy
. Moreover, those two variables, week ending date and state, are our keys that we can use to join together the tables. We can check that the combination of these variables works to uniquely identify each observation by looking to see if each combination occurs only once in the dataset:
|>
us_hosps count(week_ending_date, state) |>
filter(n > 1)
# A tibble: 0 × 3
# ℹ 3 variables: week_ending_date <date>, state <chr>, n <int>
Great, what about us_deaths
:
|>
us_deaths count(`Week Ending Date`, State) |>
filter(n > 1)
# A tibble: 54 × 3
`Week Ending Date` State n
<date> <chr> <int>
1 NA Alabama 55
2 NA Alaska 55
3 NA Arizona 55
4 NA Arkansas 55
5 NA California 55
6 NA Colorado 55
7 NA Connecticut 55
8 NA Delaware 55
9 NA District of Columbia 55
10 NA Florida 55
# ℹ 44 more rows
Ruh roh. It looks like NA
and state coming up 55 times! Let’s check out why by filtering those NA
dates
|> filter(is.na(`Week Ending Date`)) us_deaths
# A tibble: 2,970 × 17
`Data as of` `Start Date` `End Date` Group Year Month `MMWR Week`
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 01/18/2024 01/01/2020 01/31/2020 By Month 2020 1 NA
2 01/18/2024 02/01/2020 02/29/2020 By Month 2020 2 NA
3 01/18/2024 03/01/2020 03/31/2020 By Month 2020 3 NA
4 01/18/2024 04/01/2020 04/30/2020 By Month 2020 4 NA
5 01/18/2024 05/01/2020 05/31/2020 By Month 2020 5 NA
6 01/18/2024 06/01/2020 06/30/2020 By Month 2020 6 NA
7 01/18/2024 07/01/2020 07/31/2020 By Month 2020 7 NA
8 01/18/2024 08/01/2020 08/31/2020 By Month 2020 8 NA
9 01/18/2024 09/01/2020 09/30/2020 By Month 2020 9 NA
10 01/18/2024 10/01/2020 10/31/2020 By Month 2020 10 NA
# ℹ 2,960 more rows
# ℹ 10 more variables: `Week Ending Date` <date>, State <chr>,
# `COVID-19 Deaths` <dbl>, `Total Deaths` <dbl>,
# `Percent of Expected Deaths` <dbl>, `Pneumonia Deaths` <dbl>,
# `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
# `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>
If we look at the Group
column, it says “By Month”, so these rows must be capturing monthly totals. Are there other rows we should be worried about?
|> distinct(Group) us_deaths
# A tibble: 4 × 1
Group
<chr>
1 By Week
2 By Month
3 By Year
4 By Total
Ah, yes. What if we keep “By Week”, do all these have Week Ending Date
s?
|>
us_deaths filter(Group == "By Week", is.na(`Week Ending Date`))
# A tibble: 0 × 17
# ℹ 17 variables: Data as of <chr>, Start Date <chr>, End Date <chr>,
# Group <chr>, Year <chr>, Month <dbl>, MMWR Week <dbl>,
# Week Ending Date <date>, State <chr>, COVID-19 Deaths <dbl>,
# Total Deaths <dbl>, Percent of Expected Deaths <dbl>,
# Pneumonia Deaths <dbl>, Pneumonia and COVID-19 Deaths <dbl>,
# Influenza Deaths <dbl>, Pneumonia, Influenza, or COVID-19 Deaths <dbl>,
# Footnote <chr>
They do! Now we know that both week ending date and state in both the hospitalization and death tables correspond to match observations.
If we want to join the death data into the hospitalization table, we would call the week_ending_date
and state
the primary keys, which identify unqiue observations in the first table in the join, and Week Ending Date
and State
the foreign keys, which are the keys that correspond to the primary keys in the second table in the join.
us_deaths
# A tibble: 14,364 × 17
`Data as of` `Start Date` `End Date` Group Year Month `MMWR Week`
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 01/18/2024 12/29/2019 01/04/2020 By Week 2019/2020 NA 1
2 01/18/2024 01/05/2020 01/11/2020 By Week 2020 NA 2
3 01/18/2024 01/12/2020 01/18/2020 By Week 2020 NA 3
4 01/18/2024 01/19/2020 01/25/2020 By Week 2020 NA 4
5 01/18/2024 01/26/2020 02/01/2020 By Week 2020 NA 5
6 01/18/2024 02/02/2020 02/08/2020 By Week 2020 NA 6
7 01/18/2024 02/09/2020 02/15/2020 By Week 2020 NA 7
8 01/18/2024 02/16/2020 02/22/2020 By Week 2020 NA 8
9 01/18/2024 02/23/2020 02/29/2020 By Week 2020 NA 9
10 01/18/2024 03/01/2020 03/07/2020 By Week 2020 NA 10
# ℹ 14,354 more rows
# ℹ 10 more variables: `Week Ending Date` <date>, State <chr>,
# `COVID-19 Deaths` <dbl>, `Total Deaths` <dbl>,
# `Percent of Expected Deaths` <dbl>, `Pneumonia Deaths` <dbl>,
# `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
# `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>
Basic joins
There are two basic kinds of data table joins: joins that combine data from one table into another, which are called mutating joins, and joins that use one table to filter the rows of another table, which are called filtering joins. Mutating joins are usually done with the functions left_join
, inner_join
, right_join
, and full_join
, and filtering joins with semi_join
and anti_join
.
Mutating joins
We’ll start by examining left_join(x,y)
, which takes variables from table y
and adds them to table x
and keeps all the rows of table x
(hence, “left” since x
is the left argument). We use left_join
in our CDC example to accomplish an important task in joining the hospitalization and death data. To see what that is, let’s look at the combinations of our key variables again for each table:
|>
us_deaths distinct(`Week Ending Date`, State) |>
arrange(State)
# A tibble: 11,448 × 2
`Week Ending Date` State
<date> <chr>
1 2020-01-04 Alabama
2 2020-01-11 Alabama
3 2020-01-18 Alabama
4 2020-01-25 Alabama
5 2020-02-01 Alabama
6 2020-02-08 Alabama
7 2020-02-15 Alabama
8 2020-02-22 Alabama
9 2020-02-29 Alabama
10 2020-03-07 Alabama
# ℹ 11,438 more rows
|>
us_hosps distinct(week_ending_date, state) |>
arrange(state)
# A tibble: 11,913 × 2
week_ending_date state
<date> <chr>
1 2020-08-08 AK
2 2020-08-15 AK
3 2020-08-22 AK
4 2020-08-29 AK
5 2020-09-05 AK
6 2020-09-12 AK
7 2020-09-19 AK
8 2020-09-26 AK
9 2020-10-03 AK
10 2020-10-10 AK
# ℹ 11,903 more rows
Ruh roh again. The deaths State
has the full state name and the hospitalizations state
has the two letter state abbrevation. So we’ll need to convert one to the other. Lucky for us, R has some builtin data tables that help, state.abb
and state.name
, which we combine into a new data table along with abbreviations for Washington DC, Puerto Rico, and the United States.
=
states tibble(state = state.abb, state_full = state.name) |>
add_row(state = c("USA", "DC", "PR"),
state_full = c("United States", "District of Columbia", "Puerto Rico"))
states
# A tibble: 53 × 2
state state_full
<chr> <chr>
1 AL Alabama
2 AK Alaska
3 AZ Arizona
4 AR Arkansas
5 CA California
6 CO Colorado
7 CT Connecticut
8 DE Delaware
9 FL Florida
10 GA Georgia
# ℹ 43 more rows
We can join this new states table with our us_hosps
(we also select a subset of the columns to see that it worked using matches
; for other handy ways to select columns, see ?select
).
=
us_hosps |>
us_hosps left_join(states) |>
arrange(state) |>
select(week_ending_date, state, state_full, matches("covid"))
Joining with `by = join_by(state)`
We can see that R told us when doing this join that it’s joining with `by = join_by(state)`
. This actually means R tried to guess which key variables it should use to join the tables. It did this by look at columns in both tables and finding ones with identical names. This is called a natural join. In this specific case, the only column with the same name in both tables is state
, so that is the primary and foreign key that was used for the join. Also note that since this is a left join and us_hosps
is the primary table, all of its rows are kept and we simply get the new state_full
column added.
Now we are ready to join the deaths and hospitalization tables together since we have full state names in both tables. We do this by telling left_join
which variables are the primary and foreign keys since they have different names in the two tables.
=
us_hosps_deaths |>
us_deaths left_join(us_hosps, join_by(`Week Ending Date` == week_ending_date, State == state_full))
us_hosps_deaths
# A tibble: 14,364 × 28
`Data as of` `Start Date` `End Date` Group Year Month `MMWR Week`
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 01/18/2024 12/29/2019 01/04/2020 By Week 2019/2020 NA 1
2 01/18/2024 01/05/2020 01/11/2020 By Week 2020 NA 2
3 01/18/2024 01/12/2020 01/18/2020 By Week 2020 NA 3
4 01/18/2024 01/19/2020 01/25/2020 By Week 2020 NA 4
5 01/18/2024 01/26/2020 02/01/2020 By Week 2020 NA 5
6 01/18/2024 02/02/2020 02/08/2020 By Week 2020 NA 6
7 01/18/2024 02/09/2020 02/15/2020 By Week 2020 NA 7
8 01/18/2024 02/16/2020 02/22/2020 By Week 2020 NA 8
9 01/18/2024 02/23/2020 02/29/2020 By Week 2020 NA 9
10 01/18/2024 03/01/2020 03/07/2020 By Week 2020 NA 10
# ℹ 14,354 more rows
# ℹ 21 more variables: `Week Ending Date` <date>, State <chr>,
# `COVID-19 Deaths` <dbl>, `Total Deaths` <dbl>,
# `Percent of Expected Deaths` <dbl>, `Pneumonia Deaths` <dbl>,
# `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
# `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>,
# state <chr>, avg_adm_all_covid_confirmed <dbl>, …
If we look at the hospitalization columns of us_hosps_deaths
, we can see that they are all initially NA
, which reflects the fact that rows with these dates and states don’t exist in us_hosps
. In fact, the data in us_hosps
doesn’t start until August of 2020.
We could take this a step further by adding in more data. Suppose we wanted to get population size into the table so that we could eventually calculate per capita COVID-19 deaths and hospitalizations for each state. First, we need state level population number, which we can get from the United States Census, https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html, and have downloaded in the GitHub. We can load it with read_excel
and do a little cleaning first.
library(readxl)
=
pop read_excel("NST-EST2023-POP.xlsx",
range = "A10:F62",
col_names = c("state", "Pop Apr 2020", "Pop Jul 2020", "Pop Jul 2021", "Pop Jul 2022", "Pop Jul 2023")) |>
mutate(state = sub("^\\.", "", state)) # get rid of annoying "." due to space in state column in original excel file
pop
# A tibble: 53 × 6
state `Pop Apr 2020` `Pop Jul 2020` `Pop Jul 2021` `Pop Jul 2022`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Alabama 5024294 5031864 5050380 5073903
2 Alaska 733374 732964 734923 733276
3 Arizona 7157902 7186683 7272487 7365684
4 Arkansas 3011490 3014348 3028443 3046404
5 California 39538212 39503200 39145060 39040616
6 Colorado 5773707 5785219 5811596 5841039
7 Connecticut 3605912 3577586 3603691 3608706
8 Delaware 989946 991862 1004881 1019459
9 District of Colu… 689548 670839 669037 670949
10 Florida 21538216 21591299 21830708 22245521
# ℹ 43 more rows
# ℹ 1 more variable: `Pop Jul 2023` <dbl>
Adding in these population values is simply another left_join
:
|>
us_hosps_deaths left_join(pop, join_by(State == state)) |>
select(`Week Ending Date`, State, matches("covid"), matches("pop"))
# A tibble: 14,364 × 20
`Week Ending Date` State `COVID-19 Deaths` Pneumonia and COVID-19 D…¹
<date> <chr> <dbl> <dbl>
1 2020-01-04 United States 0 0
2 2020-01-11 United States 1 1
3 2020-01-18 United States 2 2
4 2020-01-25 United States 3 0
5 2020-02-01 United States 0 0
6 2020-02-08 United States 4 1
7 2020-02-15 United States 6 1
8 2020-02-22 United States 6 3
9 2020-02-29 United States 9 5
10 2020-03-07 United States 38 19
# ℹ 14,354 more rows
# ℹ abbreviated name: ¹`Pneumonia and COVID-19 Deaths`
# ℹ 16 more variables: `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>,
# avg_adm_all_covid_confirmed <dbl>,
# pct_chg_avg_adm_all_covid_confirmed_per_100k <dbl>,
# total_adm_all_covid_confirmed_past_7days <dbl>,
# total_adm_all_covid_confirmed_past_7days_per_100k <dbl>, …
Mutating joins in theory
A conceptual diagram of our left_join
would look like this:
Rows are matched by their keys, which are the colored columns. Rows in the right table y
whose keys don’t exist in the left table x
are left out. We get NA
s in the rows of variables from the right table where the left table has a value but the right table doesn’t. The complementary situation occurs for a right join:
An inner_join(x,y)
only keeps rows that have values for the key variables in both tables,
and a full_join
keeps all rows or observations in both the left and right tables (or x
and y
),
We can represent these different join types with Venn diagrams too:
Filtering joins
Filtering joins are handy for filtering tables when the conditions for the filter might be complex enough to be contained within another table. A semi_join(x,y)
keeps the rows in x
that match a row in y
whereas anti_join(x,y)
keeps the rows in x
that don’t have a match in y
. For example, suppose we want to filter our us_hosps_deaths
for a set of specific states. We could create a new tibble for this:
= tribble(
filter_dt ~state, ~val,
"California", 1,
"Kentucky", 1,
"New York", 1,
"Texas", 1
)
and then do the join being careful to make sure we use join_by
since the column names are different (even though just by capitalization!)
|>
us_hosps_deaths semi_join(filter_dt, join_by(State == state))
# A tibble: 1,064 × 28
`Data as of` `Start Date` `End Date` Group Year Month `MMWR Week`
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 01/18/2024 12/29/2019 01/04/2020 By Week 2019/2020 NA 1
2 01/18/2024 01/05/2020 01/11/2020 By Week 2020 NA 2
3 01/18/2024 01/12/2020 01/18/2020 By Week 2020 NA 3
4 01/18/2024 01/19/2020 01/25/2020 By Week 2020 NA 4
5 01/18/2024 01/26/2020 02/01/2020 By Week 2020 NA 5
6 01/18/2024 02/02/2020 02/08/2020 By Week 2020 NA 6
7 01/18/2024 02/09/2020 02/15/2020 By Week 2020 NA 7
8 01/18/2024 02/16/2020 02/22/2020 By Week 2020 NA 8
9 01/18/2024 02/23/2020 02/29/2020 By Week 2020 NA 9
10 01/18/2024 03/01/2020 03/07/2020 By Week 2020 NA 10
# ℹ 1,054 more rows
# ℹ 21 more variables: `Week Ending Date` <date>, State <chr>,
# `COVID-19 Deaths` <dbl>, `Total Deaths` <dbl>,
# `Percent of Expected Deaths` <dbl>, `Pneumonia Deaths` <dbl>,
# `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
# `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>,
# state <chr>, avg_adm_all_covid_confirmed <dbl>, …
The anti_join
works analogously; here, we use it to find which rows in us_hosps_deaths
have states that don’t have population data from the census pop
data.
|>
us_hosps_deaths anti_join(pop, join_by(State == state)) |>
distinct(State)
# A tibble: 2 × 1
State
<chr>
1 United States
2 New York City
Interesting, the us_hosps_deaths
has death data specifically from “New York City”.
Filtering joins in theory
A conceptual diagram of our semin_join
would look like this
where only the rows of x
are kept that match the key in y
and no columns of y
are added. Likewise, for anti_join
we get
where only the non-matching rows of x
are retained.
Non-equi joins
One key assumption we’ve made in the above joins is that the rows are retained in the joined data table by key columns that have equal values in rows in the left and right data tables. However, these values need not be equal; they could be greater than, less than, or satisfy another criterion. The dplyr
package identifies four particularly useful types of non-equi joins:
- Cross joins match every pair of rows.
- Inequality joins use <, <=, >, and >= instead of ==.
- Rolling joins are similar to inequality joins but only find the closest match.
- Overlap joins are a special type of inequality join designed to work with ranges.
For more information, see https://r4ds.hadley.nz/joins#sec-non-equi-joins.
Databases
Even though there is a lot of data in excel spreadsheets and csv
files and similar tabular files, there might be even more data living in databases, which are organized collections of data accessible by a user through special software. Many database systems use a special language called Structured Query Language or SQL for accessing the data contained int he database. Lucky for us, there are R packages and tools that translate the data wrangling commands we’ve been learning into equivalent SQL commands, and we’ll briefly discuss some of the features of these tools.
There are a few important differences between databases and the typical data frame you used in R up until now:
Databases are stored on disk and can be very very large whereas data tables are typically small enough to be stored entirely within the working memory or RAM of a computer. This means some kinds of datasets, such as the all the users of Facebook and the information about them, must be stored in databases.
Databases often have an index for quickly accessing specific rows, which is important when the database is many gigabytes or terabytes in size. Data frames do not have or really need an index.
Databases are often row-oriented, which means data is stored row-by-row instead of column oriented like data frames. This adding data faster but doing data wrangling slower.
Connecting to databases
You need to use a database interface or DBI to connect to a database. These are specific to the kind of database, PostgreSQL, MySQL, etc but the R package DBI
is helpful here and has many of the interfaces builtin. In order to experiment, we’ll use a SQLite database of San Francisco bike share data from 2018 from https://github.com/ropensci/bikedata. You will first need to download the file.
download.file("https://www.dropbox.com/s/pe4aja9wbcpan9a/bikedb.sqlite?dl=1", "bikedb.sqlite", mode="wb")
Now you can connect to the database
library(DBI)
library(dbplyr)
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
= dbConnect(RSQLite::SQLite(), "bikedb.sqlite")
dbcon dbcon
<SQLiteConnection>
Path: /Users/vancleve/science/teaching/UK/BIO-580/2024-Spring/bikedb.sqlite
Extensions: TRUE
Databases may have multiple tables. To see which tables are in this database, we do
|>
dbcon dbListTables()
[1] "datafiles" "stations" "trips"
Let’s look at the stations
table.
|>
dbcon dbReadTable("stations") |>
as_tibble()
# A tibble: 332 × 6
id city stn_id name longitude latitude
<int> <chr> <chr> <chr> <chr> <chr>
1 1 sf sf10 Washington St at Kearny St -122.404770255089 37.795392937…
2 2 sf sf100 Bryant St at 15th St -122.410662 37.7671004
3 3 sf sf101 San Bruno Ave at 16th St -122.405676841736 37.766007742…
4 4 sf sf102 Irwin St at 8th St -122.3995794 37.7668828
5 5 sf sf104 4th St at 16th St -122.390833497047 37.767044579…
6 6 sf sf105 16th St at Prosper St -122.4318042 37.764285
7 7 sf sf106 Sanchez St at 17th St -122.4306746 37.7632417
8 8 sf sf107 17th St at Dolores St -122.4264968 37.7630152
9 9 sf sf108 16th St Mission BART -122.419956922531 37.764710085…
10 10 sf sf109 17th St at Valencia St -122.4219039 37.7633158
# ℹ 322 more rows
We can see this is just a list of locations of bike share stations.
dbplyr
The package dbplyr
let’s us access the database as if we were using our normal dplyr
commands but in the background SQL commands are sent to the database. Let’s load the trips table.
= dbcon |>
sftrips tbl("trips")
sftrips
# Source: table<trips> [?? x 11]
# Database: sqlite 3.45.0 [/Users/vancleve/science/teaching/UK/BIO-580/2024-Spring/bikedb.sqlite]
id city trip_duration start_time stop_time start_station_id
<int> <chr> <int> <chr> <chr> <chr>
1 1 sf 75284 2018-01-31 22:52:35 2018-02-01 19… sf120
2 2 sf 85422 2018-01-31 16:13:34 2018-02-01 15… sf15
3 3 sf 71576 2018-01-31 14:23:55 2018-02-01 10… sf304
4 4 sf 61076 2018-01-31 14:53:23 2018-02-01 07… sf75
5 5 sf 39966 2018-01-31 19:52:24 2018-02-01 06… sf74
6 6 sf 6477 2018-01-31 22:58:44 2018-02-01 00… sf236
7 7 sf 453 2018-01-31 23:53:53 2018-02-01 00… sf110
8 8 sf 180 2018-01-31 23:52:09 2018-01-31 23… sf81
9 9 sf 996 2018-01-31 23:34:56 2018-01-31 23… sf134
10 10 sf 825 2018-01-31 23:34:14 2018-01-31 23… sf305
# ℹ more rows
# ℹ 5 more variables: end_station_id <chr>, bike_id <chr>, user_type <chr>,
# birth_year <chr>, gender <chr>
You can see that the table shows the number of columns but not the number of rows. This is because the table is loaded lazily, which means data aren’t accessed until they must be. Since all we did was essentially ask for the beginning of the table, it didn’t have to read the whole thing into memory.
= sftrips |>
july4th filter(start_time < "2018-07-05 00:00:00", start_time >= "2018-07-04 00:00:00") |>
group_by(start_station_id) |>
summarize(mean_duration = mean(trip_duration)) |>
arrange(desc(mean_duration))
july4th
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
# Source: SQL [?? x 2]
# Database: sqlite 3.45.0 [/Users/vancleve/science/teaching/UK/BIO-580/2024-Spring/bikedb.sqlite]
# Ordered by: desc(mean_duration)
start_station_id mean_duration
<chr> <dbl>
1 sf152 28855
2 sf155 12484.
3 sf206 10292.
4 sf149 5198.
5 sf132 4019.
6 sf180 3608.
7 sf202 3590.
8 sf207 3517
9 sf228 3372
10 sf72 3323.
# ℹ more rows
We could go back to the stations
table to get location information about the stations and then use a join to add that information here. It can be helpful to actually get the data into memory in R, which we can do with collect
,
|>
july4th collect()
# A tibble: 282 × 2
start_station_id mean_duration
<chr> <dbl>
1 sf152 28855
2 sf155 12484.
3 sf206 10292.
4 sf149 5198.
5 sf132 4019.
6 sf180 3608.
7 sf202 3590.
8 sf207 3517
9 sf228 3372
10 sf72 3323.
# ℹ 272 more rows
Now we can see the output is a normal tibble.
Finally, we can see what the SQL code is to generate the same query that we execute with our dplyr
code:
|>
sftrips filter(start_time < "2018-07-05 00:00:00", start_time >= "2018-07-04 00:00:00") |>
group_by(start_station_id) |>
summarize(mean_duration = mean(trip_duration)) |>
arrange(desc(mean_duration)) |>
show_query()
<SQL>
SELECT `start_station_id`, AVG(`trip_duration`) AS `mean_duration`
FROM (
SELECT `trips`.*
FROM `trips`
WHERE
(`start_time` < '2018-07-05 00:00:00') AND
(`start_time` >= '2018-07-04 00:00:00')
) AS `q01`
GROUP BY `start_station_id`
ORDER BY `mean_duration` DESC
If you’re interested more in SQL (and you don’t have to be since you have dbplyr
!), you can get some intro info in the “R for Data Science” book: https://r4ds.hadley.nz/databases#sql.
Lab 
Problems
For some of these problems, we’ll use data from the library nycflights13
, which contains airline flight data from 2013; you will need to install this package. Once loaded, you have access to five tables: airlines
, airports
, flights
, planes
, and weather
.
library(nycflights13)
Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?
Imagine you’ve found the top 10 most popular destinations using this code:
= flights |> top_dest count(dest, sort = TRUE) |> head(10)
How can you find all flights to those destinations?
Add the latitude and the longitude of the origin and destination airport to flights
What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)
Use the
bikedb.sqlite
data and find the station with the most number of trips in the database. Use a join to find out where that station is located.