library(tidyverse)
library(RSocrata)
# Read in hospitalization and deaths
= read.socrata("https://data.cdc.gov/api/odata/v4/r8kw-7aab") |> as_tibble()
us_deaths = read.socrata("https://data.cdc.gov/api/odata/v4/aemt-mg7g") |> as_tibble()
us_hosps
=
us_deaths_hosps |>
us_deaths rename(week_end_date = week_ending_date) |> # rename this column to match column in `us_hosps`
select(-c(`data_as_of`, `start_date`, `end_date`, group, year, month, mmwr_week, footnote)) |>
inner_join( # join the two tables together
|>
us_hosps rename(state_abbrv = jurisdiction) |> # `us_hosps` has states as abbreviations so we'll need to add full state names
inner_join(tibble(state_abbrv = state.abb, state = state.name) |>
add_row(state_abbrv = c("USA", "DC", "PR"), state = c("United States", "District of Columbia", "Puerto Rico"))))
Joins and Databases
Outline for today
- Putting data frames together: joins
- Using
dplyr
to talk to databases
Joining data
Joining data can be a superpower
Remember the CDC COVID-19 data on hospitalizations and deaths we used a few weeks back? Here it is:
Looking at the read.socrata
commands, you’ll notice that we start with two separate files, one for hospitalization and one for deaths. It turns out that the CDC gives us these tables separately: hospitalizations and deaths. Putting these data tables together requires “joining” them. Not only that, we have to do some extra work to even make the join work, which we accomplish with another join. By learning about joins, we’ll not only be able to understand this example, we’ll be able to combine data from all kinds of places.
Keys
One key to understanding joins (pardon the pun 😜) is understanding “keys”. Joining data tables is really about a common variable in two different tables; the two tables can be “joined” together by merging variables in both tables through common values of the variable common to both of them. These common variables are called keys. We can look for common variables in us_deaths
and us_hosps
:
us_deaths
# A tibble: 16,794 × 17
data_as_of start_date end_date group year month
<dttm> <dttm> <dttm> <chr> <chr> <int>
1 2024-10-03 00:00:00 2019-12-29 00:00:00 2020-01-04 00:00:00 By W… 2019… NA
2 2024-10-03 00:00:00 2020-01-05 00:00:00 2020-01-11 00:00:00 By W… 2020 NA
3 2024-10-03 00:00:00 2020-01-12 00:00:00 2020-01-18 00:00:00 By W… 2020 NA
4 2024-10-03 00:00:00 2020-01-19 00:00:00 2020-01-25 00:00:00 By W… 2020 NA
5 2024-10-03 00:00:00 2020-01-26 00:00:00 2020-02-01 00:00:00 By W… 2020 NA
6 2024-10-03 00:00:00 2020-02-02 00:00:00 2020-02-08 00:00:00 By W… 2020 NA
7 2024-10-03 00:00:00 2020-02-09 00:00:00 2020-02-15 00:00:00 By W… 2020 NA
8 2024-10-03 00:00:00 2020-02-16 00:00:00 2020-02-22 00:00:00 By W… 2020 NA
9 2024-10-03 00:00:00 2020-02-23 00:00:00 2020-02-29 00:00:00 By W… 2020 NA
10 2024-10-03 00:00:00 2020-03-01 00:00:00 2020-03-07 00:00:00 By W… 2020 NA
# ℹ 16,784 more rows
# ℹ 11 more variables: mmwr_week <int>, week_ending_date <dttm>, state <chr>,
# covid_19_deaths <int>, total_deaths <int>,
# percent_of_expected_deaths <dbl>, pneumonia_deaths <int>,
# pneumonia_and_covid_19_deaths <int>, influenza_deaths <int>,
# pneumonia_influenza_or_covid_19_deaths <int>, footnote <chr>
us_hosps
# A tibble: 12,369 × 64
week_end_date jurisdiction weekly_actual_days_reporting_any_data
<dttm> <chr> <int>
1 2020-08-08 00:00:00 AZ 458
2 2020-08-15 00:00:00 AZ 453
3 2020-08-22 00:00:00 AZ 455
4 2020-08-29 00:00:00 AZ 433
5 2020-09-05 00:00:00 AZ 466
6 2020-09-12 00:00:00 AZ 435
7 2020-09-19 00:00:00 AZ 465
8 2020-09-26 00:00:00 AZ 460
9 2020-10-03 00:00:00 AZ 447
10 2020-10-10 00:00:00 AZ 463
# ℹ 12,359 more rows
# ℹ 61 more variables: weekly_percent_days_reporting_any_data <dbl>,
# num_hospitals_previous_day_admission_adult_covid_confirmed <int>,
# num_hospitals_previous_day_admission_pediatric_covid_confirmed <int>,
# num_hospitals_previous_day_admission_influenza_confirmed <int>,
# num_hospitals_total_patients_hospitalized_confirmed_influenza <int>,
# num_hospitals_icu_patients_confirmed_influenza <int>, …
We notice that the two data tables record their information for each week and for each state or jurisdiction. So each combination of these variables denotes a unique observation and hence the tables are tidy
. Moreover, those two variables, week_ending_date
/week_end_date
and state
/jurisdiction
, are our keys that we can use to join together the tables. We can check that the combination of these variables works to uniquely identify each observation by looking to see if each combination occurs only once in the dataset:
|>
us_deaths count(week_ending_date, state) |>
filter(n > 1)
# A tibble: 54 × 3
week_ending_date state n
<dttm> <chr> <int>
1 NA Alabama 63
2 NA Alaska 63
3 NA Arizona 63
4 NA Arkansas 63
5 NA California 63
6 NA Colorado 63
7 NA Connecticut 63
8 NA Delaware 63
9 NA District of Columbia 63
10 NA Florida 63
# ℹ 44 more rows
Ruh roh. It looks like NA
and state coming up 55 times! Let’s check out why by filtering those NA
dates
|> filter(is.na(week_ending_date)) us_deaths
# A tibble: 3,402 × 17
data_as_of start_date end_date group year month
<dttm> <dttm> <dttm> <chr> <chr> <int>
1 2024-10-03 00:00:00 2020-01-01 00:00:00 2020-01-31 00:00:00 By M… 2020 1
2 2024-10-03 00:00:00 2020-02-01 00:00:00 2020-02-29 00:00:00 By M… 2020 2
3 2024-10-03 00:00:00 2020-03-01 00:00:00 2020-03-31 00:00:00 By M… 2020 3
4 2024-10-03 00:00:00 2020-04-01 00:00:00 2020-04-30 00:00:00 By M… 2020 4
5 2024-10-03 00:00:00 2020-05-01 00:00:00 2020-05-31 00:00:00 By M… 2020 5
6 2024-10-03 00:00:00 2020-06-01 00:00:00 2020-06-30 00:00:00 By M… 2020 6
7 2024-10-03 00:00:00 2020-07-01 00:00:00 2020-07-31 00:00:00 By M… 2020 7
8 2024-10-03 00:00:00 2020-08-01 00:00:00 2020-08-31 00:00:00 By M… 2020 8
9 2024-10-03 00:00:00 2020-09-01 00:00:00 2020-09-30 00:00:00 By M… 2020 9
10 2024-10-03 00:00:00 2020-10-01 00:00:00 2020-10-31 00:00:00 By M… 2020 10
# ℹ 3,392 more rows
# ℹ 11 more variables: mmwr_week <int>, week_ending_date <dttm>, state <chr>,
# covid_19_deaths <int>, total_deaths <int>,
# percent_of_expected_deaths <dbl>, pneumonia_deaths <int>,
# pneumonia_and_covid_19_deaths <int>, influenza_deaths <int>,
# pneumonia_influenza_or_covid_19_deaths <int>, footnote <chr>
If we look at the group
column, it says “By Month”, so these rows must be capturing monthly totals. Are there other rows we should be worried about?
|> distinct(group) us_deaths
# A tibble: 4 × 1
group
<chr>
1 By Week
2 By Month
3 By Year
4 By Total
Ah, yes. What if we keep “By Week”, do all these have week_ending_date
s?
|>
us_deaths filter(group == "By Week", is.na(week_ending_date))
# A tibble: 0 × 17
# ℹ 17 variables: data_as_of <dttm>, start_date <dttm>, end_date <dttm>,
# group <chr>, year <chr>, month <int>, mmwr_week <int>,
# week_ending_date <dttm>, state <chr>, covid_19_deaths <int>,
# total_deaths <int>, percent_of_expected_deaths <dbl>,
# pneumonia_deaths <int>, pneumonia_and_covid_19_deaths <int>,
# influenza_deaths <int>, pneumonia_influenza_or_covid_19_deaths <int>,
# footnote <chr>
They do! This means that week_ending_date
only has values for the “By Week” rows in us_deaths
. Now we can check us_deaths
to see if it has unique combinations of week end date and state:
|>
us_hosps count(week_end_date, jurisdiction) |>
filter(n > 1)
# A tibble: 0 × 3
# ℹ 3 variables: week_end_date <dttm>, jurisdiction <chr>, n <int>
Great, it does! That means both week ending date and state/jurisdiction in both the deaths and hospitalizations tables should correspond to matching observations.
If we want to join the death data and the hospitalization table, we would call week_ending_date
and state
the primary keys, which identify unique observations in the first table in the join, and week_end_date
and jurisdiction
the foreign keys, which are the keys that correspond to the primary keys in the second table in the join.
Basic joins
There are two basic kinds of data table joins: joins that combine data from one table into another, which are called mutating joins, and joins that use one table to filter the rows of another table, which are called filtering joins. Mutating joins are usually done with the functions left_join
, inner_join
, right_join
, and full_join
, and filtering joins with semi_join
and anti_join
.
Mutating joins
We’ll start by examining inner_join(x,y)
, which takes variables from table y
and adds them to table x
and keeps only the rows where values exist in both tables x
and y
(the name “inner” will make more sense when we talk about all kinds of mutating joins). We use inner_join
in our CDC example to join the hospitalization and death data. To do the join, we need our keys so we need to look at the combinations of our key variables again for each table:
|>
us_deaths distinct(week_ending_date, state) |>
arrange(state)
# A tibble: 13,446 × 2
week_ending_date state
<dttm> <chr>
1 2020-01-04 00:00:00 Alabama
2 2020-01-11 00:00:00 Alabama
3 2020-01-18 00:00:00 Alabama
4 2020-01-25 00:00:00 Alabama
5 2020-02-01 00:00:00 Alabama
6 2020-02-08 00:00:00 Alabama
7 2020-02-15 00:00:00 Alabama
8 2020-02-22 00:00:00 Alabama
9 2020-02-29 00:00:00 Alabama
10 2020-03-07 00:00:00 Alabama
# ℹ 13,436 more rows
|>
us_hosps distinct(week_end_date, jurisdiction) |>
arrange(jurisdiction)
# A tibble: 12,369 × 2
week_end_date jurisdiction
<dttm> <chr>
1 2020-08-08 00:00:00 AK
2 2020-08-15 00:00:00 AK
3 2020-08-22 00:00:00 AK
4 2020-08-29 00:00:00 AK
5 2020-09-05 00:00:00 AK
6 2020-09-12 00:00:00 AK
7 2020-09-19 00:00:00 AK
8 2020-09-26 00:00:00 AK
9 2020-10-03 00:00:00 AK
10 2020-10-10 00:00:00 AK
# ℹ 12,359 more rows
Ruh roh again. The deaths State
has the full state name and the hospitalizations state
has the two letter state abbreviation. This means that R can’t join the tables yet since it won’t know which state
in us_deaths
goes with which jurisdiction
in us_hosps
. So, we’ll need to convert one to the other. Lucky for us, R has some builtin data tables that help, state.abb
and state.name
, which we combine into a new data table along with abbreviations for Washington DC, Puerto Rico, and the United States.
=
states tibble(state_abbrv = state.abb, state = state.name) |>
add_row(state_abbrv = c("USA", "DC", "PR"),
state = c("United States", "District of Columbia", "Puerto Rico"))
states
# A tibble: 53 × 2
state_abbrv state
<chr> <chr>
1 AL Alabama
2 AK Alaska
3 AZ Arizona
4 AR Arkansas
5 CA California
6 CO Colorado
7 CT Connecticut
8 DE Delaware
9 FL Florida
10 GA Georgia
# ℹ 43 more rows
How can we combine the state names from states
into us_hosps
? We can this with a join! To see this, new states table with our us_hosps
.
=
us_hosps_w_states |>
us_hosps rename(state_abbrv = jurisdiction) |>
inner_join(states) |>
arrange(state) |>
relocate(week_end_date, state_abbrv, state) # this moves these columns to the beginning of the data frame
Joining with `by = join_by(state_abbrv)`
us_hosps_w_states
# A tibble: 11,501 × 65
week_end_date state_abbrv state weekly_actual_days_reporting_any_data
<dttm> <chr> <chr> <int>
1 2020-08-08 00:00:00 AL Alabama 629
2 2020-08-15 00:00:00 AL Alabama 637
3 2020-08-22 00:00:00 AL Alabama 632
4 2020-08-29 00:00:00 AL Alabama 636
5 2020-09-05 00:00:00 AL Alabama 640
6 2020-09-12 00:00:00 AL Alabama 637
7 2020-09-19 00:00:00 AL Alabama 631
8 2020-09-26 00:00:00 AL Alabama 635
9 2020-10-03 00:00:00 AL Alabama 641
10 2020-10-10 00:00:00 AL Alabama 637
# ℹ 11,491 more rows
# ℹ 61 more variables: weekly_percent_days_reporting_any_data <dbl>,
# num_hospitals_previous_day_admission_adult_covid_confirmed <int>,
# num_hospitals_previous_day_admission_pediatric_covid_confirmed <int>,
# num_hospitals_previous_day_admission_influenza_confirmed <int>,
# num_hospitals_total_patients_hospitalized_confirmed_influenza <int>,
# num_hospitals_icu_patients_confirmed_influenza <int>, …
We can see that R told us when doing this join that it’s joining with `by = join_by(state_abbrv)`
. This actually means R tried to guess which key variables it should use to join the tables. It did this by look at columns in both tables and finding ones with identical names. This is called a natural join. In this specific case, the only column with the same name in both tables is state_abbrv
, so that is the primary and foreign key that was used for the join. Also note that since this is a left join and us_hosps
is the primary table, all of its rows are kept and we simply get the new state_full
column added.
Now we are ready to join the deaths and hospitalization tables together since we have full state names in both tables. We do this by telling inner_join
which variables are the primary and foreign keys since they have different names in the two tables.
=
us_deaths_hosps |>
us_deaths inner_join(us_hosps_w_states, join_by(week_ending_date == week_end_date, state == state)) |>
select(-data_as_of, -start_date, -end_date, -group, -year, -month)
us_deaths_hosps
# A tibble: 11,501 × 74
mmwr_week week_ending_date state covid_19_deaths total_deaths
<int> <dttm> <chr> <int> <int>
1 32 2020-08-08 00:00:00 United States 7870 63719
2 33 2020-08-15 00:00:00 United States 7270 63644
3 34 2020-08-22 00:00:00 United States 6389 62562
4 35 2020-08-29 00:00:00 United States 5755 61104
5 36 2020-09-05 00:00:00 United States 5021 60247
6 37 2020-09-12 00:00:00 United States 4628 59665
7 38 2020-09-19 00:00:00 United States 4280 59740
8 39 2020-09-26 00:00:00 United States 4305 60613
9 40 2020-10-03 00:00:00 United States 4246 59802
10 41 2020-10-10 00:00:00 United States 4823 61779
# ℹ 11,491 more rows
# ℹ 69 more variables: percent_of_expected_deaths <dbl>,
# pneumonia_deaths <int>, pneumonia_and_covid_19_deaths <int>,
# influenza_deaths <int>, pneumonia_influenza_or_covid_19_deaths <int>,
# footnote <chr>, state_abbrv <chr>,
# weekly_actual_days_reporting_any_data <int>,
# weekly_percent_days_reporting_any_data <dbl>, …
Let’s quickly compare how big this combined table is to the two original tables.
nrow(us_deaths)
[1] 16794
nrow(us_hosps)
[1] 12369
nrow(us_deaths_hosps)
[1] 11501
The table created with the inner_join
has fewer rows than either of the original tables. This is because inner join
only keeps rows in the output table where the combination of the key values exist in both of the tables in the join. A left_join(x,y)
or right_join(x,y)
, on the other hand, will keep all the rows in the table in the x
(“left”) or y
(“right”) tables, respectively. This however means that there are variables in the y
table for the left_join
and in the x
table in the right_join
that don’t have values in the output table and so R puts NA
values in those places. To see this, suppose we do a left_join
and combine the us_hosps
into the us_deaths
.
|>
us_deaths left_join(us_hosps_w_states, join_by(week_ending_date == week_end_date, state == state)) |>
select(-data_as_of, -start_date, -end_date, -group, -year, -month)
# A tibble: 16,794 × 74
mmwr_week week_ending_date state covid_19_deaths total_deaths
<int> <dttm> <chr> <int> <int>
1 1 2020-01-04 00:00:00 United States 0 60170
2 2 2020-01-11 00:00:00 United States 1 60734
3 3 2020-01-18 00:00:00 United States 2 59362
4 4 2020-01-25 00:00:00 United States 3 59162
5 5 2020-02-01 00:00:00 United States 0 58834
6 6 2020-02-08 00:00:00 United States 4 59482
7 7 2020-02-15 00:00:00 United States 6 58815
8 8 2020-02-22 00:00:00 United States 6 58912
9 9 2020-02-29 00:00:00 United States 9 59334
10 10 2020-03-07 00:00:00 United States 38 59694
# ℹ 16,784 more rows
# ℹ 69 more variables: percent_of_expected_deaths <dbl>,
# pneumonia_deaths <int>, pneumonia_and_covid_19_deaths <int>,
# influenza_deaths <int>, pneumonia_influenza_or_covid_19_deaths <int>,
# footnote <chr>, state_abbrv <chr>,
# weekly_actual_days_reporting_any_data <int>,
# weekly_percent_days_reporting_any_data <dbl>, …
If we look at the first rows of this table, we can see that many of the hospitalization variables are NA
. This is because these dates, which are early 2020, don’t exist in the us_hosps
data, which only starts in August of 2020:
|> arrange(week_end_date) us_hosps
# A tibble: 12,369 × 64
week_end_date jurisdiction weekly_actual_days_reporting_any_data
<dttm> <chr> <int>
1 2020-08-08 00:00:00 AZ 458
2 2020-08-08 00:00:00 LA 919
3 2020-08-08 00:00:00 SC 424
4 2020-08-08 00:00:00 IA 802
5 2020-08-08 00:00:00 NM 128
6 2020-08-08 00:00:00 PA 991
7 2020-08-08 00:00:00 GU 13
8 2020-08-08 00:00:00 ME 177
9 2020-08-08 00:00:00 ND 264
10 2020-08-08 00:00:00 WA 575
# ℹ 12,359 more rows
# ℹ 61 more variables: weekly_percent_days_reporting_any_data <dbl>,
# num_hospitals_previous_day_admission_adult_covid_confirmed <int>,
# num_hospitals_previous_day_admission_pediatric_covid_confirmed <int>,
# num_hospitals_previous_day_admission_influenza_confirmed <int>,
# num_hospitals_total_patients_hospitalized_confirmed_influenza <int>,
# num_hospitals_icu_patients_confirmed_influenza <int>, …
In some cases, we may prefer the left_join
here since we want our table to reflect the fact that deaths are being recorded by the CDC for early 2020 even though hospitalization numbers are not.
Mutating joins in theory
A conceptual diagram of our left_join
would look like this:
Rows are matched by their keys, which are the colored columns. Rows in the right table y
whose keys don’t exist in the left table x
are left out. We get NA
s in the rows of variables from the right table where the left table has a value but the right table doesn’t. The complementary situation occurs for a right join:
An inner_join(x,y)
only keeps rows that have values for the key variables in both tables,
and a full_join
keeps all rows or observations in both the left and right tables (or x
and y
),
We can represent these different join types with Venn diagrams too:
Filtering joins
Filtering joins are handy for filtering tables when the conditions for the filter might be complex enough to be contained within another table. A semi_join(x,y)
keeps the rows in x
that match a row in y
whereas anti_join(x,y)
keeps the rows in x
that don’t have a match in y
. For example, suppose we want to filter our us_deaths_hosps
for a set of specific states. We could create a new tibble for this:
= tribble(
filter_dt ~state, ~val,
"California", 1,
"Kentucky", 1,
"New York", 1,
"Texas", 1
)
and then do the join:
|>
us_deaths_hosps semi_join(filter_dt)
Joining with `by = join_by(state)`
# A tibble: 868 × 74
mmwr_week week_ending_date state covid_19_deaths total_deaths
<int> <dttm> <chr> <int> <int>
1 32 2020-08-08 00:00:00 California 998 6309
2 33 2020-08-15 00:00:00 California 878 6291
3 34 2020-08-22 00:00:00 California 830 6478
4 35 2020-08-29 00:00:00 California 728 5878
5 36 2020-09-05 00:00:00 California 614 5784
6 37 2020-09-12 00:00:00 California 530 5874
7 38 2020-09-19 00:00:00 California 482 5755
8 39 2020-09-26 00:00:00 California 453 5563
9 40 2020-10-03 00:00:00 California 346 5671
10 41 2020-10-10 00:00:00 California 375 5568
# ℹ 858 more rows
# ℹ 69 more variables: percent_of_expected_deaths <dbl>,
# pneumonia_deaths <int>, pneumonia_and_covid_19_deaths <int>,
# influenza_deaths <int>, pneumonia_influenza_or_covid_19_deaths <int>,
# footnote <chr>, state_abbrv <chr>,
# weekly_actual_days_reporting_any_data <int>,
# weekly_percent_days_reporting_any_data <dbl>, …
The anti_join
works analogously; here, we can use it to note something about what happened when we gave us_hosps
its full state name.
|>
us_hosps anti_join(states, join_by(jurisdiction == state_abbrv)) |>
distinct(jurisdiction)
# A tibble: 4 × 1
jurisdiction
<chr>
1 GU
2 VI
3 MP
4 AS
The rows in this table are the ones that do not have a value for state_abbrv
in the states
table. These jurisdictions are US territories like Guam, US Virgin Islands, etc. Our previous use of inner_join
tossed out these rows since they don’t have a value in state_abbrv
.
Filtering joins in theory
A conceptual diagram of our semin_join
would look like this
where only the rows of x
are kept that match the key in y
and no columns of y
are added. Likewise, for anti_join
we get
where only the non-matching rows of x
are retained.
Non-equi joins
One key assumption we’ve made in the above joins is that the rows are retained in the joined data table by key columns that have equal values in rows in the left and right data tables. However, these values need not be equal; they could be greater than, less than, or satisfy another criterion. The dplyr
package identifies four particularly useful types of non-equi joins:
- Cross joins match every pair of rows (generates the Cartesian product of the two tables).
- Inequality joins use <, <=, >, and >= instead of ==.
- Rolling joins are similar to inequality joins but only find the closest match.
- Overlap joins are a special type of inequality join designed to work with ranges.
For more information, see https://r4ds.hadley.nz/joins#sec-non-equi-joins.
Databases
Even though there is a lot of data in excel spreadsheets and csv
files and similar tabular files, there might be even more data living in databases, which are organized collections of data accessible by a user through special software. Many database systems use a special language called Structured Query Language or SQL for accessing the data contained int he database. Lucky for us, there are R packages and tools that translate the data wrangling commands we’ve been learning into equivalent SQL commands, and we’ll briefly discuss some of the features of these tools.
There are a few important differences between databases and the typical data frame you used in R up until now:
Databases are stored on disk and can be very very large whereas data tables are typically small enough to be stored entirely within the working memory or RAM of a computer. This means some kinds of datasets, such as the all the users of Facebook and the information about them, must be stored in databases.
Databases often have an index for quickly accessing specific rows, which is important when the database is many gigabytes or terabytes in size. Data frames do not have or really need an index.
Databases are often row-oriented, which means data is stored row-by-row instead of column oriented like data frames. This makes adding data faster but doing data wrangling slower.
Connecting to databases
You need to use a database interface or DBI to connect to a database. These are specific to the kind of database, PostgreSQL, MySQL, etc but the R package DBI
is helpful here and has many of the interfaces builtin. In order to experiment, we’ll use a SQLite database of San Francisco bike share data from 2018 from https://github.com/ropensci/bikedata. The file, bikedb.sqlite
, should be in the project template “BIO540-DWV” in the course space on Posit Cloud. You canconnect to the database using the code below.
library(DBI)
library(dbplyr)
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
= dbConnect(RSQLite::SQLite(), "bikedb.sqlite")
dbcon dbcon
<SQLiteConnection>
Path: /Users/vancleve/science/teaching/UK/BIO-540/2024-Fall/bikedb.sqlite
Extensions: TRUE
Databases may have multiple tables. To see which tables are in this database, we do
|>
dbcon dbListTables()
[1] "datafiles" "stations" "trips"
Let’s look at the stations
table.
|>
dbcon dbReadTable("stations") |>
as_tibble()
# A tibble: 332 × 6
id city stn_id name longitude latitude
<int> <chr> <chr> <chr> <chr> <chr>
1 1 sf sf10 Washington St at Kearny St -122.404770255089 37.795392937…
2 2 sf sf100 Bryant St at 15th St -122.410662 37.7671004
3 3 sf sf101 San Bruno Ave at 16th St -122.405676841736 37.766007742…
4 4 sf sf102 Irwin St at 8th St -122.3995794 37.7668828
5 5 sf sf104 4th St at 16th St -122.390833497047 37.767044579…
6 6 sf sf105 16th St at Prosper St -122.4318042 37.764285
7 7 sf sf106 Sanchez St at 17th St -122.4306746 37.7632417
8 8 sf sf107 17th St at Dolores St -122.4264968 37.7630152
9 9 sf sf108 16th St Mission BART -122.419956922531 37.764710085…
10 10 sf sf109 17th St at Valencia St -122.4219039 37.7633158
# ℹ 322 more rows
We can see this is just a list of locations of bike share stations.
dbplyr
The package dbplyr
let’s us access the database as if we were using our normal dplyr
commands but in the background SQL commands are sent to the database. Let’s load the trips table.
= dbcon |>
sftrips tbl("trips")
sftrips
# Source: table<`trips`> [?? x 11]
# Database: sqlite 3.46.0 [/Users/vancleve/science/teaching/UK/BIO-540/2024-Fall/bikedb.sqlite]
id city trip_duration start_time stop_time start_station_id
<int> <chr> <int> <chr> <chr> <chr>
1 1 sf 75284 2018-01-31 22:52:35 2018-02-01 19… sf120
2 2 sf 85422 2018-01-31 16:13:34 2018-02-01 15… sf15
3 3 sf 71576 2018-01-31 14:23:55 2018-02-01 10… sf304
4 4 sf 61076 2018-01-31 14:53:23 2018-02-01 07… sf75
5 5 sf 39966 2018-01-31 19:52:24 2018-02-01 06… sf74
6 6 sf 6477 2018-01-31 22:58:44 2018-02-01 00… sf236
7 7 sf 453 2018-01-31 23:53:53 2018-02-01 00… sf110
8 8 sf 180 2018-01-31 23:52:09 2018-01-31 23… sf81
9 9 sf 996 2018-01-31 23:34:56 2018-01-31 23… sf134
10 10 sf 825 2018-01-31 23:34:14 2018-01-31 23… sf305
# ℹ more rows
# ℹ 5 more variables: end_station_id <chr>, bike_id <chr>, user_type <chr>,
# birth_year <chr>, gender <chr>
You can see that the table shows the number of columns but not the number of rows. This is because the table is loaded lazily, which means data aren’t accessed until they must be. Since all we did was essentially ask for the beginning of the table, it didn’t have to read the whole thing into memory. Suppose we wanted to collect all the bike trips that were taken on July 4th, which is a US holiday, and we wanted to see which station generated the longest trips. We could use our normal dplyr
tools as follows:
= sftrips |>
july4th filter(start_time < "2018-07-05 00:00:00", start_time >= "2018-07-04 00:00:00") |>
group_by(start_station_id) |>
summarize(mean_duration = mean(trip_duration)) |>
arrange(desc(mean_duration)) |>
collect()
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
july4th
# A tibble: 282 × 2
start_station_id mean_duration
<chr> <dbl>
1 sf152 28855
2 sf155 12484.
3 sf206 10292.
4 sf149 5198.
5 sf132 4019.
6 sf180 3608.
7 sf202 3590.
8 sf207 3517
9 sf228 3372
10 sf72 3323.
# ℹ 272 more rows
The final command above, collect
, collects all the data from the database so that we now just have a normal tibble
and just some of the results from database. This can be useful since pulling from the database can be a little slower but only do this if the resulting table is small enough to store in RAM.
We could go back to the stations
table to get location information about the stations and then use a join to add that information here.
= dbcon |>
stations dbReadTable("stations") |>
as_tibble()
|>
july4th left_join(stations, by = join_by(start_station_id == stn_id))
# A tibble: 282 × 7
start_station_id mean_duration id city name longitude latitude
<chr> <dbl> <int> <chr> <chr> <chr> <chr>
1 sf152 28855 51 sf 47th St at San… -122.281… 37.8356…
2 sf155 12484. 54 sf Emeryville Pub… -122.293… 37.8405…
3 sf206 10292. 105 sf College Ave at… -122.251… 37.8381…
4 sf149 5198. 47 sf Emeryville Tow… -122.285… 37.8312…
5 sf132 4019. 31 sf 24th St at Cha… -122.426… 37.7518…
6 sf180 3608. 80 sf Telegraph Ave … -122.268… 37.8126…
7 sf202 3590. 101 sf Washington St … -122.274… 37.8007…
8 sf207 3517 106 sf Broadway at Co… -122.251… 37.8357…
9 sf228 3372 129 sf Foothill Blvd … -122.217… 37.77993
10 sf72 3323. 305 sf Page St at Sco… -122.435… 37.7724…
# ℹ 272 more rows
We can see that the station that generated the longest rides on July 4th is 47th St at San Pablo Ave station.
Finally, we can see what the SQL code is to generate the same query that we execute with our dplyr
code:
|>
sftrips filter(start_time < "2018-07-05 00:00:00", start_time >= "2018-07-04 00:00:00") |>
group_by(start_station_id) |>
summarize(mean_duration = mean(trip_duration)) |>
arrange(desc(mean_duration)) |>
show_query()
<SQL>
SELECT `start_station_id`, AVG(`trip_duration`) AS `mean_duration`
FROM (
SELECT `trips`.*
FROM `trips`
WHERE
(`start_time` < '2018-07-05 00:00:00') AND
(`start_time` >= '2018-07-04 00:00:00')
) AS `q01`
GROUP BY `start_station_id`
ORDER BY `mean_duration` DESC
If you’re interested more in SQL (and you don’t have to be since you have dbplyr
!), you can get some intro info in the “R for Data Science” book: https://r4ds.hadley.nz/databases#sql.
Lab
For some of these problems, we’ll use data from the library nycflights13
, which contains airline flight data from 2013. Once loaded, you have access to five tables: airlines
, airports
, flights
, planes
, and weather
.
library(nycflights13)
Find the day that has the worst total departure delay (over all flights for that day).
Imagine you’ve found the top 10 most popular destinations using this code:
= flights |> top_dest count(dest, sort = TRUE) |> head(10)
How can you find all flights to those destinations?
Add the latitude and the longitude of the origin and destination airport to flights
What do the tail numbers in
flights
that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)Use the
bikedb.sqlite
data and find the station with the most number of trips in the database. Use a join to find out where that station is located.
Challenge problem (+5 extra credit)
Using the
bikedb.sqlite
, find the longest bike trip by distance between the stations. What are the names of starting and ending stations for this trip?You may calculate the distance between the stations using the longitutde and latitude using the
geosphere
package.