The basics of `ggplot2`

Author

Rosana Zenil-Ferguson and Jeremy Van Cleve

Published

March 6, 2025

Workshop materials

Workshop materials can be found at https://github.com/vancleve/data2design_uk.

Download R and RStudio

If you haven’t already downloaded and installed R and RStudio, please do so now.

R

Windows: https://cran.rstudio.com/bin/windows/base/
macOS: https://cran.rstudio.com/bin/macosx/

Rstudio

https://posit.co/download/rstudio-desktop/

Getting a Project going in RStudio

RStudio cheatsheet

Create a new project in RStudio

Using a “project” allows us to keep all our files in one place and helps RStudio understand which files belong to which project

Go to File->New Project->New directory->New project
Select an easy to find directory
Name your project something descriptive like “data2design_ggplot_tutorial”.
Save this qmd file into your project directory.
Go to https://github.com/vancleve/data2design_uk/raw/refs/heads/main/ggplot_tutorial.qmd and Save As... in your browser.

Packages

Packages are groups of different functions or “actions” that allows us to do special tasks. People write packages to help others with reproducibility of analyses.

We will be working with the plotting package ggplot2 and a set of packages for data wrangling called tidyverse (https://www.tidyverse.org/). By installing just tidyverse, we can get all those packages at once. We’ll also install a package, cowplot, for helping with finessing and saving our plots.

install.packages("tidyverse") # packages for data wrangling and plotting including dplyr, readr, tidyr, stringr, readxl, ggplot2, and others
install.packages("cowplot")

Once installed, packages must be loaded into our “environment” (it is like bringing a pen if you are planning to write, not only buying it!). To load them we use the function library()

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(cowplot)


Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp

Reading data

The function read_excel from the package readxl allows us to read excel files and read_csv allows us to read csv (comma separated value) files. We will first load a dataset from Onstein et al. (2019)¹ frugivory‐related traits and dispersal in the custard apple family. In the code below, we’ll load the data directly from Dryad (https://datadryad.org/dataset/doi:10.5061/dryad.2hd8b0s), which is a repository for archiving datasets, by first downloading the xlsx file and then loading it with read_excel. We’ll also simplify some of the column names too for clarity and turn some of the columns from log values into normal values.

tf = tempfile(fileext = ".xlsx")
curl::curl_download("https://datadryad.org/downloads/file_stream/82494", tf)

fruits <- 
  read_excel(tf, sheet = "Matrix for analysis", na = "NA") |>
  rename(Taxon_tree = Species_tree, Taxon = Species_PROTEUS) |>
  mutate(across(starts_with("Log_"), exp, .names = "Exp_{col}")) |>
  rename_with(\(x) gsub("Exp_Log_", "", x, fixed = TRUE))

## Check what is inside the dataset, was it read fine?
fruits

# A tibble: 228 × 27
   Taxon_tree                   Taxon   Log_Fruit_length_avg Log_Fruit_width_avg
   <chr>                        <chr>                  <dbl>               <dbl>
 1 Alphonsea_boniana_JB57       Alphon…               NA                 NA     
 2 Alphonsea_elliptica_JB34     Alphon…                0.763              0.580 
 3 Alphonsea_kinabaluensis_JB85 Alphon…                0.415              0.279 
 4 Ambavia_gerrardii_PB         Ambavi…                0.301              0.114 
 5 Anaxagorea_javanica_JB       Anaxag…                0.544              0.544 
 6 Anaxagorea_luzonensis_JB     Anaxag…                0.415             -0.222 
 7 Anaxagorea_phaeocarpa_0498   Anaxag…                0.505              0.301 
 8 Anaxagorea_silvatica_0113    Anaxag…                0.176              0.0414
 9 Annickia_chlorantha_0976     Annick…                0.114             -0.155 
10 Annickia_kummeriae_MWC7004   Annick…                0.342              0.0414
# ℹ 218 more rows
# ℹ 23 more variables: Log_Fruit_number_avg <dbl>, Log_Stipe_length_avg <dbl>,
#   Log_Seed_length_avg <dbl>, Log_Seed_width_avg <dbl>,
#   Log_Seed_number_avg <dbl>, Log_Height_max <dbl>, Bright <chr>,
#   Pubescent <chr>, Defence <chr>, Fruit_type <chr>, Cauliflory <chr>,
#   Moniliform <chr>, Dehiscence <chr>, Shrub <chr>, Conspicuousness <chr>,
#   Fruit_length_avg <dbl>, Fruit_width_avg <dbl>, Fruit_number_avg <dbl>, …

There is a lot going on the data wrangling above that we won’t have time to cover today, but one thing to point now is the use of the pipe operator, |>, above. The pipe operator helps up string together data wrangling commands; what’s on the left hand side of the pipe goes into the first argument of what’s on the right hand side. This means the above code starts with the read_excel function that returns the raw table that is given the rename function that then renames a couple of the columns and returns a modified table that is then passed to mutate, etc, and find the results is saved in the fruits variable.

Plotting and understanding our data

Basics of `ggplot2`: bar plots

The ggplot2 package is built on the idea that graphics have can have a “grammar”, or set or rules, that specifies how they can and should be constructed. Implementing these rules not only makes creating graphics easier, but it makes such graphics consistent and clear. Hadley Wickham, the creator of ggplot2, borrows this idea from the book, “The Grammar of Graphics”” by Wilkinson, Anand, and Grossman (2005)². While this structure may seem a bit artificial at first, it makes creating graphics very modular and building up complex graphics much easier.

Our first plot will be one of the simplest, which is a bar plot. Let’s plot the number of species by their Fruit_type.

ggplot(fruits) +
  geom_bar(aes(x = Fruit_type))

The above ggplot command has two pieces. The first is a call to ggplot with the name of the data table. This command by itself creates a blank canvas onto which we can plot using data from the data table. The second piece is to “add” a layer to this canvas with +, and that layer is a bar plot. The geom part is short for geometry since all the graphical elements of different plot types are made up of geometric pieces. In the argument to geom_bar, we give an “aesthetic mapping” with aes(x = Fruit_type), which says we want the x-axis to map to the Fruit_type column of our data. Finally, geom_bar builds bars automatically whose height is given by the number of rows in the data table with each x value.

This plot is a bit boring but we can spice it up with another aesthetic. Let’s try the color of the bar, which is known as the “fill”. We’ll set the fill to the Cauliflory variable.

ggplot(fruits) +
  geom_bar(aes(x = Fruit_type, fill = Cauliflory))

ggplot is smart here and automatically adds a legend for the color aethetic since you otherwise wouldn’t know which color mapped to which value of Dehiscence. In fact, ggplot also automatically added the tick labels for the same reason for the x-axis aesthetic.

Grids of plots

Looking at our dataset, there are a few other fruit variables we can look at like Dehiscence and Moniliform. What if we wanted to see if how the number of each Fruit_type varies as a function of different values of Dehiscence and Moniliform? One way to visualize this would be to replicate the bar plot above for each combination of Dehiscence and Moniliform. It turns out that ggplot let’s us do this very easily by adding a facet_wrap layer onto our bar plot.

ggplot(fruits) +
  geom_bar(aes(x = Fruit_type, fill = Cauliflory)) +
  facet_wrap(vars(Dehiscence, Moniliform))

What can we observe from this?

Notice that the y-axis scales are all the same. This is on purpose and allows us to compare the data across the panels of the plot. However, we miss the variation in the panels with fewer observations. To allow the y-axis scales to adjust in each panel, we can add scales = "free_y" to facet_wrap.

ggplot(fruits) +
  geom_bar(aes(x = Fruit_type, fill = Cauliflory)) +
  facet_wrap(vars(Dehiscence, Moniliform), scales = "free_y")

Scatter plots

When do we use scatter plots? Scatter plots are great for looking at how variables are correlated to one another and for seeing the full distribution of the data since you typically plot every single point.

Let’s make our first basic scatter plot. Notice that we put the aes command as an argument to the ggplot command. This works just as well as putting it in the geom_point command except that we can add on additional geometries without having to repeat the aes command.

ggplot(fruits, aes(x=Fruit_length_avg, y=Fruit_width_avg)) + 
  geom_point()

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).

What do we notice?

Now let’s add a regression (linear model) line through the points.

ggplot(fruits, aes(x = Fruit_length_avg, y = Fruit_width_avg)) + 
  geom_point() +
  geom_smooth(method="lm")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).

The geom_smooth function is what adds the linear regression line. Note here we have to add method="lm" to tell geom_smooth to add a straight line (i.e., linear); otherwise, it defaults to a more complex method that fits a smooth curve.

ggplot(fruits, aes(x = Fruit_length_avg, y = Fruit_width_avg)) + 
  geom_point() +
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).

By adding a color aesthetic, the points are colored based on the value of the variable we map to color. Here, let’s map Shrub to color and keep the linear regression.

fruits_plt <- ggplot(fruits, aes(x = Fruit_length_avg, y = Fruit_width_avg, color = Shrub)) + 
  geom_point() + 
  geom_smooth(method = lm, se = FALSE, fullrange = TRUE)
fruits_plt

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 24 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_point()`).

Are the groups different?

Volcano plots

Scatter plots can be used to visualize all kinds of data and are especially population in genomics data. For example, differential expression analyses using RNA-seq are very common and a “volcano plot” is often used to show which genes have increased and which have decreased expression in an experiment. We’ll use a dataset common in RNA-seq differential expression tutorials³, airway, which comes from an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone⁴. The data from airway can be loaded from the workshop GitHub site.

airway_de = read_csv("https://raw.githubusercontent.com/vancleve/data2design_uk/refs/heads/main/airway_de_transcript.csv")

Rows: 15926 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): .feature, albut, gene_id, gene_name, gene_biotype, seq_name, symbol
dbl (8): gene_seq_start, gene_seq_end, seq_strand, logFC, logCPM, F, PValue,...
lgl (4): entrezid, seq_coord_system, GRangesList, .abundant

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

airway_de

# A tibble: 15,926 × 19
   .feature        albut gene_id  gene_name entrezid gene_biotype gene_seq_start
   <chr>           <chr> <chr>    <chr>     <lgl>    <chr>                 <dbl>
 1 ENSG00000000003 untrt ENSG000… TSPAN6    NA       protein_cod…       99883667
 2 ENSG00000000419 untrt ENSG000… DPM1      NA       protein_cod…       49551404
 3 ENSG00000000457 untrt ENSG000… SCYL3     NA       protein_cod…      169818772
 4 ENSG00000000460 untrt ENSG000… C1orf112  NA       protein_cod…      169631245
 5 ENSG00000000971 untrt ENSG000… CFH       NA       protein_cod…      196621008
 6 ENSG00000001036 untrt ENSG000… FUCA2     NA       protein_cod…      143815948
 7 ENSG00000001084 untrt ENSG000… GCLC      NA       protein_cod…       53362139
 8 ENSG00000001167 untrt ENSG000… NFYA      NA       protein_cod…       41040684
 9 ENSG00000001460 untrt ENSG000… STPG1     NA       protein_cod…       24683489
10 ENSG00000001461 untrt ENSG000… NIPAL3    NA       protein_cod…       24742284
# ℹ 15,916 more rows
# ℹ 12 more variables: gene_seq_end <dbl>, seq_name <chr>, seq_strand <dbl>,
#   seq_coord_system <lgl>, symbol <chr>, GRangesList <lgl>, .abundant <lgl>,
#   logFC <dbl>, logCPM <dbl>, F <dbl>, PValue <dbl>, FDR <dbl>

Each row of the table is gene whose expression was measured in the experiment, logFC is the log2 of the ratio of the expression in the treatment vs the control, PValue is the significance of the comparison, and FDR is false discovery rate.

Plotting the volcano plot is as simple as plotting the logFC on the x-axis and the -log10(Pvalue) on the y-axis. We can color the points by those with a FDR < 0.05.

ggplot(airway_de, aes(x = logFC, y = -log10(PValue), colour = FDR < 0.05)) +
    geom_point()

What if we want to color points as well by those with a greater than 2 fold change in expression up or down? Then we can do a little wrangling to create a new signficance column and use that for the color.

airway_de_sig = airway_de |>
  mutate(significance =
           case_when(
             FDR < 0.05 & abs(logFC) >= 2 ~ "significant FDR and FC",
             FDR < 0.05 & abs(logFC) < 2 ~ "significant FDR",
             FDR > 0.05 & abs(logFC) >= 2 ~ "significant FC",
             .default = "non-significant"
           ))


ggplot(airway_de_sig, aes(x = logFC, y = -log10(PValue), colour = significance)) +
  geom_point()

Finally, what about labeling some of the genes with the most significant (i.e., lowest) p-values? To do this, we need a little more data wrangling and a new geometry called geom_text that places a label on top of a data point.

airway_de_sig_lbl = 
  airway_de_sig |>
  mutate(label = ifelse(PValue < 5e-9, gene_name, ""))
  
airway_plt = ggplot(airway_de_sig_lbl, aes(x = logFC, y = -log10(PValue))) +
  geom_point(aes(colour = significance)) +
  geom_text(aes(label = label))
airway_plt

This looks great except the labels are right on top of the points and they overlap. Maybe we also want to modify the background or add horizontal or vertical lines separating the regions? We can do all these things in Adobe Illustrator! We will save this plot later so we can load it into Illustrator.

ggsave("airway_de_plot.pdf", plt, width = 20, height = 12)

Plotting distributions

Often we are interested in understanding how often observations of a certain size appear in our data. For example, how many species have fruit of a certain length? To answer questions like this, we need to visualize the distribution of the data.

Questions about distributions often come up when our variables are contrinous or metric, which means that we can add and subtract values of the variable and create summaries of the variable like its mean value.

Histograms

One of the most common ways of visualizing a distribution is with a histogram, which groups observations into bins (or cajitas) to show us which kinds of observations are more frequent.

Let’s make a basic histogram!

ggplot(fruits, aes(x=Fruit_length_avg)) + 
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).

Let’s interpret it!

Making histograms pretty and more useful

We saw how to change the fill of bars in a bar chart, but we can also change the color of the surrounding box. Here’s how:

p_histogram <- ggplot(fruits, aes(x = Fruit_length_avg)) + 
  geom_histogram(color = "darkblue", fill = "hotpink")
p_histogram

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 15 rows containing non-finite outside the scale range
(`stat_bin()`).

There are some pesky NAs that are causing the warnings above. What happens in the histogram if we remove them? For that we will use the function filter and save the sliced data into fruits_mod.

fruits_mod <- filter(fruits, !is.na(Fruit_length_avg))

p_histogram <- ggplot(fruits_mod, aes(x = Fruit_length_avg)) + 
  geom_histogram(color = "darkblue", fill = "hotpink")
p_histogram

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What if we want to separate the data by the type of plant? We can use the variable Shrub in the dataset that tells us which one is a shrub or not a shrub

p_histogram<-ggplot(fruits_mod, aes(x = Fruit_length_avg, color = Shrub)) + 
  geom_histogram(fill = "white")
p_histogram

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What if we want to add the mean of each group? We can first calculate the mean like below

## Calculating the mean per group using Dplyr
## we are naming the mean of each group mean_fl (as a short for mean fruit length)
mu <- fruits_mod |>
  filter(!is.na(Shrub)) |>
  group_by(Shrub) |>
  summarize(mean_fl = mean(Fruit_length_avg, na.rm = TRUE),
            .groups = 'drop')
mu

# A tibble: 2 × 2
  Shrub     mean_fl
  <chr>       <dbl>
1 not_shrub    1.62
2 shrub        1.56

Ok, so what is happening with the dimension of mu? what do you notice?

dim(mu)

[1] 2 2

Okay so now finally add the mean to the groups and the histogram

p_histogram <- ggplot(fruits_mod, aes(x = Fruit_length_avg, color = Shrub)) +
  geom_histogram(fill = "white", position = "dodge") +
  geom_vline(data = mu, aes(xintercept = mean_fl, color = Shrub),
             linetype = "dashed") +
  theme(legend.position = "top")
p_histogram

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What if I want my own colors? Let’s check some cool options (aka coolors).

p_histogram <- p_histogram + 
  scale_color_manual(values = c("hotpink", "#56B4E9")) #EXPLAIN
p_histogram

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

But I want to make it publication quality! Let’s make nice labels on the axes and clean the background

p_histogram <- p_histogram + xlab("Average Fruit Length")
p_histogram

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p_histogram <- p_histogram + theme_classic()
p_histogram

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ok, what about a histogram for our airway differential gene expression data? Let’s take a look at the distribution of p-values in the data.

ggplot(airway_de) +
  geom_histogram(aes(x = PValue), fill = "gray", color = "black") +
  theme_classic()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What should this distribution look like under the null hypothesis that there is no effect of the drug on gene expression?

Density plots

What are density plots? They are areas under the curve and later they will help us to decide about probability!

Let’s make a simple one!

p_density <- ggplot(fruits_mod, aes(x = Fruit_length_avg)) + 
  geom_density()
p_density

Let’s give it some pretty colors and add some transparency or reduce the opacity (called alpha).

p_density <- ggplot(fruits_mod, aes(x = Fruit_length_avg)) +
  geom_density(color = "darkblue", fill = "lightblue", alpha = 0.5)
p_density

We will add the mean.

mean.fruit <- mean(fruits_mod$Fruit_length_avg, na.rm = TRUE)
mean.fruit

[1] 1.596013

p_density <- p_density + 
  geom_vline(aes(xintercept = mean.fruit),
              color = "blue", linetype = "dashed", linewidth = 1)
p_density

Change density plot line colors by groups

p_density <- ggplot(fruits_mod, aes(x=Fruit_length_avg, color=Shrub)) +
  geom_density() + geom_vline(data=mu, aes(xintercept=mean_fl, color=Shrub),
             linetype="dashed")
p_density

# Fill them in 
p_density <- ggplot(fruits_mod, aes(x = Fruit_length_avg, fill = Shrub)) + 
  geom_density(alpha = 0.4) + 
  geom_vline(data = mu, aes(xintercept = mean_fl, color = Shrub), linetype = "dashed")
p_density

Adding my color scheme (tolk about coolors here)

p_density <- p_density + 
  scale_fill_manual(values = c("hotpink", "#56B4E9"))
p_density

Making it publication style

p_density <- p_density + 
  xlab("Average Fruit Length") + 
  theme_classic()
p_density

Let’s now do a density plot for the log fold change for the gene expression data and looking at the distribution for FDR < 0.05 and FDR < 0.05. What do we see?

ggplot(airway_de) +
  geom_density(aes(x = logFC, fill = FDR < 0.05), alpha = 0.4) +
  scale_fill_manual(values = c("hotpink", "#56B4E9")) +
  theme_classic()

Histogram with density plot

p_histdensity <- ggplot(fruits_mod, aes(x = Fruit_length_avg)) + 
  geom_histogram(aes(y = after_stat(density)), colour = "black", fill = "white")+
  geom_density(alpha = .2, fill = "hotpink") 
p_histdensity

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Making Box and Whisker plots

What are box and whisker plots? What are quartiles and interquartile difference??

ggplot(data = fruits_mod, aes(x = Fruit_length_avg)) + 
  geom_boxplot()

How do we make it vertical???

p_boxplot <- ggplot(data = fruits_mod, aes(y = Fruit_length_avg)) + 
  geom_boxplot()
p_boxplot

How do we make it box and whisker plots by group??

p_boxplot <- ggplot(fruits_mod, aes(y = Fruit_length_avg, color = Shrub)) +
  geom_boxplot()
p_boxplot

How do we change the color?

p_boxplot <- ggplot(fruits_mod, aes(x = Shrub, y = Fruit_length_avg, color = Shrub)) +
  geom_boxplot() + 
  scale_color_manual(values = c("hotpink", "#56B4E9"))
p_boxplot

Notice that the x-axis tick labels updated, great! But now the legend is redundant, so let’s turn it off.

p_boxplot <- ggplot(fruits_mod, aes(x = Shrub, y = Fruit_length_avg, color = Shrub)) +
  geom_boxplot(show.legend = FALSE) + 
  scale_color_manual(values = c("hotpink", "#56B4E9"))
p_boxplot

How do we fill them in

p2 <- ggplot(fruits_mod, aes(x = Shrub, y = Fruit_length_avg, fill = Shrub)) +
  geom_boxplot(show.legend = FALSE)

p2 + scale_fill_manual(values = c("hotpink", "#56B4E9"))

How do we add the mean?

p_boxplot <- p_boxplot + 
  geom_point(data = mu, aes(x = Shrub, y = mean_fl), shape = 23, size = 4, show.legend = FALSE)
p_boxplot

Adding good labels

p_boxplot <- p_boxplot + 
  xlab("Shrub status") + 
  ylab("Average fruit length")
p_boxplot

How do we add all the samples?

p_boxplot <- p_boxplot + 
  geom_jitter(shape = 16, position = position_jitter(0.2), show.legend = FALSE)
p_boxplot

All pretty and ready for publication

p_boxplot <- p_boxplot + theme_classic()
p_boxplot

Box plots are useful for all kinds of data; let’s create a boxplot for the differential gene expression data that looks at the log fold change as a function of the FDR.

ggplot(airway_de) +
  geom_boxplot(aes(y = logFC, x = FDR < 0.05, color = FDR < 0.05), alpha = 0.4, show.legend = FALSE) +
  scale_color_manual(values = c("hotpink", "#56B4E9")) +
  theme_classic()

Finally, let’s say two of our plots for modification in Adobe Illustrator

plts = plot_grid(fruits_plt, airway_plt) # plot_grid comes from library(cowplot)
plts

save_plot("Fruit_length_avg_v_airway_de_sig_lbl.pdf", plts, ncol = 2, base_width = 6, base_height = 5) # save_plot comes from library(cowplot)

Futher information

There are some recent books on data science and visualization (all written in RMarkdown, which is a predecessor and alternative to Quarto) that cover much of the material in the course.

Wickham, Hadley, Grolemund, Garrett, and Mine Çetinkaya-Rundel. 2023. R for Data Science (2e). O’Reilly. < https://r4ds.hadley.nz/>
Wilke, Claus O. 2018. Fundamentals of Data Visualization. https://clauswilke.com/dataviz/
Healy, Kieran. 2018. Data Visualization: A Practical Introduction. http://socviz.co/
Ismay, Chester and Kim, Albert Y. 2018. An Introduction to Statistical and Data Sciences via R. https://moderndive.com/
Silge, Julia and Robinson, David. 2018. Text Mining with R: A Tidy Approach. https://www.tidytextmining.com/

If you want to become an R wizard in the style of Hadley Wickham, this book is for you.

Wickham, Hadley. 2019. Advanced R. https://adv-r.hadley.nz/

Footnotes

Onstein, R. E., W. D. Kissling, L. W. Chatrou, T. L. P. Couvreur, H. Morlon, and H. Sauquet. 2019. Which frugivory-related traits facilitated historical long-distance dispersal in the custard apple family (Annonaceae)? Journal of Biogeography 46:1874–1888.↩︎
Wilkinson, L. 2005. The grammar of graphics. Statistics and computing. Springer New York.↩︎
https://stemangiola.github.io/rpharma2020_tidytranscriptomics/articles/tidytranscriptomics.html↩︎
Himes, B. E., X. Jiang, P. Wagner, R. Hu, Q. Wang, B. Klanderman, R. M. Whitaker, et al. 2014. RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells. PLOS ONE 9:e99625.↩︎