install.packages("tidyverse")
BIO 540: Data Wrangling and Visualization
- I was an R newbie too!
- Go over syllabus
- Install R and RStudio
- What is R?
- What are Markdown and R Markdown?
Syllabus
- The syllabus will be located here: https://github.com/vancleve/BIO540-DWV.
- This is also where R and R Markdown / Quarto documents will be located. The only files that will be kept on the Canvas site are the PDF copies of the reference books and the completed labs and figures that you submit.
- I’ll update the syllabus and schedule of topics as necessary.
Using R, RStudio, and Quarto
Local installation
We will use R, RStudio, and Quarto via a server and will not have to install them locally on our computers. However, if you would like to do so, here are some brief instructions.
First step, download R, RStudio, and Quarto - To download R, go here: https://cran.r-project.org/. - To download RStudio, go here: https://posit.co/download/rstudio-desktop/ - To download Quarto, go here: https://quarto.org/docs/get-started/.
If you need directions, this may help: https://quarto.org/docs/get-started/hello/rstudio.html. If you already have R and need to update your packages, go here: https://stat545.com/install.html (these are a little dated now since the release of Quarto, but should still be helpful).
Server-based R, RStudio, and Quarto via Posit Cloud
We have access to R, RStudio, and Quarto via the Posit Cloud. Go to the Canvas site and find the Announcement where I welcome you to the course. There will be a link there to the course content on the Posit Cloud. You should then arrive at a page that looks like this,
If you already have a Posit Cloud account, then login with that. Otherwise, select “Sign Up” and create an account. Once you’ve created your account, you may have to click on the link to the course content again. Then, you should be able to join the workspace and see this:
Finally, you can create a new project based on the files in the course github repository by selecting “New Project” -> “New Project From Template” and then “BIO540-DWV” as seen below.
You can rename your project “Week 1”. For each week, you should repeat this process of creating a new project from the template and then giving the project the requisite week name.
Running RStudio
What is RStudio?
RStudio is an Integrated Development Environment (IDE) for the R programming language.
While you can write R code in any text editor you like and then run that code with the R interpreter, there are many things that an IDE can do that help you be more efficient when programming.
- Syntax highlighting
- New visual editor (WYSIWYM editing: https://en.wikipedia.org/wiki/WYSIWYM)
- File/project organization (see “Files” pane)
- Examining variables that you’ve set (see the “Environment” pane)
- Easily execute code and examine its text output (see “Console”) or graphical output (see “Viewer”)
- View help files (see “Help”)
- Installing/updating R packages (see “Tools” menu)
- Debugging (see “Debug” menu)
- Projects (see “File” and “Tools”)
- Version control (see “Tools” menu)
When using RStudio, I encourage you to:
Play around with different arrangements for the windows/panes
LEARN KEYBOARD SHORTCUTS. They can make you much, much more efficient.
- Shift+Alt+K (MAC/PC): Keyboard shortcut quick reference.
- Shift+Command/Control+K (MAC/PC): Render (or “knit”) current document (i.e., turn in HTML/DOCX/PDF)
- Command/Control+Enter (MAC/PC): Run selected lines
- Shift+Command/Control+Enter: Run current “chunk” (R Notebook only)
- etc
Use a Project for all assignments in the course (save them in a single directory or its subdirectories).
Course files as a project
The lecture notes and problems that are stored on the GitHub site and are already downloaded as part of the template.
Updating the BIO540-DWV project from the GitHub repository
If there is additional content uploaded to the GitHub repository that was not in the project when you created it from the template, then you can update the files by going to “Tools”->“Version Control”->“Pull Branches”.
Tidyverse
Many of the packages that packages that we will use have been helpfully collected together into a metapackage called tidyverse
. Lucky for us, this is already installed in the project template. If we needed to install it, we’d type the following at the console:
RStudio cheatsheet
For a cheatsheet on RStudio, go here:
https://github.com/rstudio/cheatsheets/raw/main/rstudio-ide.pdf
What is R?
First, there was “S”.
- S was designed by John Chambers and others at Bell Labs in the 1970s specifically for data analysis and statistics.
- R was developed in 1991 by Ross Ihaka and Robert Gentleman and re-implemented much of S after S became licensed software.
Then there was “R”
- R was open-sourced in 1995 and John Chambers and other statisticians are part of its core development team.
- R is object-oriented (i.e., you can build containers of variables and the like), like Java.
- R is interpreted (i.e., the interpreter turns interprets your text code immediately and runs it), like Python.
- R has a tonne of packages for statistics and biology.
- More on the basics of R next time.
What are Markdown, R Markdown, and Quarto?
Markdown: a plain text (i.e., just characters) way of “formatting” text
- Markdown is a kind of “markup” language (e.g., HTML).
- Designed for simplicity and readability
- No need to “view” Markdown to read it easily
- Often used in code documentation
- Increasingly used in full document preparation (journal article, books, etc)
Ok, let’s get Markdown!
First, this whole document is Markdown, so you can quickly see examples of the following:
Headings are created with hash symbols
#
#
is the “first” heading##
is the “second” heading, which is smaller (how much smaller? set by the “stylesheet”)- etc
Italic text in encapsulated by one * or asterisk, Bold text by two asterisks. Code can be added in line with backticks,
code
.Lists can start with -, *, or +. (Quarto requires a blank line before the list starts)
Sublists are indented at minimum to line up with the content of the enclosing list. Use with 4 spaces (tab twice in RStudio) for syntax highlighting.
Numbered lists can be used too.
- This is the first element.
- This is the second.
Numbered lists can be automatically numbered by using
(@)
as the list marker. In-fact, these lists will be numbered continuously across multiple lists- This is the eleventh element since it continues the
(@)
from above. - This is the twelfth element.
- This is the eleventh element since it continues the
New paragraphs are separated by blank lines.
Line breaks, without a new paragraph, need two spaces
in order to be recognized. These can be used in lists too.Links use angle brackets <>: https://www.r-project.org/
Links with different text use
[text](http://link.to.something)
Images use
![image caption](path.to.image)
. Add{width=50%}
after the(path.to.image)
to have the image use 50% of its possible size.E.g., here is an elephant that only takes up 25% of the space.
Tables can be created too:
Right Left Default Center 12 12 12 12 123 123 123 123 1 1 1 1 For more details, see the Quarto docs. Quarto uses a “flavor” of Markdown created for Pandoc, which is the tool Quarto uses to convert Markdown into all the different document types (e.g., PDF, DOCX, HTML, etc).
R Markdown and Quarto: mixing text (Markdown) and R code (R)
Ok, lets put some R in this thing.
Make a code “chunk”” with three back ticks followed by an r in braces. End the chunk with three back ticks:
paste("Hello", "World!")
[1] "Hello World!"
Place code inline with a single back ticks. The first back tick must be followed by an R, like this: Hello World!.
You can control how the chunks of R code work in the rendered document by adding options like #| echo: false
to hide the code when you create the HTML
[1] 0.30473206 0.41902696 0.70735310 0.61861072 0.01619336 0.67893882
[7] 0.75059961 0.64710759 0.40688601 0.28575588
or you can add the option eval: FALSE
so that the code isn’t evaluated (see the install.packages
lines in this .qmd
)
You can do plots too. Let’s do one with a caption (“This is a cool_plot”, which you can see Quarto output, which is useful for debugging) that is just a bunch (1000) of normally distributed numbers (mean=0, std=1):
plot(rnorm(1000, 0, 1))
Now, lets do something a little more interesting. We can pull in data from all kinds of places including websites and actively updated databases. For example, here are COVID-19 death numbers for the United States for Kentucky and Tennessee from the CDC:
library(tidyverse)
library(RSocrata)
# https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-Week-Ending-D/r8kw-7aab/about_data
= read.socrata("https://data.cdc.gov/api/odata/v4/r8kw-7aab")
us
|>
us filter(state == "Kentucky" | state == "Tennessee") |>
filter(group == "By Week") |>
drop_na(covid_19_deaths) |>
ggplot(aes(x=week_ending_date, y=covid_19_deaths, color=state)) +
geom_line() +
scale_color_manual(values=c("#0033A0", "#FF8200")) +
labs(x="Date", y="COVID-19 Weekly Deaths", color="State") +
theme_classic()
More about using R with Markdown via Quarto can be found in the here: https://quarto.org/docs/computations/r.html.
Why Quarto?
You may be asking yourself how this plain text gets mixed or “knitted” together with the R code and output and converted in HTML (or another document format). This is where Quarto comes in. Quarto is also a program that reads the .qmd
first. It looks for R code blocks, runs them, and then takes that output and knits it together into a Markdown document. Quarto gives that Markdown document to Pandoc, which converts it to the output format of choice (e.g., HTML). This knitting process used to be done by an R package called rmarkdown
, which is actually still used by Quarto. R Markdown however was focused on R whereas Quarto can be used with other languages such as Python and Julia. Hence, learning how to interact with Quarto can be helpful even if you need to switch to compiling documents with calculations done using those languages. In fact, you can use Quarto to build a document that uses each of those languages in different section if that is useful.
Reproducible Research
Lab
Create your own Quarto Markdown document (“File”->“New File”->“Quarto Document”)
Give the document a title, author, and date (see here).
Write a paragraph about what kind of data you are thinking of analyzing and visualizing. Include an image with a caption to help with the description (you may need to upload the image to via the “Upload” button in the “Files” pane).
Create a list of your top 5 favorite TV shows.
Take the list and create a table that include the following columns: show name, year it premiered, country it premiered in, your ranking
As you do the above, use some Markdown elements to structure your document. Include the following:
- headings
- a link
- some bold or italic text
Make sure you can get the
.qmd
file to render properly in HTML in RStudio. (Select the “Preview in Viewer Pane” option under the gear icon for seeing the HTML)Download your files from the server and upload them to Canvas under Week 1. To this, follow the instructions below.
- If you didn’t download any images (e.g., only linked to images on the web), select your
qmd
file in the “Files” pane, select the gear icon “More”, and select “Export”. Then upload theqmd
file to Canvas. - If you did download an image to the server, you’ll need to select the image and the
qmd
file in the “Files” pane, select the gear icon “More”, and select “Export”. This will download a zip file. Upload that zip file to canvas.
- If you didn’t download any images (e.g., only linked to images on the web), select your