Intro to R: Data types, flow control, and functions

Author

Jeremy Van Cleve

Published

August 29, 2022

Outline for today

  • Objects and functions
  • Statements and style
  • Types of objects
  • Flow control
  • Loops
  • More on functions

Obejcts are functions and functions are objects

Everything in R is either a function or an object (actually functions are objects too!). For example numbers are objects,

a_number = 10

strings of characters, AKA “strings”, are objects,

a_string = "hello world\n"

and even functions, like print, which prints a string, is an object,

a_function = print
a_function("hello world")
[1] "hello world"

What is an object then?

An object is data stored in memory with a name (or identifier) that can have attributes and functions (often called methods) that do specific things to objects with specific attributes.

For example, the names attribute comes up a lot and can provide a convenient way to access an element of a vector:

vector_with_named_elements = c(a=1, b=2)
attributes(vector_with_named_elements)
$names
[1] "a" "b"

What is a function then?

A function (or subroutine or method or procedure) is a set of instructions packaged as a unit. It may or may not take input and may or may not produce output.

For example, this function takes a number as input, calculates it square, gives us the answer as output:

num_sqrd = function(n) {
  n^2
}

num_sqrd(2)
[1] 4

Statements and style

Assignment statements

To create an object, you perform an “assignment statement” where the name of the object is on the left, the value of the object is on the right, and in the middle is the assignment operator, =,

an_object = "a string object"

or <-

an_object <- "a string object"

Either <- or = works as the assignment operator though most R aficionados will use <-. However, I prefer (strongly) the = operator since that’s what many other languages use. The ones that don’t use = often use :=. It also takes fewer keystrokes (1 vs 3) to type = compared to <-. I think <- looks funny too. I could keep going…

Style of statements

  • Every command in R must end in a “newline” (what you get when you hit “return”) or semicolon. Newlines are the norm in most R code and I will use them too.

  • If an R statement is incomplete and you enter a newline, R will let you continue the statement:

new_statement = print("this continues on the
next line")
[1] "this continues on the\nnext line"
  • Use plenty of space in your statements to help readability.

    • This is more readable

      a = (100 + 3) - 2
      mean(c(a / 100, 642564624.34))
      [1] 321282313
    • than this

      a=(100+3)-2
      mean(c(a/100,642564624.34))
      [1] 321282313
  • Use “tabs” to set off blocks of code like loops and function definition. Good coding style will make these blocks look obvious since they are indented and spaced nicely. This makes understanding which parts of the code are run when much easier.

  • The “tidyverse” (R packages for data science) has a style guide: https://style.tidyverse.org/ that produces consistent and easy to read code.

Naming conventions

Now that you can create objects using statements, you have to know how to name them. R has some simple rules for this:

A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as “.2way” are not valid, and neither are the reserved words.

This information comes from the help for the make.names command, which you can get type typing ?makes.names where “?” before a function gives you the help page:

?make.names

Then, you convert any string into a valid R variable name using the make.names command:

not_valid_name = "1 way to not be a valid name"
valid_name = make.names(not_valid_name)
print(valid_name)
[1] "X1.way.to.not.be.a.valid.name"

Even when your variable names are valid, knowing the rules for naming is important for the names attribute of objects like vectors and data.frames. When you read your data from a file into R, the column names must be “valid” or enclosed within backticks (``). Some of the nicer functions will use the backticks (e.g., read_csv from tidyverse) so that the name is preserved while others will convert the names into valid ones (e.g., read.csv in the utils package).

A couple tips for naming objects:

  • Err on the side of using longer object names that describe what that object stores: e.g., populationSize is often better than N. Note that R itself doesn’t always obey this rule; e.g., the function c combines values. Never name a function a single letter.

  • Maintain a convention for using multiple words in object names. E.g., underscores or periods for spaces or capitalize every word. The tidyverse style guide suggests:

    Variable and function names should use only lowercase letters, numbers, and _. Use underscores (_) (so called snake case) to separate words within a name.

  • Remember that R is case sensitive, so populationSize is a different object from populationsize.

  • Never use object names that differ only by upper or lower case; this will almost certainly lead to user created bugs.

Types of objects

Scalars

Scalars are the simplest object in R. They hold a single value like a number (called “numeric” by R) or a string (called “character” by R)

number = 3.1459
string = "hello again world"

How do you know these objects are the types you think they are? You can check with the function class, which gives you the value of the object’s class attribute:

class(number)
[1] "numeric"
class(string)
[1] "character"

You can ask for a little more by getting the “structure” of the object using the function str (by the way, this function is another example of a name that is too short; strike two R).

str(number)
 num 3.15
str(string)
 chr "hello again world"

These commands may not be super useful now, but R has enough different types of object that you’ll want to be able to ask an object about itself at some point.

Vectors

Vectors are everywhere in R. They are simply lists of objects of a common type, like numeric or character. Actually, our scalars were just vectors with a single element. To create a vector, use the very poorly named c or combine function.

number_vector = c(1,10,100,1000)
number_vector
[1]    1   10  100 1000
string_vector = c("hello", "world", "for", "yet", "another", "time")
string_vector
[1] "hello"   "world"   "for"     "yet"     "another" "time"   

To find out the length of the vectors you just created, use the function length

length(number_vector)
[1] 4
length(string_vector)
[1] 6

Accessing elements of a vector is accomplishing using “indexing”. Vectors are index from 1 to the number of elements in the vector. To access an element, use brackets after the object name; printing the second element of string_vector looks like this:

print(string_vector[2])
[1] "world"

Lists

Suppose that you want a vector that mixes numbers and strings. You could try

mixed = c(1,2,3,"one","two","three")

but looking at the result

mixed
[1] "1"     "2"     "3"     "one"   "two"   "three"

you will find that R converted the numbers to strings. That’s because vectors only contain a single object type.

Lists however can contain multiple types. Thus, using the list function works:

mixed = list(1,2,3,"one","two","three")
str(mixed)
List of 6
 $ : num 1
 $ : num 2
 $ : num 3
 $ : chr "one"
 $ : chr "two"
 $ : chr "three"

You can index them just like a vector too, though they require double brackets (more on this later):

mixed[[1]]
[1] 1
mixed[[4]]
[1] "one"

Flow control and logical tests

Flow control is a technical term for telling a program which statements or blocks of code to execute. In R, you use “if-else” statements to accomplish this. For example, you can print a statement only if a variable takes a certain value:

test = -10
if (test > 0) {
  print("test is greater than zero")
} else {
  print ("test is not greater than zero")
}
[1] "test is not greater than zero"

The curly brackets enclose the block of statements that get executed if the condition is true or not. Note that the else is on the same line as the final curly bracket of the “if”. The “>” (greater than) symbol is an example of a relational operator; others include equals, “==”, not equals, “!=”, and “greater than or equal”, “>=” (likewise, less than operators exist too).

The outcome of a comparison with a relational operator is a “TRUE” or “FALSE” value, which are “Booleans”. You can combine TRUE and FALSE values with Boolean operators like “&&” (“and”), “||” (“or”), and “!” (“not”). For example, you can check our whether our test variable is greater than zero or less than -1:

test = -10
if (test > 0 || test < -1) {
  print("test is greater than zero or less than -1")
} else {
  print ("test is between zero and -1")
}
[1] "test is greater than zero or less than -1"

Loops

Loops are ways to tell R to perform a block of commands repeatedly. The most useful kind of loop is the “for” loop. First, you create an empty vector. Then, you use the for loop to fill its elements.

vec = c()
for (i in 1:100) {
  vec[i] = i*i
}
str(vec) 
 int [1:100] 1 4 9 16 25 36 49 64 81 100 ...

The structure of the for loop is for (index.object in index.values) { code block }. The first run of the loop sets index.object to the first element of index.value and executes the code block. The next run sets the index.object to the second value of index.value, executes the code block, and so on until the loop has run as many times as there are elements in index.value.

Note that the above loops shows you something else about vectors. You can build them dynamically by just assigning values to their elements. This is a very convenient thing that R lets you do, but many other languages will not (and sometimes for good reasons).

More on functions

Functions are important so we will discuss them in more detail. Functions take “arguments” between the parentheses and these arguments are the data you want the function to use. For example, the runif function generates uniformly distributed random numbers. There are three arguments to the function:

str(runif)
function (n, min = 0, max = 1)  

The first argument is how many numbers we want, the second is the minimum value for them, and the third is the maximum value. Thus, to create 10 random numbers between -1 and 1, you do this:

print(runif(10, -1, 1))
 [1] -0.45665584 -0.81272271  0.09909869 -0.61793683 -0.30278301  0.10201705
 [7] -0.60862903 -0.32128506 -0.66764281  0.50995666

In RStudio, a handy way to see what functions are available or to jog your memory about a function’s name is to begin typing the name and then hit “tab” when you have a few characters. RStudio will show you a list of possible functions and if you hit “tab” again it will complete your typing with the highlighted name. If you’re typing inside the parenthesis, hitting “tab” will help remind you what the arguments to the function are.

In R, the arguments to functions often have “names”. If they do, you can give the argument with it’s name:

print(runif(min = -1, max = 1, n = 10))
 [1] -0.65297191  0.11482369  0.77288002  0.90163225 -0.64160696  0.04281076
 [7] -0.70096232 -0.30479331  0.14169347  0.27863098

Notice that giving the name of the argument allows us to give the arguments in any order. If an arguments doesn’t have a name, then you must give the arguments in exactly the order they are listed in the definition of the function (see the help for a function with “?name_of_function”).

One of the most powerful parts of programming in any language is writing your own functions. To do this, you start with a name for the function and assign the object to function(arg1, arg2, ...) { function body }. This is called a function definition. For example, the function

runit = function(n = 1) {
  return(runif(n, max = 1, min = 0))
}

generates a random number in the unit interval (0,1) where the number of values you want is the only argument. We can exclude that argument since it has a “default” value of 1 or give the argument explicitly, with or without a name:

print(runit())
[1] 0.1491875
print(runit(n = 2))
[1] 0.4800240 0.6613675
print(runit(5))
[1] 0.32569603 0.07472495 0.52346383 0.99809008 0.22544366

Finally, the function returns whatever is in the “return” function or the last value in the function body. Thus, we can easily capture our random numbers or the output from any function:

rvals = runit(5)
print(rvals)
[1] 0.2004879 0.1008170 0.7687202 0.7293823 0.1136611

Lab

Write a function that plots a date slice of COVID-19 cases for any one state.

In order to do this, we’ll use functions from the tidyverse set of packages. The relevant syntax will be given here and we’ll learn how to operate these functions more generally soon.

  • Use the following code to import the tidyverse packages and load the COVID-19 data from the New York Times. This code should go above the function definition since its something you only need to run once per R session (whereas your function could run many times). I’ve added code to pull out just the cumulative number of cases, convert them into daily cases, and put the states into their own columns, which is easier for this exercise.
library(tidyverse)

us_cases = read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv") %>%
  pivot_wider(date, names_from = state, values_from = cases) %>%
  mutate(across(where(is.numeric), ~ .x - lag(.x)))
  • Try plotting the cases for Kentucky.
plot(us_cases[["date"]], us_cases[["Kentucky"]], type='l')

  • Write a function that has the following four arguments and plots the us_cases data between two dates:
    1. the data you want to plot
    2. a beginning date
    3. an ending date
    4. the state you want to plot (e.g, Kentucky)
  • This function should plot the data only between the beginning and ending date. Thanks to the lubridate and dplyr packages in tidyverse, this isn’t too hard. You can use the following code to get a “slice” of the date between two dates (e.g., “2018-01-01” and “2018-08-25”):
my_slice = filter(us_cases, date > "2020-10-01" & date < "2021-08-25")

Plotting this slice shows that it works.

plot(my_slice[["date"]], my_slice[["Kentucky"]], type='l')

  • Finally, the function should also plot a horizontal line located at the mean value of the cases in the interval. For example, if your function was called with the us_cases date on the dates “2020-10-01” and “2021-08-25” for Kentucky cases, which have a mean in that interval of 1454.682, then your plot should look like this (the code below shows you how to add horizontal lines to existing plots):
plot(my_slice[["date"]], my_slice[["Kentucky"]], type='l')
abline(h=1454.682, col='red')