Text manipulation: regular expressions

Author

Jeremy Van Cleve

Published

October 8, 2024

Reminders

Think about data to use and plots to make for your project talk and report!!!.
E-mail me if you want to discuss your ideas.

Outline for today

Text basics
Basic string manipulation using stringr
Regular expressions: basics
Regex: character classes
Regex: quantifiers
Regex: grouping
Regex: replacing matches
Regex: practice!

Text basics

Working with text and strings can feel like the most boring and tedious aspect of data analysis and wrangling (and it often can be!), but it is important and sometimes the most crucial element. Not only are data sets often untidy, but they are also often full of quirks related to how the data was initially entered. For example, categorical variables, like a country name, might be mixed with numerical variables, such as dates or population sizes, and these must then be separated intelligently in order to be further analyzed in R. String analysis tools are crucial for this task. The first set of tools you will use includes functions from the stringr package. The second tool, more difficult but arguably the most crucial, is the “regular expression”, which is the like a Swiss Army knife for finding patterns in text.

`stringr`

R comes with some base functions for manipulating text, but these can be inconsistent in terms of their arguments and their outputs. As normal for material in this course, Hadley Wickham comes to the rescue with a package, stringr (loaded when you load tidyverse) that provides some basic function for manipulating strings and for using regular expressions. All the functions start with str_, which makes seeing a list of them easy using the “tab complete” functionality in RStudio. While it is certainly useful to learn the base packages for string manipulation, the hope is that learning a simpler framework with stringr can make the concepts easier so that if you need to understand how to work one of the base functions, the task will be easier.

Encoding

Computers store information in bits and while it can seems intuitive to convert bits to numbers (go from base 2 to base 10), converting bits to text requires some arbitrary choices. How many characters can convert to bits? Which special characters (e.g., ?, !, ^, etc) do you include beyond numbers and digits? Once you settle on which characters to include, you can simply list them in a table with the corresponding binary value that represents them. Initially, since English speakers were the dominant users of computers, this table, or “encoding”, was called ASCII (for “American Standard Code for Information Interchange”) and encoding only 128 characters (CS challenge: how many bits do you need to encode 128 characters?).

Since 128 characters is not enough to encode characters found in many foreign languages (particularly those with non-Roman characters), other encodings exist that do much better. The main encoding in use on the internet now is “Unicode”, which contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts. The most common variant of Unicode is “UTF-8”, which was designed for backward compatibility with ASCII. Instead of each character being represented by a specific 8-bit code, Unicode gives letter or character a “code point”. If you’ve seen things like U+0041, this is a Unicode code point and happens to correspond to LATIN CAPITAL LETTER A. UTF-8 does a nice trick where the first 128 characters are stored like ASCII and other characters use more bytes.

Most computer software now uses UTF-8 as the default encoding and so encoding is something users don’t have to worry about generally. However, some older software, including older R functions, may read in text expecting ASCII or some limited encoding when text is actually UTF-8. This means that some or all of the text will look mangled, say like this:

ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔÇµÇ≠Ç»Ç¢

There are some ways to fix this kind of problem. Dr. Jenny Bryan discusses some of these solutions here https://stat545.com/character-encoding.html and more generally gives some examples of how to convert from one encoding to another. For the purpose of learning more about text wrangling in this course, we can assume that everything is in UTF-8 and that we do not have to convert anything. You can often do this in real work though you may occasionally run into problems and need to google/pray to the R gods. For more on encodings than you probably want to know, try this article: http://kunststube.net/encoding/.

Creating strings: quotes and special characters

You’ve already been creating strings all over the place, but we’ll step through things a bit more methodically here.

Creating strings is easy: just use quotes, single or double (N.B. pairs of single and double quotes do the same thing so you can use a pair of one or the other interchangeably)

iamstring = "I am very worried about the country right now."
alsostring = 'So very very worried'

print(iamstring)

[1] "I am very worried about the country right now."

print(alsostring)

[1] "So very very worried"

Including a quote within a string is easy if you mix the quote styles.

print("A 'quote' within a quote")

[1] "A 'quote' within a quote"

print('Another "quote" within a quote')

[1] "Another \"quote\" within a quote"

Note in the second example how R converted the single quotes defining the string to double quotes and put backslashes before the double quotes inside the string. This is how you can tell R to ignore the special function some characters have. Within a double quoted string, a double quote would mean “end of the string”. Putting a backslash before the double quote says “this is just a double quote character.” Generally, putting a backslash before a character in a string “escapes” the character. Other special “escape characters”, which are often characters you can’t type with the keyboard easily, include tab, “, and the end-of-line character,”“.

cat("a tab separates this\tand\tthis\n")

a tab separates this    and this

cat("\n")

cat("creating multiple lines\nis easy with the end-of-line\ncharacter")

creating multiple lines
is easy with the end-of-line
character

Basic string manipulation using `stringr`

The package stringr has a number of useful functions for working with strings. First, load stringr via `tidyverse.

library(tidyverse)

Length

The function str_length() not only tells you how a string is,

str_length("ugh. so disappointed. so very disappointed.")

[1] 43

but it “threads” over vectors and will tell you how long each string is in a vector.

whysosad = c("trump", "gingrich", "giuliani")
str_length(whysosad)

[1] 5 8 8

This “threading” behavior is useful since it can help you avoid writing a for loop. It is also very useful for dplyr functions like mutate since this means that the function will work on whole columns or variables in a data frame.

Combining strings

Putting strings together is done using str_c() where the “c=concatenate” (again, R, use more characters for this word!).

str_c("x", "y", "z", "w")

[1] "xyzw"

To put characters separating the strings, add the sep argument:

str_c("x", "y", "z", "w", sep = " is a character and ")

[1] "x is a character and y is a character and z is a character and w"

str_c is also “vectorized” and threads over vectors,

str_c("prefix-", c("a", "b", "c"), "-suffix")

[1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

Combined strings can be collapsed into a single string with the collapse argument.

str_c(c("x", "y", "z", "w")) # just a vector still!

[1] "x" "y" "z" "w"

str_c(c("x", "y", "z", "w"), collapse = ", ") # now, a single string

[1] "x, y, z, w"

Detecting a string

Use str_detect to see if a string is contained in a vector of strings. Use str_subset to return the vector containing only those strings that match

# vector of strings from tidyverse
fruit

 [1] "apple"             "apricot"           "avocado"          
 [4] "banana"            "bell pepper"       "bilberry"         
 [7] "blackberry"        "blackcurrant"      "blood orange"     
[10] "blueberry"         "boysenberry"       "breadfruit"       
[13] "canary melon"      "cantaloupe"        "cherimoya"        
[16] "cherry"            "chili pepper"      "clementine"       
[19] "cloudberry"        "coconut"           "cranberry"        
[22] "cucumber"          "currant"           "damson"           
[25] "date"              "dragonfruit"       "durian"           
[28] "eggplant"          "elderberry"        "feijoa"           
[31] "fig"               "goji berry"        "gooseberry"       
[34] "grape"             "grapefruit"        "guava"            
[37] "honeydew"          "huckleberry"       "jackfruit"        
[40] "jambul"            "jujube"            "kiwi fruit"       
[43] "kumquat"           "lemon"             "lime"             
[46] "loquat"            "lychee"            "mandarine"        
[49] "mango"             "mulberry"          "nectarine"        
[52] "nut"               "olive"             "orange"           
[55] "pamelo"            "papaya"            "passionfruit"     
[58] "peach"             "pear"              "persimmon"        
[61] "physalis"          "pineapple"         "plum"             
[64] "pomegranate"       "pomelo"            "purple mangosteen"
[67] "quince"            "raisin"            "rambutan"         
[70] "raspberry"         "redcurrant"        "rock melon"       
[73] "salal berry"       "satsuma"           "star fruit"       
[76] "strawberry"        "tamarillo"         "tangerine"        
[79] "ugli fruit"        "watermelon"

str_detect(fruit, "fruit")

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[37] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

# get the subset that match
my_fruit <- str_subset(fruit, "fruit")
print(my_fruit)

[1] "breadfruit"   "dragonfruit"  "grapefruit"   "jackfruit"    "kiwi fruit"  
[6] "passionfruit" "star fruit"   "ugli fruit"

Again, notice that this function is vectorized and thus can work nicely for slicing data with dplyr.

Splitting a string

Passing a delimiter to str_split() will split strings or vectors of strings based on the delimiter:

str_split(my_fruit, " ") # split on a space

[[1]]
[1] "breadfruit"

[[2]]
[1] "dragonfruit"

[[3]]
[1] "grapefruit"

[[4]]
[1] "jackfruit"

[[5]]
[1] "kiwi"  "fruit"

[[6]]
[1] "passionfruit"

[[7]]
[1] "star"  "fruit"

[[8]]
[1] "ugli"  "fruit"

Subsetting a string

You can pull out parts of a string with str_sub(), which is also vectorized,

x <- c("Trump", "Gingrich", "Giuliani")
str_sub(x, 1, 3)

[1] "Tru" "Gin" "Giu"

# negative numbers count backwards from end
str_sub(x, -3, -1)

[1] "ump" "ich" "ani"

and works for assignment too,

str_sub(x, 1, 3) = "XXX"
print(x)

[1] "XXXmp"    "XXXgrich" "XXXliani"

Replacing a substring within a string

Replacing a substring can be done using str_replace():

str_replace(my_fruit, "fruit", "XXX")

[1] "breadXXX"   "dragonXXX"  "grapeXXX"   "jackXXX"    "kiwi XXX"  
[6] "passionXXX" "star XXX"   "ugli XXX"

my_fruit # it didn't modify the original (as usual for most R functions)

[1] "breadfruit"   "dragonfruit"  "grapefruit"   "jackfruit"    "kiwi fruit"  
[6] "passionfruit" "star fruit"   "ugli fruit"

Regular expressions: basics

In the previous examples, strings were matched directly and entirely; either the whole string (e.g, “fruit”) matched some piece of the string exactly or it didn’t match at all. Regular expressions (“regex”” for short) provide a syntax for constructing much more complicated ways to match strings than an exact match. Using a regular expression, you can construct a pattern that will only match dates (no matter the date!) or ones that will only match URLs (no matter the URL!). Needless to say, this can be extremely helpful when you have to modify/edit tons of data, but the drawback is the regular expressions have a steep learning curve.

Metacharacters

Like the quote characters and backslash, regex have special characters that are meant to represent something other than the character itself. The metacharacters are: . \\ | ( [ { ^ $ * + ?. To see how some of them work, you can use the function str_view(), which shows visually all places where a regex matches a string.

.: match any character

str_view("some characters #*$#(", ".")

[1] │ <s><o><m><e>< ><c><h><a><r><a><c><t><e><r><s>< ><#><*><$><#><(>

^: match at the beginning of the string

str_view(c("ogres", "are trolls orgres?", "no, they aren't"), "^og")

[1] │ <og>res

$: match at the end of the string

str_view(c("ogres", "are trolls orgres?", "no, they aren't"), "s$")

[1] │ ogre<s>

|, []: creating alternatives and character classes; see below
*, +, ?, {}: quantifiers; see below
(): create “groups”; see below

Regex: character classes

Suppose that you want to match any of the characters “a” or “b”? The “|” operator, which is an “or” operator, works for this because it matches either pattern on its sides. For example,

str_view("is 'party like it is 1999' the best prince song????", "a|b")

[1] │ is 'p<a>rty like it is 1999' the <b>est prince song????

What about matching “a”, “b”, or “c”? A character class is used for this where “[abc]” denotes the class of characters “a”, “b”, and “c” and will matches any of those characters.

str_view("is 'party like it is 1999' the best prince song????", "[abc]")

[1] │ is 'p<a>rty like it is 1999' the <b>est prin<c>e song????

You can create an “inverted” character class that matches all characters except those specified by putting “^” in the class:

str_view("is 'party like it is 1999' the best prince song????", "[^abc]")

[1] │ <i><s>< ><'><p>a<r><t><y>< ><l><i><k><e>< ><i><t>< ><i><s>< ><1><9><9><9><'>< ><t><h><e>< >b<e><s><t>< ><p><r><i><n>c<e>< ><s><o><n><g><?><?><?><?>

Finally, there are some handy character classes that are built into regex. For example,

\d, \D, \s, \S, and \w represent any decimal digit, not a digit, a space character, not a space character, and a ‘word’ character, respectively

# we need the extra "\" so that the "\" is actually included in the string!
str_view("is 'party like it is 1999' the best prince song????", "\\d")

[1] │ is 'party like it is <1><9><9><9>' the best prince song????

str_view("is 'party like it is 1999' the best prince song????", "\\D")

[1] │ <i><s>< ><'><p><a><r><t><y>< ><l><i><k><e>< ><i><t>< ><i><s>< >1999<'>< ><t><h><e>< ><b><e><s><t>< ><p><r><i><n><c><e>< ><s><o><n><g><?><?><?><?>

str_view("is 'party like it is 1999' the best prince song????", "\\s")

[1] │ is< >'party< >like< >it< >is< >1999'< >the< >best< >prince< >song????

str_view("is 'party like it is 1999' the best prince song????", "\\S")

[1] │ <i><s> <'><p><a><r><t><y> <l><i><k><e> <i><t> <i><s> <1><9><9><9><'> <t><h><e> <b><e><s><t> <p><r><i><n><c><e> <s><o><n><g><?><?><?><?>

str_view("is 'party like it is 1999' the best prince song????", "\\w")

[1] │ <i><s> '<p><a><r><t><y> <l><i><k><e> <i><t> <i><s> <1><9><9><9>' <t><h><e> <b><e><s><t> <p><r><i><n><c><e> <s><o><n><g>????

\b and \B: matches the empty string at either edge of a word, or not at the edge of a word, respectively. The example below shows how this allows you to match the character “h” either at the edge of a word or within a word.

str_view("is 'party like it is 1999' the best prince song????", "\\bi")

[1] │ <i>s 'party like <i>t <i>s 1999' the best prince song????

str_view("is 'party like it is 1999' the best prince song????", "\\Bi")

[1] │ is 'party l<i>ke it is 1999' the best pr<i>nce song????

Other built in character classes are specified with “[::]” or ranges such as [a-z]. Some of these include

[:digit:]: same as \d and [0-9].
[:lower:]: lower-case letters, equivalent to [a-z].
[:upper:]: upper-case letters, equivalent to [A-Z].
[:alpha:]: alphabetic characters, equivalent to [A-z].
[:alnum:]: alphanumeric characters, equivalent to [A-z0-9].
[:blank:]: blank characters, i.e. space and tab.
[:space:]: space characters: tab, newline, vertical tab, form feed, carriage return, space.
[:punct:]: punctuation characters, ! ” # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.

Finally, it is important to note that unless otherwise specified, when we put a character like “a” in a regex, the regex will match “a” and not “A”. In other words, regex are case sensitive:

str_view("Is 'party like it is 1999' the BEST Prince song????", "[abc]")

[1] │ Is 'p<a>rty like it is 1999' the BEST Prin<c>e song????

We could get this to match by adding additional uppercase characters to the character class:

str_view("Is 'party like it is 1999' the BEST Prince song????", "[abcABC]")

[1] │ Is 'p<a>rty like it is 1999' the <B>EST Prin<c>e song????

but sometimes its simpler to tell the regex to match regardless of case:

str_view("Is 'party like it is 1999' the BEST Prince song????", regex("[abc]", ignore_case = TRUE))

[1] │ Is 'p<a>rty like it is 1999' the <B>EST Prin<c>e song????

This is simpler especially if you worry there could be weird combinations of upper and lower case letters.

Regex: quantifiers

One of the most important pieces of a regex pattern is a quantifier that denotes how many times each piece of the pattern show match. There are four basic quantifiers:

*: match zero or more times
+: match one or more times
?: match zero or one times
{n}, {n,}, or {n,m}: match n times, at least n times, or between n and m times.

str_view("is 'party like it is 1999' the best prince song????", "\\?+")

[1] │ is 'party like it is 1999' the best prince song<????>

str_view("is 'party like it is 1999' the best prince song????", "\\??")

[1] │ <>i<>s<> <>'<>p<>a<>r<>t<>y<> <>l<>i<>k<>e<> <>i<>t<> <>i<>s<> <>1<>9<>9<>9<>'<> <>t<>h<>e<> <>b<>e<>s<>t<> <>p<>r<>i<>n<>c<>e<> <>s<>o<>n<>g<?><?><?><?><>

str_view("is 'party like it is 1999' the best prince song????", "\\?{3}")

[1] │ is 'party like it is 1999' the best prince song<???>?

str_view("is 'party like it is 1999' the best prince song????", "\\?+?")

[1] │ is 'party like it is 1999' the best prince song<?><?><?><?>

The last example shows how adding the ? to either * or + can make the quantifier “not greedy” and it will only matches as few characters as possible to match.

Regex: grouping and detecting matches.

Eventually you will want to extract pieces of a match. For example, if you try to match four digit years, you may want to know what years the regex matched. In order to extract this piece from a larger pattern, you can “group” the year part of the pattern.

str_match("is 'party like it is 1999' the best prince song????", "(\\w+) +(\\d{4})")

     [,1]      [,2] [,3]  
[1,] "is 1999" "is" "1999"

The above example shows a match for a word followed by a four digit year. Using the str_match() function, you get the matched string and then the grouped expressions afterwards in columns two and three.

If you want to a true/false for whether the pattern matched, you can use str_detect() instead of str_match()

str_detect("is 'party like it is 1999' the best prince song????", "(\\w+) +(\\d{4})")

[1] TRUE

and str_count() to get the number of matches in the whole string

str_count("is 'party like it is 1999' the best prince song????", "\\?{2}")

[1] 2

Note here that matches don’t overlap; otherwise you would count more than 2 matches in the four exclamation points.

Regex: replacing matches

Replacing text with regular expression involves supplying a pattern to match and a replacement string. For example, using str_replace(),

str_replace("is 'party like it is 1999' the best prince song????", "(\\w+) +(\\d{4})", "is 2024")

[1] "is 'party like it is 2024' the best prince song????"

The groups can also be taken advantage of in the replacement where the variables (called “backreferences”) “\n” contain the nth saved or grouped part of the pattern.

str_replace("is 'party like it is 1999' the best prince song????", "(\\w+) +(\\d{4})", "was 1899 but \\1 now \\2")

[1] "is 'party like it was 1899 but is now 1999' the best prince song????"

Regex: practice!

There are a couple of nice websites that allow you to easily construct a regex pattern and see how it matches text at the same time. This can be very useful when trying to construct a pattern to do a very specific job.

This is a good places to learn the basics too in a more fun way:

https://regexcrossword.com/challenges/tutorial

Lab

For reference, here is the RStudio regular expression cheat sheet: pdf on github.

Problems

Go to https://regexcrossword.com/challenges/beginner. Solve the first four puzzles and submit the answers to them in your qmd file.
Write your own 2x2 regex crossword in the style of the puzzles in problem #1. Provide the puzzle and solution.
Use GWAS Catalog data for this problem, which you can load like this:
```
gwas = read_tsv("gwas-catalog-v1.0.3.1-studies-r2024-09-22.tsv")
```
Unlike our previous GWAS data, these data are just the studies (not the individual SNPs) but they are all the studies in the database.
For a description of the columns in this dataset, look here: https://www.ebi.ac.uk/gwas/docs/fileheaders#_file_headers_for_unpublished_studies.

How many different studies in the database study something to do with the thyroid? (use str_detect and DISEASE/TRAIT for this)
How many of these studies reported no significant association for their DISEASE/TRAIT? (use ASSOCIATION COUNT)
Using the GWAS data above, how many studies use at least 50,000 individuals in their INITIAL SAMPLE SIZE?
Hint: you’ll definitely need a regular expression for this and str_extract should be useful. Getting all the numbers from INITIAL SAMPLE SIZE is tricky so get the first one as a start. If you’re interesting in a challenge, get all the numbers and check them.