3 Strings

3.1 Basic String Functions

A string is a sequence of characters that are bound together, where a character is a symbol is a written language.

In R, a string is of mode (or class) character and is bounded by quotes (either single or double):

mode("r")
class('r')

Two notes:

R helps keep your life complicated via two similar but not identical concepts: mode and class. Feel free to lose time trying to fully internalize how they differ.
Double quotes are preferable, because then one can use single quotes as apostrophes in strings.

There are special symbols that one uses for tabbing (“) and for forcing line breaks (”“):

message <- "To-do list for students:\n\tHomework\n\tLabs\n\tFinal Project"
cat(message)

To count the number of characters in a string, use the function nchar() (as opposed to length(), which counts the number of elements in a vector).

To illustrate the difference, let’s define a vector of strings:

str.vec <- c("I","will","master","R")
nchar(str.vec)
length(str.vec)

Potentially useful functions to use with vectors that we have not seen previously include:

head(str.vec,2)       # display/extract the first two elements
tail(str.vec,2)       #                 the last two elements
toupper(str.vec)      # change all characters to upper case
tolower(str.vec)      # change all characters to lower case

Let’s define another string vector:

candidates <- c("Trump","Clinton","Johnson","Stein")

If you, for instance, want to extract the first two letters of each name, you can use substr():

substr(candidates,1,2)

If you want to extract the last two letters, utilize nchar(), since the total number of letters in each name is different:

substr(candidates,nchar(candidates)-1,nchar(candidates))

To replace the last two letters with something else,

substr(candidates,nchar(candidates)-1,nchar(candidates)) <- ":)"
candidates

If there is a mismatch between the number of characters to replace (here: 2) and the number of characters in the replacement string (here: 6), R will simply truncate. Here, that means only the first two characters in the replacement string are used.

substr(candidates,nchar(candidates)-1,nchar(candidates)) <- ":(:|:)"
candidates

3.1.1 Basic String Functions: Lab Exercises

Please download and work through the following R Markdown file:

string_functions.Rmd

The solutions are provided here.

3.2 Regular Expressions

Regexes are specially constructed strings that allow for flexible pattern matching. The rules for constructing regexes are independent of R; you may already know them. (But even then, reviewing them cannot hurt. Too much.)

Literals: strings we want to literally match (e.g., “fly”, which does not match with “flies”)
“Or”: for more flexible matching; if you want to match “fly” or “flies”, use “fly|flies”.
Concatenation: “(a|b) (c|d)” is a concatenation of two regexes…use this if you want to find “a” or “b” followed by a space followed by “c” or “d”. Here, the parantheses define the group of possible literals that we are trying to match.

The grep() function carries out matching.

str.vec <- c("fly","Fly","flies")
grep("fly",str.vec)               # grep is case-sensitive! This returns vector element(s) for match(es).
grep("fly",str.vec,value=TRUE)    #                         This returns the vector value(s) for match(es).
grep("fl",str.vec,value=TRUE)
str.vec <- c("time flies","fruit fly","fruit flies")
grep("(time|fruit) flies",str.vec,value=TRUE)
grep("fruit (time|times)",str.vec,value=TRUE)       # no error, just a zero-length vector

NOTE: a very useful alternative to grep() is grepl(), or “logical grep”: instead of returning vector elements or the values themselves, it returns TRUE if there is a match, and FALSE otherwise.

grepl("(time|fruit) flies",str.vec)

We use square brackets when we want to specify a range of possible matching characters:

“[abcde]” means “look for any string that contains a, b, c, d, or e” (case sensitive!)
“[a-e]” means the same thing; the dash denotes a range
“[^a-e]” means “look for any string that contain characters other than a, b, c, d, or e”
“[1-4][2-6]” matches strings that contain the numbers 12-16, 22-26, 32-36, or 42-46

str.vec <- c("I am 18 years old","I turned 24 yesterday","His age is 112")
grep("[w-z]",str.vec,value=TRUE)
grep("[^w-z]",str.vec,value=TRUE)
grep("[1-4][2-6]",str.vec,value=TRUE)

Note: do not use, e.g., “[2020]” to try to match the year 2020! This actually will match any string that has 2, 0, 2, or 0, or stated more concisely, 0 and 2, in it. Thus it will match with “2020” and “2002” and “2200” and “2345” and “0135”, etc.

Other ways to specify particular character types include:

“[[:alnum:]]”, which is the same as “[a-zA-Z0-9]”
“[[:punct:]]”, which means “match any string that contains a punctuation mark”
“[[:space:]]”, which means “match any string that contains a space, a tab, or a new line”
“.”, which matches anything (and so is meaningless unless applied as in, e.g., “(a|b).(c|d)”)

str.vec <- c("R2D2","r2d2","R2 D2","R2-D2")
grep("[A-Z][0-9]",str.vec,value=TRUE)
grep("[[:space:]]",str.vec,value=TRUE)
grep("[[:punct:]]",str.vec,value=TRUE)

See ?regex for more possibilities.

Metacharacters are strings that are not to be intrepreted literally! The symbols . $ ^ * + ? | { } [ ] ( ) are all metacharacters.

To find occurrences of these symbols in strings, we need to use an escape sequence: we place a backslash in front of the symbol. (Note that in searches we have to double the number of backslashes because the backslash itself is a metacharacter!)

str.vec <- c("que?","these symbols-[ and ]-are square brackets",":)",":->")
grep("\\?",str.vec,value=TRUE)
grep("\\]",str.vec,value=TRUE)
grep(":\\)",str.vec,value=TRUE)
grep(">",str.vec,value=TRUE)

Note: a single backslash in any string will not be interpreted as a single backslash, but rather as a backslash plus whatever character follows it. (For example, you cannot search for the backslash in the string "\n" because to R, the backslash and the “n” are implicitly combined together into a single entity that means “line break”.)

Quantifiers in regexes allow us greater flexibility is searches.

“+” means “occurs 1 or more times”
“*” means “occurs 0 or more times”
“?” means “the preceding regex is optional”…it differs from “*” in that “?” means “occurs 0 or 1 time” only
“{n}”, “{n,}”, and “{n,m}” mean “exactly n”, “n or more”, and “between n and m inclusive” times; note that these only work in the way you’d expect in combination with other regexes

A quantifier’s scope is what it is applied to. By default, the scope is the character preceding the quantifier. If you wish to have the quantifier apply to an entire group of characters, place parantheses around those characters.

str.vec <- c("a","ab","ac","bc","cb","abb?","abab","10","100","1000")
grep("ab+",str.vec,value=TRUE)
grep("ab*",str.vec,value=TRUE)
grep("bc?",str.vec,value=TRUE)
grep("bc?\\?",str.vec,value=TRUE)
grep("10{1,2}",str.vec,value=TRUE)         # doesn't actually match just 10 and 100

Admittedly, this last one is a bit confusing. Think of it as asking whether there is a “10” or a “100” in the string. “1000” has both a “10” and a “100”. You have to add more to the regex to limit the matches to the strings “10” and “100” only. See, e.g., the concept of anchoring below.

grep("(ab)+",str.vec,value=TRUE)
grep("(ab){2}",str.vec,value=TRUE)
grep("(00){2}",str.vec,value=TRUE)

When ^ is used outside of square brackets, it means we will only look for the match at the beginning of the string. (It “anchors” the search to the beginning of the string.)

str.vec <- c("Win!","winner","I win.")
grep("^[Ww]in",str.vec,value=TRUE)

Similarly, “$” looks for a match only at the end of the string.

str.vec <- c("WIN","win","Winner winner chicken dinner")
grep("[Nn]$",str.vec,value=TRUE)

str <- "keep ... on ....swimming... swimming.......swimming..."
strsplit(str,split=" *\\.+ *")       # 0+ spaces followed by 1+ periods followed by 0+ spaces

Note that the output of strsplit() is a list. Here, the input is a single string, so the output list has only one element that contains the split vector of strings. In cases like this, applying unlist() can be helpful.

str <- "keep on swimming; swimming;  swimming;   swimming"
unlist(strsplit(str,split=";? +"))   # optional semi-colon followed by 1+ spaces

3.2.1 Regular Expressions: Lab Exercises

Please download and work through the following R Markdown file:

regex.Rmd

The solutions are provided here.

3.3 String Extraction and Replacement

Let’s say that instead of receiving an entire string that contains a matching substring, which is what you would receive if you use grep() with value=TRUE, you just want that substring. To get just the substring, use a combination of regexpr() and regmatches():

regexpr() returns the location of the first match in the target string (or -1 if no match is found); and
regmatches() takes the output of regexpr() and returns the matching substring.

An example of string extraction:

str.vec <- c("abccc","ababc","cabcc","ccabb","ccccc")
reg.exp <- regexpr("ab+",str.vec,useBytes=TRUE)
reg.exp
regmatches(str.vec,reg.exp)

The first line of output tells you where in the string the match occurs, and that the match.length attribute tells you the length of the matching substring in characters.

You can also use regmatches() to specify a replacement for these substrings:

regmatches(str.vec,reg.exp) <- "xy"
str.vec

Notice that in the second string, only the first instance of “ab” is replaced. Again, as we have been using it up to now, regexpr() will only return the first matching substring. To get all the matching substrings in a given string, use gregexpr():

str.vec  <- c("abab","abAB","ABAB","ccAbCC")
greg.exp <- gregexpr("ab|AB",str.vec)
regmatches(str.vec,greg.exp)                # returns a list, one element per input string
regmatches(str.vec,greg.exp) <- "xy"
str.vec

As an alternative, we can use sub() and gsub():

str.vec <- c("abab","abAB","ABAB","ccAbCC")
sub("ab|AB","xy",str.vec)                   # sub(): replace first occurrence of matching substring
str.vec

Note that unlike before, the original string vector is not itself changed! If you need, e.g., to compare how your initial and final strings appear, sub() or gsub() is the way to go. The difference: sub() replaces the first occurrence of the matching substring, while gsub() replaces all occurrences of the matching substring.

str.vec  <- c("abab","abAB","ABAB","ccAbCC")
gsub.out <- gsub("ab|AB","xy",str.vec)
gsub.out

3.3.1 String Extraction and Replacement: Lab Exercises

Please download and work through the following R Markdown file:

string_extraction.Rmd

The solutions are provided here.