Strings

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

A lot of data, especially metadata about clinical samples or subjects, are stored in arbitrary text, rather than numbers. In programming languages like Julia, the data type used to encode text is called a String.

Julia has numerous facilities for working with Strings, some of which have already been introduced in previous tutorials. Here, we will repeat some of those tools, and introduce many more with the hope that this tutorial will serve as a 1-stop-shop for all of your text processing needs.

Caution

In R, string values are called “character” instead of “string.”

r$> as.character(1001)
[1] "1001"

r$> typeof(as.character(1001))
[1] "character"

Julia also has a “character” (Char) type, but this is used to represent individual characters, rather than whole strings. A String can be thought of as combining 0 or more Chars, and if you index a single value out of a String, you get a Char. Notice also that single quotes (') can only be used to make a Char, and double quotes (") are required for Strings.

julia> s = "A String"
"A String"

julia> 'a'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> 'A String'
ERROR: syntax: character literal contains multiple characters
Stacktrace:
 [1] top-level scope
   @ none:1

julia> s[3]
'S': ASCII/Unicode U+0053 (category Lu: Letter, uppercase)

1 Strings as containers

In many ways, Strings are treated as “scalar” values - that is, they are atomic data points. But in other ways, they are like vectors of characters.

For example, we can index into strings.

ex_string = "abcdefg"
"abcdefg"
ex_string[1]
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
ex_string[end]
'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
ex_string[3:5]
"cde"
Caution

In many cases, it is one character per-index, but not always! Julia indexes by “code points,” rather than characters. This doesn’t matter for most types of characters, but if you want more details, see the end of the tutorial.

2 Building Strings

In Julia, Strings are constructed using double quotes: ".

"This is a String"
"This is a String"
"a" # Also a string
"a"
"" # an empty string
""

A multi-line string can be easily constructed using 3 double quotes ("""). For convenience, when using this syntax, whitespace that’s found at the beginning of every line is ignored, as is a new line at the beginning.

That is, the following are equivalent:

ozy1 = """I met a traveller from an antique land,
Who said—“Two vast and trunkless legs of stone
Stand in the desert. . . .
"""
"I met a traveller from an antique land,\nWho said—“Two vast and trunkless legs of stone\nStand in the desert. . . .\n"
ozy2 = """
I met a traveller from an antique land,
Who said—“Two vast and trunkless legs of stone
Stand in the desert. . . .
"""
"I met a traveller from an antique land,\nWho said—“Two vast and trunkless legs of stone\nStand in the desert. . . .\n"
ozy3 = """
    I met a traveller from an antique land,
    Who said—“Two vast and trunkless legs of stone
    Stand in the desert. . . .
    """
"I met a traveller from an antique land,\nWho said—“Two vast and trunkless legs of stone\nStand in the desert. . . .\n"
ozy1 == ozy2 == ozy3
true
Tip

In the output, the \n character is a unix “newline” character. If this were written to a file, the result of each \n would be a line break in most text editors.

Note

These lines are taken from the poem Ozymandias, by Percy Shelley.

s1 = "string 1"
s2 = "string 2"
s3 = "string 3"
"string 3"

Strings can be “concatenated” using the * operator.

s1 * s2 * s3
"string 1string 2string 3"

Alternatively, one can use the string() function, which takes any number of arguments and combines them into a single String.

string(s1, s2, s3) == s1 * s2 * s3
true

One advantage to using string() is that one can provide non-string arguments, and they will be automatically converted to strings.

n = 42
42
string(s1, " and ", n)
"string 1 and 42"

Whereas if you try this using concatenation, you’ll get an error:

s1 * " and " * n
MethodError: MethodError(*, ("string 1 and ", 42), 0x0000000000007af8)
MethodError: no method matching *(::String, ::Int64)

Closest candidates are:
  *(::Any, ::Any, !Matched::Any, !Matched::Any...)
   @ Base operators.jl:587
  *(!Matched::BigFloat, ::Union{Int16, Int32, Int64, Int8})
   @ Base mpfr.jl:447
  *(!Matched::Missing, ::Number)
   @ Base missing.jl:123
  ...

Stacktrace:
 [1] *(::String, ::String, ::Int64)
   @ Base ./operators.jl:587
 [2] top-level scope
   @ ~/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/DataWranglingInJulia/06-strings.qmd:259

2.1 🪢 join()

Another way to join multiple strings is to use join(). join() takes a collection (eg a vector) of strings as the first argument, an optional delimiter (that will be added between each item), and an optional last delimiter (that will be added between the last and penultimate item instead of the primary delimiter).

ss = [s1, s2, s3, "string 4"]
4-element Vector{String}:
 "string 1"
 "string 2"
 "string 3"
 "string 4"
join(ss, "; ")
"string 1; string 2; string 3; string 4"
join(ss, ", ", ", and ")
"string 1, string 2, string 3, and string 4"

2.2 Mixed join

As with string(), elements of a collection passed to join() will be converted to Strings if they aren’t already.

join([1, 2, 3], " and ")
"1 and 2 and 3"

2.3 Interpolation (adding in expressions)

It is often quite useful to insert the contents of variables or expressions into a string. This is called “interpolating” the value into the string.

In Julia, this is done using $. If the expression (usually a variable) only contains numbers, letters, or underscores, you can just use $. Otherwise, wrap it in parentheses, eg $(expr).

x = 42
42
"There's this number, $x. Half of it is $(x ÷ 2)"
"There's this number, 42. Half of it is 21"
Tip

Occasionally, you may want to put a floating point number into a string, but it has way too many decimal values. The best way to deal with this is using round(). round() can be used outside the context of Strings as well, but it’s particularly useful in this case.

julia> "🥧 is ≈ $(22 / 7)"
"🥧 is ≈ 3.142857142857143"

julia> "🥧 is ≈ $(round(22 / 7; digits = 2))"
"🥧 is ≈ 3.14"

julia> round(123.456; sigdigits = 2)
120.0

3 🐣 Basic string operations

The “length” of a String indicates how many characters are present:

length("abc")
3

3.1 🛌 lpad() and rpad()

If you need strings to be longer, you can “pad” them to the right or left with rpad() and lpad() respectively. These functions take the original string, a length, and the Char or String to pad with.

lpad("thing", 10, "xyz")
"xyzxything"
rpad("thing", 10, "xyz")
"thingxyzxy"

This can be useful when building eg identification numbers with a fixed width, since you can also provide non-string numbers as the first argument:

"X" .* lpad.(1:10:102, 3, '0')
11-element Vector{String}:
 "X001"
 "X011"
 "X021"
 "X031"
 "X041"
 "X051"
 "X061"
 "X071"
 "X081"
 "X091"
 "X101"

3.2 ✂️ Trim white space with strip()

The strip() family of functions (including rstrip() and lstrip()) removes leading and/or trailing characters (usually spaces) from a String.

spacey = "    far out! 👾     "
"    far out! 👾     "
strip(spacey)
"far out! 👾"
lstrip(spacey)
"far out! 👾     "
rstrip(spacey)
"    far out! 👾"

Each of these functions can also take a character or vector of characters to strip, or a boolean function, such that a character that returns true will be removed.

strip(isuppercase, "TOP of the morning")
" of the morning"
rstrip("What! A! Crazy! DAY!!?!?!", ['!', '?'])
"What! A! Crazy! DAY"

3.3 💼 Changing case

Julia has convenient functions for modifying the case of Strings.

  • uppercase()
  • lowercase()
  • titlecase()
my_str = "Who's on first? That's what I SAID!"
"Who's on first? That's what I SAID!"
lowercase(my_str)
"who's on first? that's what i said!"
uppercase(my_str)
"WHO'S ON FIRST? THAT'S WHAT I SAID!"
titlecase(my_str)
"Who'S On First? That'S What I Said!"
titlecase(my_str, strict = false)
"Who'S On First? That'S What I SAID!"
titlecase(my_str, wordsep = isspace) # this one is subtle, look at the 's' after '
"Who's On First? That's What I Said!"

3.4 🪓 Breaking up Strings with split()

The split() function is reciprocal to join(), dividing a String into pieces at particular characters or strings (a space by default).

split("what a great day!")
4-element Vector{SubString{String}}:
 "what"
 "a"
 "great"
 "day!"
split("what a great day!", "t ")
3-element Vector{SubString{String}}:
 "wha"
 "a grea"
 "day!"

By default split() will break on each delimiter, even if there’s nothing in between them. You may use the keepempty keyword argument to override this behavior.

split("who,,,,wrote,,,,this?", ",")
9-element Vector{SubString{String}}:
 "who"
 ""
 ""
 ""
 "wrote"
 ""
 ""
 ""
 "this?"
split("who,,,,wrote,,,,this?", ","; keepempty = false)
3-element Vector{SubString{String}}:
 "who"
 "wrote"
 "this?"

You can also pass an array of delimiters, or regular expressions (see below) to break up the string on multiple patterns.

split(
    "Hmm, this might be... complicated!",
    [' ', ',', '.']; # a space, a comma, and a period
    keepempty = false,
)
5-element Vector{SubString{String}}:
 "Hmm"
 "this"
 "might"
 "be"
 "complicated!"

4 🔁 replace()

If we want to replace part of a String with something else, we use the replace() function, which takes a String as the first argument, and a Pair that shows what to replace.

replace("Goodbye, cruel world!", "Goodbye, cruel" => "Hello,")
"Hello, world!"

5 🔎 Finding matches

There are a number of ways to look into the contents of strings. In the documentation for these functions, they use the phrase “needle” for the thing that is being searched for and “haystack” for the thing being searched.

Note

Just in case you’re not familiar with this English idiom, see here.

5.1 contains()

The most basic way to match is the contains(haystack, needle) function, which is a Boolean function that takes an AbstractString as the first argument (the haystack), and something to look for as the second argument (the needle) - it may be another String, a Char, or a Regex (which we’ll get to later). contains() returns true if there are any matches, false if not.

contains("banana", "ana")
true
contains("banana", 'z')
false
Note

There is also the occursin() function, which is essentially the reverse signature - occursin(needle, haystack).

julia> occursin("ana", "banana")
true

5.2 find*()

There is a family of functions in the find* family that may be used to identify the indexes of matches within a String. These are related to similar functions used on arrays and other containers, but the two that are used most often with Strings are:

  • findfirst(): Identifies the first index / index range that matches
  • findall(): Identify all indices or index ranges that match

The return value can be used to index into the original string and return the match.

findfirst("an", "banana")
2:3
findfirst('a', "banana")
2
findall("an", "banana")
2-element Vector{UnitRange{Int64}}:
 2:3
 4:5
findall("a", "banana")
3-element Vector{UnitRange{Int64}}:
 2:2
 4:4
 6:6
Note

Did you notice that searching with a String, even if there’s only 1 character in the String, returns a 1-element range rather than an integer index? This is so that using the return value will produce exactly what was searched for. Using a single index would return a Char instead of a String:

julia> "banana"[2]
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> "banana"[2:2]
"a"
Tip

There is no findall(::Char, ::String) in Julia v1.6 (though it was added in v1.7). If you’re using findall() in Julia v1.6 and need to get a single character, you can use only(str) to get the Char from the string:

julia> only("a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

5.3 match()

The match() function is more complex, and has much more functionality. The signature is match(needle, haystack), but the needle needs to be a Regex type, which is Julia’s representation of a “regular expression.”

Regular expressions are an enormous topic, and a comprehensive treatment is beyond the scope of this tutorial. But a brief primer with some basic information may be useful.

Tip

If you’d like a deeper dive on regular expressions than we’re providing here, we can recommend RegexOne for learning, and Regex101 for practicing and testing your regular expressions.

5.3.1 Aside: Regular expressions in Julia

The easiest way to make a regular expression is to use the “string literal” syntax, which is just putting an r before the first quote:

r"a regular expression"
r"a regular expression"
Tip

This is special Julia syntax for calling a macro @r_str. There are a number of other string literals in Julia, all of which are called @something_str. One that we’ll see in just a moment is raw"", short for @raw_str.

You may also construct a regular expression using a String and the Regex() constructor.

Regex("a different regular expression")
r"a different regular expression"

Regular expressions often use special characters (like \d to represent a digit) which can be used directly in the string literal:

r"a digit: \d"
r"a digit: \d"

But if you need to use the Regex() constructor, you’ll need to “escape” the first \ with another \, or use the raw"" string literal. If you don’t…

Regex("a digit: \d")
Base.Meta.ParseError: Base.Meta.ParseError("ParseError:\n# Error @ /build/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/DataWranglingInJulia/06-strings.qmd:693:17\n#| error: true\nRegex(\"a digit: \\d\")\n#               └┘ ── invalid escape sequence", Base.JuliaSyntax.ParseError(Base.JuliaSyntax.SourceFile("#| error: true\nRegex(\"a digit: \\d\")\n", 0, "/build/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/DataWranglingInJulia/06-strings.qmd", 692, [1, 16, 37]), Base.JuliaSyntax.Diagnostic[Base.JuliaSyntax.Diagnostic(32, 33, :error, "invalid escape sequence")], :none))
ParseError:
# Error @ /build/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/DataWranglingInJulia/06-strings.qmd:693:17
#| error: true
Regex("a digit: \d")
#               └┘ ── invalid escape sequence
Stacktrace:
 [1] top-level scope
   @ ~/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/DataWranglingInJulia/06-strings.qmd:693
Regex("a digit: \\d") == Regex(raw"a digit: \d") == r"a digit: \d"
true

In many cases, if you just need to look for a simple string, you can just turn that string directly into a Regex without worrying too much. However, there are a number of characters that have special meaning in Regex and are worth being aware of.

  • \: used to denote many special characters (like \d mentioned above) and also to “escape” other characters, which means to give them back their normal meaning. To match a literal \, you need \\ in a regex 🤯
  • .: Used to match “any” character.
  • Parentheses, square brackets, and curly brackets (()[]{}) all have special meanings in regular expressions. To match them literally, escape them. So to match Hello (world), you’d write r"Hello \(world\)"

5.3.2 Ok, back to match()

Suppose you have a sample ID system that contains two uppercase letters that indicate the state it was collected in, and 4 numbers that represent the sample number.

In other words, here’s a DataFrame with a column of valid IDs.

using DataFrames
samples = DataFrame(sample_id = ["CA0001", "MA0034", "TN1004", "GA0042"])
4×1 DataFrame
Row sample_id
String
1 CA0001
2 MA0034
3 TN1004
4 GA0042

Now, we’d like to add a second column with the state, and a 3rd column with the numerical ID.

One way to do this if you know everything is formatted the same way is to just use indices, eg str[1:2] will get you the first 2 characters and str[3:end] gets you the rest.

using DataFramesMeta
@rtransform samples begin
    :state = :sample_id[1:2]
    :id = parse(Int, :sample_id[3:end])
end
4×3 DataFrame
Row sample_id state id
String String Int64
1 CA0001 CA 1
2 MA0034 MA 34
3 TN1004 TN 1004
4 GA0042 GA 42
Tip

If you don’t remember how the parse() function works, see our tutorial on Julia syntax, or take a look at the live docs!

One problem with this approach is that it will happily take an incorrectly formatted ID and return gibberish.

Instead, we can use a regular expression and match(). The regular expression that I’ll use is r"([A-Z][A-Z])(\d\d\d\d).

  • [A-Z] matches any capital letter.
  • \d matches any number character.
  • The parentheses give us “capture groups,” so ([A-Z][A-Z]) matches 2 capital letters and \d\d\d\d matches 4 digits
Tip

You can also use numbers in curly braces to match a set number of a particular match. So we could instead use r"([A-Z]{2})(\d{4}) to match {2} of [A-Z] and {4} of \d.

There’s A LOT more that can be done with regular expressions. Alas, getting too deep into it would completely derail the tutorial, but if you have complex text-inputs, we highly encourage you to explore. Julia is an excellent language to work with regex.

To see how this works, let’s just start with the first ID.

exid = first(samples.sample_id)
"CA0001"
mch = match(r"([A-Z][A-Z])(\d\d\d\d)", exid)
RegexMatch("CA0001", 1="CA", 2="0001")

As you can see, the return value contains the match itself ("CA0001"), and the returned capture groups (1 - "CA" and 2 - "00001"). Note that in this case, the whole match is identical to the parent string, but match() will also pick out matches in longer strings.

match(r"([A-Z][A-Z])(\d\d\d\d)", "Hey, here's CA0001 in a sentence.")
RegexMatch("CA0001", 1="CA", 2="0001")

If there is no match, the return value is nothing.

match(r"([A-Z][A-Z])(\d\d\d\d)", "No match!") |> typeof
Nothing

With the RegexMatch value, we can pull out the match itself, and get a list of the capture groups.

mch.match
"CA0001"
mch.captures
2-element Vector{Union{Nothing, SubString{String}}}:
 "CA"
 "0001"

So, returning to our original problem:

@rtransform samples @astable begin
    m = match(r"([A-Z][A-Z])(\d\d\d\d)", :sample_id)

    :state = m.captures[1]
    :id = parse(Int, m.captures[2])
end
4×3 DataFrame
Row sample_id state id
String SubStrin… Int64
1 CA0001 CA 1
2 MA0034 MA 34
3 TN1004 TN 1004
4 GA0042 GA 42
Note

In this example, we use @astable to create the :state and :id columns within one expression, which allows us to reuse the intermediate variable m. See here for more information.

5.3.3 Dealing with multiple matches

By default, match() identifies the first substring in haystack that matches needle. If there are multiple matches in a String, and you want to get all of them, you can use eachmatch(), which creates an iterator that runs through each match.

eachmatch(r"(\w)a", "banana") |> collect
3-element Vector{RegexMatch}:
 RegexMatch("ba", 1="b")
 RegexMatch("na", 1="n")
 RegexMatch("na", 1="n")
Note

Recall that in a regular expression, \w matches any “word” character, eg letters, numbers or underscores. And parentheses create capture groups.

Note

Often, you would use eachmatch() in a for loop, and do something with the match in the body of the loop. Here, we use collect() to just put them in a vector for demo purposes.

By default, eachmatch() only returns non-overlapping matches. Use the keyword argument overlap=true to return all matches, regardless of overlap.

eachmatch(r"(\w)na", "banana") |> collect
1-element Vector{RegexMatch}:
 RegexMatch("ana", 1="a")

Here, the only match is “banana”, since the first ‘a’ in “banana” is consumed by the first match.

eachmatch(r"(\w)na", "banana"; overlap = true) |> collect
2-element Vector{RegexMatch}:
 RegexMatch("ana", 1="a")
 RegexMatch("ana", 1="a")

6 😰 Conclusion

Phew! That was a LOT!

Dealing with strings in any data wrangling situation can be a slog, but Julia has many tools for manipulating and matching strings that can make it, if not easy, then at least manageable. You will almost never need all of these tools at once, but when you need one, you’ll be glad to know it’s there.