= "abcdefg" ex_string
"abcdefg"
A lot of data, especially metadata about clinical samples or subjects, are stored in arbitrary text, rather than numbers. In programming languages like Julia, the data type used to encode text is called a String
.
Julia has numerous facilities for working with String
s, some of which have already been introduced in previous tutorials. Here, we will repeat some of those tools, and introduce many more with the hope that this tutorial will serve as a 1-stop-shop for all of your text processing needs.
In R, string values are called âcharacterâ instead of âstring.â
$> as.character(1001)
r1] "1001"
[
$> typeof(as.character(1001))
r1] "character" [
Julia also has a âcharacterâ (Char
) type, but this is used to represent individual characters, rather than whole strings. A String
can be thought of as combining 0 or more Char
s, and if you index a single value out of a String
, you get a Char
. Notice also that single quotes ('
) can only be used to make a Char
, and double quotes ("
) are required for String
s.
julia> s = "A String"
"A String"
julia> 'a'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
julia> 'A String'
ERROR: syntax: character literal contains multiple characters
Stacktrace:
[1] top-level scope
@ none:1
julia> s[3]
'S': ASCII/Unicode U+0053 (category Lu: Letter, uppercase)
String
s as containersIn many ways, String
s are treated as âscalarâ values - that is, they are atomic data points. But in other ways, they are like vectors of characters.
For example, we can index into strings.
= "abcdefg" ex_string
"abcdefg"
1] ex_string[
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
end] ex_string[
'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
3:5] ex_string[
"cde"
In many cases, it is one character per-index, but not always! Julia indexes by âcode points,â rather than characters. This doesnât matter for most types of characters, but if you want more details, see the end of the tutorial.
String
sIn Julia, String
s are constructed using double quotes: "
.
"This is a String"
"This is a String"
"a" # Also a string
"a"
"" # an empty string
""
A multi-line string can be easily constructed using 3 double quotes ("""
). For convenience, when using this syntax, whitespace thatâs found at the beginning of every line is ignored, as is a new line at the beginning.
That is, the following are equivalent:
= """I met a traveller from an antique land,
ozy1 Who saidââTwo vast and trunkless legs of stone
Stand in the desert. . . .
"""
"I met a traveller from an antique land,\nWho saidââTwo vast and trunkless legs of stone\nStand in the desert. . . .\n"
= """
ozy2 I met a traveller from an antique land,
Who saidââTwo vast and trunkless legs of stone
Stand in the desert. . . .
"""
"I met a traveller from an antique land,\nWho saidââTwo vast and trunkless legs of stone\nStand in the desert. . . .\n"
= """
ozy3 I met a traveller from an antique land,
Who saidââTwo vast and trunkless legs of stone
Stand in the desert. . . .
"""
"I met a traveller from an antique land,\nWho saidââTwo vast and trunkless legs of stone\nStand in the desert. . . .\n"
== ozy2 == ozy3 ozy1
true
In the output, the \n
character is a unix ânewlineâ character. If this were written to a file, the result of each \n
would be a line break in most text editors.
These lines are taken from the poem Ozymandias, by Percy Shelley.
= "string 1"
s1 = "string 2"
s2 = "string 3" s3
"string 3"
Strings can be âconcatenatedâ using the *
operator.
* s2 * s3 s1
"string 1string 2string 3"
Alternatively, one can use the string()
function, which takes any number of arguments and combines them into a single String
.
string(s1, s2, s3) == s1 * s2 * s3
true
One advantage to using string()
is that one can provide non-string arguments, and they will be automatically converted to strings.
= 42 n
42
string(s1, " and ", n)
"string 1 and 42"
Whereas if you try this using concatenation, youâll get an error:
* " and " * n s1
MethodError: MethodError(*, ("string 1 and ", 42), 0x00000000000082cd)
MethodError: no method matching *(::String, ::Int64)
Closest candidates are:
*(::Any, ::Any, !Matched::Any, !Matched::Any...)
@ Base operators.jl:578
*(!Matched::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8}
@ Base int.jl:88
*(::Union{AbstractChar, AbstractString}, !Matched::Union{AbstractChar, AbstractString}...)
@ Base strings/basic.jl:260
...
Stacktrace:
[1] *(::String, ::String, ::Int64)
@ Base ./operators.jl:578
[2] top-level scope
@ ~/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/DataWranglingInJulia/06-strings.qmd:259
join()
Another way to join multiple strings is to use join()
. join()
takes a collection (eg a vector) of strings as the first argument, an optional delimiter (that will be added between each item), and an optional last delimiter (that will be added between the last and penultimate item instead of the primary delimiter).
= [s1, s2, s3, "string 4"] ss
4-element Vector{String}:
"string 1"
"string 2"
"string 3"
"string 4"
join(ss, "; ")
"string 1; string 2; string 3; string 4"
join(ss, ", ", ", and ")
"string 1, string 2, string 3, and string 4"
join
As with string()
, elements of a collection passed to join()
will be converted to String
s if they arenât already.
join([1, 2, 3], " and ")
"1 and 2 and 3"
It is often quite useful to insert the contents of variables or expressions into a string. This is called âinterpolatingâ the value into the string.
In Julia, this is done using $
. If the expression (usually a variable) only contains numbers, letters, or underscores, you can just use $
. Otherwise, wrap it in parentheses, eg $(expr)
.
= 42 x
42
"There's this number, $x. Half of it is $(x Ă· 2)"
"There's this number, 42. Half of it is 21"
Occasionally, you may want to put a floating point number into a string, but it has way too many decimal values. The best way to deal with this is using round()
. round()
can be used outside the context of String
s as well, but itâs particularly useful in this case.
julia> "đ„§ is â $(22 / 7)"
"đ„§ is â 3.142857142857143"
julia> "đ„§ is â $(round(22 / 7; digits = 2))"
"đ„§ is â 3.14"
julia> round(123.456; sigdigits = 2)
120.0
The âlengthâ of a String
indicates how many characters are present:
length("abc")
3
lpad()
and rpad()
If you need strings to be longer, you can âpadâ them to the right or left with rpad()
and lpad()
respectively. These functions take the original string, a length, and the Char
or String
to pad with.
lpad("thing", 10, "xyz")
"xyzxything"
rpad("thing", 10, "xyz")
"thingxyzxy"
This can be useful when building eg identification numbers with a fixed width, since you can also provide non-string numbers as the first argument:
"X" .* lpad.(1:10:102, 3, '0')
11-element Vector{String}:
"X001"
"X011"
"X021"
"X031"
"X041"
"X051"
"X061"
"X071"
"X081"
"X091"
"X101"
strip()
The strip()
family of functions (including rstrip()
and lstrip()
) removes leading and/or trailing characters (usually spaces) from a String
.
= " far out! đŸ " spacey
" far out! đŸ "
strip(spacey)
"far out! đŸ"
lstrip(spacey)
"far out! đŸ "
rstrip(spacey)
" far out! đŸ"
Each of these functions can also take a character or vector of characters to strip, or a boolean function, such that a character that returns true
will be removed.
strip(isuppercase, "TOP of the morning")
" of the morning"
rstrip("What! A! Crazy! DAY!!?!?!", ['!', '?'])
"What! A! Crazy! DAY"
Julia has convenient functions for modifying the case of String
s.
uppercase()
lowercase()
titlecase()
= "Who's on first? That's what I SAID!" my_str
"Who's on first? That's what I SAID!"
lowercase(my_str)
"who's on first? that's what i said!"
uppercase(my_str)
"WHO'S ON FIRST? THAT'S WHAT I SAID!"
titlecase(my_str)
"Who'S On First? That'S What I Said!"
titlecase(my_str, strict = false)
"Who'S On First? That'S What I SAID!"
titlecase(my_str, wordsep = isspace) # this one is subtle, look at the 's' after '
"Who's On First? That's What I Said!"
String
s with split()
The split()
function is reciprocal to join()
, dividing a String
into pieces at particular characters or strings (a space by default).
split("what a great day!")
4-element Vector{SubString{String}}:
"what"
"a"
"great"
"day!"
split("what a great day!", "t ")
3-element Vector{SubString{String}}:
"wha"
"a grea"
"day!"
By default split()
will break on each delimiter, even if thereâs nothing in between them. You may use the keepempty
keyword argument to override this behavior.
split("who,,,,wrote,,,,this?", ",")
9-element Vector{SubString{String}}:
"who"
""
""
""
"wrote"
""
""
""
"this?"
split("who,,,,wrote,,,,this?", ","; keepempty = false)
3-element Vector{SubString{String}}:
"who"
"wrote"
"this?"
You can also pass an array of delimiters, or regular expressions (see below) to break up the string on multiple patterns.
split(
"Hmm, this might be... complicated!",
' ', ',', '.']; # a space, a comma, and a period
[= false,
keepempty )
5-element Vector{SubString{String}}:
"Hmm"
"this"
"might"
"be"
"complicated!"
replace()
If we want to replace part of a String
with something else, we use the replace()
function, which takes a String
as the first argument, and a Pair
that shows what to replace.
replace("Goodbye, cruel world!", "Goodbye, cruel" => "Hello,")
"Hello, world!"
There are a number of ways to look into the contents of strings. In the documentation for these functions, they use the phrase âneedleâ for the thing that is being searched for and âhaystackâ for the thing being searched.
Just in case youâre not familiar with this English idiom, see here.
contains()
The most basic way to match is the contains(haystack, needle)
function, which is a Boolean function that takes an AbstractString
as the first argument (the haystack), and something to look for as the second argument (the needle) - it may be another String
, a Char
, or a Regex
(which weâll get to later). contains()
returns true
if there are any matches, false
if not.
contains("banana", "ana")
true
contains("banana", 'z')
false
There is also the occursin()
function, which is essentially the reverse signature - occursin(needle, haystack)
.
julia> occursin("ana", "banana")
true
find*()
There is a family of functions in the find*
family that may be used to identify the indexes of matches within a String
. These are related to similar functions used on arrays and other containers, but the two that are used most often with String
s are:
findfirst()
: Identifies the first index / index range that matchesfindall()
: Identify all indices or index ranges that matchThe return value can be used to index into the original string and return the match.
findfirst("an", "banana")
2:3
findfirst('a', "banana")
2
findall("an", "banana")
2-element Vector{UnitRange{Int64}}:
2:3
4:5
findall("a", "banana")
3-element Vector{UnitRange{Int64}}:
2:2
4:4
6:6
Did you notice that searching with a String
, even if thereâs only 1 character in the String
, returns a 1-element range rather than an integer index? This is so that using the return value will produce exactly what was searched for. Using a single index would return a Char
instead of a String
:
julia> "banana"[2]
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
julia> "banana"[2:2]
"a"
There is no findall(::Char, ::String)
in Julia v1.6 (though it was added in v1.7). If youâre using findall()
in Julia v1.6 and need to get a single character, you can use only(str)
to get the Char
from the string:
julia> only("a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
match()
The match()
function is more complex, and has much more functionality. The signature is match(needle, haystack)
, but the needle needs to be a Regex
type, which is Juliaâs representation of a âregular expression.â
Regular expressions are an enormous topic, and a comprehensive treatment is beyond the scope of this tutorial. But a brief primer with some basic information may be useful.
The easiest way to make a regular expression is to use the âstring literalâ syntax, which is just putting an r
before the first quote:
r"a regular expression"
r"a regular expression"
This is special Julia syntax for calling a macro @r_str
. There are a number of other string literals in Julia, all of which are called @something_str
. One that weâll see in just a moment is raw""
, short for @raw_str
.
You may also construct a regular expression using a String
and the Regex()
constructor.
Regex("a different regular expression")
r"a different regular expression"
Regular expressions often use special characters (like \d
to represent a digit) which can be used directly in the string literal:
r"a digit: \d"
r"a digit: \d"
But if you need to use the Regex()
constructor, youâll need to âescapeâ the first \
with another \
, or use the raw""
string literal. If you donâtâŠ
Regex("a digit: \d")
ErrorException: ErrorException("syntax: invalid escape sequence")
syntax: invalid escape sequence
Stacktrace:
[1] top-level scope
@ ~/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/DataWranglingInJulia/06-strings.qmd:693
Regex("a digit: \\d") == Regex(raw"a digit: \d") == r"a digit: \d"
true
In many cases, if you just need to look for a simple string, you can just turn that string directly into a Regex
without worrying too much. However, there are a number of characters that have special meaning in Regex and are worth being aware of.
\
: used to denote many special characters (like \d
mentioned above) and also to âescapeâ other characters, which means to give them back their normal meaning. To match a literal \
, you need \\
in a regex đ€Ż.
: Used to match âanyâ character.()[]{}
) all have special meanings in regular expressions. To match them literally, escape them. So to match Hello (world)
, youâd write r"Hello \(world\)"
match()
Suppose you have a sample ID system that contains two uppercase letters that indicate the state it was collected in, and 4 numbers that represent the sample number.
In other words, hereâs a DataFrame with a column of valid IDs.
using DataFrames
= DataFrame(sample_id = ["CA0001", "MA0034", "TN1004", "GA0042"]) samples
Row | sample_id |
---|---|
String | |
1 | CA0001 |
2 | MA0034 |
3 | TN1004 |
4 | GA0042 |
Now, weâd like to add a second column with the state, and a 3rd column with the numerical ID.
One way to do this if you know everything is formatted the same way is to just use indices, eg str[1:2]
will get you the first 2 characters and str[3:end]
gets you the rest.
using DataFramesMeta
@rtransform samples begin
:state = :sample_id[1:2]
:id = parse(Int, :sample_id[3:end])
end
Row | sample_id | state | id |
---|---|---|---|
String | String | Int64 | |
1 | CA0001 | CA | 1 |
2 | MA0034 | MA | 34 |
3 | TN1004 | TN | 1004 |
4 | GA0042 | GA | 42 |
If you donât remember how the parse()
function works, see our tutorial on Julia syntax, or take a look at the live docs!
One problem with this approach is that it will happily take an incorrectly formatted ID and return gibberish.
Instead, we can use a regular expression and match()
. The regular expression that Iâll use is r"([A-Z][A-Z])(\d\d\d\d)
.
[A-Z]
matches any capital letter.\d
matches any number character.([A-Z][A-Z])
matches 2 capital letters and \d\d\d\d
matches 4 digitsYou can also use numbers in curly braces to match a set number of a particular match. So we could instead use r"([A-Z]{2})(\d{4})
to match {2}
of [A-Z]
and {4}
of \d
.
Thereâs A LOT more that can be done with regular expressions. Alas, getting too deep into it would completely derail the tutorial, but if you have complex text-inputs, we highly encourage you to explore. Julia is an excellent language to work with regex.
To see how this works, letâs just start with the first ID.
= first(samples.sample_id) exid
"CA0001"
= match(r"([A-Z][A-Z])(\d\d\d\d)", exid) mch
RegexMatch("CA0001", 1="CA", 2="0001")
As you can see, the return value contains the match itself ("CA0001"
), and the returned capture groups (1 - "CA"
and 2 - "00001"
). Note that in this case, the whole match is identical to the parent string, but match()
will also pick out matches in longer strings.
match(r"([A-Z][A-Z])(\d\d\d\d)", "Hey, here's CA0001 in a sentence.")
RegexMatch("CA0001", 1="CA", 2="0001")
If there is no match, the return value is nothing
.
match(r"([A-Z][A-Z])(\d\d\d\d)", "No match!") |> typeof
Nothing
With the RegexMatch
value, we can pull out the match itself, and get a list of the capture groups.
mch.match
"CA0001"
mch.captures
2-element Vector{Union{Nothing, SubString{String}}}:
"CA"
"0001"
So, returning to our original problem:
@rtransform samples @astable begin
= match(r"([A-Z][A-Z])(\d\d\d\d)", :sample_id)
m
:state = m.captures[1]
:id = parse(Int, m.captures[2])
end
Row | sample_id | state | id |
---|---|---|---|
String | SubStrin⊠| Int64 | |
1 | CA0001 | CA | 1 |
2 | MA0034 | MA | 34 |
3 | TN1004 | TN | 1004 |
4 | GA0042 | GA | 42 |
In this example, we use @astable
to create the :state
and :id
columns within one expression, which allows us to reuse the intermediate variable m
. See here for more information.
By default, match()
identifies the first substring in haystack
that matches needle
. If there are multiple matches in a String
, and you want to get all of them, you can use eachmatch()
, which creates an iterator that runs through each match.
eachmatch(r"(\w)a", "banana") |> collect
3-element Vector{RegexMatch}:
RegexMatch("ba", 1="b")
RegexMatch("na", 1="n")
RegexMatch("na", 1="n")
Recall that in a regular expression, \w
matches any âwordâ character, eg letters, numbers or underscores. And parentheses create capture groups.
Often, you would use eachmatch()
in a for
loop, and do something with the match in the body of the loop. Here, we use collect()
to just put them in a vector for demo purposes.
By default, eachmatch()
only returns non-overlapping matches. Use the keyword argument overlap=true
to return all matches, regardless of overlap.
eachmatch(r"(\w)na", "banana") |> collect
1-element Vector{RegexMatch}:
RegexMatch("ana", 1="a")
Here, the only match is âbananaâ, since the first âaâ in âbananaâ is consumed by the first match.
eachmatch(r"(\w)na", "banana"; overlap = true) |> collect
2-element Vector{RegexMatch}:
RegexMatch("ana", 1="a")
RegexMatch("ana", 1="a")
Phew! That was a LOT!
Dealing with strings in any data wrangling situation can be a slog, but Julia has many tools for manipulating and matching strings that can make it, if not easy, then at least manageable. You will almost never need all of these tools at once, but when you need one, youâll be glad to know itâs there.