Getting started with Julia

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

1 Why Julia?

So, here you are, a quantitative scientist in pharma, perhaps proficient in R, or know to google your way through writing R code, and you are being asked to learn a whole new programming language.

You might be asking, “Why?”

1.1 🏎️ Julia is fast (to run)!

For many applications, speed isn’t all that critical. Getting a 10x speedup on a 100ms operation that you do a couple of times a day isn’t all that big a deal - you are unlikely to notice a 90ms savings.

But as the size of your datasets and the complexity of your models increase, these things start to matter. For example, imagine you have a large dataset, and you want to calculate some statistics on different groups.

Don’t worry too much about the syntax in the following code block - it just creates an artificial dataset with 10_000 rows, then groups them into 4 categories (‘A’, ‘B’, ‘C’, and ‘D’), and calculates the mean on one column and the median of another.

(Example adapted from here)

How does this stack up to other programming languages? Here are the results I get on my laptop (times in ms):

julia R python
0.411 2.74 1.114
using Statistics
using DataFrames
using Chain
using BenchmarkTools

df = DataFrame(x = rand(["A", "B", "C", "D"], 10_000), y = rand(10_000), z = randn(10_000))

benchmark_results = @benchmark @chain $df begin
    groupby(:x)
    combine(:y => median, :z => mean)
end
library(dplyr)


df <- tibble(
    x = sample(c("A", "B", "C", "D"), 10000, replace = TRUE),
    y = runif(10000),
    z = rnorm(10000)
)


print(bench::mark(
    df %>%
        group_by(x) %>%
        summarize(
            median(y),
            mean(z)
        )
))
#!/usr/bin/env python3

import timeit
import statistics
import pandas as pd
import numpy as np

def bench():
    df.groupby('x').agg({'y': 'median', 'z': 'mean'})


    df = pd.DataFrame({'x': np.random.choice(['A', 'B', 'C', 'D'], 10000, replace=True),
                    'y': np.random.randn(10000),
                    'z': np.random.rand(10000)})
    
    # reports in seconds
    bmks = [timeit.Timer(bench).timeit(number=1) for _ in range(1000)]

1.2 🏁 Julia is fast (to develop)

To be fair, many common operations may be as fast or faster using libraries in python or R. But this is not because python and R are fast.

In fact, the group-by functions in pandas and dplyr used above are not written in python and R, they’re written in C or C++, and have bindings in python and R respectively. This is true of almost all performance-critical operations in languages like python and R.

But this is not the case in Julia. One can prototype and optimize in the same language, dramatically increasing the speed of development.

While this might not be immediately relevant to end-users, it means that cutting-edge software and algorithms can be developed more quickly in Julia, and is one of the reasons PumasAI is developed in Julia.

1.3 🔁 Julia is reproducible

When writing and running code for science, it’s important to keep track of the versions of all of the packages you use so that your code is reproducible, both for your future self and for others that need to repeat or evaluate your work.

If you’ve ever attempted to manage R project dependencies using packrat or renv, you know that this can be a challenge. Further, not all of your code dependencies can be handled by R directly, as things like tools for compiling C code (which underlies many R packages) are often handed off to the operating system. This also means that users of different operating systems may may have a different dependency chain in a way that is invisible to most users of your code.

In Julia, project management is built in. Every Julia project is associated with plain-text files that define the project environment, including all direct and indirect dependencies. The Julia language also maintains a large library of binary dependencies that are cross-compiled on multiple operating systems that can be managed by the same package management system, ensuring that code results are reproducible (nearly) everywhere.

1.4 🎨 Julia is expressive

The first advice often given to R users that want their code to be fast is to “vectorize” everything. That is, provide an operation that will work on each item in an array or table.

That’s great when your problem is easily solved in vectorized form, but sometimes you really need a loop, and sometimes other patterns are more sensible.

In Julia, you can write code the way it makes sense, and it’s almost always fast (or can be made fast), meaning you are not limited to a particular coding pattern to solve your problem.

Julia is also a modern language, which (among other things) means that it has native support for things like unicode.

We’ve already seen one benefit above, where we used σ (written with \sigma<tab>) in a variable name: in Julia, we can write code that looks like the underlying math that it represents.

Eg. \frac{1}{\sqrt x} can be written in Julia as 1/√x.

You can even use emojis right in your code!

function 🐺(🐰)
    "There are now $(🐰 - 1) bunnies"
end
🐺 (generic function with 1 method)
🐇 = 10
🐺(🐇)
"There are now 9 bunnies"

Here’s another example:

Tip

Julia has robust support for unicode. For example, to type the variable σ₁, write \sigma, then press <TAB>, then write \_1 and press <TAB> again.

You can even use emoji in code 🎉 (that’s \:tada).

using Distributions
using CairoMakie

μ = 0
σ = 1
𝔑 = Normal(0, 1)
plot(𝔑)

2 Wrapping up

You may still be thinking, “Why do I have to learn a new language?” That’s understandable, but stick with it! We came from other languages too (R for Jose, Python for Kevin and Juan), and found it challenging at first. But we think the benefits outweigh the cost of getting on the new learning curve, and we will do our best to make the process less painful (and even fun!).

For a deeper dive into notable differences between R and Julia, see here.