I write a lot of stream-oriented code, where you’re working with data in chunks and manipulating them as they pass through the system. Julia 0.6 introduced some new functionality for expressing array manipulations more efficiently, which are described in depth in Steven G. Johnson’s post here.
When implementing read and write methods for streams you often end up copying chunks of data between arrays, so I wanted to explore how some of Julia’s features work together to make that as efficient as possible. From reading the performance tips section of the manual we already know that pre-allocating your result arrays is one way to reduce allocations and speed things up (assuming you’ll be writing to the destination array many times), so we’ll use that as a starting point and assume we have a result array a that we’re going to fill by doing some operations on b and c arrays.
I’ll be using the following toy operation to test some variants and discuss what’s going on:
a = rand(2000)
b = rand(2000)
c = rand(2000)
using Compat # currently needed because of an issue with BenchmarkTools
using BenchmarkTools
function naive(a, b, c)
    a[1:1000] = b[1:1000] * 2 + c[1001:2000]
end
And we can benchmark the code to see how long it takes to run:
julia> @benchmark naive($a, $b, $c)
BenchmarkTools.Trial:
  memory estimate:  31.75 KiB
  allocs estimate:  4
  --------------
  minimum time:     3.339 μs (0.00% GC)
  median time:      3.760 μs (0.00% GC)
  mean time:        4.666 μs (16.71% GC)
  maximum time:     95.044 μs (87.97% GC)
  --------------
  samples:          10000
  evals/sample:     8
Now it’s a little hard to evaluate whether 3.8μs is a reasonable amount of time to spend, but almost 32KB seems like a lot of memory to allocate when all our arrays are already allocated. Consider that we’re moving around blocks of 1000 elements of type Float64, so each block should be about 8000B (7.8KiB). So this allocation represents 4 copies plus some extra.
For reference, we can check against the following python code using numpy:
In [57]: a = np.random.rand(2000)
In [58]: b = np.random.rand(2000)
In [59]: c = np.random.rand(2000)
In [60]: %timeit a[0:999] = b[0:999] * 2 + c[1000:1999]
The slowest run took 30.76 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.7 µs per loop
Trying my best to match methodologies1, our naïve Julia version ends up taking about 35% longer than the python equivalent with numpy.
Let’s discuss a little bit what happens when Julia executes this function. First we have the array slices b[1:1000] and c[1001:2000], which will each create a new copy of the 1000 sliced elements. We’re not modifying these arrays and throwing away these copies immediately after, so we’d prefer to create a view into the same data. We could do this with view(b, 1:1000), but we can also use the @views macro to turn both our indexing operations into views:
function views(a, b, c)
    @views a[1:1000] = b[1:1000] * 2 + c[1001:2000]
end
This definitely improves our run time:
julia> @benchmark views($a, $b, $c)
BenchmarkTools.Trial:
  memory estimate:  15.97 KiB
  allocs estimate:  4
  --------------
  minimum time:     1.615 μs (0.00% GC)
  median time:      2.332 μs (0.00% GC)
  mean time:        2.757 μs (12.35% GC)
  maximum time:     71.391 μs (90.06% GC)
  --------------
  samples:          10000
  evals/sample:     10
We see we’ve cut our allocations in half and runtime by about 40%. The allocation reduction makes sense as we’ve avoided 2 copies by using views instead. Numpy uses views by default, which probably accounted for the initial performance edge of the numpy version over the naïve Julia version.
Let’s try another tactic and use dot broadcasting and loop fusion, as described in SGJ’s post. To summarize the issue (though I recommend reading the post for more details), elementwise operations between arrays often create temporary copies that we’d like to avoid. For instance, y = 3x.^2 + 5x + 2 will create temporary arrays for 3x, and 5x, then another one for 3x.^2, and so on, until it finally assigns the identifier y to the final result. Even if you pre-allocate y and assign the elements with y[:] = ... it will still allocate temporaries and then at the end copy the final result into y.
The solution is to use dot syntax on the operators y.= 3.*x.^2.+5.*x.+2, which is transformed (more or less) into:
for i in eachindex(x)
    y[i] = 3x[i]^2+5x[i]+2
end
Which does all the processing in a single pass over the data without allocating any extra temporaries, and is super fast. Now that’s a lot of dots and looks pretty ugly, so you can use the @. macro to add the dots to all the operations in the expression.
Now let’s try this technique with our array slicing. We’ll take out the views so we can separate out the two effects:
function dotted(a, b, c)
    @. a[1:1000] = b[1:1000] * 2 + c[1001:2000]
end
Benchmarking this version shows that it avoids 2 copies for temporary values, and also gives a tidy speedup.
julia> @benchmark dotted($a, $b, $c)
BenchmarkTools.Trial:
  memory estimate:  15.92 KiB
  allocs estimate:  3
  --------------
  minimum time:     1.490 μs (0.00% GC)
  median time:      1.712 μs (0.00% GC)
  mean time:        2.085 μs (15.33% GC)
  maximum time:     57.836 μs (89.50% GC)
  --------------
  samples:          10000
  evals/sample:     10
Now the obvious question is whether we can combine these approaches, and of course we can! The following is exactly the same as our original function with both the macros prepended:
function dottedviews(a, b, c)
    @. @views a[1:1000] = b[1:1000] * 2 + c[1001:2000]
end
So now we’re creating the two views rather than copies, and using loop fusion to avoid the temporary copies. How does this affect our benchmark?
julia> @benchmark dottedviews($a, $b, $c)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  2
  --------------
  minimum time:     284.270 ns (0.00% GC)
  median time:      304.295 ns (0.00% GC)
  mean time:        310.547 ns (0.88% GC)
  maximum time:     3.001 μs (84.90% GC)
  --------------
  samples:          10000
  evals/sample:     278
There’s still some allocation, probably because views still allocate a small object that refers to the original array (there are improvements around the corner that would allow us to stack-allocate these as well), and this version now takes less than 10% as long as our original.
Conclusion
When trying to improve the performance of your code, checking for excessive allocation can be a very fruitful approach. Here we improved our performance by 10X by adding two macros to the front of our expression.
Notes
- 
For the python benchmark, %timeitturns off garbage collection and gives the best mean from 3 runs, so the equivalent julia timing comes out to 3.64µs (calculated by running the benchmark 3 times, subtracting the GC time from the mean value, and picking the best result). There’s also the caching warning printed by%timeit, so the actual python runtime might be somewhat slower. ↩