13 How to write high-performance Julia

13 How to write high-performance Julia

High-performance code

Avoid global variables

The value and type of global variables can change at any time. This makes it difficult for the compiler to optimize code that uses global variables. Variables should be local or passed as parameters to functions as much as possible.

Any code that focuses on performance or needs to test performance should be placed in the function.

Declaring global variables as constants can greatly improve performance.

const VAL = 0

If you must declare global variables, you can mark their types where they are used to optimize efficiency.

global x = rand(10000)

function lp1()
    s = 0.0
    for i in x
        s += i
    end
end

Use @timeto estimate code runtime

When the function is run for the first time, it needs to be warmed up due to jit, and the result of the second run is the real code running time.

The following @timeresults are all time without jit.

In the figure, we can not only know the code running time, but also see the amount of memory occupied by the code. From the memory size, we can also roughly see if there is a problem with the program.

For example, the function lp1()applies for 733KB of memory and only calculates the addition of 64-bit floating point numbers, indicating that there must be extra space that can be saved.

Below we run by passing function parameters and specifying variable types

function lp2()
    s = 0.0
    for i in x::Vector{Float64}
        s += i
    end
end

function lp3(x)
    s = 0.0
    for i in x
        s += i
    end
end

It can be seen that the running time is greatly reduced, but it still takes up 160Byte of memory, which is @timecaused by running in the global scope .

If we put the test code in the function, this problem does not exist.

For more formal performance testing, you can use the BenchmarkTools.jl package, which will evaluate the performance of the function multiple times to reduce noise.

It can be seen that the way to specify the variable type is the fastest.

code generation

In the REPL, we enter @code to view the macros used for code generation

  • @code_lowered: Julia's underlying running process;
  • @code_typed: the type change when the program is running;
  • @code_llvm: the running process of the llvm compiler;
  • @code_native: The machine language of the generated program;
  • @code_warntypes: Check if there is a type warning when the program is running.

First we define a simple function

cal(a, b, c) = a + b * c
cal(1, 2, 3)
>>7
cal(1.0, 2.0, 3.0)
>>7.0
cal(1, 2, 3.0)
>>7.0

Use to @code_lowerdview the underlying running process

@code_lowered cal(1, 2, 3)
@code_lowered cal(1.0, 2.0, 3.0)
@code_lowered cal(1, 2, 3.0)

It can be seen that the running process of the three functions is the same and must be the same, but the types are different, but the process of multiplication and addition is still the same.

Reuse to code_typedview the runtime type changes

@code_typed cal(1, 2, 3)
@code_typed cal(1.0, 2.0, 3.0)
@code_typed cal(1, 2, 3.0)

It can be seen that the first two running processes are the same, but the input variable parameters are different, and the calculated function changes accordingly; while the third function is running, there is an extra process of converting two Int-type variables into Float variables.

Then use to @code_llvmview the running process of the llvm compiler

@code_llvm cal(1, 2, 3)
@code_llvm cal(1.0, 2.0, 3.0)
@code_llvm cal(1, 2, 3.0)

It can be seen that when the llvm compiler runs on the first two functions, the process is basically the same, and the third function has many more steps.

Similarly, use to @code_nativeview the running process of machine language.

With @code_llvmsimilar results, the first two functions are basically the same process, the third function a lot more steps.

Understanding the code generation is helpful for our follow-up explanation.

Abstract and concrete types

When we define a function, if the type of the function parameters is fixed, such as an Int64 Array[1,2,3,4], then they will be stored continuously in memory;

But if the type of the function parameter is Any, then the continuous storage in the memory is just their "pointer", which will point to its actual location. As a result, data access slows down.

Look at the example below.

import Base: +, *
struct IntNum
    n::Int64
end

struct AnyNum
    n::Number
end

+(a::IntNum, b::IntNum) = IntNum(an + bn)
*(a::IntNum, b::IntNum) = IntNum(an * bn)

+(a::AnyNum, b::AnyNum) = AnyNum(an + bn)
*(a::AnyNum, b::AnyNum) = AnyNum(an * bn)

You can use @code_native +(IntNum(1), IntNum(2))sum @code_native +(AnyNum(1), AnyNum(2))to compare the length of the machine language of the two functions. It can be clearly seen that AnyNum has a lot more operations than IntNum.

function calc_Int_time(a,b,c,n)
    for i in 1:n
        cal(IntNum(a), IntNum(b), IntNum(c))
    end
end

function calc_Any_time(a,b,c,n)
    for i in 1:n
        cal(AnyNum(a), AnyNum(b), AnyNum(c))
    end
end

It can also be seen from the memory size occupied by the program that IntNum is much less than AnyNum. Therefore, calculating the concrete type will save time than calculating the abstract type.

We can use @code_warntypeto see if there is an abstract type in the running function

For concrete type:

image

For the abstract type:

Where there is an abstract type, it will be marked in red.

Give another example of Julia's own functions.

Hidden type conversion

In C++, each defined variable has its fixed type, but in Julia, since the parameter can be defaulted when the variable is defined, the conversion of the parameter type is often not noticed.

function cums(n)
    r = 0
    for i in 1:n
        r += cos(i)
    end
    r
end

function cums_f(n)
    r = 0.0
    for i in 1:n
        r += cos(i)
    end
    r
end

When putting the two functions together, students who have a basic programming experience can easily see that r in the first function is of type Int64. When the program is running, you need to convert Int64 to Float64 and then perform addition calculations. But if these two functions are not put together, will they be ignored by many students?

Compare the types of two functions warning

Compare the running time of two functions

Therefore, when we define a variable, we must try to keep the type consistent with the subsequent operation. There are several functions that can help us complete this definition.

  • zero(value)
  • eltype(array)
  • one(value)
  • similar(array)

When we define variables, we can use these 4 functions to avoid common mistakes.

eg.

pos(x) = x <0? zero(x): x

eg. Find the sum of an array

x = [1.0, 2.1, 4.5]
sm = zero(x[1])
sm = sum(x) # Of course, you can write this sentence directly without defining sm, this is just an example

The usage of the other three is basically the same, so I will not give an example here.

Avoid containers with abstract type parameters

When defining struct, we often write

mutable struct MyType1
           x::AbstractFloat
       end

But it would be better to use

mutable struct MyType2{T<:AbstractFloat}
           x::T
       end

Check their type

a = MyType1(2.3)
typeof(a)
>>MyType1
b = MyType2(2.3)
typeof(b)
>>MyType2{Float64}

The type of x can be determined by the type of b, but not the type of a. The type of bx will not change, it will always be Float64, and the type of ax will change.

typeof(ax)
ax = 2.3f0
typeof(ax)
>>Float32

bx = 2.3f0
typeof(bx)
>>Float64

When an Array contains multiple variables, if we know the type of one of the variables, we will explicitly specify it

function foo(a::Array{Any,1})
    x = a[1]::Int32
    b = x+1
    ...
end

If you can only determine the type of x, but the type of a[1] cannot be determined, you can use convert() to complete

x = convert(Int32, a[1])::Int32

From what we have said above, we can also know a strategy for optimizing the code: the simpler the program, the better, let the compiler know exactly what you want to do, instead of letting the compiler guess our purpose

Therefore, it can be inferred that functions with variable parameter types are definitely not as fast as functions with fixed keyword parameter types.

But this kind of designation of a certain variable type needs to pay attention to when using it, that is, if the variable type is not determined at compile time but determined at runtime, it will hurt performance.

function nr(n, prec)
    ctype = prec == 32? Float32: Float64
    b = Complex{ctype}(2.3)
    d = 0
    for i in 1:n
        c = (b + 1.0f0)::Complex{ctype}
        d += c
    end
    d
end

How to deal with the situation where the type needs to be determined at runtime to maximize the performance?

function nr_op(n, prec)
    ctype = prec == 32? Float32: Float64
    b = Complex{ctype}(2.3)
    d = opFunc(n, b::Complex{ctype})
end

function opFunc(n, b::Complex{T}) where {T}
    d = 0
    for i in 1:n
        c = (b + 1.0f0)::Complex{T}
        d += c
    end
    d
end
@time nr(10000, 64)
>>0.003382 seconds (20.00 k allocations: 625.188 KiB)

@time nr_op(10000, 64)
>>0.000023 seconds (7 allocations: 240 bytes)

We will use this type of variable calculation to complete the function value, because when the function is called, the type of the input parameter will be determined first, so that there is no need to consider the uncertainty of the type when calculating. If you don't use a function, you need to get the variable type after each assignment, and then convert the result to the corresponding type. This will take a lot more time.

In short, get the variable type first and then calculate

Parameter type optimization

When our type is a function parameter, we must also use the above method to get it first

# Define an N-dimensional matrix
function array3(fillval, N)
           fill(fillval, ntuple(d->3, N))
       end
typeof(array3(2.0, 4))
>>Array{Float64,4}

Here N is the type parameter of the matrix, which is an N-dimensional. The same problem is that the type parameter of the matrix variable is the value of N. If we first obtain the value of N and then generate the matrix, the performance will be better.

function array3(fillval, ::Val{N}) where N
           fill(fillval, ntuple(d->3, Val(N)))
       end

When using Val(), you must pay attention to the way it is used. If it is used improperly, the performance will be worse than not using Val()

function call_array3(fillval, n)
    A = array3(fillval, Val(n))
end

In this way, the compiler doesn't know what n is, or the type of n, which is totally different from our original intention.

Let's take another example of variable parameters and keyword parameters to illustrate the difference in running time between the two

function func1(n, x...)
    res = 0
    for i in 1:n
        res += x[2]
    end
    res
end

function func2(n, x, y)
    res = 0
    for i in 1:n
        res += y
    end
    res
end
@time func1(1000000000, 10, 10)
>> 0.408809 seconds (5 allocations: 176 bytes)

@time func2(1000000000, 10, 10)
>> 0.000003 seconds (5 allocations: 176 bytes)

After clarifying the parameters, the running speed is several orders of magnitude faster.

Use methods to replace judgment conditions in functions

using LinearAlgebra

function mynorm(A)
    if isa(A, Vector)
        return sqrt(real(dot(A,A)))
    elseif isa(A, Matrix)
        return maximum(svdvals(A))
    else
        error("mynorm: invalid argument")
    end
end

If the function mynorm is written in the following form, it will be more efficient

norm(x::Vector) = sqrt(real(dot(x, x)))
norm(A::Matrix) = maximum(svdvals(A))

Matrix optimization

In Julia, multi-dimensional matrices are arranged according to the column-first principle, which is the same as in MATLAB

x = [1 2; 3 4]

# Convert x to a 1-dimensional matrix
x[:]

In other words, the address of the data in each column of the matrix in Julia is continuous in the memory, and the address of each row is not continuous, and operating continuous addresses is much faster than non-continuous addresses. Here is an example of matrix copy.

# Copy by column
function copy_cols(x::Vector{T}) where T
    inds = axes(x, 1)
    out = similar(Array{T}, inds, inds)
    for i = inds
        out[:, i] = x
    end
    return out
end

# Copy by line
function copy_rows(x::Vector{T}) where T
    inds = axes(x, 1)
    out = similar(Array{T}, inds, inds)
    for i = inds
        out[i, :] = x
    end
    return out
end

# Copy by element, in order of columns
function copy_col_row(x::Vector{T}) where T
    inds = axes(x, 1)
    out = similar(Array{T}, inds, inds)
    for col = inds, row = inds
        out[row, col] = x[row]
    end
    return out
end

# Copy by element, in order of line
function copy_row_col(x::Vector{T}) where T
    inds = axes(x, 1)
    out = similar(Array{T}, inds, inds)
    for row = inds, col = inds
        out[row, col] = x[col]
    end
    return out
end
x = randn(10000)
fmt(f) = println(rpad(string(f)*": ", 14, ''), @elapsed f(x))
map(fmt, Any[copy_cols, copy_rows, copy_col_row, copy_row_col])
>>copy_cols: 0.205115039
 copy_rows: 0.856341406
 copy_col_row: 0.279589601
 copy_row_col: 0.845328085

dot operation

In matrix operations, add. Before the operator, see the following example

f(x) = 5x.^3-2x.^2 + 3x
fdot(x) = @. 5x^3-2x^2 + 3x
x = rand(10^6);
@time f(x);
>>0.040295 seconds (20 allocations: 53.407 MiB, 17.06% gc time)

@time fdot(x);
>>0.002195 seconds (8 allocations: 7.630 MiB)

@time f.(x);
0.002047 seconds (8 allocations: 7.630 MiB)

It can be seen that the dot multiplication operation is much faster.

Vectorization does not improve Julia's running speed

Many students who have used MATLAB and Python will think that vector operations must be much faster than loop operations, but there is no such rule in Julia. This point must be paid attention to.

function loop_sum()
    s = 0.0
    for i = 1:1000
        s = 0.0
        for k = 1:10000
            s += 1.0/(k*k)
        end
    end
    s
end

Let's take a @code_warntype loop_sum()look at the code for type conversion problems.

There is no warning of type conversion during operation.

@benchmark loop_sum()
>>BenchmarkTools.Trial: 
  memory estimate: 0 bytes
  allocs estimate: 0
  --------------
  minimum time: 8.868 ms (0.00% GC)
  median time: 8.869 ms (0.00% GC)
  mean time: 8.870 ms (0.00% GC)
  maximum time: 8.925 ms (0.00% GC)
  --------------
  samples: 564
  evals/sample: 1

Use a vector to complete the accumulation operation above

function vec_sum()
    s = 0.0
    k = collect(1:10000)
    for i=1:1000
        s = sum(1 ./(k.^2))
    end
end

@benchmark vec_sum()
>>BenchmarkTools.Trial: 
  memory estimate: 76.48 MiB
  allocs estimate: 3002
  --------------
  minimum time: 19.575 ms (17.75% GC)
  median time: 22.603 ms (21.13% GC)
  mean time: 23.038 ms (22.79% GC)
  maximum time: 59.079 ms (68.95% GC)
  --------------
  samples: 217
  evals/sample: 1

@simd can accelerate the operation when the operation support is re-recorded, and the addition operation can be accelerated

function loop_sum_simd()
    s = 0.0
    for i = 1:1000
        s = 0.0
        @simd for k = 1:10000
            s += 1.0/(k*k)
        end
    end
    s
end

@benchmark loop_sum_simd()
>>BenchmarkTools.Trial: 
  memory estimate: 0 bytes
  allocs estimate: 0
  --------------
  minimum time: 6.202 ms (0.00% GC)
  median time: 6.236 ms (0.00% GC)
  mean time: 6.240 ms (0.00% GC)
  maximum time: 7.119 ms (0.00% GC)
  --------------
  samples: 802
  evals/sample: 1

It shows that vector operations in Julia do not optimize speed, which has also been explained many times on Julia's official website.

Output pre-allocation

How to understand it? That is, when we are operating Array or other complex types, we pre-allocate an Array or other type to store it for the result, and then perform calculations, the performance will be significantly improved

function xinc(x)
           return [x, x+1, x+2]
end

function loopinc()
   y = 0
   for i = 1:10^7
       ret = xinc(i)
       y += ret[2]
   end
   return y
end

function xinc!(ret::AbstractVector{T}, x::T) where T
   ret[1] = x
   ret[2] = x+1
   ret[3] = x+2
   nothing
end

function loopinc_prealloc()
   ret = Vector{Int}(undef, 3)
   y = 0
   for i = 1:10^7
       xinc!(ret, i)
       y += ret[2]
   end
   return y
end
@time loopinc()
>>0.428358 seconds (40.00 M allocations: 1.490 GiB, 10.25% gc time)

@time loopinc_prealloc()
>>0.024745 seconds (6 allocations: 288 bytes)

copy and view

Copy means that the copied variable is a copy of the original variable, and there is a memory copy operation, while the view is only a mapping relationship, and there is no memory copy.

First give an example of view

A = zeros(3, 3);
@views for row in 1:3
   b = A[row, :]
   b[:] .= row # The value of A will also change accordingly
end

Define the functions of copy and view respectively

fcopy(x) = sum(x[2:end-1]);
@views fview(x) = sum(x[2:end-1]);
x = rand(10^6);
@time fcopy(x)
>>0.031803 seconds (7 allocations: 7.630 MiB, 85.25% gc time)

@time fview(x)
>>0.001218 seconds (6 allocations: 224 bytes)

IO operation optimization

When writing a file, for a variable, we can get its value through $, or use the variable itself directly

println(file, "$a $b")

The following writing method is better, because the above method first converts the expression into a string, and then writes it into the file. And the following method writes the value directly into the file

println(file, a, "", b)

Other performance optimization tips

  • Avoid unnecessary Array. For example, when calculating the sum of x, y, z, use x+y+z instead of sum([x,y,z])
  • When calculating the square of the complex modulus, use it abs2(z), don’t use it abs(z)^2, in short, don’t write the calculation process yourself if you have a ready-made function
  • Use div(x,y) instead of trunc(x/y), use fld(x,y) instead of floor(x/y), use cld(x,y) instead of ceil(x/y), the reasons are also ready Function, don’t write the calculation process yourself
function test_loop(r)
    for i in 1:10000
        for j in length(r)
            r[j] = tanh(r[j])
        end
    end
end

function test_vectorize(r)
    for i in 1:10000
        @. r = tanh(r)
    end
end
r = rand(1000)

@time test_vectorize(r)
>> 0.126987 seconds (4 allocations: 160 bytes)

@time test_loop(r)
>>0.000289 seconds (4 allocations: 160 bytes)
Reference: https://cloud.tencent.com/developer/article/1652992 13 How to write high-performance Julia-Cloud + Community-Tencent Cloud