The value and type of global variables can change at any time. This makes it difficult for the compiler to optimize code that uses global variables. Variables should be local or passed as parameters to functions as much as possible.

**Any code that focuses on performance or needs to test performance should be placed in the function.**

Declaring global variables as constants can greatly improve performance.

const VAL = 0

If you must declare global variables, you can mark their types where they are used to optimize efficiency.

global x = rand(10000) function lp1() s = 0.0 for i in x s += i end end

Use `@time`

to estimate code runtime

When the function is run for the first time, it needs to be warmed up due to jit, and the result of the second run is the real code running time.

**The following @timeresults are all time without jit.**

In the figure, we can not only know the code running time, but also see the amount of memory occupied by the code. From the memory size, we can also roughly see if there is a problem with the program.

For example, the function `lp1()`

applies for 733KB of memory and only calculates the addition of 64-bit floating point numbers, indicating that there must be extra space that can be saved.

Below we run by passing function parameters and specifying variable types

function lp2() s = 0.0 for i in x::Vector{Float64} s += i end end function lp3(x) s = 0.0 for i in x s += i end end

It can be seen that the running time is greatly reduced, but it still takes up 160Byte of memory, which is `@time`

caused by running in the global scope .

If we put the test code in the function, this problem does not exist.

*For more formal performance testing, you can use the BenchmarkTools.jl package, which will evaluate the performance of the function multiple times to reduce noise.*

It can be seen that the way to specify the variable type is the fastest.

In the REPL, we enter @code to view the macros used for code generation

- @code_lowered: Julia's underlying running process;
- @code_typed: the type change when the program is running;
- @code_llvm: the running process of the llvm compiler;
- @code_native: The machine language of the generated program;
- @code_warntypes: Check if there is a type warning when the program is running.

First we define a simple function

cal(a, b, c) = a + b * c cal(1, 2, 3) >>7 cal(1.0, 2.0, 3.0) >>7.0 cal(1, 2, 3.0) >>7.0

Use to `@code_lowerd`

view the underlying running process

@code_lowered cal(1, 2, 3) @code_lowered cal(1.0, 2.0, 3.0) @code_lowered cal(1, 2, 3.0)

It can be seen that the running process of the three functions is the same and must be the same, but the types are different, but the process of multiplication and addition is still the same.

Reuse to `code_typed`

view the runtime type changes

@code_typed cal(1, 2, 3) @code_typed cal(1.0, 2.0, 3.0) @code_typed cal(1, 2, 3.0)

It can be seen that the first two running processes are the same, but the input variable parameters are different, and the calculated function changes accordingly; while the third function is running, there is an extra process of converting two Int-type variables into Float variables.

Then use to `@code_llvm`

view the running process of the llvm compiler

@code_llvm cal(1, 2, 3) @code_llvm cal(1.0, 2.0, 3.0) @code_llvm cal(1, 2, 3.0)

It can be seen that when the llvm compiler runs on the first two functions, the process is basically the same, and the third function has many more steps.

Similarly, use to `@code_native`

view the running process of machine language.

With `@code_llvm`

similar results, the first two functions are basically the same process, the third function a lot more steps.

Understanding the code generation is helpful for our follow-up explanation.

When we define a function, if the type of the function parameters is fixed, such as an Int64 Array[1,2,3,4], then they will be stored continuously in memory;

But if the type of the function parameter is Any, then the continuous storage in the memory is just their "pointer", which will point to its actual location. As a result, data access slows down.

Look at the example below.

import Base: +, * struct IntNum n::Int64 end struct AnyNum n::Number end +(a::IntNum, b::IntNum) = IntNum(an + bn) *(a::IntNum, b::IntNum) = IntNum(an * bn) +(a::AnyNum, b::AnyNum) = AnyNum(an + bn) *(a::AnyNum, b::AnyNum) = AnyNum(an * bn)

You can use `@code_native +(IntNum(1), IntNum(2))`

sum `@code_native +(AnyNum(1), AnyNum(2))`

to compare the length of the machine language of the two functions. It can be clearly seen that AnyNum has a lot more operations than IntNum.

function calc_Int_time(a,b,c,n) for i in 1:n cal(IntNum(a), IntNum(b), IntNum(c)) end end function calc_Any_time(a,b,c,n) for i in 1:n cal(AnyNum(a), AnyNum(b), AnyNum(c)) end end

It can also be seen from the memory size occupied by the program that IntNum is much less than AnyNum. Therefore, calculating the concrete type will save time than calculating the abstract type.

We can use `@code_warntype`

to see if there is an abstract type in the running function

For concrete type:

image

For the abstract type:

Where there is an abstract type, it will be marked in red.

Give another example of Julia's own functions.

In C++, each defined variable has its fixed type, but in Julia, since the parameter can be defaulted when the variable is defined, the conversion of the parameter type is often not noticed.

function cums(n) r = 0 for i in 1:n r += cos(i) end r end function cums_f(n) r = 0.0 for i in 1:n r += cos(i) end r end

When putting the two functions together, students who have a basic programming experience can easily see that r in the first function is of type Int64. When the program is running, you need to convert Int64 to Float64 and then perform addition calculations. But if these two functions are not put together, will they be ignored by many students?

Compare the types of two functions warning

Compare the running time of two functions

Therefore, when we define a variable, we must try to keep the type consistent with the subsequent operation. There are several functions that can help us complete this definition.

- zero(value)
- eltype(array)
- one(value)
- similar(array)

When we define variables, we can use these 4 functions to avoid common mistakes.

eg.

pos(x) = x <0? zero(x): x

eg. Find the sum of an array

x = [1.0, 2.1, 4.5] sm = zero(x[1]) sm = sum(x) # Of course, you can write this sentence directly without defining sm, this is just an example

The usage of the other three is basically the same, so I will not give an example here.

When defining struct, we often write

mutable struct MyType1 x::AbstractFloat end

But it would be better to use

mutable struct MyType2{T<:AbstractFloat} x::T end

Check their type

a = MyType1(2.3) typeof(a) >>MyType1 b = MyType2(2.3) typeof(b) >>MyType2{Float64}

The type of x can be determined by the type of b, but not the type of a. The type of bx will not change, it will always be Float64, and the type of ax will change.

typeof(ax)

ax = 2.3f0 typeof(ax) >>Float32 bx = 2.3f0 typeof(bx) >>Float64

When an Array contains multiple variables, if we know the type of one of the variables, we will explicitly specify it

function foo(a::Array{Any,1}) x = a[1]::Int32 b = x+1 ... end

If you can only determine the type of x, but the type of a[1] cannot be determined, you can use convert() to complete

x = convert(Int32, a[1])::Int32

**From what we have said above, we can also know a strategy for optimizing the code: the simpler the program, the better, let the compiler know exactly what you want to do, instead of letting the compiler guess our purpose**

Therefore, it can be inferred that functions with variable parameter types are definitely not as fast as functions with fixed keyword parameter types.

But this kind of designation of a certain variable type needs to pay attention to when using it, that is, if the variable type is not determined at compile time but determined at runtime, it will hurt performance.

function nr(n, prec) ctype = prec == 32? Float32: Float64 b = Complex{ctype}(2.3) d = 0 for i in 1:n c = (b + 1.0f0)::Complex{ctype} d += c end d end

How to deal with the situation where the type needs to be determined at runtime to maximize the performance?

function nr_op(n, prec) ctype = prec == 32? Float32: Float64 b = Complex{ctype}(2.3) d = opFunc(n, b::Complex{ctype}) end function opFunc(n, b::Complex{T}) where {T} d = 0 for i in 1:n c = (b + 1.0f0)::Complex{T} d += c end d end

@time nr(10000, 64) >>0.003382 seconds (20.00 k allocations: 625.188 KiB) @time nr_op(10000, 64) >>0.000023 seconds (7 allocations: 240 bytes)

We will use this type of variable calculation to complete the function value, because when the function is called, the type of the input parameter will be determined first, so that there is no need to consider the uncertainty of the type when calculating. If you don't use a function, you need to get the variable type after each assignment, and then convert the result to the corresponding type. This will take a lot more time.

In short, **get the variable type first and then calculate**

When our type is a function parameter, we must also use the above method to get it first

# Define an N-dimensional matrix function array3(fillval, N) fill(fillval, ntuple(d->3, N)) end

typeof(array3(2.0, 4)) >>Array{Float64,4}

Here N is the type parameter of the matrix, which is an N-dimensional. The same problem is that the type parameter of the matrix variable is the value of N. If we first obtain the value of N and then generate the matrix, the performance will be better.

function array3(fillval, ::Val{N}) where N fill(fillval, ntuple(d->3, Val(N))) end

When using Val(), you must pay attention to the way it is used. If it is used improperly, the performance will be worse than not using Val()

function call_array3(fillval, n) A = array3(fillval, Val(n)) end

In this way, the compiler doesn't know what n is, or the type of n, which is totally different from our original intention.

Let's take another example of variable parameters and keyword parameters to illustrate the difference in running time between the two

function func1(n, x...) res = 0 for i in 1:n res += x[2] end res end function func2(n, x, y) res = 0 for i in 1:n res += y end res end

@time func1(1000000000, 10, 10) >> 0.408809 seconds (5 allocations: 176 bytes) @time func2(1000000000, 10, 10) >> 0.000003 seconds (5 allocations: 176 bytes)

After clarifying the parameters, the running speed is several orders of magnitude faster.

using LinearAlgebra function mynorm(A) if isa(A, Vector) return sqrt(real(dot(A,A))) elseif isa(A, Matrix) return maximum(svdvals(A)) else error("mynorm: invalid argument") end end

If the function mynorm is written in the following form, it will be more efficient

norm(x::Vector) = sqrt(real(dot(x, x))) norm(A::Matrix) = maximum(svdvals(A))

In Julia, multi-dimensional matrices are arranged according to the column-first principle, which is the same as in MATLAB

x = [1 2; 3 4] # Convert x to a 1-dimensional matrix x[:]

In other words, the address of the data in each column of the matrix in Julia is continuous in the memory, and the address of each row is not continuous, and operating continuous addresses is much faster than non-continuous addresses. Here is an example of matrix copy.

# Copy by column function copy_cols(x::Vector{T}) where T inds = axes(x, 1) out = similar(Array{T}, inds, inds) for i = inds out[:, i] = x end return out end # Copy by line function copy_rows(x::Vector{T}) where T inds = axes(x, 1) out = similar(Array{T}, inds, inds) for i = inds out[i, :] = x end return out end # Copy by element, in order of columns function copy_col_row(x::Vector{T}) where T inds = axes(x, 1) out = similar(Array{T}, inds, inds) for col = inds, row = inds out[row, col] = x[row] end return out end # Copy by element, in order of line function copy_row_col(x::Vector{T}) where T inds = axes(x, 1) out = similar(Array{T}, inds, inds) for row = inds, col = inds out[row, col] = x[col] end return out end

x = randn(10000) fmt(f) = println(rpad(string(f)*": ", 14, ''), @elapsed f(x)) map(fmt, Any[copy_cols, copy_rows, copy_col_row, copy_row_col]) >>copy_cols: 0.205115039 copy_rows: 0.856341406 copy_col_row: 0.279589601 copy_row_col: 0.845328085

**dot operation**

In matrix operations, add. Before the operator, see the following example

f(x) = 5x.^3-2x.^2 + 3x fdot(x) = @. 5x^3-2x^2 + 3x

x = rand(10^6); @time f(x); >>0.040295 seconds (20 allocations: 53.407 MiB, 17.06% gc time) @time fdot(x); >>0.002195 seconds (8 allocations: 7.630 MiB) @time f.(x); 0.002047 seconds (8 allocations: 7.630 MiB)

It can be seen that the dot multiplication operation is much faster.

**Vectorization does not improve Julia's running speed**

Many students who have used MATLAB and Python will think that vector operations must be much faster than loop operations, but there is no such rule in Julia. This point must be paid attention to.

function loop_sum() s = 0.0 for i = 1:1000 s = 0.0 for k = 1:10000 s += 1.0/(k*k) end end s end

Let's take a `@code_warntype loop_sum()`

look at the code for type conversion problems.

There is no warning of type conversion during operation.

@benchmark loop_sum() >>BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 8.868 ms (0.00% GC) median time: 8.869 ms (0.00% GC) mean time: 8.870 ms (0.00% GC) maximum time: 8.925 ms (0.00% GC) -------------- samples: 564 evals/sample: 1

Use a vector to complete the accumulation operation above

function vec_sum() s = 0.0 k = collect(1:10000) for i=1:1000 s = sum(1 ./(k.^2)) end end @benchmark vec_sum() >>BenchmarkTools.Trial: memory estimate: 76.48 MiB allocs estimate: 3002 -------------- minimum time: 19.575 ms (17.75% GC) median time: 22.603 ms (21.13% GC) mean time: 23.038 ms (22.79% GC) maximum time: 59.079 ms (68.95% GC) -------------- samples: 217 evals/sample: 1

@simd can accelerate the operation when the operation support is re-recorded, and the addition operation can be accelerated

function loop_sum_simd() s = 0.0 for i = 1:1000 s = 0.0 @simd for k = 1:10000 s += 1.0/(k*k) end end s end @benchmark loop_sum_simd() >>BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 6.202 ms (0.00% GC) median time: 6.236 ms (0.00% GC) mean time: 6.240 ms (0.00% GC) maximum time: 7.119 ms (0.00% GC) -------------- samples: 802 evals/sample: 1

It shows that vector operations in Julia do not optimize speed, which has also been explained many times on Julia's official website.

How to understand it? That is, when we are operating Array or other complex types, we pre-allocate an Array or other type to store it for the result, and then perform calculations, the performance will be significantly improved

function xinc(x) return [x, x+1, x+2] end function loopinc() y = 0 for i = 1:10^7 ret = xinc(i) y += ret[2] end return y end function xinc!(ret::AbstractVector{T}, x::T) where T ret[1] = x ret[2] = x+1 ret[3] = x+2 nothing end function loopinc_prealloc() ret = Vector{Int}(undef, 3) y = 0 for i = 1:10^7 xinc!(ret, i) y += ret[2] end return y end

@time loopinc() >>0.428358 seconds (40.00 M allocations: 1.490 GiB, 10.25% gc time) @time loopinc_prealloc() >>0.024745 seconds (6 allocations: 288 bytes)

Copy means that the copied variable is a copy of the original variable, and there is a memory copy operation, while the view is only a mapping relationship, and there is no memory copy.

First give an example of view

A = zeros(3, 3); @views for row in 1:3 b = A[row, :] b[:] .= row # The value of A will also change accordingly end

Define the functions of copy and view respectively

fcopy(x) = sum(x[2:end-1]); @views fview(x) = sum(x[2:end-1]); x = rand(10^6);

@time fcopy(x) >>0.031803 seconds (7 allocations: 7.630 MiB, 85.25% gc time) @time fview(x) >>0.001218 seconds (6 allocations: 224 bytes)

When writing a file, for a variable, we can get its value through $, or use the variable itself directly

println(file, "$a $b")

The following writing method is better, because the above method first converts the expression into a string, and then writes it into the file. And the following method writes the value directly into the file

println(file, a, "", b)

- Avoid unnecessary Array. For example, when calculating the sum of x, y, z, use x+y+z instead of sum([x,y,z])
- When calculating the square of the complex modulus, use it
`abs2(z)`

, don’t use it`abs(z)^2`

, in short, don’t write the calculation process yourself if you have a ready-made function - Use div(x,y) instead of trunc(x/y), use fld(x,y) instead of floor(x/y), use cld(x,y) instead of ceil(x/y), the reasons are also ready Function, don’t write the calculation process yourself

function test_loop(r) for i in 1:10000 for j in length(r) r[j] = tanh(r[j]) end end end function test_vectorize(r) for i in 1:10000 @. r = tanh(r) end end

r = rand(1000) @time test_vectorize(r) >> 0.126987 seconds (4 allocations: 160 bytes) @time test_loop(r) >>0.000289 seconds (4 allocations: 160 bytes)

Reference: https://cloud.tencent.com/developer/article/1652992 13 How to write high-performance Julia-Cloud + Community-Tencent Cloud