3. Why Halide? 3
Halide's answer: decouples Algorithm from Scheduling
Algorithm: what is computed.
Schedule: where and when it's computed.
Easy for programmers to build pipelines
• simplifies algorithm code
• improves modularity
Easy for programmers to specify & explore optimizations
• fusion, tiling, parallelism, vectorization
• can’t break the algorithm
Easy for the compiler to generate fast code
10. Types – Var, Expr, Func and RDom 10
Func: represents a (schedulable) pipeline stage.
Func gradient;
Var: names to use as variables in the definition of a Func.
Var x, y;
Expr: calculations of those variables, expressions and other functions in a function.
Expr e = x + y;
gradient(x, y) = e;
add a definition for the Func object:
RDom: reduction domain, calculate a value from a area of inputs, as loops for calculation
RDom r(-1, 3) // MIN, EXTENTS
Expr e = sum(f(x+r, y));
11. Advanced things in Functions 11
bfloat16_t: truncated version 16b version of float32
Func ops;
ops(x, y) = Tuple( expr_add, expr_sub, expr_mul, expr_div);
Tuple: represents a Func with mutiple outputs
float16_t: IEEE754 16-bit float representation
Halide::* : special math or operations, refer to https://halide-lang.org/docs/namespace_halide.html
Expr u8val;
u8val = u8(clamp(out, 0, 255));
u8val = saturating_cast<uint8>(out);
// math, other like ceil, floor, pow, sin/cos/tan ...
Expr logval = log(x);
// select, works like “?:” in C or switch-case in complex cases
Expr c = select( c < 0, 0, c);
12. JIT Image Processing Example 12
// load the input image
Buffer<uint8_t> input = load_image("images/rgb.png");
// function used to brighter the image
Func brighter;
// variables used to define brighter function
Var x, y, c;
// 'value' Expr is used to define the procedure of image processing
Expr value = input(x, y, c);
value = Halide::cast<float>(value);
value = value * 1.5f;
value = Halide::min(value, 255.0f);
value = Halide::cast<uint8_t>(value);
// define the function
brighter(x, y, c) = value;
// get output result
Buffer<uint8_t> output =
brighter.realize(input.width(), input.height(), input.channels());
// save the output to a file
save_image(output, "brighter.png");
13. Put It All Together! - 3x3 Blur - In JIT 13
https://github.com/champyen/halide_2019.git
15. Scheduling Basics – Default Loop Structure 15
func_foo (a, b, c, … x, y, z) = …
inner-most loop
outermost loop
//default scheduling equal to the below loop:
for(z = 0; z < Z_MAX; z++){
for(y = 0; y < Y_MAX; y++){
for(x = 0; x < X_MAX; x++){
…
for(a = 0; a < A_MAX; A++){
// computing at here
}
…
}
}
}
16. Scheduling Basics - Reodering 16
func_foo.reorder (z, y, x, … c, b, a) = …
inner-most loop
outermost loop
//reordered scheduling equal to the below loop:
for(a = 0; a < A_MAX; a++){
for(b = 0; b < B_MAX; b++){
for(c = 0; c < C_MAX; c++){
…
for(z = 0; z < Z_MAX; Z++){
// computing at here
}
…
}
}
}
17. Scheduling Basics - Splitting 17
func_foo(x, y) = ...
func_foo.split(y, yo, yi, 32);
//splitted scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/32; yo++){
for(yi = 0; yi < 32; yi++){
for(x = 0; x < X_MAX; x++){
//computation is here
}
}
}
18. Scheduling Basics - Tiling 18
func_foo(x, y) = ...
func_foo.tile(x, y, xo, xi, yo, yi, 32, 32);
//tiled scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/32; yo++){
for(xo = 0; xo < X_MAX/32; xo++){
for(yi = 0; yi < 32; yi++){
for(xi = 0; xi < 32; xi++{
//computation is here
}
}
}
}
19. Schedule Basics - Fuse 19
func_foo(x, y) = ...
func_foo.fuse(x, y, fidx);
//fused scheduling equal to the below loop:
for(fidx = 0; fidx < X_MAX*Y_MAX; fidx++){
//computation is here
}
serialized by fidx
20. Scheduling – Vectorize, Parallel 20
func_foo(x, y) = ...
func_foo.vectorize(x, 8);
//vectorized scheduling equal to the below loop:
for(y = 0; y < Y_MAX; y++){
for(x = 0; x < X_MAX; x+=8){
//8-LANE auto-vectorization
}
}
func_foo(x, y) = ...
func_foo.parallel(y);
//parallel scheduling equal to the below loop:
#pragma omp paralle for
for(y = 0; y < Y_MAX; y++){
for(x = 0; x < X_MAX; x++){
//computation is here
}
}
Vectorize
Parallel
21. compute_at/store_at, compute_root/store_root
● store position should be same or outer than computation
● store_root => indicate the stage/function has whole frame buffer output
● compute_root => bread-first
○ and also mean store_root
● store_at(Func, Var)
○ the Func’s storage is declared in Var’s loop of Func
● compute_at( Func, Var )
○ computed in Var’s loop of Func
○ also mean store_at(Func, Var)
● Var::outermost()
21