Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Halide tutorial 2019

Download to read offline

Halide Language Tutorial

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Halide tutorial 2019

  1. 1. Introduction to Halide Champ Yen champ.yen@gmail.com https://tinyurl.com/ubqye3y
  2. 2. Overview of Halide 2
  3. 3. Why Halide? 3 Halide's answer: decouples Algorithm from Scheduling Algorithm: what is computed. Schedule: where and when it's computed. Easy for programmers to build pipelines  • simplifies algorithm code • improves modularity Easy for programmers to specify & explore optimizations  • fusion, tiling, parallelism, vectorization • can’t break the algorithm Easy for the compiler to generate fast code
  4. 4. Image Processing Tradeoffs Experienced Engineers always keep PARALLELISM, LOCALITY and REDUNDANT WORK in mind.
  5. 5. Processing Policies/Skills used in image processing coding 5 bh(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)/3 bv(x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1)/3 Breadth-First Sliding-Window Fusion Tiling Sliding-Window with Tiling
  6. 6. Performance - It's All about Scheduling 6 To optimize is to find a better scheduling in the valid space.
  7. 7. Example – C++, Optimized C++ and Halide 7
  8. 8. How Halide works 8
  9. 9. define Algorithms & JIT in Halide 9
  10. 10. Types – Var, Expr, Func and RDom 10 Func: represents a (schedulable) pipeline stage. Func gradient; Var: names to use as variables in the definition of a Func. Var x, y; Expr: calculations of those variables, expressions and other functions in a function. Expr e = x + y; gradient(x, y) = e; add a definition for the Func object: RDom: reduction domain, calculate a value from a area of inputs, as loops for calculation RDom r(-1, 3) // MIN, EXTENTS Expr e = sum(f(x+r, y));
  11. 11. Advanced things in Functions 11 bfloat16_t: truncated version 16b version of float32 Func ops; ops(x, y) = Tuple( expr_add, expr_sub, expr_mul, expr_div); Tuple: represents a Func with mutiple outputs float16_t: IEEE754 16-bit float representation Halide::* : special math or operations, refer to https://halide-lang.org/docs/namespace_halide.html Expr u8val; u8val = u8(clamp(out, 0, 255)); u8val = saturating_cast<uint8>(out); // math, other like ceil, floor, pow, sin/cos/tan ... Expr logval = log(x); // select, works like “?:” in C or switch-case in complex cases Expr c = select( c < 0, 0, c);
  12. 12. JIT Image Processing Example 12 // load the input image Buffer<uint8_t> input = load_image("images/rgb.png"); // function used to brighter the image Func brighter; // variables used to define brighter function Var x, y, c; // 'value' Expr is used to define the procedure of image processing Expr value = input(x, y, c); value = Halide::cast<float>(value); value = value * 1.5f; value = Halide::min(value, 255.0f); value = Halide::cast<uint8_t>(value); // define the function brighter(x, y, c) = value; // get output result Buffer<uint8_t> output =          brighter.realize(input.width(), input.height(), input.channels()); // save the output to a file save_image(output, "brighter.png");
  13. 13. Put It All Together! - 3x3 Blur - In JIT 13 https://github.com/champyen/halide_2019.git
  14. 14. Scheduling in Halide 14
  15. 15. Scheduling Basics – Default Loop Structure 15 func_foo (a, b, c, … x, y, z) = … inner-most loop outermost loop //default scheduling equal to the below loop: for(z = 0; z < Z_MAX; z++){     for(y = 0; y < Y_MAX; y++){         for(x = 0; x < X_MAX; x++){             …                 for(a = 0; a < A_MAX; A++){                     // computing at here                 }             …          }     } }
  16. 16. Scheduling Basics - Reodering 16 func_foo.reorder (z, y, x, … c, b, a) = … inner-most loop outermost loop //reordered scheduling equal to the below loop: for(a = 0; a < A_MAX; a++){     for(b = 0; b < B_MAX; b++){         for(c = 0; c < C_MAX; c++){             …                 for(z = 0; z < Z_MAX; Z++){                     // computing at here                 }             …          }     } }
  17. 17. Scheduling Basics - Splitting 17 func_foo(x, y) = ... func_foo.split(y, yo, yi, 32); //splitted scheduling equal to the below loop: for(yo = 0; yo < Y_MAX/32; yo++){     for(yi = 0; yi < 32; yi++){         for(x = 0; x < X_MAX; x++){             //computation is here         }     } }
  18. 18. Scheduling Basics - Tiling 18 func_foo(x, y) = ... func_foo.tile(x, y, xo, xi, yo, yi, 32, 32); //tiled scheduling equal to the below loop: for(yo = 0; yo < Y_MAX/32; yo++){     for(xo = 0; xo < X_MAX/32; xo++){         for(yi = 0; yi < 32; yi++){             for(xi = 0; xi < 32; xi++{                 //computation is here             }         }     } }
  19. 19. Schedule Basics - Fuse 19 func_foo(x, y) = ... func_foo.fuse(x, y, fidx); //fused scheduling equal to the below loop: for(fidx = 0; fidx < X_MAX*Y_MAX; fidx++){     //computation is here } serialized by fidx
  20. 20. Scheduling – Vectorize, Parallel 20 func_foo(x, y) = ... func_foo.vectorize(x, 8); //vectorized scheduling equal to the below loop: for(y = 0; y < Y_MAX; y++){     for(x = 0; x < X_MAX; x+=8){         //8-LANE auto-vectorization     } } func_foo(x, y) = ... func_foo.parallel(y); //parallel scheduling equal to the below loop: #pragma omp paralle for for(y = 0; y < Y_MAX; y++){     for(x = 0; x < X_MAX; x++){         //computation is here     } } Vectorize Parallel
  21. 21. compute_at/store_at, compute_root/store_root ● store position should be same or outer than computation ● store_root => indicate the stage/function has whole frame buffer output ● compute_root => bread-first ○ and also mean store_root ● store_at(Func, Var) ○ the Func’s storage is declared in Var’s loop of Func ● compute_at( Func, Var ) ○ computed in Var’s loop of Func ○ also mean store_at(Func, Var) ● Var::outermost() 21
  22. 22. The Schedule Directives Combinations 22
  23. 23. Ahead-of-Time(AOT) Workflow 23 CodeGen Executable Halide Code Static Library (.a + .h) Function Implement Code Halide Shared Library (.so) Final Executable /Library Halide Runtime Buffer (.h)
  24. 24. AOT code structure & example 24 //box_aot.cpp: Box_2x2 DownSample class BoxDown2 : public Generator<BoxDown2> { public:     // Input/Output types are not specified, they are set in code-generation phase.     Input<Buffer<>> input{"input", 3};     Output<Func> output{"output", 3};     void generate() {         Func clamp_input = BoundaryConditions::repeat_edge(input);         output(x, y, c) = cast(output.type(),                              ((clamp_input(2*x, 2*y, c)+                             clamp_input(2*x+1, 2*y, c)+                              clamp_input(2*x, 2*y+1, c)+                              clamp_input(2*x+1, 2*y+1), c) >> 2) );     }     void schedule() {         output.vectorize(x, 16).parallel(y);     } private:     Var x, y, c; }; HALIDE_REGISTER_GENERATOR(BoxDown2, box_down2); $ clang++ -O3 -fno-rtti -std=c++11 -o box_aot box_aot.cpp $HALIDE_ROOT/tools/GenGen.cpp -I $HALIDE_ROOT/include/ -L $HALIDE_ROOT/bin/ -lHalide -ltinfo -lpthread -ldl;  //change targe to "arm-64-android" for Android usage $ LD_LIBRARY_PATH=$HALIDE_ROOT/bin/ ./box_aot -g box_down2 -o ./aot input.type=uint8 output.type=uint8 target=host
  25. 25. AOT code usage 25 //test.cpp … #include "halide_image_io.h" #include "HalideBuffer.h" #include "box_down2.h" … using namespace Halide::Tools; using Halide::Runtime::Buffer; int main(int argc, char** argv) {     Buffer<uint8_t> input = load_image(argv[1]);     Buffer<uint8_t> output(input.width()/2, input.height()/2, input.channels());     box_down2(input, output);     save_image(output, "output.png"); } $ clang++ -fno-rtti -std=c++11 -O3 -o test test.cpp aot/box_down2.a -I aot -I $HALIDE_ROOT/include -I aot/ -lpthread -ldl -ljpeg -ltinfo -lpng –lz $ ./test input.jpg
  26. 26. More about Runtime Buffer Manipulation Buffer<uint8_t> buf(width, height); //2D buffer // get buffer pointer unsigned char* buf_ptr = (unsigned char*)(buf.data()); // get ROI buffer object Buffer<uint8_t> crop_buf= buf.cropped(0, crop_x, crop_w).cropped(1, crop_y, crop_h); … // use external memory (from other place, eg: OpenCV mat) for Buffer creation uint8_t *data = (uint8*)malloc(width*height*channels); Buffer<uint8_t> external_buf(data, channels, width, height); 26
  27. 27. Put It All Together! - Matrix Multiplication • https://github.com/champyen/halide_2019 • halide_mm • Generator • mm_generator.cpp • Application • mm.cpp 27
  28. 28. Resource • Halide Official Tutorial • http://halide-lang.org/tutorials/tutorial_introduction.html • Halide Site • http://halide-lang.org/ • Halide GitHub • https://github.com/halide/Halide • https://suif.stanford.edu/~courses/cs243/lectures/l14-halide.pdf • Qualcomm Halide Software (in Hexagon SDK) • https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools 28
  29. 29. Q & A 29

Halide Language Tutorial

Views

Total views

30

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×