SlideShare a Scribd company logo
1 of 29
Download to read offline
Introduction to Halide
Champ Yen
champ.yen@gmail.com
https://tinyurl.com/ubqye3y
Overview of Halide
2
Why Halide? 3
Halide's answer: decouples Algorithm from Scheduling
Algorithm: what is computed.
Schedule: where and when it's computed.
Easy for programmers to build pipelines 
• simplifies algorithm code
• improves modularity
Easy for programmers to specify & explore optimizations 
• fusion, tiling, parallelism, vectorization
• can’t break the algorithm
Easy for the compiler to generate fast code
Image Processing Tradeoffs
Experienced Engineers always keep
PARALLELISM, LOCALITY and REDUNDANT
WORK in mind.
Processing Policies/Skills used in image processing coding 5
bh(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)/3
bv(x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1)/3
Breadth-First
Sliding-Window
Fusion
Tiling
Sliding-Window
with Tiling
Performance - It's All about Scheduling 6
To optimize is to find a
better scheduling in the
valid space.
Example – C++, Optimized C++ and Halide 7
How Halide works 8
define Algorithms & JIT in
Halide
9
Types – Var, Expr, Func and RDom 10
Func: represents a (schedulable) pipeline stage.
Func gradient;
Var: names to use as variables in the definition of a Func.
Var x, y;
Expr: calculations of those variables, expressions and other functions in a function.
Expr e = x + y;
gradient(x, y) = e;
add a definition for the Func object:
RDom: reduction domain, calculate a value from a area of inputs, as loops for calculation
RDom r(-1, 3) // MIN, EXTENTS
Expr e = sum(f(x+r, y));
Advanced things in Functions 11
bfloat16_t: truncated version 16b version of float32
Func ops;
ops(x, y) = Tuple( expr_add, expr_sub, expr_mul, expr_div);
Tuple: represents a Func with mutiple outputs
float16_t: IEEE754 16-bit float representation
Halide::* : special math or operations, refer to https://halide-lang.org/docs/namespace_halide.html
Expr u8val;
u8val = u8(clamp(out, 0, 255));
u8val = saturating_cast<uint8>(out);
// math, other like ceil, floor, pow, sin/cos/tan ...
Expr logval = log(x);
// select, works like “?:” in C or switch-case in complex cases
Expr c = select( c < 0, 0, c);
JIT Image Processing Example 12
// load the input image
Buffer<uint8_t> input = load_image("images/rgb.png");
// function used to brighter the image
Func brighter;
// variables used to define brighter function
Var x, y, c;
// 'value' Expr is used to define the procedure of image processing
Expr value = input(x, y, c);
value = Halide::cast<float>(value);
value = value * 1.5f;
value = Halide::min(value, 255.0f);
value = Halide::cast<uint8_t>(value);
// define the function
brighter(x, y, c) = value;
// get output result
Buffer<uint8_t> output = 
        brighter.realize(input.width(), input.height(), input.channels());
// save the output to a file
save_image(output, "brighter.png");
Put It All Together! - 3x3 Blur - In JIT 13
https://github.com/champyen/halide_2019.git
Scheduling in Halide
14
Scheduling Basics – Default Loop Structure 15
func_foo (a, b, c, … x, y, z) = …
inner-most loop
outermost loop
//default scheduling equal to the below loop:
for(z = 0; z < Z_MAX; z++){
    for(y = 0; y < Y_MAX; y++){
        for(x = 0; x < X_MAX; x++){
            …
                for(a = 0; a < A_MAX; A++){
                    // computing at here
                }
            … 
        }
    }
}
Scheduling Basics - Reodering 16
func_foo.reorder (z, y, x, … c, b, a) = …
inner-most loop
outermost loop
//reordered scheduling equal to the below loop:
for(a = 0; a < A_MAX; a++){
    for(b = 0; b < B_MAX; b++){
        for(c = 0; c < C_MAX; c++){
            …
                for(z = 0; z < Z_MAX; Z++){
                    // computing at here
                }
            … 
        }
    }
}
Scheduling Basics - Splitting 17
func_foo(x, y) = ...
func_foo.split(y, yo, yi, 32);
//splitted scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/32; yo++){
    for(yi = 0; yi < 32; yi++){
        for(x = 0; x < X_MAX; x++){
            //computation is here
        }
    }
}
Scheduling Basics - Tiling 18
func_foo(x, y) = ...
func_foo.tile(x, y, xo, xi, yo, yi, 32, 32);
//tiled scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/32; yo++){
    for(xo = 0; xo < X_MAX/32; xo++){
        for(yi = 0; yi < 32; yi++){
            for(xi = 0; xi < 32; xi++{
                //computation is here
            }
        }
    }
}
Schedule Basics - Fuse 19
func_foo(x, y) = ...
func_foo.fuse(x, y, fidx);
//fused scheduling equal to the below loop:
for(fidx = 0; fidx < X_MAX*Y_MAX; fidx++){
    //computation is here
}
serialized by fidx
Scheduling – Vectorize, Parallel 20
func_foo(x, y) = ...
func_foo.vectorize(x, 8);
//vectorized scheduling equal to the below loop:
for(y = 0; y < Y_MAX; y++){
    for(x = 0; x < X_MAX; x+=8){
        //8-LANE auto-vectorization
    }
}
func_foo(x, y) = ...
func_foo.parallel(y);
//parallel scheduling equal to the below loop:
#pragma omp paralle for
for(y = 0; y < Y_MAX; y++){
    for(x = 0; x < X_MAX; x++){
        //computation is here
    }
}
Vectorize
Parallel
compute_at/store_at, compute_root/store_root
● store position should be same or outer than computation
● store_root => indicate the stage/function has whole frame buffer output
● compute_root => bread-first
○ and also mean store_root
● store_at(Func, Var)
○ the Func’s storage is declared in Var’s loop of Func
● compute_at( Func, Var )
○ computed in Var’s loop of Func
○ also mean store_at(Func, Var)
● Var::outermost()
21
The Schedule Directives Combinations 22
Ahead-of-Time(AOT) Workflow 23
CodeGen
Executable
Halide
Code
Static
Library
(.a + .h)
Function
Implement
Code
Halide
Shared
Library
(.so)
Final
Executable
/Library
Halide
Runtime
Buffer
(.h)
AOT code structure & example 24
//box_aot.cpp: Box_2x2 DownSample
class BoxDown2 : public Generator<BoxDown2> {
public:
    // Input/Output types are not specified, they are set in code-generation phase.
    Input<Buffer<>> input{"input", 3};
    Output<Func> output{"output", 3};
    void generate() {
        Func clamp_input = BoundaryConditions::repeat_edge(input);
        output(x, y, c) = cast(output.type(), 
                            ((clamp_input(2*x, 2*y, c)+
                            clamp_input(2*x+1, 2*y, c)+ 
                            clamp_input(2*x, 2*y+1, c)+ 
                            clamp_input(2*x+1, 2*y+1), c) >> 2) );
    }
    void schedule() {
        output.vectorize(x, 16).parallel(y);
    }
private:
    Var x, y, c;
};
HALIDE_REGISTER_GENERATOR(BoxDown2, box_down2);
$ clang++ -O3 -fno-rtti -std=c++11 -o box_aot box_aot.cpp $HALIDE_ROOT/tools/GenGen.cpp -I $HALIDE_ROOT/include/ -L $HALIDE_ROOT/bin/ -lHalide -ltinfo
-lpthread -ldl; 
//change targe to "arm-64-android" for Android usage
$ LD_LIBRARY_PATH=$HALIDE_ROOT/bin/ ./box_aot -g box_down2 -o ./aot input.type=uint8 output.type=uint8 target=host
AOT code usage 25
//test.cpp
…
#include "halide_image_io.h"
#include "HalideBuffer.h"
#include "box_down2.h"
…
using namespace Halide::Tools;
using Halide::Runtime::Buffer;
int main(int argc, char** argv)
{
    Buffer<uint8_t> input = load_image(argv[1]);
    Buffer<uint8_t> output(input.width()/2, input.height()/2, input.channels());
    box_down2(input, output);
    save_image(output, "output.png");
}
$ clang++ -fno-rtti -std=c++11 -O3 -o test test.cpp aot/box_down2.a -I aot -I $HALIDE_ROOT/include -I
aot/ -lpthread -ldl -ljpeg -ltinfo -lpng –lz
$ ./test input.jpg
More about Runtime Buffer Manipulation
Buffer<uint8_t> buf(width, height); //2D buffer
// get buffer pointer
unsigned char* buf_ptr = (unsigned char*)(buf.data());
// get ROI buffer object
Buffer<uint8_t> crop_buf= buf.cropped(0, crop_x, crop_w).cropped(1, crop_y,
crop_h);
…
// use external memory (from other place, eg: OpenCV mat) for Buffer creation
uint8_t *data = (uint8*)malloc(width*height*channels);
Buffer<uint8_t> external_buf(data, channels, width, height);
26
Put It All Together! - Matrix Multiplication
• https://github.com/champyen/halide_2019
• halide_mm
• Generator
• mm_generator.cpp
• Application
• mm.cpp
27
Resource
• Halide Official Tutorial
• http://halide-lang.org/tutorials/tutorial_introduction.html
• Halide Site
• http://halide-lang.org/
• Halide GitHub
• https://github.com/halide/Halide
• https://suif.stanford.edu/~courses/cs243/lectures/l14-halide.pdf
• Qualcomm Halide Software (in Hexagon SDK)
• https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools
28
Q & A
29

More Related Content

What's hot

LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)Wang Hsiangkai
 
圏論のモナドとHaskellのモナド
圏論のモナドとHaskellのモナド圏論のモナドとHaskellのモナド
圏論のモナドとHaskellのモナドYoshihiro Mizoguchi
 
範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコルMITSUNARI Shigeo
 
Cache-Oblivious データ構造入門 @DSIRNLP#5
Cache-Oblivious データ構造入門 @DSIRNLP#5Cache-Oblivious データ構造入門 @DSIRNLP#5
Cache-Oblivious データ構造入門 @DSIRNLP#5Takuya Akiba
 
Fast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaFast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaCharles Nutter
 
「みもふたもない」論文投稿必勝法
「みもふたもない」論文投稿必勝法「みもふたもない」論文投稿必勝法
「みもふたもない」論文投稿必勝法Makoto Iguchi
 
OPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCHOPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCHCool Guy
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataJignesh Shah
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsSteven Francia
 
Rolling Hashを殺す話
Rolling Hashを殺す話Rolling Hashを殺す話
Rolling Hashを殺す話Nagisa Eto
 
Refactoring and code smells
Refactoring and code smellsRefactoring and code smells
Refactoring and code smellsPaul Nguyen
 
C/C++プログラマのための開発ツール
C/C++プログラマのための開発ツールC/C++プログラマのための開発ツール
C/C++プログラマのための開発ツールMITSUNARI Shigeo
 
条件分岐とcmovとmaxps
条件分岐とcmovとmaxps条件分岐とcmovとmaxps
条件分岐とcmovとmaxpsMITSUNARI Shigeo
 
Dijkstra's algorithm presentation
Dijkstra's algorithm presentationDijkstra's algorithm presentation
Dijkstra's algorithm presentationSubid Biswas
 
4 greedy methodnew
4 greedy methodnew4 greedy methodnew
4 greedy methodnewabhinav108
 
Stressen's matrix multiplication
Stressen's matrix multiplicationStressen's matrix multiplication
Stressen's matrix multiplicationKumar
 
daa-unit-3-greedy method
daa-unit-3-greedy methoddaa-unit-3-greedy method
daa-unit-3-greedy methodhodcsencet
 

What's hot (20)

LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)
 
圏論のモナドとHaskellのモナド
圏論のモナドとHaskellのモナド圏論のモナドとHaskellのモナド
圏論のモナドとHaskellのモナド
 
範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル
 
Cache-Oblivious データ構造入門 @DSIRNLP#5
Cache-Oblivious データ構造入門 @DSIRNLP#5Cache-Oblivious データ構造入門 @DSIRNLP#5
Cache-Oblivious データ構造入門 @DSIRNLP#5
 
Fast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaFast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible Java
 
「みもふたもない」論文投稿必勝法
「みもふたもない」論文投稿必勝法「みもふたもない」論文投稿必勝法
「みもふたもない」論文投稿必勝法
 
OPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCHOPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCH
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS Applications
 
Rolling Hashを殺す話
Rolling Hashを殺す話Rolling Hashを殺す話
Rolling Hashを殺す話
 
Refactoring and code smells
Refactoring and code smellsRefactoring and code smells
Refactoring and code smells
 
C/C++プログラマのための開発ツール
C/C++プログラマのための開発ツールC/C++プログラマのための開発ツール
C/C++プログラマのための開発ツール
 
条件分岐とcmovとmaxps
条件分岐とcmovとmaxps条件分岐とcmovとmaxps
条件分岐とcmovとmaxps
 
Dijkstra's algorithm presentation
Dijkstra's algorithm presentationDijkstra's algorithm presentation
Dijkstra's algorithm presentation
 
Rolling hash
Rolling hashRolling hash
Rolling hash
 
4 greedy methodnew
4 greedy methodnew4 greedy methodnew
4 greedy methodnew
 
Stressen's matrix multiplication
Stressen's matrix multiplicationStressen's matrix multiplication
Stressen's matrix multiplication
 
Disjoint sets
Disjoint setsDisjoint sets
Disjoint sets
 
直交領域探索
直交領域探索直交領域探索
直交領域探索
 
daa-unit-3-greedy method
daa-unit-3-greedy methoddaa-unit-3-greedy method
daa-unit-3-greedy method
 

Similar to Halide tutorial 2019

How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITEgor Bogatov
 
Circles graphic
Circles graphicCircles graphic
Circles graphicalldesign
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Mr. Vengineer
 
Machine-level Composition of Modularized Crosscutting Concerns
Machine-level Composition of Modularized Crosscutting ConcernsMachine-level Composition of Modularized Crosscutting Concerns
Machine-level Composition of Modularized Crosscutting Concernssaintiss
 
20.1 Java working with abstraction
20.1 Java working with abstraction20.1 Java working with abstraction
20.1 Java working with abstractionIntro C# Book
 
Introducción a Elixir
Introducción a ElixirIntroducción a Elixir
Introducción a ElixirSvet Ivantchev
 
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java scriptCodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java scriptCodiLime
 
Introduction to Coding
Introduction to CodingIntroduction to Coding
Introduction to CodingFabio506452
 
JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?Doug Hawkins
 
Covering a function using a Dynamic Symbolic Execution approach
Covering a function using a Dynamic Symbolic Execution approach Covering a function using a Dynamic Symbolic Execution approach
Covering a function using a Dynamic Symbolic Execution approach Jonathan Salwan
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumlercorehard_by
 
JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020Joseph Kuo
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linuxMiller Lee
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for SpeedYung-Yu Chen
 
Coscup2021 - useful abstractions at rust and it's practical usage
Coscup2021 - useful abstractions at rust and it's practical usageCoscup2021 - useful abstractions at rust and it's practical usage
Coscup2021 - useful abstractions at rust and it's practical usageWayne Tsai
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)PROIDEA
 
COSCUP: Introduction to Julia
COSCUP: Introduction to JuliaCOSCUP: Introduction to Julia
COSCUP: Introduction to Julia岳華 杜
 

Similar to Halide tutorial 2019 (20)

How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
 
Circles graphic
Circles graphicCircles graphic
Circles graphic
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
Machine-level Composition of Modularized Crosscutting Concerns
Machine-level Composition of Modularized Crosscutting ConcernsMachine-level Composition of Modularized Crosscutting Concerns
Machine-level Composition of Modularized Crosscutting Concerns
 
Boosting Developer Productivity with Clang
Boosting Developer Productivity with ClangBoosting Developer Productivity with Clang
Boosting Developer Productivity with Clang
 
20.1 Java working with abstraction
20.1 Java working with abstraction20.1 Java working with abstraction
20.1 Java working with abstraction
 
Introducción a Elixir
Introducción a ElixirIntroducción a Elixir
Introducción a Elixir
 
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java scriptCodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
 
Introduction to Coding
Introduction to CodingIntroduction to Coding
Introduction to Coding
 
JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?
 
Covering a function using a Dynamic Symbolic Execution approach
Covering a function using a Dynamic Symbolic Execution approach Covering a function using a Dynamic Symbolic Execution approach
Covering a function using a Dynamic Symbolic Execution approach
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumler
 
JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
 
Coscup2021 - useful abstractions at rust and it's practical usage
Coscup2021 - useful abstractions at rust and it's practical usageCoscup2021 - useful abstractions at rust and it's practical usage
Coscup2021 - useful abstractions at rust and it's practical usage
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
 
COSCUP: Introduction to Julia
COSCUP: Introduction to JuliaCOSCUP: Introduction to Julia
COSCUP: Introduction to Julia
 
Cpp tutorial
Cpp tutorialCpp tutorial
Cpp tutorial
 

More from Champ Yen

Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Champ Yen
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introductionChamp Yen
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionChamp Yen
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsChamp Yen
 
OpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming IntroductionOpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming IntroductionChamp Yen
 
Chrome OS Observation
Chrome OS ObservationChrome OS Observation
Chrome OS ObservationChamp Yen
 
Play With Android
Play With AndroidPlay With Android
Play With AndroidChamp Yen
 
Linux Porting
Linux PortingLinux Porting
Linux PortingChamp Yen
 

More from Champ Yen (8)

Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introduction
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & Introduction
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization Tips
 
OpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming IntroductionOpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming Introduction
 
Chrome OS Observation
Chrome OS ObservationChrome OS Observation
Chrome OS Observation
 
Play With Android
Play With AndroidPlay With Android
Play With Android
 
Linux Porting
Linux PortingLinux Porting
Linux Porting
 

Recently uploaded

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 

Recently uploaded (20)

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 

Halide tutorial 2019

  • 1. Introduction to Halide Champ Yen champ.yen@gmail.com https://tinyurl.com/ubqye3y
  • 3. Why Halide? 3 Halide's answer: decouples Algorithm from Scheduling Algorithm: what is computed. Schedule: where and when it's computed. Easy for programmers to build pipelines  • simplifies algorithm code • improves modularity Easy for programmers to specify & explore optimizations  • fusion, tiling, parallelism, vectorization • can’t break the algorithm Easy for the compiler to generate fast code
  • 4. Image Processing Tradeoffs Experienced Engineers always keep PARALLELISM, LOCALITY and REDUNDANT WORK in mind.
  • 5. Processing Policies/Skills used in image processing coding 5 bh(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)/3 bv(x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1)/3 Breadth-First Sliding-Window Fusion Tiling Sliding-Window with Tiling
  • 6. Performance - It's All about Scheduling 6 To optimize is to find a better scheduling in the valid space.
  • 7. Example – C++, Optimized C++ and Halide 7
  • 9. define Algorithms & JIT in Halide 9
  • 10. Types – Var, Expr, Func and RDom 10 Func: represents a (schedulable) pipeline stage. Func gradient; Var: names to use as variables in the definition of a Func. Var x, y; Expr: calculations of those variables, expressions and other functions in a function. Expr e = x + y; gradient(x, y) = e; add a definition for the Func object: RDom: reduction domain, calculate a value from a area of inputs, as loops for calculation RDom r(-1, 3) // MIN, EXTENTS Expr e = sum(f(x+r, y));
  • 11. Advanced things in Functions 11 bfloat16_t: truncated version 16b version of float32 Func ops; ops(x, y) = Tuple( expr_add, expr_sub, expr_mul, expr_div); Tuple: represents a Func with mutiple outputs float16_t: IEEE754 16-bit float representation Halide::* : special math or operations, refer to https://halide-lang.org/docs/namespace_halide.html Expr u8val; u8val = u8(clamp(out, 0, 255)); u8val = saturating_cast<uint8>(out); // math, other like ceil, floor, pow, sin/cos/tan ... Expr logval = log(x); // select, works like “?:” in C or switch-case in complex cases Expr c = select( c < 0, 0, c);
  • 12. JIT Image Processing Example 12 // load the input image Buffer<uint8_t> input = load_image("images/rgb.png"); // function used to brighter the image Func brighter; // variables used to define brighter function Var x, y, c; // 'value' Expr is used to define the procedure of image processing Expr value = input(x, y, c); value = Halide::cast<float>(value); value = value * 1.5f; value = Halide::min(value, 255.0f); value = Halide::cast<uint8_t>(value); // define the function brighter(x, y, c) = value; // get output result Buffer<uint8_t> output =          brighter.realize(input.width(), input.height(), input.channels()); // save the output to a file save_image(output, "brighter.png");
  • 13. Put It All Together! - 3x3 Blur - In JIT 13 https://github.com/champyen/halide_2019.git
  • 15. Scheduling Basics – Default Loop Structure 15 func_foo (a, b, c, … x, y, z) = … inner-most loop outermost loop //default scheduling equal to the below loop: for(z = 0; z < Z_MAX; z++){     for(y = 0; y < Y_MAX; y++){         for(x = 0; x < X_MAX; x++){             …                 for(a = 0; a < A_MAX; A++){                     // computing at here                 }             …          }     } }
  • 16. Scheduling Basics - Reodering 16 func_foo.reorder (z, y, x, … c, b, a) = … inner-most loop outermost loop //reordered scheduling equal to the below loop: for(a = 0; a < A_MAX; a++){     for(b = 0; b < B_MAX; b++){         for(c = 0; c < C_MAX; c++){             …                 for(z = 0; z < Z_MAX; Z++){                     // computing at here                 }             …          }     } }
  • 17. Scheduling Basics - Splitting 17 func_foo(x, y) = ... func_foo.split(y, yo, yi, 32); //splitted scheduling equal to the below loop: for(yo = 0; yo < Y_MAX/32; yo++){     for(yi = 0; yi < 32; yi++){         for(x = 0; x < X_MAX; x++){             //computation is here         }     } }
  • 18. Scheduling Basics - Tiling 18 func_foo(x, y) = ... func_foo.tile(x, y, xo, xi, yo, yi, 32, 32); //tiled scheduling equal to the below loop: for(yo = 0; yo < Y_MAX/32; yo++){     for(xo = 0; xo < X_MAX/32; xo++){         for(yi = 0; yi < 32; yi++){             for(xi = 0; xi < 32; xi++{                 //computation is here             }         }     } }
  • 19. Schedule Basics - Fuse 19 func_foo(x, y) = ... func_foo.fuse(x, y, fidx); //fused scheduling equal to the below loop: for(fidx = 0; fidx < X_MAX*Y_MAX; fidx++){     //computation is here } serialized by fidx
  • 20. Scheduling – Vectorize, Parallel 20 func_foo(x, y) = ... func_foo.vectorize(x, 8); //vectorized scheduling equal to the below loop: for(y = 0; y < Y_MAX; y++){     for(x = 0; x < X_MAX; x+=8){         //8-LANE auto-vectorization     } } func_foo(x, y) = ... func_foo.parallel(y); //parallel scheduling equal to the below loop: #pragma omp paralle for for(y = 0; y < Y_MAX; y++){     for(x = 0; x < X_MAX; x++){         //computation is here     } } Vectorize Parallel
  • 21. compute_at/store_at, compute_root/store_root ● store position should be same or outer than computation ● store_root => indicate the stage/function has whole frame buffer output ● compute_root => bread-first ○ and also mean store_root ● store_at(Func, Var) ○ the Func’s storage is declared in Var’s loop of Func ● compute_at( Func, Var ) ○ computed in Var’s loop of Func ○ also mean store_at(Func, Var) ● Var::outermost() 21
  • 22. The Schedule Directives Combinations 22
  • 23. Ahead-of-Time(AOT) Workflow 23 CodeGen Executable Halide Code Static Library (.a + .h) Function Implement Code Halide Shared Library (.so) Final Executable /Library Halide Runtime Buffer (.h)
  • 24. AOT code structure & example 24 //box_aot.cpp: Box_2x2 DownSample class BoxDown2 : public Generator<BoxDown2> { public:     // Input/Output types are not specified, they are set in code-generation phase.     Input<Buffer<>> input{"input", 3};     Output<Func> output{"output", 3};     void generate() {         Func clamp_input = BoundaryConditions::repeat_edge(input);         output(x, y, c) = cast(output.type(),                              ((clamp_input(2*x, 2*y, c)+                             clamp_input(2*x+1, 2*y, c)+                              clamp_input(2*x, 2*y+1, c)+                              clamp_input(2*x+1, 2*y+1), c) >> 2) );     }     void schedule() {         output.vectorize(x, 16).parallel(y);     } private:     Var x, y, c; }; HALIDE_REGISTER_GENERATOR(BoxDown2, box_down2); $ clang++ -O3 -fno-rtti -std=c++11 -o box_aot box_aot.cpp $HALIDE_ROOT/tools/GenGen.cpp -I $HALIDE_ROOT/include/ -L $HALIDE_ROOT/bin/ -lHalide -ltinfo -lpthread -ldl;  //change targe to "arm-64-android" for Android usage $ LD_LIBRARY_PATH=$HALIDE_ROOT/bin/ ./box_aot -g box_down2 -o ./aot input.type=uint8 output.type=uint8 target=host
  • 25. AOT code usage 25 //test.cpp … #include "halide_image_io.h" #include "HalideBuffer.h" #include "box_down2.h" … using namespace Halide::Tools; using Halide::Runtime::Buffer; int main(int argc, char** argv) {     Buffer<uint8_t> input = load_image(argv[1]);     Buffer<uint8_t> output(input.width()/2, input.height()/2, input.channels());     box_down2(input, output);     save_image(output, "output.png"); } $ clang++ -fno-rtti -std=c++11 -O3 -o test test.cpp aot/box_down2.a -I aot -I $HALIDE_ROOT/include -I aot/ -lpthread -ldl -ljpeg -ltinfo -lpng –lz $ ./test input.jpg
  • 26. More about Runtime Buffer Manipulation Buffer<uint8_t> buf(width, height); //2D buffer // get buffer pointer unsigned char* buf_ptr = (unsigned char*)(buf.data()); // get ROI buffer object Buffer<uint8_t> crop_buf= buf.cropped(0, crop_x, crop_w).cropped(1, crop_y, crop_h); … // use external memory (from other place, eg: OpenCV mat) for Buffer creation uint8_t *data = (uint8*)malloc(width*height*channels); Buffer<uint8_t> external_buf(data, channels, width, height); 26
  • 27. Put It All Together! - Matrix Multiplication • https://github.com/champyen/halide_2019 • halide_mm • Generator • mm_generator.cpp • Application • mm.cpp 27
  • 28. Resource • Halide Official Tutorial • http://halide-lang.org/tutorials/tutorial_introduction.html • Halide Site • http://halide-lang.org/ • Halide GitHub • https://github.com/halide/Halide • https://suif.stanford.edu/~courses/cs243/lectures/l14-halide.pdf • Qualcomm Halide Software (in Hexagon SDK) • https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools 28