TensorFlow XLAは、中で何をやっているのか？

TensorFlow User Group ハード部 #2
2017/4/21
TensorFlow XLAは、
中で何をやっているのか？
TensorFlow r1.0(r1.1)
で公開されたXLAの
ソースコードを追ってみ
ました
@Vengineer

勉強会主催 :
Xilinx Zynq MPSoC (2016/02/20)
Altera SDK for OpenCL (2016/06/10)
Xilinx SDSoC (2017/01/28)
PYNQ祭り (2017/03/04)
ブログ : Vengineerの戯言
http://blogs.yahoo.co.jp/verification_engineer
Twitter : ＠Vengineer
書籍 : SystemVerilogスタートアップ
http://www.cqpub.co.jp/hanbai/books/36/36191.htm
自己紹介

Design Solution Forum
http://www.dsforum.jp/
2017年10月13日(金)開催@新横浜
今年で4年目
毎年500名を越える来場者
今年は「RISC-V」関連を5講演予定
ディーブラーニング関連講演者募集中

「PYNQ祭り」延長戦 :
FPGAディープラーニング実践懇親会
(2017/05/20)
https://fpgax.connpass.com/event/52935/
BNN-PYNQを実際にやってみよう
今すぐ、申込しよう

この資料は、
TensorFlow XLAに関するコードを
解析したものをまとめたです
TensorFlow r1.1対応
ご利用は、自己責任でお願いします

TensorFlow XLAとは
https://www.tensorflow.org/performance/xla/
XLA(Accelerated Linear Algebra)は、TensorFlow計算を最適化
する線形代数のドメイン固有のコンパイラです。結果として、サー
バーおよびモバイルプラットフォームでの速度、メモリ使用率、移植性
が向上します。当初、ほとんどのユーザーはXLAの大きなメリットは
見られませんが、JIT(Just-In-Time)コンパイルや
AOT(Ahead-Of-Time)コンパイルを使用してXLAを使用することで
実験を開始できます。新しいハードウェアアクセラレータをターゲット
とする開発者は、XLAを試すことを特にお勧めします。
原文(英語)をそのまま、Google翻訳にお願いしました。

ブログにも書きました
TensorFlow XLAの衝撃
2017年2月20日
http://blogs.yahoo.co.jp/verification_engineer/71016304.html

TensorFlow XLAって何？
Recap of TensorFlow DEV SUMMIT 2017で
発表された「XLAコンパイラ」
　　　　　　足立昌彦さん（株式会社カブク）
資料と解説を見てちょうだい
詳しくは、「TensorFlow XLAの情報と発表」

これからお話する内容
0)、Pythonの式からTensorFlowグラフが
　　どう変形されるかを見ていきます
1)、JIT (Just-In-Time) コンパイル
ただし、単一マシンのみで、GPUは1つ
2)、AOT (Ahead-Of-Time) コンパイル
CPUのみ
x86-64/ARM/AARCH64/PowerPC

0)、Pythonの式から
TensorFlowグラフがどう変形さ
れるかを見ていきます

TensorFlow XLAは、
まだ、
単一マシンでしか使えないので
DirectSessionの場合で

Session.runの動き
python/client/session.py
SessionInterface => BaseSession => Session
def run( self, fetches, feed_dict=None,
options=None, run_metadata=None );
_run
　_do_run
　　tf_session.TF_PRun
　ここからC++の世界
c/c_api.ccのTF_Run関数
　　　c/c_api.ccのTF_Run_Helper関数
　　　　　　Session::run (core/public/session.h)
　DirectSession::Run

C++のDirectSession::Run
DirectSession::Run (core/common_runtime/direct_session.cc)
Executorを生成する
GetOrCreateExecutors(pool, input_tensor_names,
output_names, target_nodes,
&executors_and_keys,
&run_state_args));
Executorは複数あり、各Executorが独立して実行し、
各Executor間の通信は非同期に行われる

C++のDirectSession::Runの続き
DirectSession::Run (core/common_runtime/direct_session.cc)
実行部分のところ
for (const auto& item : executors_and_keys->items) {
item.executor->RunAsync(args, barrier->Get());
}　　Executorが非同期に実行される
すべてExecutorの実行が終了するまで待つ
WaitForNotification(&run_state, &step_cancellation_manager,
run_options.timeout_in_ms() > 0
? run_options.timeout_in_ms()
: operation_timeout_in_ms_);

executor->RunAsync
Executor::RunAsync (core/common_runtime/executor.h)
ExecuteImple::RunAsync
ExecuteState::RunAsync
ExecuteState::ScheduleReady
ExecuteState::Process (core/common_runtime/executor.cc)
　・device->ComputeAsync 非同期の場合
　・device->Compute 同期の場合

え、
どこでグラフが
生成されるんだよ！

はい、ここです
DirectSession::GetOrCreateExecutors の
CreateGraphs 関数内でグラフを生成し、分割する
CreateGraphs( options, &graphs, &ek->flib_def,
run_state_args));
その後に、
分割されたグラフ単位で Executor にて実行される

グラフは次のステップで作られる
　1)、Feed/Fetchノードの追加
subgraph::RewriteGraphForExecution
(core/graph/subgraph.cc)
　2)、Placement
SimplePlacer::Run
(core/common_runtime/simple_placer.cc)
　3)、グラフの分割 (同じデバイス＆実行単位)
Partition
(core/graph/graph_partition.cc)

RewriteGraphForExecution
core/graph/subgraph.cc
Feedノードを追加 (_Recv : .Attr("client_terminated", true))
if (!fed_outputs.empty()) {
FeedInputs( g, device_info, fed_outputs, &name_index );
}
Fetchノードを追加 (_Send : .Attr("client_terminated", true))
std::vector<Node*> fetch_nodes;
if (!fetch_outputs.empty()) {
FetchOutputs( g, device_info, fetch_outputs,
&name_index, &fetch_nodes );
}

SimplePlacer::Run
core/common_runtime/simple_placer.cc
1. First add all of the nodes.
2. Enumerate the constraint edges,
and use them to update the disjoint node set.
3. For each node, assign a device based on the constraints in the
disjoint node set.
4. Perform a second pass assignment for those nodes explicitly
skipped during the first pass.

Partition
core/graph/graph_partition.cc
　1)、各デバイスで実行できる単位に分割する
　　　デバイス : cpu / gpu / XLA_CPU / XLA_GPU
　2)、各デバイス間に、_Send / _Recv ノードを追加する
　例えば、cpu => gpu の部分に、
cpu側には _Send ノードを
gpu側には _Recv ノードを追加する

サンプルコードで
確認してみよう

デバイスを gpu にすると
def test_gpu(self):
with tf.Session() as sess:
x = tf.placeholder(tf.float32, [2], name="x")
with tf.device("gpu"):
y = x * 2
result = sess.run(y, {x: [1.5, 0.5]})

0)、最初
Mul
Const
Feed(x)
Fetch(y)

1)、Feed/Fetchノードの追加
Mul
_Recv
Const
_Send
Feed(x)
Fetch(y)

2)、Placement
Mul
_Recv
Const
_Send
cpu : Feed(x)
cpu : Fetch(y)
gpu
gpu

3)、グラフの分割
_Recv
_Recv
_Send
_Send _Recv _Send
gpu
Feed(x) Fetch(y)cpu
Mul
Const

Using JIT Compilation
https://www.tensorflow.org/performance/xla/jit
TensorFlow/XLA JITコンパイラは、XLAを使用してTensorFlowグ
ラフの一部をコンパイルして実行します。
この標準的なTensorFlow実装の利点は、XLAが複数の演算子(カー
ネル融合)を少数のコンパイル済みカーネルに融合できることです。
TensorFlow Executorsが実行するように、演算子を融合させること
で、メモリ帯域幅の要件を減らし、演算子を1つずつ実行するよりもパ
フォーマンスを向上させることができます。
原文(英語)をそのまま、Google翻訳にお願いしました。

JITが出来るようにビルドする
TensorFlowでXLAを使えるようにする
by @adamrocker
http://blog.adamrocker.com/2017/03/build-t
ensorflow-xla-compiler.html
の
「A: TensorFlowのビルド」
に詳しく書いてあります。

ディレクトリ構成
compilerディレクトリがTensorFlow XLA
・aot
・jit
・tests
・tf2xla
・xla
JIT関連は、主に、jitディレクトリ内にある

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance
https://autodiff-workshop.github.io/slides/JeffDean.pdf
XLA対応のデバイス

先ずは、
TensorFlow XLAのJITでは
グラフがどのように変更されるか、
確認してみよう

gpu を XLA_CPU に変更
def testXLA_JIT(self):
with tf.Session() as sess:
with tf.device("device:XLA_CPU:0"):
y = x * 2
result = sess.run(y, {x: [1.5, 0.5]})

2)、Placement
Mul
_Recv
Const
_Send
cpu : Feed(x)
cpu : Fetch(y)
XLA_CPU
XLA_CPU

_Recv
_Recv
_Send
_Send _Recv _Send
XLA_CPU
Feed(x) Fetch(y)cpu
Mul
Const

_XlaLaunch
_Recv
_Recv _Send
_Send _Recv _Send
XLA_CPU
Feed(x) Fetch(y)cpu

複数Opsを_XlaLaunch Opに変換
_XlaLaunch
XLA_CPU
MulConst
gpu

ええええ、
なんで、_XlaLaunch
になっちゃうの？
どうして？

Passを使ってグラフを変形してるよ
compiler/jit/jit_compilation_pass_registration.cc
REGISTER_OPTIMIZATIONマクロを使って、
OptimizationPassRegistry::POST_REWRITE_FOR_EXEC
Passを追加
　・MarkForCompilationPass // コンパイル可能なものにマーク
mark_for_compilation_pass.{h,cc}
　・EncapsulateSubgraphsPass // サブグラフを関数ノード
Encapsulate_subgraphs_pass.{h,cc}
　・BuildXlaLaunchOpsPass // 関数ノードを_XlaLaunchに置換
build_xla_launch_ops_pass.{h,cc}
上から順番に実行される

これらのPassはいつ実行される？
　1)、Feed/Fetchノードの追加
subgraph::RewriteGraphForExecution
ここで、PRE_PLACEMENTパスを実行
　2)、Placement
ここで、POST_PLACEMENTパスを実行
　　SimpleGraphExecutionState::BuildGraph関数で
　　　POST_REWRITE_FOR_EXEC を実行
　3)、グラフの分割
Partition
ここで、POST_PARTITIONINGパスを実行

TensorFlow XLA : JITでは！
同じデバイス内で実行できるSubgraph単位の
ノードをギュギュッと1つにまとめて、
_XlaLaunch Op
内で実行する
_XlaLaunchは、
TensorFlow XLA専用のOpとして実装

_XlaLaunch Opで実装は？
・Register the new Op in a C++ file
・Implement the Op in C++
compiler/kernels/xla_local_launch_op.h
compiler/kernels/xla_local_launch_op.cc

_XlaLaunch Op の登録
REGISTER_OP("_XlaLaunch")
.Input("constants: Tconstants")
.Attr("Tconstants: list(type) >= 0")
.Input("args: Targs")
.Attr("Targs: list(type) >= 0")
.Output("results: Tresults")
.Attr("Tresults: list(type) >= 0")
.Attr("function: func")
.Doc("XLA Launch Op. For use by the XLA JIT only.");

XlaLocalLaunchOp::Compute
　・XlaCompilationCacheクラスのインスタンス(compiler)を生成
　・_XlaLaunch Op内で実行する一連の関数群をコンパイル
　　ここで、LLVMを利用して、バイナリコードに変換する
compiler->Compile( function_,
num_constant_args_, ctx,
&kernel, &executable));
　・各種パラメータ＆入力リストをXLA用データに変換
　・executableのRunを実行(バイナリコードが実行される)
　　auto run_result = executable->Run(arg_ptrs, run_options);
　・XLA用データを出力リストに変換

Computeの処理
ここに LLVM を使っている
compiler->Compile executable->Run

Compile
TensorFlowグラフから
実行コードへの変換

XlaCompilationCache::Compile
jit/xla_compilation_cache.cc
メンバー compiler_ は、XlaCompiler
・コンパイル済みで無いときは、コンパイルする
　entry->compiled = true;
　entry->compilation_status = compiler_.CompileFunction(
flr.get(), function, args, &entry->compilation_result);
・コンパイル済みコードでExecutableを生成する
　entry->compilation_status = compiler_.BuildExecutable(
entry->compilation_result, &entry->executable);
　*executable = entry->executable.get();

XlaCompiler::CompileFuntion
xf2xla/xla_compiler.cc
　・CompileFunction 関数内のグラフからマシン語まで生成
　　　1)、グラフの最適化 (OptimizeGraph)
TensorFlowの標準関数
　2)、グラフのコンパイル (CompileGraph)
TensorFlowグラフからXLA(HLO) Computionへ

XlaCompiler::CompileGraphは、
2)のAOT で

BuildExecutable
BuildHloModule
backend->compiler()->Compile
CpuExecutable

Service::BuildExecutable
xla/service/service.cc
グラフからXLA HLOへ変換
for (const VersionedComputationHandle& versioned_handle : versioned_handles) {
auto module = computation_tracker_.BuildHloModule(
Versioned_handle, true));
modules.push_back(std::move(module));
}
….
XLA HLOからLLVM IR => Executableに変換
std::vector<std::unique_ptr<Executable>> executables =
backend->compiler()->Compile(
std::move(modules), std::move(module_configs),
hlo_dumper, std::move(executors)));

executable->Runを実行
executable->RunCpuExecutable

LocalExecutable::Run
xla/client/local_client.cc
StatusOr<std::unique_ptr<ShapedBuffer>> LocalExecutable::Run(
const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
const ExecutableRunOptions& options) {
ExecutableRunOptions actual_options = options;
…..
return executable_->ExecuteOnStream(&actual_options,
arguments, nullptr);
}

ExecuteOnStream
xla/service/cpu/cpu_executable.cc
se::Stream* stream = run_options->stream();
メモリの割当て
DeviceMemoryAllocator* memory_allocator = run_options->allocator();
std::vector<se::DeviceMemoryBase> buffers(assignment_->Allocations().size());
AllocateBuffers(
memory_allocator, stream->parent()->device_ordinal(), &buffers);
関数の実行
ExecuteComputeFunction(run_options, arguments, buffers,
hlo_execution_profile));

ExecuteComputeFunction
xla/service/cpu/cpu_executable.cc
マシンコードに変換された関数 (compute_function_) を実行
compute_function_(result_buffer, run_options, args_array.data(),
buffer_pointers.data(), profile_counters.data());
CpuExecutableのコンストラクタで compute_function_ は設定
CpuExecutable::CpuExecutable( ….,
Const string& entry_function_name, …) {
llvm::JITSymbol sym = jit_->FindSymbol(entry_function_name);
compute_function_ =reinterpret_cast<ComputeFunctionType>(sym.getAddress());
}

Using AOT compilation
https://www.tensorflow.org/performance/xla/tfcompile
・tfcompileって、何？
・tfcompileは、何をする？
・tfcompileを使うには！
現時点（TensorFlow r1.1) では、AOTのターゲットは、
公式には、CPU(/x86-64/ARM64)のみサポート。
でも、コードにはCPU(ARM/PowerPC)もサポート。

tfcompileって、何？
・TensorFlowグラフを実行可能コードに
　コンパイルする
・バイナリサイズ
　およびランタイムオーバーヘッドを減らす
・利用例：推論用グラフを
　モバイルデバイス用実行コードに変換

ランタイムが無くなる
TensorFlowグラフはTensorFlowランタイム上で実
行されるので、グラフ内の各ノードの実行ではラン
タイムオーバヘッドを招く
また、TensorFlowランタイム用のコードが必要であ
るため、バイナリサイズが大きくなる
tfcompileによって生成される実行コードは、
TensorFlowランタイムを使用せず、計算で実際に
使用されるカーネルにのみ依存する

tfcompileは、何をする？
tfcompileは、TensorFlowサブグラフからそのサ
ブグラフを実行する関数を生成する
Feedは関数の入力引数、Fetchは関数の出力引
数となる
すべてのPalceholdersとVariablesは、関数の入
力引数としてFeedとして指定する必要がある
tfcompileによって生成されたファイルは、
関数のオブジェクトファイルとして利用できる

tfcompileを使うには！
　1)　コンパイルするサブグラフを構成する
　2)　tf_libraryビルドマクロを使用して
　　　サブグラフをコンパイルする
　3)　サブグラフを呼び出すコードを書く
　4)　最終的なバイナリを作成する

tfcompile
バイナリでは提供されていない
ので、ソースコードからビルドす
る必要がある

tfcompileのビルドの仕方
TensorFlowでXLAを使えるようにする
by @adamrocker
http://blog.adamrocker.com/2017/03/bui
ld-tensorflow-xla-compiler.html
の
「B: tfcompileを試す」
に詳しく書いてあります。

tfcompile::Main
aot/tfcompile_main.cc
コンフィグファイルとグラフファイルの読み込み
ReadProtoFile("config", flags.config, &config);
ReadProtoFile("graph", flags.graph, &graph_def);
グラフの初期化
InitGraph(graph_def, config, flags, &flib, &graph);
グラフのコンパイル
CompileGraph(std::move(graph), flags, &flib, &compile_result);
ファイル(オブジェクト、ヘッダ)の書き出し
WriteStringToFile( …., …., …. );

グラフ
情報
コンフィグ
情報
グラフ情報
をHLO(最適化)に変換
HLOをLLVMで
CPU実行コードに変換
オブジェクトファイルへ
の出力

グラフの初期化
aot/compile.cc : InitGraph
グラフ定義とグラフを生成
std::unique_ptr<Graph> g(new Graph(flib));　グラフ
GraphDef copy_def(graph_def); グラフ定義
AddDefaultAttrsToGraphDef(&copy_def, *g->op_registry(), 0);
グラフ定義(GraphDef)からグラフに変換
ConvertGraphDefToGraph(GraphConstructorOptions(), copy_def, g.get());
Feed/Fetchをノード(_Arg/_Retval)としてグラフに追加
RewriteAndPruneGraph(g.get(), config, flags));

0)、最初
Mul
Const
Feed(x)
Fetch(y)
y = x * 2

1)、Feed/Fetchノードの追加
Mul
_Arg
Const
_Retval
Feed(x)
Fetch(y)
y = x * 2

グラフのコンパイル
aot/compile.cc : CompileGraph
TensorFlowグラフをXLA(HLO)フォーマットに変換
ConvertGraphToXla(client, std::move(graph), flib,
&computation, &compile_result->has_context_arg);
コンパイルオプションの設定
xla::cpu::CpuAotCompilationOptions aot_opts(
flags.target_triple, flags.target_cpu, flags.target_features,
flags.entry_point,
xla::cpu::CpuAotCompilationOptions::RelocationModel::BigPic);
XLA(HLO)をコンパイル
return CompileXla(client, computation, aot_opts, compile_result);

ConvertGraphToXla
グラフ情報をXLA(HLO)に変換

ConvertGraphToXla
aot/compile.cc
ノードをすべてDEVICE_CPU_XLA_JITに割り当てる
for (Node* node : graph->nodes()) {
node->set_assigned_device_name(DEVICE_CPU_XLA_JIT);
}
XlaCompilerの初期化
XlaCompiler::Options compiler_options;
compiler_options.client = client;
compiler_options.device_type = DeviceType(DEVICE_CPU_XLA_JIT);
compiler_options.allow_cpu_custom_calls = true;
XlaCompiler compiler(compiler_options);

2)、ノードをCPU_XLA_JITに
Mul
_Arg
Const
_Retval
Feed(x)
Fetch(y)
CPU_XLA_JIT

ConvertGraphToXla
aot/compile.cc
XlaCompilerのCompileGraph関数を実行
std::unique_ptr<FunctionLibraryRuntime> flib_run(NewFunctionLibraryRuntime(
compiler.device_mgr(), Env::Default(), compiler.device(),
graph->versions().producer(), flib_def, OptimizerOptions()));
XlaCompiler::CompilationResult result;
compiler.CompileGraph("tfcompile", std::move(graph),
flib_run.get(), xla_args, false, &result);
グラフのコンパイル結果を XLA Computation を取り出す
*computation = std::move(result.computation);

XlaCompiler::CompileGraph
xf2xla/xla_compiler.cc
　グラフのコンパイル (CompileGraph)
　
1)、引数のビルド (BuildArguments)
2)、グラフの実行 (ExecuteGraph)
XLA Computationの生成
3)、Executionの生成 (BuildComputation)
XLA ComputationからLocalExecutable生成

https://docs.google.com/presentation/d/197G6FWQ4pqMS5cFkbNMkgQMoUV3B4Sdo9CzPNHJ5L
BU/edit#slide=id.g1d042a8a7f_0_729 の20頁
XlaCompiler::ExecuteGraph

XlaCompiler::ExecuteGraph
　・LocalExecutor
　グラフ内のすべてのノードのカーネルを生成し、実行する
　・XLA Graph
　・tf2xla kernels
　各ノードのcompute関数でコンパイル(Compile関数)を実行

各種カーネル
tf2xla/kernels
_Arg : declaration_op.cc Feedに対応
_Retval : retval_op.cc Fetchに対応
その他、このディレクトリにあるカーネルのみ、
XLA(HLO)に変換可能

https://docs.google.com/presentation/d/197G6FWQ4pqMS5cFkbNMkgQMoUV3B4Sdo9CzPNHJ5L
BU/edit#slide=id.g1d042a8a7f_0_729 の22頁
CompileXla

CompileXla
aot/compile.cc : CompileXla
xla::LocalClient* client;
xla::LocalClient::AheadOfTimeComputationInstance instance;
instance.computation = &computation;
instance.argument_layouts = std::move(arg_layouts);
instance.result_layout = &pshape->result();
xla::StatusOr<std::vector<std::unique_ptr<xla::AotCompilationResult>>>
aot_or = client->CompileAheadOfTime({instance}, aot_opts);

CompileAheadOfTime
xla/client/local_client.cc
std::vector<LocalService::AheadOfTimeComputationInstance> service_instances;
service_instances.reserve(computations.size());
for (const AheadOfTimeComputationInstance& instance : computations) {
service_instances.push_back({});
LocalService::AheadOfTimeComputationInstance& service_instance =
service_instances.back();
TF_RET_CHECK(instance.computation != nullptr);
service_instance.computation = instance.computation->handle();
service_instance.argument_layouts = instance.argument_layouts;
service_instance.result_layout = instance.result_layout;
}
local_service_->CompileAheadOfTime(service_instances, options);

CompileAheadOfTime
xla/service/local_service.cc
std::vector<std::unique_ptr<HloModule>> hlo_modules;
std::vector<std::unique_ptr<HloModuleConfig>> module_configs;
for (const AheadOfTimeComputationInstance& instance : computations) {
…..
std::unique_ptr<HloModule> hlo_module =
computation_tracker_.BuildHloModule(
Versioned_handle, true );
hlo_modules.push_back(std::move(hlo_module));
}
HLOをコンパイル
return execute_backend_->compiler()->CompileAheadOfTime(
std::move(hlo_modules), std::move(module_configs), MakeHloDumper(),
options);

BuildHloModule
xla/service/computation_tracker.cc
for (auto versioned_handle : post_order) {
UserComputation* computation =
ResolveInternal(versioned_handle.handle).ValueOrDie();
std::unique_ptr<HloComputation> hlo_computation =
computation->BuildHloComputation(
versioned_handle.version, resolver, include_unused_parameters));
hlo_computations[versioned_handle] = hlo_computation.get();
if (computation == entry_computation) {
module->AddEntryComputation(std::move(hlo_computation));
} else {
module->AddEmbeddedComputation(std::move(hlo_computation));
}
}

BuildHloComputation
xla/service/user_computation.cc
HLOのLoweringを行う
std::unique_ptr<HloComputation> hlo_computation =
ComputationLowerer::Lower(
tensorflow::strings::StrCat(name(), ".v", version),
session_computation_,
version,
std::move(hlo_resolver),
include_unused_parameters);
return std::move(hlo_computation);

やっと出てきました
xla/cpu/cpu_compiler::
CompileAheadOfTime
HLOを最適化し、LLVMを使って
CPUオブジェクトファイルを生成

CompileAheadOfTime
xla/service/cpu/cpu_compiler.cc
LLVMのターゲット・データレイアウトの設定
HLOモジュール毎に下記の関数を実行
RunHloPasses(hlo_module, module_config, dump_hlo));
IrEmitter ir_emitter(*hlo_module, *module_config, *assignment,
&llvm_module, nullptr);
for (auto embedded_computation : computation->MakeEmbeddedComputationsList()) {
ir_emitter.EmitComputation(embedded_computation,
embedded_computation->name(), false,
&module_sequence.at(embedded_computation)).status());
}

RunHloPasses
HLOに対して下記の最適化を実施する
Inliner / ConvCanonicalization / HloPassFix<HloPassPipeline>
AlgebraicSimplifier / ReshapeMover
HloSubcomputationUnification / HloCSE
CpuInstructionFusion / CpuLayoutAssignment
AlgebraicSimplifier / HloCSE / ParallelizationPreparation
CopyInsertion / Parallelization / HloDCE
return pipeline.Run(hlo_module).status();

IrEmitter::EmitComputation
xla/service/cpu/ir_emitter.cc
llvmのFunction宣言を生成し、ビルダー(ir_builder)に追加
InitializeIrFunction(function_name, is_entry_computation);
HloComputation (root_instruction() )のVisitorパターンを実施
computation->root_instruction()->AcceptOrdered(
this, *instruction_order));
computation->root_instruction()->Accept(this));
InsertOrDie(&emitted_functions_, computation, compute_function_);
最終的には、llvm::Functionのポインタに変換される
return compute_function_;

関数のコンパイル
Disassembler disassembler(*target_machine);
CompilerFunctor compiler_functor(
target_machine.get(), &disassembler, opt_level,
CompilerFunctor::AllIntrinsics());
CompilerFunctor::operator()にて、llvm::moduleを
　 CPU実行オブジェクトに変換する
　 llvm::object::OwningBinary<llvm::object::ObjectFile> object_file =
compiler_functor(llvm_module);

LLVMでコンパイル後、
ObjectFileの生成
xla/service/cpu/compiler_functor.cc
CompilerFunctior::operator()

LLVM IRの最適化パス
最適化パスの設定
llvm::legacy::PassManager module_passes;
llvm::legacy::FunctionPassManager function_passes(&module);
AddOptimizationPasses(&module_passes, &function_passes);
最適化パスの実行
function_passes.doInitialization();
for (auto func = module.begin(); func != module.end(); ++func) {
function_passes.run(*func);
}
function_passes.doFinalization();
module_passes.run(module);

マシンコードの生成
llvm::MCContext* mc_context;
llvm::legacy::PassManager codegen_passes;
target_machine_->addPassesToEmitMC(codegen_passes,
　　　　　　　　　　　　　　　　　　　　　　　　　 mc_context, ostream);
codegen_passes.run(module);

ObjctFileの生成
std::unique_ptr<llvm::MemoryBuffer> memory_buffer(
new llvm::ObjectMemoryBuffer(std::move(stream_buffer)));
llvm::Expected<std::unique_ptr<llvm::object::ObjectFile>>
object_file_or_error =
　　　llvm::object::ObjectFile::createObjectFile(
　　　　　　memory_buffer->getMemBufferRef());
std::unique_ptr<llvm::object::ObjectFile> object_file =
std::move(object_file_or_error.get());
return llvm::object::OwningBinary<llvm::object::ObjectFile>(
std::move(object_file), std::move(memory_buffer));

新しいCPUへの対応は？
CpuCompiler
を改造すればいいのですよ

InitializeLLVMTarget
llvm::InitializeNativeTarget();
….
LLVMInitializeX86Target();
….
LLVMInitializeARMTarget();
….
LLVMInitializeAArch64Target();
….
LLVMInitializePowerPCTarget();
公式には、x86−64とAArch64のみサポートにはなっていますが？

ありがとうございました
ブログ : Vengineerの戯言
http://blogs.yahoo.co.jp/verification_engineer
Twitter : ＠Vengineer
TensorFlow XLAの衝撃
　　　　　　　　2017年2月20日

TensorFlow XLAは、中で何をやっているのか？

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to TensorFlow XLAは、中で何をやっているのか？

Similar to TensorFlow XLAは、中で何をやっているのか？ (20)

More from Mr. Vengineer

More from Mr. Vengineer (20)