MILEPOST GCC: machine learning based research compiler

MILEPOST GCC: machine learning based research compiler
Grigori Fursin, Mircea Namolaru, Edwin Bonilla,
Cupertino Miranda, Elad Yom-Tov, John Thomson,
Olivier Temam Ayal Zaks, Hugh Leather,
INRIA Saclay, France Bilha Mendelson Chris Williams,
IBM Haifa, Israel Michael O’Boyle
University of Edinburgh, UK
Phil Barnard, Elton Ashton Eric Courtois, Francois Bodin
ARC International, UK CAPS Enterprise, France

Contact: grigori.fursin@inria.fr

Abstract The difficulty of achieving portable compiler perfor-
mance has led to iterative compilation [11, 16, 15, 26,
Tuning hardwired compiler optimizations for rapidly 33, 19, 30, 24, 25, 18, 20, 22] being proposed as a
evolving hardware makes porting an optimizing com- means of overcoming the fundamental problem of static
piler for each new platform extremely challenging. Our modeling. The compiler’s static model is replaced by
radical approach is to develop a modular, extensible, a search of the space of compilation strategies to find
self-optimizing compiler that automatically learns the the one which, when executed, best improves the pro-
best optimization heuristics based on the behavior of the gram. Little or no knowledge of the current platform is
platform. In this paper we describe MILEPOST1 GCC, needed so programs can be adapted to different archi-
a machine-learning-based compiler that automatically tectures. It is currently used in library generators and
adjusts its optimization heuristics to improve the exe- by some existing adaptive tools [36, 28, 31, 8, 1, 3].
cution time, code size, or compilation time of specific However it is largely limited to searching for combina-
programs on different architectures. Our preliminary tions of global compiler optimization flags and tweaking
experimental results show that it is possible to consider- a few fine-grain transformations within relatively nar-
ably reduce execution time of the MiBench benchmark row search spaces. The main barrier to its wider use is
suite on a range of platforms entirely automatically. the currently excessive compilation and execution time
needed in order to optimize each program. This prevents
its wider adoption in general purpose compilers.
1 Introduction
Our approach is to use machine learning which has the
potential to reuse knowledge across iterative compila-
Current architectures and compilers continue to evolve
tion runs, gaining the benefits of iterative compilation
bringing higher performance, lower power and smaller
while reducing the number of executions needed.
size while attempting to keep time to market as short as
possible. Typical systems may now have multiple het-
The MILEPOST project’s [4] objective is to develop
erogeneous reconfigurable cores and a great number of
compiler technology that can automatically learn how to
compiler optimizations available, making manual com-
best optimize programs for configurable heterogeneous
piler tuning increasingly infeasible. Furthermore, static
embedded processors using machine learning. It aims to
compilers often fail to produce high-quality code due to
dramatically reduce the time to market of configurable
a simplistic model of the underlying hardware.
systems. Rather than developing a specialised compiler
1 MILEPOST - MachIne Learning for Embedded PrOgramS op- by hand for each configuration, MILEPOST aims to
Timization [4] produce optimizing compilers automatically.

1

A key goal of the project is to make machine learning learning tools to correlate aspects of program structure,
based compilation a realistic technology for general- or features, with optimizations, building a strategy that
purpose compilation. Current approaches [29, 32, 10, predicts a good combination of optimizations.
14] are highly preliminary; limited to global compiler
flags or simple transformations considered in isolation. In order to learn a good strategy, machine learning tools
GCC was selected as the compiler infrastructure for need a large number of compilations and executions as
MILEPOST as it is currently the most stable and ro- training examples. These training examples are gener-
bust open-source compiler. It supports multiple archi- ated by a tool, the Continuous Collective Compilation
tectures and has multiple aggressive optimizations mak- Framework[2] (CCC), which evaluates different compi-
ing it a natural vehicle for our research. In addition, lation optimizations, storing execution time, code size
each new version usually features new transformations and other metrics in a database. The features of the pro-
demonstrating the need for a system to automatically re- gram are extracted from MILEPOST GCC via a plugin
tune its optimization heuristics. and are also stored in the database. Plugins allow fine
grained control and examination of the compiler, driven
In this paper we present early experimental results externally through shared libraries.
showing that it is possible to improve the performance
of the well-known MiBench [23] benchmark suite on a
range of platforms including x86 and IA64. We ported Deployment Once sufficient training data is gathered,
our tools to the new ARC GCC 4.2.1 that targets ARC a model is created using machine learning modeling.
International’s configurable core family. Using MILE- The model is able to predict good optimization strate-
POST GCC, after a few weeks training, we were able to gies for a given set of program features and is built as
learn a model that automatically improves the execution a plugin so that it can be re-inserted into MILEPOST
time of MiBench benchmark by 11% demonstrating the GCC. On encountering a new program the plugin deter-
use of our machine learning based compiler. mines the program’s features, passing them to the model
which determines the optimizations to be applied.
This paper is organized as follows: the next section de-
scribes the overall MILEPOST framework and is, it-
self, followed by a section detailing our implementation Framework In this paper we use a new version of the
of the Interactive Compilation Interface for GCC that Interactive Compilation Interface (ICI) for GCC which
enables dynamic manipulation of optimization passes. controls the internal optimization decisions and their pa-
Section 4 describes machine learning techniques used rameters using external plugins. It now allows the com-
to predict good optimization passes for programs us- plete substitution of default internal optimization heuris-
ing static program features and optimization knowledge tics as well as the order of transformations.
reuse. Section 5 provides experimental results and is We use the Continuous Collective Compilation Frame-
followed by concluding remarks. work [2] to produce a training set for machine learn-
ing models to learn how to optimize programs for the
2 MILEPOST Framework best performance, code size, power consumption and
any other objective function needed by the end-user.
The MILEPOST project uses a number of components, This framework allows knowledge of the optimization
at the heart of which is the machine learning enabled space to be reused among different programs, architec-
MILEPOST GCC, shown in Figure 1. MILEPOST tures and data sets.
GCC currently proceeds in two distinct phases, in ac- Together with additional routines needed for machine
cordance with typical machine learning practice: train- learning, such as program feature extraction, this forms
ing and deployment. the MILEPOST GCC. MILEPOST GCC transforms
the compiler suite into a powerful research tool suitable
for adaptive computing.
Training During the training phase we need to gather
information about the structure of programs and record The next section describes the new ICI structure and ex-
how they behave when compiled under different op- plains how program features can be extracted for later
timization settings. Such information allows machine machine learning in Section 4.

2

MILEPOST GCC CCC
Program1 (with ICI and ML routines) Continuous Collective
Compilation Framework
IC Plugins
Training

Drivers for
… Recording pass iterative
sequences
compilation
and model
Extracting static training
ProgramN program features

MILEPOST GCC
Deployment

Extracting static
program features Predicting “good”
New program passes to improve
exec.time, code size
Selecting “good”
passes and comp. time

Figure 1: Framework to automatically tune programs and improve default optimization heuristics using machine
learning techniques, MILEPOST GCC with Interactive Compilation Interface (ICI) and program features extractor,
and Continuous Collective Compilation Framework to train ML model and predict good optimization passes

3 Interactive Compilation Interface 3.1 Internal structure

To avoid the drawbacks of the first version of the ICI, we
designed a new version, as shown in Figure 2. This ver-
This section describes the Interactive Compilation Inter- sion can now transparently monitor execution of passes
face (ICI). The ICI provides opportunities for external or replace the GCC Controller (Pass Manager), if de-
control and examination of the compiler. Optimization sired. Passes can be selected by an external plugin
settings at a fine-grained level, beyond the capabilities which may choose to drive them in a very different order
of command line options or pragmas, can be managed to that currently used in GCC, even choosing different
through external shared libraries, leaving the compiler pass orderings for each and every function in program
uncluttered. being compiled. Furthermore, the plugin can provide its
own passes, implemented entirely outside of GCC.

The first version of ICI [21] was reactive and required In an additional set of enhancements, a coherent event
minimal changes to GCC. It was, however, unable to and data passing mechanism enables external plugins to
modify the order of optimization passes within the com- discover the state of the compiler and to be informed as
piler and so large opportunities for speedup were closed it changes. At various points in the compilation process
to it. The new version of ICI [9] expands on the ca- events (IC Event) are raised indicating decisions about
pabilities of its predecessor permitting the pass order to transformations. Auxiliary data (IC Data) is registered
be modified. This version of ICI is used in the MILE- if needed.
POST GCC to automatically learn good sequences of
optimization passes. In replacing default optimization Since plugins now extend GCC through external shared
heuristics, execution time, code size and compilation libraries, experiments can be built with no further mod-
time can be improved. ifications to the underlying compiler. Modifications for

3

High-level scripting
(java, python, etc)
GCC GCC with ICI

Detect Detect
ICI IC Plugins
optimization flags optimization flags
<Dynamically linked
shared libraries>
IC Selecting pass
Event sequences
...
GCC Controller GCC Controller IC Extracting static
(Pass Manager) (Pass Manager) Event program features

Interactive
Compilation
Interface

Pass 1 Pass N Pass 1 Pass N CCC
Continuous Collective
... IC Compilation Framework
Event

ML drivers
to optimize
programs and
tune compiler
GCC Data Layer GCC Data Layer IC optimization
AST, CFG, CF, etc AST, CFG, CF, etc Data heuristic

(a) (b)

Figure 2: GCC Interactive Compilation Interface: a) original GCC, b) GCC with ICI and plugins

different analysis, optimization and monitoring scenar- called pass_execution.
ios proceed in a tight engineering environment. These
plugins communicate with external drivers and can al- Figure 3c shows simple modifications in GCC Con-
low both high-level scripting and communication with troller (Pass Manager) to enable monitoring of exe-
machine learning frameworks such as MILEPOST cuted passes. When the GCC Controller function exe-
GCC. cute_one_pass is invoked, we register an IC-Parameter
called pass_name giving the real name of the executed
Note that it is not the goal of this project to develop pass and trigger an IC-Event pass_execution. This in
fully fledged plugin system. Rather, we show the util- turn invokes the plugin function executed_pass where
ity of such approaches for iterative compilation and we can obtain the current name of the compiled function
machine learning in compilers. We may later utilize using ici_get_feature("function_name") and the pass
GCC plugin systems currently in development, for ex- name using ici_get_parameter("pass_name").
ample [7] and [13].
IC-Features provide read only data about the compila-
Figure 3 shows some of the modifications needed to en- tion state. IC-Parameters, on the other hand, can be dy-
able ICI in GCC with an example of a passive plugin namically changed by plugins to change the subsequent
to monitor executed passes. The plugin is invoked by behavior of the compiler. Such behavior modification is
the new -fici GCC flag or by setting ICI_USE environ- demonstrated in the next subsection using an example
ment variable to 1 (to enable non-intrusive optimiza- with an avoid_gate parameter needed for dynamic pass
tions without changes to Makefiles). When GCC detects manipulation.
these options, it loads a plugin (dynamic library) with a
name specified by ICI_PLUGIN environment variable Since we use the name field from the GCC pass struc-
and checks for two functions start and stop as shown in ture to identify passes, we have had to ensure that each
Figure 3a. pass has a unique name. Previously, some passes have
had no name at all and we suggest that in the future
The start function of the example plugin registers an a good development practice of always having unique
event handler function executed_pass on an IC-Event names would be sensible.

4

x v ¦iii u§tsr¦q§pi ¤ £¡
¢ ƒ ' y
€ ¦iii u ¤ £¡
¢
¦© ©¨§¦¥ ¤ £¡
¢ w ‚

¢ AA 5
¡ !¡
£3 ¢ !¡ „
X … c ‰ # b # (ˆ( ‡ £3 ¢ !¡ ‡ …
c ‚ ' X
‘ †† €
aa$ ‰ #‰b (‰” ‰
d ‚ ££ X1
1 ) HA 7H2 H
7£¡ 5¢! 4321 ) ' ( %$ #
¢ ¡ 0 ' £3 ¢ !¡ „
X … ‰($“ ‰b ' ˆ( ‡
c ‚ ' ’‡ ’‡ ‡ …
6 6
‘ †† †† 8
6
8
9 ’ ¡ f¢ ¡ !¡ ¢ AA5
! ¡ ¡ E H X 3
Y
aa$ % ‰b (‰”‰ 3 A @
d € ¡ 73 A@
¡ ) 6
9
8 A3A ! ¡ 3£ ’ X
! ¡ ! E G F1 £
3 A@
¡ E 5DB C 2
¡ B
6
E £’¢ AA5 ¡
X
E G F1 A 1 £ ¢ AA5
1 H21
E £
0Y ! 6
6
Y !
E £ !X 21
0
6 FF £’ €
) F a b $ba‰b$ £ 1 k 7 D j i
X D
RD Q I £¡ 5¢! 43 ! £HA 3 F 5DB C
¢ ¡ ¢ ¡ B E 7UTSD hg H ¤
P )
6 6 E 7 £ ’
)
E 7 1 A1 £ 3 1 A W £
1 ¢ F 1 A1 £
V 1 £H0 ’ A 0 £ £B 2‡
X 1 1 7£ X £ 3 £ ¡ 5A
1 ! ) hg H
) 0
6 6
‡21 £3 A H CC• ¢ ¡ ’1 A £@ A 2
1 ! ¡
¡ ¡ £’ )1 ££
… Y H 1 £ ’£
1 € ¡
E 7a b$ba‰b$ I …
1 l X
¡ ` c ` ¡ B F 1 ¡
E 7 b $b a I 5DB C )Y 4 3 7 !X )
¢
6 6 6
6 ¡ ¡ E 7 ‰b$ % #m $ ) ! £@£ ¢¢
` ' `
£ E 7 1 A1 £ 3 1 A W £
) 1 ¢ 0 F 1 A1
V 1 721 ) F £
0 Y ! !X 6 6
6 6 6 ¡ ¡ E 7 £’ )1 ££
… X … Y H 1 £ ’£ !X
1 ¡ 1
¡ E 7 d #b a I 5DB C )Y 4 3 7 !X ) F HA ¡
` ` ¡ B ¢ ` – $ # b( € “ v ‰ € b$‰“ b‰ (
— ƒ ‰ ‚ ‚ ' ‚ ` c ' '
6 6 6
6
£ E 7 1 A1 £ 3 1 A W £
) 1 ¢ 0 F 1 A1
V 1
6 6 7 £ ’) ¡
X n
721 ) F £
0 Y ! H
6
9 6 E £ !1X £
¢ 1
e — ƒ ‰
` – $ aa$ ` v ‰b‰
‚ d c – $ $ b‰ (
c d ' '

HI £Y ! !X I… ! ™ ˜ ˜ ) !¡1 H
… Y !
E 7£
6 6 ƒp % #mv ` ‰
' – $ $ ‰ba ‰ (
– $ aa$ ` v ‰b ‰
‚ d c c d c ' c ' '
e o
— ƒƒ ‰
– $ qâ a$ v
‚ d
e
— ƒ ‚ # b (‰”‰a a$ ` vb ‰m‰ $( (
` ' € d ‚ ' '

— ƒ ‰ – $ $ ‰ba ‰ ‚€ (
` – $ aa$ ` v ‰b‰
‚ d c c d c ' c ' '
b $ba1
c 0 73 A @
¡ )

8
9

I`‚ # b (‰”‰aa$ ` v ! £@£ 1 £ ’£ ¡ ¡
' € d ¡ 1
6 6
E 7aa$ %‰b (‰ ”‰d
d €

e
f h
hg f hef r

Figure 3: Some GCC modifications to enable ICI and an example of a plugin to monitor executed passes:
a) IC Framework within GCC, b) IC Plugin to monitor executed passes, c) GCC Controller (pass manager) modifi-
cation

3.2 Dynamic Manipulation of GCC Passes performance to be gained within GCC from such actions
otherwise there is no point in trying to learn. By using
Previous research shows a great potential to improve the Continuous Collective Compilation Framework [2]
program execution time or reduce code size by carefully to random search though the optimization flag space
selecting global compiler flags or transformation param- (50% probability of selecting each optimization flag)
eters using iterative compilation. The quality of gener- and MILEPOST GCC 4.2.2 on AMD Athlon64 3700+
ated code can also be improved by selecting different and Intel Xeon 2800MHz we could improve execution
optimization orders as shown in [16, 15, 17, 26]. Our time of susan_corners by around 16%, compile time by
approach combine the selection of optimal optimization 22% and code size by 13% using Pareto optimal points
orders and tuning parameters of transformations at the as described in the previous work [24, 25]. Note, that
same time. the same combination of flags degrade execution time
of this benchmark on Itanium-2 1.3GHz by 80% thus
The new version of ICI enables arbitrary selection of le- demonstrating the importance of adapting compilers to
gal optimization passes and has a mechanism to change each new architecture. Figure 4a shows the combina-
parameters or transformations within passes. Since tion of flags found for this benchmark on AMD platform
GCC currently does not provide enough information while Figures 4b,c show the passes invoked and moni-
about dependencies between passes to detect legal or- tored by MILEPOST GCC for the default -O3 level and
ders, and the optimization space is too large to check for the best combination of flags respectively.
all possible combinations, we focused on detecting in-
fluential passes and legal orders of optimizations. We
Given that there is good performance to be gained
examined the pass orders generated by compiler flags
by searching for good compiler flags, we now wish
that improved program execution time or code size us-
to automatically select good optimization passes and
ing iterative compilation.
transformation parameters. These should enable fine-
Before we attempt to learn good optimization settings grained program and compiler tuning as well as non-
and pass orders we first confirmed that there is indeed intrusive continuous program optimizations without

5

6
a program needed by the model to distinguish between avoid_gate so that we can set gate_status to TRUE
representation and characterize the essential aspects of trol we use IC-Parameter gate_status and IC-Event
tures are typically a summary of the internal program optimization flags is selected. To avoid this gate con-
gram structure or program features. The program fea- (pass-gate()) to execute the pass only if the associate
mization to apply to an input program based on its pro- execute_one_pass shown in Figure 3c has gate control
Our machine learnt model predicts the best GCC opti- Figure 4c. However, note that the GCC internal function
mark with the -O3 flag but selecting passes shown in
3.3 Adding program feature extractor pass pass orders using ICI, we recompiled the same bench-
To verify that we can change the default optimization
future plan to add support for such cases explicitly. their interaction and dependencies for future work.
pile programs with such flags always enabled, and in the optimizations. However, we leave the determination of
generation in other passes. Thus, at this point we recom- shown in Figure 2b or search for new good orders of
fine-grain transformation parameters and influence code of passes previously found by the CCC Framework as
ated pass such as -funroll-loops but also select specific default GCC Pass Manager and execute good sequences
son is that some compiler flags not only invoke associ- execute_one_pass. Therefore, we can circumvent the
its execution time by 13% instead of 16% and the rea- ici_run_pass function that in turn invokes GCC function
cution of the generated binary shows that we improve of ICI allows passes to be called directly using the
within plugins and thus force its execution. The exe- modifications to Makefiles, etc. The current version
-O3 using ICI, c) recorded compiler passes for the good selection of flags (a)
improve execution and compilation time for susan_corners benchmark over -O3, b) recorded compiler passes for
Figure 4: a) Selection of compiler flags found using CCC Framework with uniform random search strategy that
5e
3 7
§ © © 8 §¥7 £ ¤ £8 © £8 ¥ §
¤ H ( ¦ © 8© ¤
7 ( ¥ § 7
§ © ¦¤7¤§( ¦§7¤(§(('7¤§ © §7¦¥$7 £ 8 §( £ ¤ © $ ¥7¤ © §$ 8 © $ ¥70¥ © 7¨§¦¥¤7¤§( 8£ §
§ ¤
7 (''7§((7 §¥7§ ¦§§7 7 7 7 ¨ 7 7 7 §( T IDP9U7§ 7H §¨ 7 7
¢ ¤¥ H £ §¤¥ §¤¥ 8 §( © ¨ §( © §(
¤ ¤ Y £ ¤ $ % ©¤ § 1 $ §(
7§¥7§' 7 7 7 7 ¨ 7 7 7 7 7 7 7 7
$ ¥ % § £ §¤¥ `Dd § 8 S S@CEWcbA@@S ¦¥ ©H¤ 8 © ( 1 8 © 8 %§¥ %§¤¥ $ #
7 7 7 7 7 ' 7 ¥ 7 7 6 7 ¥ § 7 ¥ §
¤ §( 1 ¨§( ¦¤ ¤ 1©
¤( © ¥ ¤ © £ 8©
© $ ¥ 8£ 8© § ( ¨ § ¤
¤ © ©
8 £ 8 §( £ ¤§( © ( © © ¨ 8 §( £
¤
7( § ¥ 7 7 '7 1 7¨§) 7 ¦7 (H( 7
( 7 7 7 ¨
© ( 8 © £8 (H §¥ 8 £ ¤0¥ $ © © £ §¥¨¨¥ ( 1 ¥ ¤¤§( §
7 ( ¥7 ¥ 17 ¨ ' ¨( ¥§(7 §7¥¥ 7¦¥ H 7 ( ¥7 7 7 7 7 7 7 7' 7 §
¤ 8 2© $ ¤ © ¤ 2 © 0 ¤ ¤ § ( ¨§ ©(¥ ¤ £ (2 ¥ 8 ( © ¤
7 § ' 7
¥¥ 8 ( © 7)¤ # 7 © ¦7 (H( £ ¤7§¥¨ 7¥ ¤¤§(7¤7(¤7(§H 6¥7¦¥7§ £ ( 7¤7 © ¦7§¥¨7( 1 ¦§(§$7 (2 ¥ 7
¤
7 (H( 7§¥¨7§( 7¥¥7 § 7 7 7 7 ¥ § 7 ¨ ¥ ¥
£ £ © ¤ © ( ¤ ¤¤ ¤ ¤ ¤ £8 ¥ 8© ¤§( ¤( 18 §¥ §(§ £ ( ¤
§ 7 © ) $ ©
87¤§( © ( © © ¨ 8©7 £ 6 £
¤
5 a
3
§ © © 8 § ¥
¤
7 7 ¥ § ( ¥ § 7
£ ¤ £ 8 © £8 H (¦ © 8© ¤7§ © ¦¤7¤§ ¦§7¤(§(('7¤§ © §7¦¥$7 £ 8 §( £ ¤ © $ ¥7¤ © §$ 8 © $ ¥ §
7 7 7 7 ''7 7 ¨

0 ¥ © ¨ § ¦ ¥ ¤ ¤ § ( 8£ § (
¤ §((7¢§¥7§ ¦§§7¤¥ 7H £ §¤¥7§¤¥ 8 §( © 7¨ §( © 7§(7§(7§ £ 7H¤ §¨ $
¤ ¤
7 7 7 7 7 7 7 UUFAB
% © ¤7§ 1 $§ (7§¥7§'$ ¥ 7%§ £7§¤¥7§ ¨ 8 ¦¥ ©H ¤ 8 © ( 1 8 © 8 %§¥ `
7 DU9X7%§ ¥7 7 7 7 7 ( 7 ' 7 ¥ 7 7 6 7 ¥ §
Y ¤ $ # ¤ §( 1 ¨§(¦¤ ¤ 1© ¤ © ¥
¤ © £ 8©
© $ ¥ 8£ 8© § ( ¨ § ¤
¤ © ©
8 £ 8 §( £
7 ¥ § 7 § ¥ 7 (

§( © ( © © ¨ 8 §( £ ( © ( 8 © £8 (H 7§¥ 8 £ ¤0¥ '7 1 7¨§)$ © 7 A@CA9EWTDGFEDCBA@9T9SQFR7 © ¦7 (H( £
¤
7DUI7§¥¨¨¥7 A@CA9QPA TG@I7( 1 ¥ 7 7 ¨ 7 URA@VQ7
7 ¥ 17 ¨ ' ¨( ¥§(7 §7 ¥¥ 7¦¥ H 7 GQS7 ( ¥
¤¤§( § ( ¥ ¤ 8 2© $ ¤ © ¤ 2
7 7 7 7 7 7 7 7' 7 § 7 § 7) ' 7 ¦7 (H( 7 7 UI7§¥¨7¥ 7
© 0 ¤ ¤ §( ¨§ ©(¥ ¤ £ (2 ¥ 8 ( © ¥¥ 8 ( © ¤ ¤ ¤# © £ ¤ D ¤¤§( A@CA9QPA
TG@ITDGFEDCBA@97 7 7 6 77 7 7 R7 7 ¦7 A@CA9QPA7 G@I7§¥¨7( 1 ¦§(§ 7 7 7 7
¤ (¤ (§H ¥ ¦¥ § £ ( CSQF ¤ © $ (2 ¥ (H( £ §¥¨
7§( 7¥¥7 DGFEDCBA@97 § 7 7 7 7 ¥ § 7 ¨ ¥ ¥
£ © ¤ © ( ¤ ¤¤ ¤ ¤ ¤ £8 ¥ 8© ¤§( ¤( 18 §¥ §(§ £ ( ¤
§ 7 © ) $ ©
87¤§( © ( © © ¨ 8©7 £ 6 £
¤
54
3 (§ 1 ¥ ( §§( §( §§( ( §§(
¤ © § 1 §§( © £ ¤ ©£ ©£ ¤ © £
§ ¨ §§(
§)$ © §§( © £ (§ §§( © £ $ §§( © £ ¤ © £ ¤©
( © $ ¨ §§( © £ §$§(2 ¥ §§( © £
¥ § §
¤ © ( © £ §¥¨§( ¦ © §( © £ (§ (
¤ ¤ ¤ 1 © ¤ £ ¤ ¤ £ ¤ ¤ ¨§¦¥¤ £ ¤ ¤ ¨§¦¥¤ £
' ¨ ¨ ¨ '
¤0¥ (§ ¤ §¤ ¨§¦¥¤ £ ¥§¤ ¨§ ¦¥¤ £ ¤ (§ ¥§¤ ¨§¦¥¤ £ 0¥ (§ © ¨§¦¥¤ £ © (§( £
' ' § ) § 1 )
¤0¥ (§¨( §( £ § 1 $ §( £ ¤2(( ¦¥ © £ (
§ § £ ¤ §§ £ ¤¥ ¤ $ © £ $ (§ © ¤ §( § $ © £
((§ ¦ ¥

© $ £ ¤© 1 £ ¤ © £ § £ ¤(§ 1 ¥ £ £ §¤¥ £ ¤ §¤¥ £ ¨ §( (§ ©£ §¤¥ £
§ §
(¨¨ §¥( £ £ (§ £ ¨ £ (§ £ ¨ £ ¤0¥ ' 0¤ §¤¥ £ $ # ¤ (¥ £ §)$ © ¨ © ( © ¦¥(' £ §
¤
§( ' % ¥ ¨§
© ¥ ¦¥ ( £ ¤ £ !¤$ # £ !¤ © £ £ §¨ ¤ ¤ © ¨§¦¥¤ £ ¢¡
¤

good and bad optimizations. Each such struct data type may introduce an entity, and
its f ields may introduce relations over the entity rep-
The current version of ICI allows invoking auxiliary
resenting the including struct data type and the entity
passes that are not a part of default GCC compiler.
representing the data type of the f ield. This data is col-
These passes can monitor and profile the compilation
lected using ml-feat pass.
process or extract data structures needed to generate pro-
gram features. In the second stage, we provide a Prolog program defin-
ing the features to be computed from the Datalog re-
During the compilation, the program is represented by
lational extracted from the compiler internal data struc-
several data structures, implementing the intermediate
tures in the first stage. The extract_program_static _fea-
representation (tree-SSA, RTL etc), control flow graph
tures plugin invokes a Prolog compiler to execute this
(CFG), def-use chains, the loop hierarchy, etc. The data
program, the result being the vector of features shown
structures available depend on the compilation pass cur-
in Table 1 that can later be used by the Continuous Col-
rently being performed. For statistical machine learn-
lective Compilation Framework to build machine learn-
ing, the information about these data structures is en-
ing models and predict best sequences of passes for new
coded as a vector of constant size of numbers (i.e fea-
programs. This example shows the flexibility and capa-
tures) - this process is called feature extraction and is
bilities of the new version of ICI.
needed to enable optimization knowledge reuse among
different programs.
4 Using Machine Learning to Select Good Op-
Therefore, we implemented an additional GCC pass ml-
timization Passes
feat to extract static program features. This pass is not
invoked during default compilation but can be called us-
The previous sections have described the infrastructure
ing a extract_program_static_features plugin after any
necessary to build a learning compiler. In this section
arbitrary pass starting from FRE when all the GCC data
we describe how this infrastructure is used in building a
necessary to produce features is ready.
model.
In the MILEPOST GCC, the feature extraction is per-
formed in two stages. In the first stage, a relational rep- Our approach to selecting good passes for programs is
resentation of the program is extracted; in the second based upon the construction of a probabilistic model on
stage, the vector of features is computed from this rep- a set of M training programs and the use of this model in
resentation. order to make predictions of “good” optimization passes
on unseen programs.
In the first stage, the program is considered to be charac-
terized by a number of entities and relations over these Our specific machine learning method is similar to that
entities. The entities are a direct mapping of similar en- of [10] where a probability distribution over “good” so-
tities defined by the language reference, or generated lutions (i.e. optimization passes or compiler flags) is
during the compilation. Such examples of entities are learnt across different programs. This approach has
variables, types, instructions, basic blocks, temporary been referred in the literature to as Predictive Search
variables, etc. Distributions (PSD) [12]. However, unlike [10, 12]
where such a distribution is used to focus the search of
A relation over a set of entities is a subset of their Carte- compiler optimizations on a new program, we use the
sian product. The relations specify properties of the en- distribution learned to make one-shot predictions on un-
tities or the connections between them. For describing seen programs. Thus we do not search for the best opti-
the relations we used a notation based on logic - Datalog mization, we automatically predict it.
is a Prolog-like language but with a simpler semantics,
suitable for expressing relations and operations between 4.1 The Machine Learning Model
them [35, 34]
For extracting the relational representation of the pro- Given a set of training programs T 1 , . . . , T M , which can
gram, we used a simple method based on the examina- be described by (vectors of) features t1 . . . , tM , and for
tion of the include files. The compiler main data struc- which we have evaluated different sequences of opti-
tures are struct data types, having a number of f ields. mization passes (x) and their corresponding execution

7

times (or speed-ups y) so that we have for each pro- Let us denote the distribution over good solutions on
j
gram M j an associated dataset D j = {(xi , yi )}N , with
i=1 each training program by P(x|T j ) with j = 1, . . . , M. In
j = 1, . . . M, our goal is to predict a good sequence of principle, these distributions can belong to any paramet-
optimization passes x∗ when a new program T ∗ is pre- ric family. However, In our experiments we use an IID
sented. model where each of the elements of the sequence are
considered independently. In other words, the probabil-
We approach this problem by learning the mapping from ity of a “good” sequence of passes is simply the product
the features of a program t to a distribution over good of each of the individual probabilities corresponding to
solutions q(x|t, θ ), where θ are the parameters of the how likely each pass is to belong to a good solution:
distribution. Once this distribution has been learnt, pre-
dictions on a new program T ∗ is straightforward and it is L
achieved by sampling at the mode of the distribution. In P(x|T j ) = ∏ P(x |T j ), (2)
=1
other words, we obtain the predicted sequence of passes
by computing: where L is the length of the sequence.
x∗ = argmax q(x|t, θ ). (1)
x As proposed in [12], once the individual training distri-
butions P(x|T j ) have been obtained, the predictive dis-
4.2 Continuous Collective Compilation Frame- tribution q(x|t, θ ) can be learnt by maximization of the
work conditional likelihood or by using k-nearest neighbor
methods. In our experiments we use a 1-nearest neigh-
We used Continuous Collective Compilation Frame- bor approach. In other words, we set the predictive dis-
work [2] and MILEPOST GCC shown in Figure 1 to tribution q(x|t, θ ) to be the distribution corresponding
generate a training set of programs together with com- to the training program that is closest in feature space to
piler flags selected uniformly at random, associated se- the new (test) program.
quences of passes, program features and speedups (code
size, compilation time) that is stored in the externally Note that we currently predict “good” sequences of opti-
accessible Global Optimization Database. We use this mization passes that are associated to the best combina-
training set to build machine learning model described tion of compiler flags. We will investigate in future work
in the next section which in turn is used to predict the the application of our models to the general problem of
best sequence of passes for a new program given its determining “optimal” order of optimization passes for
feature vector. Current version of CCC Framework re- programs.
quires minimal changes to the Makefile to pass opti-
mization flags or sequences of passes and has a support
to verify the correctness of the binary by comparing pro-
5 Experiments
gram output with the reference one to avoid illegal com-
binations of optimizations. We performed our experiments on four different plat-
forms:
4.3 Learning and Predicting

In order to learn the model it is necessary to fit a dis- • AMD – a cluster with 16 AMD Athlon 64 3700+
tribution over good solutions to each training program processors running at 2.4GHz
beforehand. These solutions can be obtained, for ex-
• IA32 – a cluster with 4 Intel Xeon processors run-
ample, by using uniform sampling or by running an es-
ning at 2.8GHz
timation of distribution algorithm (EDA, see [27] for
an overview) on each of the training programs. In our • IA64 – a server with Itanium2 processor running at
experiments we use uniform sampling and we choose 1.3GHz
the set of good solutions to be those optimization set-
tings that achieve at least 98% of the maximum speed- • ARC – FPGA implementation of the ARC 725D
up available in the corresponding program-dependent processor running GNU/Linux with a 2.4.29 ker-
dataset. nel.

8

2.24 2.31 2.31 1.97
1.6
1.5

speedup
1.4
1.3
1.2
1.1
1

gsm
n_c

n_e

n_s

_c

_d

tra

cia

d
ael_e

sha

_c

_d

32

t1
unt

d

e

h1
fish_

ael_
fish_

qsor
jpeg

CRC
m
jpeg

m

earc
dijks

patri
bitco

susa

susa

susa

adpc

adpc
rijnd
rijnd
blow
blow

gs
strin
AMD IA32 IA64

(a)

1.4
1.3
speedup

1.2
1.1
1
0.9
0.8

age
gsm
n_c

n_e

n_s

_c

_d

tra

icia

d

e
sha

m_c

m_d

32

t1
unt

d

e

h1
ael_
fish_

ael_
fish_

qsor
jpeg

CRC
jpeg

earc
dijks
bitco

patr

Aver
susa

susa

susa

adpc

adpc
rijnd
rijnd
blow
blow

gs
strin
Iterative compilation Predicted optimization passes using ML and MILEPOST GCC

(b)

Figure 5: Experimental results when using iterative compilation with random search strategy (500 iterations; 50%
probability to select each flags; AMD,IA32,IA64) and when predicting best optimization passes based on program
features and ML model (ARC)

In all case the compiler used is MILEPOST GCC 4.2.x sufficient information can be gleaned from this to allow
with the ICI version 0.9.6 We decided to use open- significant speedup. Indeed, the size of the space left
source MiBench benchmark with MiDataSets [20, 6] unexplored serves to highlight our lack of knowledge in
(dataset No1 in all cases) due to its applicability to both this area, and the need for further work.
general purpose and embedded domains.
Firstly, features for each benchmark were extracted
5.1 Generating Training Data from programs using the new pass within MILEPOST
GCC, and these features then sent to the Global Op-
In order to build a machine learning model, we need timization Database within CCC Framework. An ML
training data. This was generated by a random explo- model for each benchmark was built, using the execu-
ration of a vast optimization search space using the CCC tion time gathered from 500 separate runs using differ-
Framework. It generated 500 random sequences of flags ent random sequences of passes, and a fixed data set.
either turned on or off. These flag settings can be read- Each run was repeated 5 times so speedups were not
ily associated with different sequences of optimization caused by cache priming etc.). After each run, the ex-
passes. Although such a number of runs is very small perimental results including execution time, compila-
in respect to the optimisation space, we have shown that tion time, code size and program features are sent to the

9

database where they are stored for future reference. 6 Conclusions and Future Work
Figure 5a shows that considerable speedups can be al- In this paper we have shown that MILEPOST GCC has
ready obtained after iterative compilation on all plat- significant potential in the automatic tuning of GCC op-
forms. However, this is a time-consuming process and timization. We plan to use these techniques and tools
different speedups across different platforms motivates to further investigate the automatic selection of opti-
the use of machine learning to automatically build spe- mal orders of optimization passes and fine-grain tun-
cialized compilers and predict the best optimization ing of transformation parameters. The overall frame-
flags or sequences of passes for different architectures. work will also allow the analysis of interactions be-
tween optimizations and investigation of the influence
5.2 Evaluating Model Performance of program inputs and run-time state on program opti-
mizations. Future work will also include fine-grain run-
Once a model has been built for each of our bench- time adaptation for multiple program inputs on hetero-
marks, we can evaluate the results by introducing a geneous multi-core architectures.
new program to the system, and measuring how well
the prediction provided by our model performs. In this 7 Acknowledgments
work we did not use a separate testing benchmark suite
due to time constraints, so instead leave-one-out-cross- The authors would like to thank Sebastian Pop for the
validation was used. Using this method, all training data help with modifying and debugging GCC and John
relating to the benchmark being tested is excluded from W.Lockhart for the help with editing this paper. This
the training process, and the models rebuilt. When a sec- research is supported by the European Project MILE-
ond benchmark is tested, the training data pertaining to POST (MachIne Learning for Embedded PrOgramS
the first benchmark is returned to the training set, that of opTimization) [4] and by the HiPEAC European Net-
the second benchmark excluded, and so on. In this way work of Excellence (High-Performance Embedded Ar-
we ensure that each benchmark is tested as a new pro- chitecture and Compilation) [5].
gram entering the system for the first time—of course,
in real-world usage, this process is unnecessary.
References
When a new program is compiled, features are first gen-
erated using MILEPOST GCC. These features are then [1] ACOVEA: Using Natural Selection to Investigate
sent to our ML model within CCC Framework (imple- Software Complexities. http://www.
mented as a MATLAB server), which processes them coyotegulch.com/products/acovea.
and returns a predicted sequence of passes which should [2] Continuous Collective Compilation Framework.
either improve execution time or reduce code size or http://cccpf.sourceforge.net.
both. We then evaluate the prediction by compiling the
program with the suggested sequence of passes, mea- [3] ESTO: Expert System for Tuning Optimizations.
sure the execution time and compare with the origi- http:
nal time for the default ’-O3’ optimization level. It //www.haifa.ibm.com/projects/
is important to note that only one compilation occurs systems/cot/esto/index.html.
at evaluation—there is no search involved. Figure 5b
[4] EU Milepost project (MachIne Learning for
shows these results for the ARC725D. It demonstrates
Embedded PrOgramS opTimization).
that except a few pathological cases where predicted
http://www.milepost.eu.
flags degraded performance and which analysis we leave
for future work, using CCC Framework, MILEPOST [5] European Network of Excellence on
GCC and Machine Learning Models we can improve High-Performance Embedded Architecture and
original ARC GCC by around 11%. Compilation (HiPEAC).
http://www.hipeac.net.
This suggests that our techniques and tools can be effi-
cient to build future iterative adaptive specialized com- [6] MiDataSets SourceForge development site.
pilers. http://midatasets.sourceforge.net.

10

[7] Plugin project. Languages, Compilers, and Tools for Embedded
http://libplugin.sourceforge.net. Systems (LCTES), pages 1–9, 1999.

[8] QLogic PathScale EKOPath Compilers. [17] B. Elliston. Studying optimisation sequences in
http://www.pathscale.com. the gnu compiler collection. Technical Report
ZITE8199, School of Information Technology
[9] GCC Interactive Compilation Interface. and Electical Engineering, Australian Defence
http://gcc-ici.sourceforge.net, Force Academy, 2005.
2006.
[18] B. Franke, M. O’Boyle, J. Thomson, and
[10] F. Agakov, E. Bonilla, J.Cavazos, B.Franke, G. Fursin. Probabilistic source-level optimisation
G. Fursin, M. O’Boyle, J. Thomson, of embedded programs. In Proceedings of the
M. Toussaint, and C. Williams. Using machine Conference on Languages, Compilers, and Tools
learning to focus iterative optimization. In for Embedded Systems (LCTES), 2005.
Proceedings of the International Symposium on
Code Generation and Optimization (CGO), 2006. [19] G. Fursin. Iterative Compilation and Performance
Prediction for Numerical Applications. PhD
[11] F. Bodin, T. Kisuki, P. Knijnenburg, M. O’Boyle, thesis, University of Edinburgh, United Kingdom,
and E. Rohou. Iterative compilation in a 2004.
non-linear optimisation space. In Proceedings of
the Workshop on Proﬁle and Feedback Directed [20] G. Fursin, J. Cavazos, M. O’Boyle, and
Compilation, 1998. O. Temam. Midatasets: Creating the conditions
for a more realistic evaluation of iterative
[12] E. V. Bonilla, C. K. I. Williams, F. V. Agakov, optimization. In Proceedings of the International
J. Cavazos, J. Thomson, and M. F. P. O’Boyle. Conference on High Performance Embedded
Predictive search distributions. In W. W. Cohen Architectures Compilers (HiPEAC 2007),
and A. Moore, editors, Proceedings of the 23rd January 2007.
International Conference on Machine learning,
pages 121–128, New York, NY, USA, 2006. [21] G. Fursin and A. Cohen. Building a practical
ACM. iterative interactive compiler. In 1st Workshop on
Statistical and Machine Learning Approaches
[13] S. Callanan, D. J. Dean, and E. Zadok. Extending Applied to Architectures and Compilation
gcc with modular gimple optimizations. In (SMART’07), colocated with HiPEAC 2007
Proceedings of the GCC Developers’ conference, January 2007.
Summit’2007, 2007.
[22] G. Fursin, A. Cohen, M. O’Boyle, and O. Temam.
[14] J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, A practical method for quickly evaluating
M. O’Boyle, and O. Temam. Rapidly selecting program optimizations. In Proceedings of the
good compiler optimizations using performance International Conference on High Performance
counters. In Proceedings of the 5th Annual Embedded Architectures Compilers (HiPEAC
International Symposium on Code Generation 2005), pages 29–46, November 2005.
and Optimization (CGO), March 2007.
[23] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M.
[15] K. Cooper, A. Grosul, T. Harvey, S. Reeves, Austin, T. Mudge, and R. B. Brown. Mibench: A
D. Subramanian, L. Torczon, and T. Waterman. free, commercially representative embedded
ACME: adaptive compilation made efﬁcient. In benchmark suite. In IEEE 4th Annual Workshop
Proceedings of the Conference on Languages, on Workload Characterization, Austin, TX,
Compilers, and Tools for Embedded Systems December 2001.
(LCTES), 2005.
[24] K. Heydemann and F. Bodin. Iterative
[16] K. Cooper, P. Schielke, and D. Subramanian. compilation for two antagonistic criteria:
Optimizing for reduced code space using genetic Application to code size and performance. In
algorithms. In Proceedings of the Conference on Proceedings of the 4th Workshop on

11

Optimizations for DSP and Embedded Systems, optimization-space exploration. In Proceedings of
colocated with CGO, 2006. the International Symposium on Code Generation
and Optimization (CGO), pages 204–215, 2003.
[25] K. Hoste and L. Eeckhout. Cole: Compiler
optimization level exploration. In Proceedings of [34] J. Ullman. Principles of database and knowledge
the International Symposium on Code Generation systems. Computer Science Press, 1, 1988.
and Optimization (CGO), 2008.
[35] J. Whaley and M. S. Lam. Cloning based context
[26] P. Kulkarni, W. Zhao, H. Moon, K. Cho, sensitive pointer alias analysis using binary
D. Whalley, J. Davidson, M. Bailey, Y. Paek, and decision diagrams. In Proceedings of the
K. Gallivan. Finding effective optimization phase Conference on Programming Language Design
sequences. In Proc. Languages, Compilers, and and Implementation (PLDI), 2004.
Tools for Embedded Systems (LCTES), pages
12–23, 2003. [36] R. Whaley and J. Dongarra. Automatically tuned
linear algebra software. In Proc. Alliance, 1998.
[27] P. Larrañaga and J. A. Lozano. Estimation of
Distribution Algorithms: A New Tool for
Evolutionary Computation. Kluwer Academic
Publishers, Norwell, MA, USA, 2001.

[28] F. Matteo and S. Johnson. FFTW: An adaptive
software architecture for the FFT. In Proceedings
of the IEEE International Conference on
Acoustics, Speech, and Signal Processing,
volume 3, pages 1381–1384, Seattle, WA, May
1998.

[29] A. Monsifrot, F. Bodin, and R. Quiniou. A
machine learning approach to automatic
production of compiler heuristics. In Proceedings
of the International Conference on Artiﬁcial
Intelligence: Methodology, Systems, Applications,
LNCS 2443, pages 41–50, 2002.

[30] Z. Pan and R. Eigenmann. Fast and effective
orchestration of compiler optimizations for
automatic performance tuning. In Proceedings of
the International Symposium on Code Generation
and Optimization (CGO), pages 319–332, 2006.

[31] B. Singer and M. Veloso. Learning to predict
performance from formula modeling and training
data. In Proceedings of the Conference on
Machine Learning, 2000.

[32] M. Stephenson and S. Amarasinghe. Predicting
unroll factors using supervised classiﬁcation. In
Proceedings of International Symposium on Code
Generation and Optimization (CGO), pages
123–134, 2005.

[33] S. Triantafyllis, M. Vachharajani,
N. Vachharajani, and D. August. Compiler

12

MILEPOST GCC: machine learning based research compiler

MILEPOST GCC: machine learning based research compiler

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to MILEPOST GCC: machine learning based research compiler

Similar to MILEPOST GCC: machine learning based research compiler (20)

More from butest

More from butest (20)

MILEPOST GCC: machine learning based research compiler