The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
PyPy's approach to construct domain-specific language runtime
1. Tag: virtual machine, compiler, performance
PyPy’s Approach to Construct Domain-specific
Language Runtime
2. Tag: virtual machine, compiler, performance
Construct Domain-specific Language Runtime
using
3. Speed
7.4 times faster than CPython
http://speed.pypy.org
antocuni (PyCon Otto) PyPy Status Update April 07 2017 4 / 19
4. Why is Python slow?
Interpretation overhead
Boxed arithmetic and automatic overflow handling
Dynamic dispatch of operations
Dynamic lookup of methods and attributes
Everything can change on runtime
Extreme introspective and reflective capabilities
Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 8 / 51
5. Why is Python slow?
Boxed arithmetic and automatic overflow handling
i = 0
while i < 10000000:
i = i +1
Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 9 / 51
6. Why is Python slow?
Dynamic dispatch of operations
# while i < 1000000
9 LOAD_FAST 0 (i)
12 LOAD_CONST 2 (10000000)
15 COMPARE_OP 0 (<)
18 POP_JUMP_IF_FALSE 34
# i = i + 1
21 LOAD_FAST 0 (i)
24 LOAD_CONST 3 (1)
27 BINARY_ADD
28 STORE_FAST 0 (i)
31 JUMP_ABSOLUTE 9
Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 10 / 51
7. Why is Python slow?
Dynamic lookup of methods and attributes
class MyExample(object ):
pass
def foo(target , flag ):
if flag:
target.x = 42
obj = MyExample ()
foo(obj , True)
print obj.x #=> 42
print getattr(obj , "x") #=> 42
Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 11 / 51
8. Why is Python slow?
Everything can change on runtime
def fn():
return 42
def hello ():
return ’Hi! PyConEs!’
def change_the_world ():
global fn
fn = hello
print fn() #=> 42
change_the_world ()
print fn() => ’Hi! PyConEs!’
Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 12 / 51
9. Why is Python slow?
Everything can change on runtime
class Dog(object ):
def __init__(self ):
self.name = ’Jandemor ’
def talk(self ):
print "%s: guau!" % self.name
class Cat(object ):
def __init__(self ):
self.name = ’CatInstance ’
def talk(self ):
print "%s: miau!" % self.name
Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 13 / 51
10. Why is Python slow?
Everything can change on runtime
my_pet = Dog()
my_pet.talk () #=> ’Jandemor: guau!’
my_pet.__class__ = Cat
my_pet.talk () #=> ’Jandemor: miau!’
Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 14 / 51
11. Why is Python slow?
Extreme introspective and reflective capabilities
def fill_list(name ):
frame = sys._getframe (). f_back
lst = frame.f_locals[name]
lst.append (42)
def foo ():
things = []
fill_list(’things ’)
print things #=> 42
Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 15 / 51
12. Why is Python slow?
Everything can change on runtime
def fn():
return 42
def hello ():
return ’Hi! PyConEs!’
def change_the_world ():
global fn
fn = hello
print fn() #=> 42
change_the_world ()
print fn() => ’Hi! PyConEs!’
Francisco Fernandez Castano (@fcofdezc) PyPy November 8, 2014 12 / 51
15. PyPy based interpreters
• Topaz (Ruby)!
• HippyVM (PHP)!
• Pyrolog (Prolog)!
• pycket (Racket)!
• Various other interpreters for (Scheme, Javascript,
io, Gameboy)
16. Compiler / Interpreter
Source: Compiler Construction, Prof. O. NierstraszSource: Compiler Construction, Prof. O. Nierstrasz
17. • intermediate representation (IR)
• front end maps legal code into IR
• back end maps IR onto target machine
• simplify retargeting
• allows multiple front ends
• multiple passes better code→
Traditional 2 pass compiler
18. • analyzes and changes IR
• goal is to reduce runtime
• must preserve values
Traditional 3 pass compiler
19. • constant propagation and folding
• code motion
• reduction of operator strength
• common sub-expression elimination
• redundant store elimination
• dead code elimination
Optimizer: middle end
Modern optimizers are usually built as a set of passes
20. • Preserve language semantics
• Reflection, Introspection, Eval
• External APIs
• Interpreter consists of short sequences of code
• Prevent global optimizations
• Typically implemented as a stack machine
• Dynamic, imprecise type information
• Variables can change type
• Duck Typing: method works with any object that provides
accessed interfaces
• Monkey Patching: add members to “class” after initialization
• Memory management and concurrency
• Function calls through packing of operands in fat object
Optimization Challenges
22. RPython
• Python subset!
• Statically typed!
• Garbage collected!
• Standard library almost entirely unavailable!
• Some missing builtins (print, open(), …)!
• rpython.rlib!
• exceptions are (sometimes) ignored!
• Not a really a language, rather a "state"
26. CFG (Call Flow Graph)
• Consists of Blocks and
Links
• Starting from entry_point
• “Single Static Information”
form
def f(n):
return 3 * n + 2
Block(v1): # input argument
v2 = mul(Constant(3), v1)
v3 = add(v2, Constant(2))
27. 33
CFG: Static Single Information
33
def test(a):
if a > 0:
if a > 5:
return 10
return 4
if a < - 10:
return 3
return 10
• SSI: “PHIs” for all used variables
• Blocks as “functions without branches”
28. • High Level Language Implementation
• to implement new features: lazily computed objects
and functions, plug-able garbage-collection, runtime
replacement of live-objects, stackless concurrency
• JIT Generation
• Object space
• Stackless
• infinite Recursion
• Microthreads: Coroutines, Tasklets and Channels,
Greenlets
PyPy Advantages
30. Assumptions
Pareto Principle (80-20 rule)
I the 20% of the program accounts for the 80% of the
runtime
I hot-spots
Fast Path principle
I optimize only what is necessary
I fall back for uncommon cases
Most of runtime spent in loops
Always the same code paths (likely)
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 9 / 32
38. Tracing JIT phases
Interpretation
Tracing
hot loop detected
Compilation
Running
cold guard failed
entering compiled loop
guard failure → hot
hot guard failed
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 11 / 32
39. Trace trees (1)
tracetree.py
def foo():
a = 0
i = 0
N = 100
while i < N:
if i%2 == 0:
a += 1
else:
a *= 2;
i += 1
return a
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 12 / 32
51. Part 3
The PyPy JIT
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 14 / 32
52. Terminology (1)
translation time: when you run "rpython
targetpypy.py" to get the pypy binary
runtime: everything which happens after you start
pypy
interpretation, tracing, compiling
assembler/machine code: the output of the JIT
compiler
execution time: when your Python program is being
executed
I by the interpreter
I by the machine code
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 15 / 32
53. Terminology (2)
interp-level: things written in RPython
[PyPy] interpreter: the RPython program which
executes the final Python programs
bytecode: "the output of dis.dis". It is executed by the
PyPy interpreter.
app-level: things written in Python, and executed by
the PyPy Interpreter
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 16 / 32
54. Terminology (3)
(the following is not 100% accurate but it’s enough to
understand the general principle)
low level op or ResOperation
I low-level instructions like "add two integers", "read a field
out of a struct", "call this function"
I (more or less) the same level of C ("portable assembler")
I knows about GC objects (e.g. you have getfield_gc
vs getfield_raw)
jitcodes: low-level representation of RPython
functions
I sequence of low level ops
I generated at translation time
I 1 RPython function --> 1 C function --> 1 jitcode
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 17 / 32
55. Terminology (4)
JIT traces or loops
I a very specific sequence of llops as actually executed by
your Python program
I generated at runtime (more specifically, during tracing)
JIT optimizer: takes JIT traces and emits JIT traces
JIT backend: takes JIT traces and emits machine
code
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 18 / 32
64. PyPy trace example
def fn():
c = a+b
...
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32
65. PyPy trace example
def fn():
c = a+b
...
LOAD_GLOBAL A
LOAD_GLOBAL B
BINARY_ADD
STORE_FAST C
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32
66. PyPy trace example
def fn():
c = a+b
...
LOAD_GLOBAL A
LOAD_GLOBAL B
BINARY_ADD
STORE_FAST C
...
p0 = getfield_gc(p0, 'func_globals')
p2 = getfield_gc(p1, 'strval')
call(dict_lookup, p0, p2)
...
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32
67. PyPy trace example
def fn():
c = a+b
...
LOAD_GLOBAL A
LOAD_GLOBAL B
BINARY_ADD
STORE_FAST C
...
p0 = getfield_gc(p0, 'func_globals')
p2 = getfield_gc(p1, 'strval')
call(dict_lookup, p0, p2)
...
...
p0 = getfield_gc(p0, 'func_globals')
p2 = getfield_gc(p1, 'strval')
call(dict_lookup, p0, p2)
...
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 20 / 32
76. Virtuals (2)
unoptimized
...
guard_class(p0, W_IntObject)
i1 = getfield_pure(p0, ’intval’)
i2 = int_add(i1, 2)
p3 = new(W_IntObject)
setfield_gc(p3, i2, ’intval’)
...
optimized
...
i2 = int_add(i1, 2)
...
The most important optimization (TM)
It works both inside the trace and across the loop
It works for tons of cases
I e.g. function frames
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 25 / 32
77. Virtuals (2)
unoptimized
...
guard_class(p0, W_IntObject)
i1 = getfield_pure(p0, ’intval’)
i2 = int_add(i1, 2)
p3 = new(W_IntObject)
setfield_gc(p3, i2, ’intval’)
...
optimized
...
i2 = int_add(i1, 2)
...
The most important optimization (TM)
It works both inside the trace and across the loop
It works for tons of cases
I e.g. function frames
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 25 / 32
78. Virtuals (2)
unoptimized
...
guard_class(p0, W_IntObject)
i1 = getfield_pure(p0, ’intval’)
i2 = int_add(i1, 2)
p3 = new(W_IntObject)
setfield_gc(p3, i2, ’intval’)
...
optimized
...
i2 = int_add(i1, 2)
...
The most important optimization (TM)
It works both inside the trace and across the loop
It works for tons of cases
I e.g. function frames
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 25 / 32
80. Constant folding (2)
unoptimized
...
i1 = getfield_pure(p0, ’intval’)
i2 = getfield_pure(<W_Int(2)>,
’intval’)
i3 = int_add(i1, i2)
...
optimized
...
i1 = getfield_pure(p0, ’intval’)
i3 = int_add(i1, 2)
...
It "finishes the job"
Works well together with other optimizations (e.g.
virtuals)
It also does "normal, boring, static" constant-folding
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 27 / 32
81. Constant folding (2)
unoptimized
...
i1 = getfield_pure(p0, ’intval’)
i2 = getfield_pure(<W_Int(2)>,
’intval’)
i3 = int_add(i1, i2)
...
optimized
...
i1 = getfield_pure(p0, ’intval’)
i3 = int_add(i1, 2)
...
It "finishes the job"
Works well together with other optimizations (e.g.
virtuals)
It also does "normal, boring, static" constant-folding
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 27 / 32
82. Constant folding (2)
unoptimized
...
i1 = getfield_pure(p0, ’intval’)
i2 = getfield_pure(<W_Int(2)>,
’intval’)
i3 = int_add(i1, i2)
...
optimized
...
i1 = getfield_pure(p0, ’intval’)
i3 = int_add(i1, 2)
...
It "finishes the job"
Works well together with other optimizations (e.g.
virtuals)
It also does "normal, boring, static" constant-folding
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 27 / 32
83. Out of line guards (1)
outoflineguards.py
N = 2
def fn():
i = 0
while i < 5000:
i += N
return i
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 28 / 32
84. Out of line guards (2)
unoptimized
...
quasiimmut_field(<Cell>, ’val’)
guard_not_invalidated()
p0 = getfield_gc(<Cell>, ’val’)
...
i2 = getfield_pure(p0, ’intval’)
i3 = int_add(i1, i2)
optimized
...
guard_not_invalidated()
...
i3 = int_add(i1, 2)
...
Python is too dynamic, but we don’t care :-)
No overhead in assembler code
Used a bit "everywhere"
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 29 / 32
85. Out of line guards (2)
unoptimized
...
quasiimmut_field(<Cell>, ’val’)
guard_not_invalidated()
p0 = getfield_gc(<Cell>, ’val’)
...
i2 = getfield_pure(p0, ’intval’)
i3 = int_add(i1, i2)
optimized
...
guard_not_invalidated()
...
i3 = int_add(i1, 2)
...
Python is too dynamic, but we don’t care :-)
No overhead in assembler code
Used a bit "everywhere"
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 29 / 32
86. Out of line guards (2)
unoptimized
...
quasiimmut_field(<Cell>, ’val’)
guard_not_invalidated()
p0 = getfield_gc(<Cell>, ’val’)
...
i2 = getfield_pure(p0, ’intval’)
i3 = int_add(i1, i2)
optimized
...
guard_not_invalidated()
...
i3 = int_add(i1, 2)
...
Python is too dynamic, but we don’t care :-)
No overhead in assembler code
Used a bit "everywhere"
antocuni (Intel@Bucharest) PyPy Intro April 4 2016 29 / 32
93. 10 PRINT TAB(32);"HAMURABI"
20 PRINT TAB(15);"CREATIVE COMPUTING MORRISTOWN, NEW JERSEY"
30 PRINT:PRINT:PRINT
80 PRINT "TRY YOUR HAND AT GOVERNING ANCIENT SUMERIA"
90 PRINT "FOR A TEN-YEAR TERM OF OFFICE.":PRINT
95 D1=0: P1=0
100 Z=0: P=95:S=2800: H=3000: E=H-S
110 Y=3: A=H/Y: I=5: Q=1
210 D=0
215 PRINT:PRINT:PRINT "HAMURABI: I BEG TO REPORT TO YOU,": Z=Z+1
217 PRINT "IN YEAR";Z;",";D;"PEOPLE STARVED,";I;"CAME TO THE CITY,"
218 P=P+I
227 IF Q>0 THEN 230
228 P=INT(P/2)
229 PRINT "A HORRIBLE PLAGUE STRUCK! HALF THE PEOPLE DIED."
230 PRINT "POPULATION IS NOW";P
232 PRINT "THE CITY NOW OWNS ";A;"ACRES."
235 PRINT "YOU HARVESTED";Y;"BUSHELS PER ACRE."
250 PRINT "THE RATS ATE";E;"BUSHELS."
260 PRINT "YOU NOW HAVE ";S;"BUSHELS IN STORE.": PRINT
270 REM *** MORE CODE THAT DID NOT FIT INTO THE SLIDE FOLLOWS
99. >>> from basic.lexer import lex
>>> source = open("hello.bas").read()
>>> for token in lex(source):
... print token
Token("NUMBER", "10")
Token("PRINT", "PRINT")
Token("STRING",'"HELLO BASIC!"')
Token(":", "n")
100. Grammar
• A set of formal rules that defines the syntax!
• terminals = tokens!
• nonterminals = rules defining a sequence of one or
more (non)terminals
101. 10 PRINT TAB(32);"HAMURABI"
20 PRINT TAB(15);"CREATIVE COMPUTING MORRISTOWN, NEW JERSEY"
30 PRINT:PRINT:PRINT
80 PRINT "TRY YOUR HAND AT GOVERNING ANCIENT SUMERIA"
90 PRINT "FOR A TEN-YEAR TERM OF OFFICE.":PRINT
95 D1=0: P1=0
100 Z=0: P=95:S=2800: H=3000: E=H-S
110 Y=3: A=H/Y: I=5: Q=1
210 D=0
215 PRINT:PRINT:PRINT "HAMURABI: I BEG TO REPORT TO YOU,": Z=Z+1
217 PRINT "IN YEAR";Z;",";D;"PEOPLE STARVED,";I;"CAME TO THE CITY,"
218 P=P+I
227 IF Q>0 THEN 230
228 P=INT(P/2)
229 PRINT "A HORRIBLE PLAGUE STRUCK! HALF THE PEOPLE DIED."
230 PRINT "POPULATION IS NOW";P
232 PRINT "THE CITY NOW OWNS ";A;"ACRES."
235 PRINT "YOU HARVESTED";Y;"BUSHELS PER ACRE."
250 PRINT "THE RATS ATE";E;"BUSHELS."
260 PRINT "YOU NOW HAVE ";S;"BUSHELS IN STORE.": PRINT
270 REM *** MORE CODE THAT DID NOT FIT INTO THE SLIDE FOLLOWS
150. class Line(BaseBox):
...
def compile(self, program):
program.lineno2instruction[self.lineno] = len(program.instructions)
for statement in self.statements:
statement.compile(program)
151. class Line(BaseBox):
...
def compile(self, program):
program.lineno2instruction[self.lineno] = len(program.instructions)
for statement in self.statements:
statement.compile(program)
152. class Print(Statement):
def compile(self, program):
for expression in self.expressions:
expression.compile(program)
program.instructions.append(
bytecode.Print(
len(self.expressions),
self.newline
)
)
153. class Print(Statement):
...
def compile(self, program):
for expression in self.expressions:
expression.compile(program)
program.instructions.append(
bytecode.Print(
len(self.expressions),
self.newline
)
)
173. Benchmark
10 N = 1
20 IF N <= 10000 THEN 40
30 END
40 GOSUB 100
50 IF R = 0 THEN 70
60 PRINT "PRIME"; N
70 N = N + 1: GOTO 20
100 REM *** ISPRIME N -> R
110 IF N <= 2 THEN 170
120 FOR I = 2 TO (N - 1)
130 A = N: B = I: GOSUB 200
140 IF R <> 0 THEN 160
150 R = 0: RETURN
160 NEXT I
170 R = 1: RETURN
200 REM *** MOD A -> B -> R
210 R = A - (B * INT(A / B))
220 RETURN
175. Project milestones
2008 Django support
2010 First JIT-compiler
2011 Compatibility with CPython 2.7
2014 Basic ARM support
CPython 3 support
Improve compatibility with C extensions
NumPyPy
Multi-threading support
178. PyPy STM
10 loops, best of 3: 1.2 sec per loop10 loops, best of 3: 822 msec per loop
from threading import Thread
def count(n):
while n > 0:
n -= 1
def run():
t1 = Thread(target=count, args=(10000000,))
t1.start()
t2 = Thread(target=count, args=(10000000,))
t2.start()
t1.join(); t2.join()
def count(n):
while n > 0:
n -= 1
def run():
count(10000000)
count(10000000)
Inside the Python GIL - David Beazley
179. PyPy in the real world (1)
High frequency trading platform for sports bets
I low latency is a must
PyPy used in production since 2012
~100 PyPy processes running 24/7
up to 10x speedups
I after careful tuning and optimizing for PyPy
antocuni (PyCon Otto) PyPy Status Update April 07 2017 6 / 19
180. PyPy in the real world (2)
Real-time online advertising auctions
I tight latency requirement (<100ms)
I high throughput (hundreds of thousands of requests per
second)
30% speedup
We run PyPy basically everywhere
Julian Berman
antocuni (PyCon Otto) PyPy Status Update April 07 2017 7 / 19
181. PyPy in the real world (3)
IoT on the cloud
5-10x faster
We do not even run benchmarks on CPython
because we just know that PyPy is way faster
Tobias Oberstein
antocuni (PyCon Otto) PyPy Status Update April 07 2017 8 / 19