1. In Search of the Perfect
Global Interpreter Lock
David Beazley
http://www.dabeaz.com
@dabeaz
October 15, 2011
Presented at RuPy 2011
Poznan, Poland
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 1
2. Introduction
• As many programmers know, Python and Ruby
feature a Global Interpreter Lock (GIL)
• More precise: CPython and MRI
• It limits thread performance on multicore
• Theoretically restricts code to a single CPU
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 2
3. An Experiment
• Consider a trivial CPU-bound function
def countdown(n):
while n > 0:
n -= 1
• Run it once with a lot of work
COUNT = 100000000 # 100 million
countdown(COUNT)
• Now, divide the work across two threads
t1 = Thread(target=count,args=(COUNT//2,))
t2 = Thread(target=count,args=(COUNT//2,))
t1.start(); t2.start()
t1.join(); t2.join()
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 3
4. An Experiment
• Some Ruby
def countdown(n)
while n > 0
n -= 1
end
end
• Sequential
COUNT = 100000000 # 100 million
countdown(COUNT)
• Subdivided across threads
t1 = Thread.new { countdown(COUNT/2) }
t2 = Thread.new { countdown(COUNT/2) }
t1.join
t2.join
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 4
5. Expectations
• Sequential and threaded versions perform the
same amount of work (same # calculations)
• There is the GIL... so no parallelism
• Performance should be about the same
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 5
11. Results
• Ruby 1.9 on Windows Server 2008 (2 cores)
Sequential : 3.32s
Threaded (2 threads) : 3.45s (~ same)
• Python 2.7
Sequential : 6.9s
Threaded (2 threads) : 63.0s (9.1x slower!)
• Why does it get that much slower on Windows?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 11
12. Experiment: Messaging
• A request/reply server for size-prefixed messages
Client Server
• Each message: a size header + payload
• Similar: ZeroMQ
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 12
13. An Experiment: Messaging
• A simple test - message echo (pseudocode)
def client(nummsg,msg): def server():
while nummsg > 0: while True:
send(msg) msg = recv()
resp = recv() send(msg)
sleep(0.001)
nummsg -= 1
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 13
14. An Experiment: Messaging
• A simple test - message echo (pseudocode)
def client(nummsg,msg): def server():
while nummsg > 0: while True:
send(msg) msg = recv()
resp = recv() send(msg)
sleep(0.001)
nummsg -= 1
• To be less evil, it's throttled (<1000 msg/sec)
• Not a messaging stress test
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 14
15. An Experiment: Messaging
• A test: send/receive 1000 8K messages
• Scenario 1: Unloaded server
Client Server
• Scenario 2 : Server competing with one CPU-thread
CPU-Thread
Client Server
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 15
16. Results
• Messaging with no threads (OS-X, 4 cores)
C : 1.26s
Python 2.7 : 1.29s
Ruby 1.9 : 1.29s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 16
17. Results
• Messaging with no threads (OS-X, 4 cores)
C : 1.26s
Python 2.7 : 1.29s
Ruby 1.9 : 1.29s
• Messaging with one CPU-bound thread*
C : 1.16s (~8% faster!?)
Python 2.7 : 12.3s (10x slower)
Ruby 1.9 : 42.0s (33x slower)
• Hmmm. Curious.
* On Ruby, the CPU-bound thread
was also given lower priority
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 17
18. Results
• Messaging with no threads (Linux, 8 CPUs)
C : 1.13s
Python 2.7 : 1.18s
Ruby 1.9 : 1.18s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 18
19. Results
• Messaging with no threads (Linux, 8 CPUs)
C : 1.13s
Python 2.7 : 1.18s
Ruby 1.9 : 1.18s
• Messaging with one CPU-bound thread
C : 1.11s (same)
Python 2.7 : 1.60s (1.4x slower) - better
Ruby 1.9 : 5839.4s (~5000x slower) - worse!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 19
20. Results
• Messaging with no threads (Linux, 8 CPUs)
C : 1.13s
Python 2.7 : 1.18s
Ruby 1.9 : 1.18s
• Messaging with one CPU-bound thread
C : 1.11s (same)
Python 2.7 : 1.60s (1.4x slower) - better
Ruby 1.9 : 5839.4s (~5000x slower) - worse!
• 5000x slower? Really? Why?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 20
21. The Mystery Deepens
• Disable all but one CPU core
• CPU-bound threads (OS-X)
Python 2.7 (4 cores+hyperthreading) : 9.28s
Python 2.7 (1 core) : 7.9s (faster!)
• Messaging with one CPU-bound thread
Ruby 1.9 (4 cores+hyperthreading) : 42.0s
Ruby 1.9 (1 core) : 10.5s (much faster!)
• ?!?!?!?!?!?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 21
22. Better is Worse
• Change software versions
• Let's upgrade to Python 3 (Linux)
Python 2.7 (Messaging) : 12.3s
Python 3.2 (Messaging) : 20.1s (1.6x slower)
• Let's downgrade to Ruby 1.8 (Linux)
Ruby 1.9 (Messaging) : 42.0
Ruby 1.8.7 (Messaging) : 10.0s (4x faster)
• So much for progress (sigh)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 22
23. What's Happening?
• The GIL does far more than limit cores
• It can make performance much worse
• Better performance by turning off cores?
• 5000x performance hit on Linux?
• Why?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 23
24. Why You Might Care
• Must you abandon Python/Ruby for concurrency?
• Having threads restricted to one CPU core might
be okay if it were sane
• Analogy: A multitasking operating system
(e.g., Linux) runs fine on a single CPU
• Plus, threads get used a lot behind the scenes
(even in thread alternatives, e.g., async)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 24
25. Why I Care
• It's an interesting little systems problem
• How do you make a better GIL?
• It's fun.
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 25
26. Some Background
• I have been discussing some of these issues
in the Python community since 2009
http://www.dabeaz.com/GIL
• I'm less familiar with Ruby, but I've looked at
its GIL implementation and experimented
• Very interested in commonalities/differences
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 26
27. A Tale of Two GILs
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 27
28. Thread Implementation
• System threads • System threads
(e.g., pthreads) (e.g., pthreads)
• Managed by OS • Managed by OS
• Concurrent • Concurrent
execution of the execution of the
Python interpreter Ruby VM
(written in C) (written in C)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 28
29. Alas, the GIL
• Parallel execution is forbidden
• There is a "global interpreter lock"
• The GIL ensures that only one thread runs in
the interpreter at once
• Simplifies many low-level details (memory
management, callouts to C extensions, etc.)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 29
31. Thread Execution Model
• The GIL results in cooperative multitasking
block block block block block
Thread 1
run run
Thread 2 run
run run
Thread 3
release acquire release acquire
GIL GIL GIL GIL
• When a thread is running, it holds the GIL
• GIL released on blocking (e.g., I/O operations)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 31
32. Threads for I/O
• For I/O it works great
• GIL is never held very long
• Most threads just sit around sleeping
• Life is good
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 32
33. Threads for Computation
• You may actually want to compute something!
• Fibonacci numbers
• Image/audio processing
• Parsing
• The CPU will be busy
• And it won't give up the GIL on its own
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 33
34. CPU-Bound Switching
• Releases and • Background thread
reacquires the GIL generates a timer
every 100 "ticks" interrupt every 10ms
• 1 Tick ~= 1 interpreter • GIL released and
instruction reacquired by current
thread on interrupt
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 34
35. Python Thread Switching
Run 100 Run 100 Run 100
ticks ticks ticks
CPU Bound
Thread e e e e e e
as uir
le q as uir
le q as uir
le q
re ac re ac re ac
• Every 100 VM instructions, GIL is dropped,
allowing other threads to run if they want
• Not time based--switching interval depends on
kind of instructions executed
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 35
36. Ruby Thread Switching
Timer Timer (10ms) Timer (10ms)
Thread
CPU Bound Run Run
Thread e e
as uir
e e
as uir
le q le q
re ac re ac
• Loosely mimics the time-slice of the OS
• Every 10ms, GIL is released/acquired
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 36
37. A Common Theme
• Both Python and Ruby have C code like this:
void execute() {
while (inst = next_instruction()) {
// Run the VM instruction
...
if (must_release_gil) {
GIL_release();
/* Other threads may run now */
GIL_acquire();
}
}
}
• Exact details vary, but concept is the same
• Each thread has periodic release/acquire in the
VM to allow other threads to run
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 37
38. Question
• What can go wrong with this bit of code?
if (must_release_gil) {
GIL_release();
/* Other threads may run now */
GIL_acquire();
}
• Short answer: Everything!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 38
40. Thread Switching
• Suppose you have two threads
Running
Thread 1
Thread 2 READY
• Thread 1 : Running
• Thread 2 : Ready (Waiting for GIL)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 40
41. Thread Switching
• Easy case : Thread 1 performs I/O (read/write)
I/O
Running
Thread 1 BLOCKED
release
GIL
pthreads/OS
schedule
Running
Thread 2 READY
acquire GIL
• Thread 1 : Releases GIL and blocks for I/O
• Thread 2 : Gets scheduled, starts running
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 41
42. Thread Switching
• Tricky case : Thread 1 runs until preempted pt
m
ee
Running pr
Thread 1 ???
release
GIL
pthreads/OS Which thread runs?
Thread 2 READY ???
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 42
43. Thread Switching
• You might expect that Thread 2 will run pt
m
ee
pr
Running
Thread 1 READY
release
GIL
pthreads/OS
acquire
schedule
GIL
Running
Thread 2 READY
• But you assume the GIL plays nice...
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 43
44. Thread Switching
• What might actually happen on multicore pt
m
ee
pr
Running Running
Thread 1
release acquire
GIL GIL
pthreads/OS
schedule fails (GIL locked)
Thread 2 READY READY
• Both threads attempt to run simultaneously
• ... but only one will succeed (depends on timing)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 44
45. Fallacy
• This code doesn't actually switch threads
if (must_release_gil) {
GIL_release();
/* Other threads may run now */
GIL_acquire();
}
• It might switch threads, but it depends
• What operating system
• # cores
• Lock scheduling policy (if any)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 45
46. Fallacy
• This doesn't force switching (sleeping)
if (must_release_gil) {
GIL_release();
sleep(0);
/* Other threads may run now */
GIL_acquire();
}
• It might switch threads, but it depends
• What operating system
• # cores
• Lock scheduling policy (if any)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 46
47. Fallacy
• Neither does this (calling the scheduler)
if (must_release_gil) {
GIL_release();
sched_yield()
/* Other threads may run now */
GIL_acquire();
}
• It might switch threads, but it depends
• What operating system
• # cores
• Lock scheduling policy (if any)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 47
48. A Conflict
• There are conflicting goals
• Python/Ruby - wants to run on a single
CPU, but doesn't want to do thread
scheduling (i.e., let the OS do it).
• OS - "Oooh. Multiple cores."
Schedules as many runnable tasks as
possible at any instant
• Result: Threads fight with each other
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 48
49. Multicore GIL Battle
• Python 2.7 on OS-X (4 cores)
Sequential : 6.12s
Threaded (2 threads) : 9.28s (1.5x slower!)
pt pt pt
em em em
p re p re pr
e
100 ticks 100 ticks
Thread 1 ... READY
release acquire release acquire
pthreads/OS Eventually...
schedule fail schedule fail
run
Thread 2 READY READY READY
• Millions of failed GIL acquisitions
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 49
50. Multicore GIL Battle
• You can see it! (2 CPU-bound threads)
Why >100%?
• Comment: In Python, it's very rapid
• GIL is released every few microseconds!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 50
51. I/O Handling
• If there is a CPU-bound thread, I/O bound
threads have a hard time getting the GIL
Thread 1 (CPU 1) Thread 2 (CPU 2)
run sleep
preempt Network Packet
run Acquire GIL (fails)
preempt
run Acquire GIL (fails)
Might repeat
preempt
100s-1000s of times
run Acquire GIL (fails)
preempt
Acquire GIL (success)
run
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 51
52. Messaging Pathology
• Messaging on Linux (8 Cores)
Ruby 1.9 (no threads) : 1.18s
Ruby 1.9 (1 CPU thread) : 5839.4s
• Locks in Linux have no fairness
• Consequence: Really hard to steal the GIL
• And Ruby only retries every 10ms
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 52
53. Let's Talk Fairness
• Fair-locking means that locks have some notion
of priorities, arrival order, queuing, etc.
running waiting
t0 Lock t1 t2 t3 t4 t5
release
running waiting
t1 Lock t2 t3 t4 t5 t0
• Releasing means you go to end of line
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 53
54. Effect of Fair-Locking
• Ruby 1.9 (multiple cores)
Messages + 1 CPU Thread (OS-X) : 42.0s
Messages + 1 CPU Thread (Linux) : 5839.4s
• Question: Which one uses fair locking?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 54
55. Effect of Fair-Locking
• Ruby 1.9 (multiple cores)
Messages + 1 CPU Thread (OS-X) : 42.0s (Fair)
Messages + 1 CPU Thread (Linux) : 5839.4s
• Benefit : I/O threads get their turn (yay!)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 55
56. Effect of Fair-Locking
• Ruby 1.9 (multiple cores)
Messages + 1 CPU Thread (OS-X) : 42.0s (Fair)
Messages + 1 CPU Thread (Linux) : 5839.4s
• Benefit : I/O threads get their turn (yay!)
• Python 2.7 (multiple cores)
2 CPU-Bound Threads (OS-X) : 9.28s
2 CPU-Bound Threads (Windows) : 63.0s
• Question: Which one uses fair-locking?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 56
57. Effect of Fair-Locking
• Ruby 1.9 (multiple cores)
Messages + 1 CPU Thread (OS-X) : 42.0s (Fair)
Messages + 1 CPU Thread (Linux) : 5839.4s
• Benefit : I/O threads get their turn (yay!)
• Python 2.7 (multiple cores)
2 CPU-Bound Threads (OS-X) : 9.28s
2 CPU-Bound Threads (Windows) : 63.0s (Fair)
• Problem: Too much context switching
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 57
58. Fair-Locking - Bah!
• In reality, you don't want fairness
• Messaging Revisited (OS X, 4 Cores)
Ruby 1.9 (No Threads) : 1.29s
Ruby 1.9 (1 CPU-Bound thread) : 42.0s (33x slower)
• Why is it still 33x slower?
• Answer: Fair locking! (and convoying)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 58
59. Messaging Revisited
• Go back to the messaging server
def server():
while True:
msg = recv()
send(msg)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 59
60. Messaging Revisited
• The actual implementation (size-prefixed messages)
def server():
while True:
size = recv(4)
msg = recv(size)
send(size)
send(msg)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 60
61. Performance Explained
• What actually happens under the covers
def server():
while True:
GIL release size = recv(4)
GIL release
msg = recv(size)
GIL release
send(size)
GIL release
send(msg)
• Why? Each operation might block
• Catch: Passes control back to CPU-bound thread
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 61
62. Performance Illustrated
Timer 10ms 10ms 10ms 10ms 10ms
Thread
run
CPU Bound
Thread
run run run run run
I/O recv recv send send done
Thread
Data
Arrives
• Each message has 40ms response cycle
• 1000 messages x 40ms = 40s (42.0s measured)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 62
64. A Solution?
Don't use threads!
• Yes, yes, everyone hates threads
• However, that's only because they're useful!
• Threads are used for all sorts of things
• Even if they're hidden behind the scenes
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 64
65. A Better Solution
Make the GIL better
• It's probably not going away (very difficult)
• However, does it have to thrash wildly?
• Question: Can you do anything?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 65
66. GIL Efforts in Python 3
• Python 3.2 has a new GIL implementation
• It's imperfect--in fact, it has a lot of problems
• However, people are experimenting with it
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 66
67. Python 3 GIL
• GIL acquisition now based on timeouts
running
Thread 1
drop_request release
5ms
running
Thread 2 IOWAIT READY
wait(gil, TIMEOUT) wait(gil, TIMEOUT)
data
arrives
• Involves waiting on a condition variable
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 67
68. Problem: Convoying
• CPU-bound threads significantly degrade I/O
running running running
Thread 1
release
5ms 5ms 5ms
run run
Thread 2 READY READY READY
data data data
arrives arrives arrives
• This is the same problem as in Ruby
• Just a shorter time delay (5ms)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 68
69. Problem: Convoying
• You can directly observe the delays (messaging)
Python/Ruby (No threads) : 1.29s (no delays)
Python 3.2 (1 Thread) : 20.1s (5ms delays)
Ruby 1.9 (1 Thread) : 42.0s (10ms delays)
• Still not great, but problem is understood
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 69
71. Priorities
• Best promise : Priority scheduling
• Earlier versions of Ruby had it
• It works (OS-X, 4 cores)
Ruby 1.9 (1 Thread) : 42.0s
Ruby 1.8.7 (1 Thread) : 40.2s
Ruby 1.8.7 (1 Thread, lower priority) : 10.0s
• Comment: Ruby-1.9 allows thread priorities to be
set in pthreads, but it doesn't seem to have much
(if any) effect
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 71
72. Priorities
• Experimental Python-3.2 with priority scheduler
• Also features immediate preemption
• Messages (OS X, 4 Cores)
Python 3.2 (No threads) : 1.29s
Python 3.2 (1 Thread) : 20.2s
Python 3.2+priorities (1 Thread) : 1.21s (faster?)
• That's a lot more promising!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 72
73. New Problems
• Priorities bring new challenges
• Starvation
• Priority inversion
• Implementation complexity
• Do you have to write a full OS scheduler?
• Hopefully not, but it's an open question
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 73
74. Final Words
• Implementing a GIL is a lot trickier than it looks
• Even work with priorities has problems
• Good example of how multicore is diabolical
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 74
75. Thanks for Listening!
• I hope you learned at least one new thing
• I'm always interested in feedback
• Follow me on Twitter (@dabeaz)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 75