On Nov 11, 2016, Alibaba smashed its own online transaction record at the world's largest online shopping event (known as 11.11), by processing 175,000 online ordering transactions per second. Many Alibaba’s e-Commerce applications are written in Java. To cater for the specific needs to run these applications, we identified the requirements and optimized these features on the customized version of HotSpot.
3. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Java at Alibaba
Single VS multiple instance deployment
AliOS(xa64, Linux)
AJDK
CaiNiao
Middleware
C2C, B2C,
B2B
Alipay OthersAliYun
Alibaba/Alipay JDK, based on OpenJDK
All the Java applications developed in Alibaba and
affiliated companies, such like Alipay, CaiNiao are running
on AJDK
• Service oriented architecture
• Services communicate with each other via RPC
• Typically, many JVMs as one cluster per service
• Heterogeneous
• communicate between the native libraries written
in C/C++ and Java via JNI
Running Java at massive scale
• Million instances of JVMs, serving a very large
number of requests every second
4. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
China's Singles' Day
the world's biggest online shopping day
3200
14,000
42,000
80,000
140,000
175,000
1200 3,850
15,000
38,000
86,000
120,000
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
2011 2012 2013 2014 2015 2016
Transantions per Second(at peek)
Alibaba cloud platform Alibaba payment service
• On Singles Day in 2016, peak transactions/sec reached 175k, and peak payment orders processed/sec
reached 120k
• In contrast, Visa only processed a peak volume of 56K messages per second(source:
www.digitaltransactions.net)
6. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Single vs multiple instance deployment
Single VS multiple instance deployment
• JavaEE containers such as Apache Tomcat support deploying multiple web
applications into the same container
• We don’t do this in real production environment , one of major reasons is the absence of
resource consumption control
JavaEE container
App App1
App2
Java VM
JavaEE container
Java VM
Appx…
7. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Building containers inside JVM
Create “tenant” resource container per
application unit, referred as “tenant”.
The application unit may be mapped
as a single thread, a thread group, a
war or even an OSGi bundle.
The resources consumed by an
application unit are accounted for per
tenant
AJDK
Java module Java module Java module
Memory CPU I/O
Tenant Tenant Tenant
OS
8. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Tenant Container API
• TenantConfiguration: maintains resource configuration information for
tenant
• TenantContainer: represents resource container, thread switches to the
context of tenant by calling TenantContainer.run()
CPU Usage Quota
Heap Usage Quota
All relevant resource allocations(CPU/Heap) in this tenant
context are accounted for by the tenant.
9. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Thread and tenant
Attach/detach
• Attached to tenant automatically after entering into TenantContainer.run()
• Detached after leaving from run block
Inheritance
• Threads forked in TenantContainer.run() will be attached to parent thread’s
tenant automatically
Limit
• Provide option to limit the maximum number of threads a tenant is allowed to
use
1|TenantContainer tenant = getTenant();
2|tenant.run(()->{
3| …… //switches to new tenant context
4|});
10. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
CPU Throttling per Tenant
Control Group directly provided by Linux kernel
Allocate resources for aggregated
processes(threads)
Partition all processes and children into
groups
Organize groups in hierarchies
Associate groups with particular resources
Manage resource among groups
Delegate CPU management to Control
Group(cgroup)
So far, we only implement it on X64 platform.
1:1
12. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Tenant Heap Management
• G1GC divides the entire heap into
equally-sized regions, well suited for
tenant management
• TAC(tenant allocation context) manages
the regions for tenant
• Code running in a tenant allocates
objects in a region it owns
• New regions can be requested up to
tenant maximum reservation
S O O E E S S O O E S E E O O
TAC1 TAC2 TAC3
Tenant1 Tenant2 Tenant3
T1 T2 T3
T6T4 T5
HeapRegion
TAC
13. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Tenant Heap Management(2)
Eden Survivor FreeEden
Free
Free
Free
Tenant-2
OldSurvivor
Survivor
Survivor Free
Tenant-2
Tenant-1
Tenant-1
Before GC
After GC
• Mark and Copy: copy object from “from space” to “to space”
• Be sure to copy object into correct tenant region
14. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Other Changes for Tenant Isolation
o TLAB(Thread Local Allocation Buffer)
Option 1: Retire TLAB when thread switches into different tenant
• Easy to implement, but waste space on TLAB, might cause
frequent YGC
Option 2: Tenant aware TLAB, maps TLAB into correct heap region
of tenant
o IHOP(Initiating Heap Occupancy Percent)
Calculate IHOP per tenant
• Otherwise, may suffer FULL GC when only some tenant reaches
to limit
Do we need real “GC per tenant” ??
15. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Resource Usage Monitoring
o Monitor CPU/Heap usage for each of tenants in
runtime via JMX API
• CPU: cpuacct subsystem generates automatic
reports on CPU resources.
• HEAP: TAC helps us to track how many heap
regions are used for the tenant.
16. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Where we use this?
AliTomcat, use it to manage shared resource when multiple WAR applications are
deployed in same instance.
CloudEngine, built on OSGi technology, widely used in Alipay, use this for resource
management for OSGi bundle.
Some ecommerce/information management applications, build resource management
capacity upon this technology.
• E.g. limit the CPU/HEAP usage of groovy script, which is mostly used as
externalized script for business rule definition.
Haven’t yet implemented IO(network/disk) throttling for tenant, may consider it
in future plan.
18. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Thread Context Switch Cost
• 4 IO threads, and 200
business logic threads
• Much more of CPU
time(sy) are consumed by
thread context switch,
~100k / s (on 4 core
machine)
19. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Proactor Pattern in Java
Asynchronous I/O API(.aka. NIO.2), introduced by JSR203
Proactor pattern allows event-driven applications to efficiently demultiplex and dispatch
requests triggered by the completion of asynchronous IO
• Scales really well
• maximize throughput by avoiding unnecessary context switching
Completion Handler:
process completion events
processing of
requests proactively,
together with
completion
handlers
20. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Drawbacks in asynchronous programming
o Complexity of programming
• Hard to program applications using asynchrony mechanisms due to the separation
between operation invocation and completion
• Poor coding readability, hard to understand
o Complexity of debugging and testing
• 'single stepping‘ within a debugger doesn’t work for callbacks on application-
specific handlers
o A lot of existing applications don’t work in this way
• RPC framework
• Third party libraries
21. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Demo for Our ApproachDemo for Our Approach
Thread Dump
Transparent to java developer.
Coroutine used as likes Thread.
Simplicity of synchronous, performance of
asynchronous
• Control transferring between coroutines,
transparently made by underlying implementation.
1
2
3
Connect Write
Accept Read
yieldexecution suspended
23. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Java NIO Selector
Selector: allows only single thread to be used for managing
multiple channels
• Monitoring events for multiple socket channels
• Recognizing when one or more become available for data
transfer
Key idea : use Selector mechanism for coroutine scheduling for
blocked IO.
24. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
How Does It Work: IO Read Demystified
read(ByteBuffer bb) {
if ((n = ch.read(bb)) != 0)
return n;
do {
engine.registerEvent( ch,OP_READ);
engine.scheduleNext();
} while ((n = ch.read(bb)) == 0);
return n;
}
Register ‘read’ event with
scheduler(using selector)
Block current and yield to
another available co-routine.
Thread Scheduler1:1 1:N
Coroutine 1
Coroutine 2
Coroutine 3
…
Selector
• get notified when IO readable
or writable
IO
IO
IO
IO
25. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
How Does the Scheduler Work
o Wisp engine(Scheduler) is used to schedule next available coroutine, which is allocated per
thread.
o Interested event will be registered with Wisp engine when thread gets blocked
IO(network r/w)
Thread parking(e.g. Thread.sleep)
o Schedule next coroutine when:
Interested events happen:
• IO completed(via Selector)
• Timeout(e.g. Thread.sleep)
• Thread unparking(e.g. Object.notify)
New request arrives (newly submitted Runnable)
Coroutine/thread exit
WISP: is the light phenomenon traditionally ascribed to ghosts
26. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
WispEngine API
o WispEngine public APIs
o dispatch, run a task in a coroutine in same thread
o execute, allows you to execute a task in a coroutine in different
thread.
o WispThreadExecutor, which extends AbstractExecutorService,
which allows user to submit tasks into executor, which run in
coroutine
28. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Approach for synchronization
Modify synchronization in HotSpot, support
coroutine scheduling at all ‘blocked’ places.
Fast Lock
Determining lock ownership by address on
stack, natural support due to the fact: stack is
allocated per coroutine.
Biased lock
Use –XX:- UseBiasedLocking to work around
Inflated Lock(Very complex case)
Create “WispThread” instance per coroutine,
which extends from JavaThread
Use “WispThread” instance as current
thread(Self) for monitor manipulation.
coroutine A
monitor_enter
yeildTo
Object.wait
waiter list
monitor_enter
Object.notify
unpark
coroutine B
monitor_exit
monitor_exit
yeildTo
2
5
6
Main Thread
1
dispatch
3
dispatch4
Main Thread
yeildTo
7
29. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Performance Results
• Used in real world Alibaba
application, Carts, which is an e-
commerce application for shopping
carts.
• Coroutine is used to handle all
network IO requests, running each
thread on each logical processor.
• Reduced CPU usage from 60% to
54% (~10% saving)while serving
the same number of requests.
30. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Revisit Wisp Implementation
o We add coroutine scheduling logic to almost all blocked places in JDK
J.U.C stuff (Future, J.U.C lock, etc.)
Java network IO(java.net, java.nio)
Java primitive lock
• Synchronized(monitor enter/exit)
• Object.wait
• Object.notify/Object.notifyAll
o Java disk io(java.io) is not supported yet
o Native code in JNI is not supported so far
32. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Java Warmup Issue
CPU
Time
o High CPU consumption caused by
intensive methods compilation during
application warmup
• TieredCompilation can alleviate it, but
can not completely resolve it.
o Problems occurred when lots of user
requests come in after application online
• Much longer response time , hence
lots of ‘timeout’ errors.
Warmup
100%
33. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
JWarmUp: eliminate JVM warm-up
Server
user requests
class
profiling data
loading
record
Server
compile
beta production
classes
Before JWarmup, our application owners
usually use ‘mock’ data to warm-up JVM to
let JIT optimize before actual requests come
in.
‘JWarmup’ used to obviate the need for
"warming-up" by:
• Record the profiling data (in beta
testing)
• Let JIT compile code based on
recorded profiling data before requests
come in (in online run)
user requests
34. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
‘Under the hook view 'of JWarmup
o In recording phase
Record class initialization info
Record method compilation info
Flush them onto disk as record file after enough time
o In compiling phase
Eagerly load classes recorded in previous run
Eagerly initialize loaded classes
• Tricky case: <clinit> MUST be executed in determined order
• Initializing them only through constant pool might be problem
Submit for compilation
Bar.<clinit> is dependent on the
execution of Foo.test();
35. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Results from Alibaba Online Application(UMP)
(1) Warmup compiles code in advance before requests come in
(2) At the point when real user requests come in.
(3) Most of hot methods are completely compiled by TieredCompilation
(1)
(2)
(3)
UMP: ecommerce application for complex discounting calculation
36. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Other considerations
• Don’t record dynamically generated classes
• Generated by groovy, classes themselves might be changed between runs due to
the change of rules.
• Generated by Java Reflection/Proxy because class names get changed in next run.
• JWarmup compiles methods without ‘MethodData ’ information, because we see
unexpected de-optimization when real requests come in.
• Disable ‘Null check elimination’ optimization for avoiding unexpected de-optimization.
• Not compatible with –XX:+ TieredCompilation, consider it in next step.
38. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Support for Diagnostics and Troubleshooting
o Modify OpenJDK for better profiling and debugging
Method tracing
Heap profiling
Debug on late attach
• …
o Built upon that, we have our own internal tools:
ZProfiler: problem and performance diagnostic tool
• GC log analyzing
• Low-overhead hot method profiling
• Method tracing
ZDebugger: browser-based debugging tool
• Designed for debugging Alibaba cloud application
39. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
On demand debugging
The current implementation of HotSpot doesn’t
support the late attach for jdwp agent:
https://bugs.openjdk.java.net/browse/JDK-4841257
Why we need DOLA(Debug-on-Late-Attach)?
• In our production environment, for performance consideration, we don’t start
JVM with -agentlib:jdwp option
• It is hard to reproduce the problem if restarting the JVM after the problem
occurred
40. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
How to use it
• java -XX:+DebugOnLateAttach HelloWorld
• jcmd [pid] AgentLib.load agent.name=jdwp
agent.options=transport=dt_socket,server=y,suspend=n,address=2012
DOLA is implemented in Alibaba JDK as an experimental feature, only supports
breakpoint set and variable inspections, but they are sufficient for our use!
41. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Implementation of DOLA
• Modify the interpreter of HotSpot to support breakpoint post in bytecode patch path
• Modify management capacities of JVMTi to add:
• can_access_local_variables
• can_generate_breakpoint_events
• Add the Agent_OnAttach functionality to jdwp agent
• Extend the jcmd to support agent loading (usability improvements)
43. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
GC & Further Performance Optimization
o Garbage Collector
Real time processing for big data, requires minimal STW
‘Generation hypothesis” doesn’t fit for LRU, widely used in Alibaba applications
o Performance Optimization
Object serialization/de-serialization
• Hessian, Kryo are better, but still not enough.
44. 系统软件事业部:打造具备全球竞争力、效率最优的系统软件
Recap
• What we have mainly covered in this talk:
• Container-based resource management technology built for high density
deployments
• Wisp coroutine wins the same performance benefits as async, while
keeping a simple programming(sync) for developers.
• JWarump is used to obviate the need for "warming-up"