Optimizing JVM at Alibaba for e-commerce apps running on 100,000+ servers

系统软件事业部：打造具备全球竞争力、效率最优的系统软件
Optimizing JVM at Alibaba
for e-commerce applications running on 100,000+ servers
Sanhong Li
JVM Lead at Alibaba

Agenda
 How Java is used at Alibaba
 Optimizing JVM for our needs
 Diagnostics and Troubleshooting
 Recap
 Q&A

Java at Alibaba
Single VS multiple instance deployment
AliOS(xa64, Linux)
AJDK
CaiNiao
Middleware
C2C, B2C,
B2B
Alipay OthersAliYun
Alibaba/Alipay JDK, based on OpenJDK
 All the Java applications developed in Alibaba and
affiliated companies, such like Alipay, CaiNiao are running
on AJDK
• Service oriented architecture
• Services communicate with each other via RPC
• Typically, many JVMs as one cluster per service
• Heterogeneous
• communicate between the native libraries written
in C/C++ and Java via JNI
 Running Java at massive scale
• Million instances of JVMs, serving a very large
number of requests every second

China's Singles' Day
the world's biggest online shopping day
3200
14,000
42,000
80,000
140,000
175,000
1200 3,850
15,000
38,000
86,000
120,000
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
2011 2012 2013 2014 2015 2016
Transantions per Second(at peek)
Alibaba cloud platform Alibaba payment service
• On Singles Day in 2016, peak transactions/sec reached 175k, and peak payment orders processed/sec
reached 120k
• In contrast, Visa only processed a peak volume of 56K messages per second(source:
www.digitaltransactions.net)

Containers inside JVM
run multiple apps in same instance safely#1

Single vs multiple instance deployment
Single VS multiple instance deployment
• JavaEE containers such as Apache Tomcat support deploying multiple web
applications into the same container
• We don’t do this in real production environment , one of major reasons is the absence of
resource consumption control
JavaEE container
App App1
App2
Java VM
JavaEE container
Java VM
Appx…

Building containers inside JVM
 Create “tenant” resource container per
application unit, referred as “tenant”.
 The application unit may be mapped
as a single thread, a thread group, a
war or even an OSGi bundle.
 The resources consumed by an
application unit are accounted for per
tenant
AJDK
Java module Java module Java module
Memory CPU I/O
Tenant Tenant Tenant
OS

Tenant Container API
• TenantConfiguration: maintains resource configuration information for
tenant
• TenantContainer: represents resource container, thread switches to the
context of tenant by calling TenantContainer.run()
CPU Usage Quota
Heap Usage Quota
All relevant resource allocations(CPU/Heap) in this tenant
context are accounted for by the tenant.

Thread and tenant
 Attach/detach
• Attached to tenant automatically after entering into TenantContainer.run()
• Detached after leaving from run block
 Inheritance
• Threads forked in TenantContainer.run() will be attached to parent thread’s
tenant automatically
 Limit
• Provide option to limit the maximum number of threads a tenant is allowed to
use
1|TenantContainer tenant = getTenant();
2|tenant.run(()->{
3| …… //switches to new tenant context
4|});

CPU Throttling per Tenant
 Control Group directly provided by Linux kernel
 Allocate resources for aggregated
processes(threads)
 Partition all processes and children into
groups
 Organize groups in hierarchies
 Associate groups with particular resources
 Manage resource among groups
 Delegate CPU management to Control
Group(cgroup)
 So far, we only implement it on X64 platform.
1:1

CPU Throttling per Tenant(2)
• The directory tree mirrors the control group
Process ID
Tenant ID
Threads run in tenant
CPU Usage Quota

Tenant Heap Management
• G1GC divides the entire heap into
equally-sized regions, well suited for
tenant management
• TAC(tenant allocation context) manages
the regions for tenant
• Code running in a tenant allocates
objects in a region it owns
• New regions can be requested up to
tenant maximum reservation
S O O E E S S O O E S E E O O
TAC1 TAC2 TAC3
Tenant1 Tenant2 Tenant3
T1 T2 T3
T6T4 T5
HeapRegion
TAC

Tenant Heap Management(2)
Eden Survivor FreeEden
Free
Free
Free
Tenant-2
OldSurvivor
Survivor
Survivor Free
Tenant-2
Tenant-1
Tenant-1
Before GC
After GC
• Mark and Copy: copy object from “from space” to “to space”
• Be sure to copy object into correct tenant region

Other Changes for Tenant Isolation
o TLAB(Thread Local Allocation Buffer)
 Option 1: Retire TLAB when thread switches into different tenant
• Easy to implement, but waste space on TLAB, might cause
frequent YGC
 Option 2: Tenant aware TLAB, maps TLAB into correct heap region
of tenant
o IHOP(Initiating Heap Occupancy Percent)
 Calculate IHOP per tenant
• Otherwise, may suffer FULL GC when only some tenant reaches
to limit
 Do we need real “GC per tenant” ??

Resource Usage Monitoring
o Monitor CPU/Heap usage for each of tenants in
runtime via JMX API
• CPU: cpuacct subsystem generates automatic
reports on CPU resources.
• HEAP: TAC helps us to track how many heap
regions are used for the tenant.

Where we use this?
 AliTomcat, use it to manage shared resource when multiple WAR applications are
deployed in same instance.
 CloudEngine, built on OSGi technology, widely used in Alipay, use this for resource
management for OSGi bundle.
 Some ecommerce/information management applications, build resource management
capacity upon this technology.
• E.g. limit the CPU/HEAP usage of groovy script, which is mostly used as
externalized script for business rule definition.
 Haven’t yet implemented IO(network/disk) throttling for tenant, may consider it
in future plan.

Coroutine in Java
simplicity of synchronous, performance of asynchronous#2

Thread Context Switch Cost
• 4 IO threads, and 200
business logic threads
• Much more of CPU
time(sy) are consumed by
thread context switch,
~100k / s (on 4 core
machine)

Proactor Pattern in Java
 Asynchronous I/O API(.aka. NIO.2), introduced by JSR203
 Proactor pattern allows event-driven applications to efficiently demultiplex and dispatch
requests triggered by the completion of asynchronous IO
• Scales really well
• maximize throughput by avoiding unnecessary context switching
Completion Handler:
process completion events
processing of
requests proactively,
together with
completion
handlers

Drawbacks in asynchronous programming
o Complexity of programming
• Hard to program applications using asynchrony mechanisms due to the separation
between operation invocation and completion
• Poor coding readability, hard to understand
o Complexity of debugging and testing
• 'single stepping‘ within a debugger doesn’t work for callbacks on application-
specific handlers
o A lot of existing applications don’t work in this way
• RPC framework
• Third party libraries

Demo for Our ApproachDemo for Our Approach
Thread Dump
 Transparent to java developer.
 Coroutine used as likes Thread.
 Simplicity of synchronous, performance of
asynchronous
• Control transferring between coroutines,
transparently made by underlying implementation.
1
2
3
Connect Write
Accept Read
yieldexecution suspended

JVM Coroutines(MLVM implementation)
 Implemented as part of the
MLVM project
 Highlights:
 Separate stack is allocated
per coroutine
 Support symmetric coroutine
 yieldTo API is used to
transfer control to the next
coroutine

Java NIO Selector
 Selector: allows only single thread to be used for managing
multiple channels
• Monitoring events for multiple socket channels
• Recognizing when one or more become available for data
transfer
 Key idea : use Selector mechanism for coroutine scheduling for
blocked IO.

How Does It Work: IO Read Demystified
read(ByteBuffer bb) {
if ((n = ch.read(bb)) != 0)
return n;
do {
engine.registerEvent( ch,OP_READ);
engine.scheduleNext();
} while ((n = ch.read(bb)) == 0);
return n;
}
Register ‘read’ event with
scheduler(using selector)
Block current and yield to
another available co-routine.
Thread Scheduler1:1 1:N
Coroutine 1
Coroutine 2
Coroutine 3
…
Selector
• get notified when IO readable
or writable
IO
IO
IO
IO

How Does the Scheduler Work
o Wisp engine(Scheduler) is used to schedule next available coroutine, which is allocated per
thread.
o Interested event will be registered with Wisp engine when thread gets blocked
 IO(network r/w)
 Thread parking(e.g. Thread.sleep)
o Schedule next coroutine when:
 Interested events happen:
• IO completed(via Selector)
• Timeout(e.g. Thread.sleep)
• Thread unparking(e.g. Object.notify)
 New request arrives (newly submitted Runnable)
 Coroutine/thread exit
WISP: is the light phenomenon traditionally ascribed to ghosts

WispEngine API
o WispEngine public APIs
o dispatch, run a task in a coroutine in same thread
o execute, allows you to execute a task in a coroutine in different
thread.
o WispThreadExecutor, which extends AbstractExecutorService,
which allows user to submit tasks into executor, which run in
coroutine

Coroutine and Synchronization
coroutine A
coroutine B
1) Schedule and run test::foo in coroutine A
2) Current thread get blocked by wait();
3) No chance to schedule next coroutine B due to
the block in 2)
Thread Dump
1)
2)

Approach for synchronization
Modify synchronization in HotSpot, support
coroutine scheduling at all ‘blocked’ places.
 Fast Lock
 Determining lock ownership by address on
stack, natural support due to the fact: stack is
allocated per coroutine.
 Biased lock
 Use –XX:- UseBiasedLocking to work around
 Inflated Lock(Very complex case)
 Create “WispThread” instance per coroutine,
which extends from JavaThread
 Use “WispThread” instance as current
thread(Self) for monitor manipulation.
coroutine A
monitor_enter
yeildTo
Object.wait
waiter list
monitor_enter
Object.notify
unpark
coroutine B
monitor_exit
monitor_exit
yeildTo
2
5
6
Main Thread
1
dispatch
3
dispatch4
Main Thread
yeildTo
7

Performance Results
• Used in real world Alibaba
application, Carts, which is an e-
commerce application for shopping
carts.
• Coroutine is used to handle all
network IO requests, running each
thread on each logical processor.
• Reduced CPU usage from 60% to
54% (~10% saving)while serving
the same number of requests.

Revisit Wisp Implementation
o We add coroutine scheduling logic to almost all blocked places in JDK
 J.U.C stuff (Future, J.U.C lock, etc.)
 Java network IO(java.net, java.nio)
 Java primitive lock
• Synchronized(monitor enter/exit)
• Object.wait
• Object.notify/Object.notifyAll
o Java disk io(java.io) is not supported yet
o Native code in JNI is not supported so far

JWarmup
priming online application for speed#3

Java Warmup Issue
CPU
Time
o High CPU consumption caused by
intensive methods compilation during
application warmup
• TieredCompilation can alleviate it, but
can not completely resolve it.
o Problems occurred when lots of user
requests come in after application online
• Much longer response time , hence
lots of ‘timeout’ errors.
Warmup
100%

JWarmUp: eliminate JVM warm-up
Server
user requests
class
profiling data
loading
record
Server
compile
beta production
classes
 Before JWarmup, our application owners
usually use ‘mock’ data to warm-up JVM to
let JIT optimize before actual requests come
in.
 ‘JWarmup’ used to obviate the need for
"warming-up" by:
• Record the profiling data (in beta
testing)
• Let JIT compile code based on
recorded profiling data before requests
come in (in online run)
user requests

‘Under the hook view 'of JWarmup
o In recording phase
 Record class initialization info
 Record method compilation info
 Flush them onto disk as record file after enough time
o In compiling phase
 Eagerly load classes recorded in previous run
 Eagerly initialize loaded classes
• Tricky case: <clinit> MUST be executed in determined order
• Initializing them only through constant pool might be problem
 Submit for compilation
Bar.<clinit> is dependent on the
execution of Foo.test();

Results from Alibaba Online Application(UMP)
(1) Warmup compiles code in advance before requests come in
(2) At the point when real user requests come in.
(3) Most of hot methods are completely compiled by TieredCompilation
(1)
(2)
(3)
UMP: ecommerce application for complex discounting calculation

Other considerations
• Don’t record dynamically generated classes
• Generated by groovy, classes themselves might be changed between runs due to
the change of rules.
• Generated by Java Reflection/Proxy because class names get changed in next run.
• JWarmup compiles methods without ‘MethodData ’ information, because we see
unexpected de-optimization when real requests come in.
• Disable ‘Null check elimination’ optimization for avoiding unexpected de-optimization.
• Not compatible with –XX:+ TieredCompilation, consider it in next step.

Diagnostics and Troubleshooting
technical support for java developers#4

Support for Diagnostics and Troubleshooting
o Modify OpenJDK for better profiling and debugging
 Method tracing
 Heap profiling
 Debug on late attach
• …
o Built upon that, we have our own internal tools:
 ZProfiler: problem and performance diagnostic tool
• GC log analyzing
• Low-overhead hot method profiling
• Method tracing
 ZDebugger: browser-based debugging tool
• Designed for debugging Alibaba cloud application

On demand debugging
The current implementation of HotSpot doesn’t
support the late attach for jdwp agent:
https://bugs.openjdk.java.net/browse/JDK-4841257
Why we need DOLA(Debug-on-Late-Attach)?
• In our production environment, for performance consideration, we don’t start
JVM with -agentlib:jdwp option
• It is hard to reproduce the problem if restarting the JVM after the problem
occurred

How to use it
• java -XX:+DebugOnLateAttach HelloWorld
• jcmd [pid] AgentLib.load agent.name=jdwp
agent.options=transport=dt_socket,server=y,suspend=n,address=2012
DOLA is implemented in Alibaba JDK as an experimental feature, only supports
breakpoint set and variable inspections, but they are sufficient for our use!

Implementation of DOLA
• Modify the interpreter of HotSpot to support breakpoint post in bytecode patch path
• Modify management capacities of JVMTi to add:
• can_access_local_variables
• can_generate_breakpoint_events
• Add the Agent_OnAttach functionality to jdwp agent
• Extend the jcmd to support agent loading (usability improvements)

The last
We are still working hard on…#5

GC & Further Performance Optimization
o Garbage Collector
 Real time processing for big data, requires minimal STW
 ‘Generation hypothesis” doesn’t fit for LRU, widely used in Alibaba applications
o Performance Optimization
 Object serialization/de-serialization
• Hessian, Kryo are better, but still not enough.

Recap
• What we have mainly covered in this talk:
• Container-based resource management technology built for high density
deployments
• Wisp coroutine wins the same performance benefits as async, while
keeping a simple programming(sync) for developers.
• JWarump is used to obviate the need for "warming-up"

谢谢观看
thanks
Mail: sanhong.lsh@alibaba-inc.com
Twttier: @sanhong_li

Optimizing JVM at Alibaba for e-commerce apps running on 100,000+ servers

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Optimizing JVM at Alibaba for e-commerce apps running on 100,000+ servers