XPDS13: Performance Optimization on Xen-based Android Device - Jack Ren, Intel and Xiantao Zhang, Intel

Performance Optimization on Xenbased Android device
Jack Ren/Xiantao Zhang/Dongxiao Xu
Key contributor: Eddie Dong
Intel Corporation

Legal Disclaimer
 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO
LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF
INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN
MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
 Intel may make changes to specifications and product descriptions at any time, without notice.
 All products, dates, and figures specified are preliminary based on current expectations, and are subject to
change without notice.
 Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which
may cause the product to deviate from published specifications. Current characterized errata are available on
request.
 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the
United States and other countries.
 *Other names and brands may be claimed as the property of others.
 Copyright © 2012 Intel Corporation.

Agenda
• Overview
• Design Details
• Gaps, Analysis & Optimizations
• Summary

3

Overview
• Back to Xen Summit 2011 in Seoul…
“Mobile virtualization will be more important…Xen has unique advantages there”
- <<Mobile Virtualization using the Xen Technoligies>>, Jun Nakajima, Intel.
And Jun proposed xen-based Android system:

Overview continue
• New use case: Android in Dom0, hypervisor as TEE
Dom0

TEE:
Trusted Execution Engine

Android userland (ring 3)

Gallery

VideoPlayer

Surface Manager
OpenGLES
GFX
Video

Virtual CPU
Virtual MMU

Browser

Android framework

Android Kernel
(ring 1)

Xen
(ring 0)

…

Dalvik
…
PM
…

Virtual IRQ
…

But we don’t want to sacrifice performance and power too much

Design Details
• Android runs almost natively
Virtualization

performance

I/O

pass-through to Android

close to native performance

CPU

vCPUs pinned to physical
CPUs

Eliminate the vCPU scheduling
penalty

MMU

Para-virtualized

Good run time performance

IRQ

Xen owns, dispatch to
Main overhead: ring switch
Android via I/O: 21% downgrade
− For example, Quadrantevent channel
FPU
Para-virtualized
No vCPU scheduling, very good
performance

CpuIdle

Pass-through to Android

Completely consistent with
Android PM

CpuFreq

Pass-through to Android

Same as above

Standby (S3) Pass-through to Android

Same as above

Standby (S3) is a little bit tricky…

Design Details continue
Re-design S3
• Dom0 owns the full suspend/resume logic.

• Xen assists Dom0 to issue the real monitor/mwait.
• 2X faster than native for S3 resume.

CPU0

HYPERVISOR_
do_mwait_suspend()

do_mwait_suspend()

CPU1

CPU2

CPU3

mwait

wake up

CPU0

CPU1

sleep
HYPERVISOR_
vcpu_op(VCPUOP_down)

do_mwait_suspend()
mwait

mwait

HYPERVISOR_
vcpu_op(VCPUOP_up)

CPU2

CPU3

Time line

Preliminary Power (normalized)
• > 90% of benchmarks reach 95% of native power
Power KPIs
105%
100%
95%
90%
85%
80%

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully
evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance

But we still identified several gaps…

Preliminary Performance (normalized)
•> 90% of benchmarks reach 97% of native performance
Performance KPIs
120.00%
100.00%
80.00%
60.00%
40.00%

USB MTP write…

USB MTP erad large…

CF-Bench (malloc)

WLAN download

3G HSDPA download

H.264 video record

H.264/MPEG-4 AVC…

ColdBoot time to…

GLBenchmark 2.5.1…

GLBenchmark 2.5.1…

Qudrant IO

Qudrant3D

Qudrant2D

SmarkBench2012

BaseMarkES2v1…

BaseMarkES2v1 Taiji

FishIE Tank -200M

Octane

Browsermark

EEMC BrowingBench

Sunspider

AnTutu 2.9.4 CPU Int

Micro Benchmark…

iSPEC00 - speed

CaffeineMark

Dhrystone - BENC

CoreMark

EEMBC

0.00%

Micro Benchmark…

20.00%

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully
evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance

But we still identified several gaps and need some tools to help us…

Tools Enabling
Enabled a lot of tools for performance tuning
• vTune
− Based on PMU, mainly used to tune Dom0

• Xentrace
− Based on original Xentrace, but revised to count key events and hypercalls

• Perf
− Based on PMU, mainly used to tune Dom0

• Xenoprofier
− Based on PMU, mainly used to tune Xen

Those tools prove very helpful in the late tuning Performance and power

Case #1: Quadrant I/O (perf)
Gap: 21%
• Analysis:
Storage data are cached in page cache which is allocated from
high_memory. Each page cache access needs to kmap/kunmap which
leads to a lot of PVMMU hypercalls

• Optimizations:
− Shrink Xen memory foot print from 168M to 72M
− Force page cache allocated from low memory

• Gap reduced to 8.5%

Can we continue to optimize and close that gap of 8.5%?

Case #1: Quadrant I/O (perf) continue
Profiled by Vtune
Among 8.5%:
Xen overhead =
134/3138 ~= 4.27%

Xen traces
type
hcall
hcall
hcall
hcall
hcall
hcall
hcall
event
event
hcall
hcall
hcall
hcall
event
hcall
event
event
hcall
hcall
hcall

name
mwait_idle_op
multicall
mmu_update
mmuext_op
vcpu_op
event_channel_op
xen_version
PAGE_FAULT
IRQ
event_channel_op
physdev_op
event_channel_op
event_channel_op
TIMER_IRQ
event_channel_op
TRAP
PRIVOP
fpu_taskswitch
undfined
apic_op

count
3759
12147
27126
7781
6577
3405
4937
9764
1119
1259
1692
840
761
472
545
1038
1032
1038
21
3
total cost:

cost
cost%
37142118744
145492506
32.12%
113270256
25.00%
50615724
11.17%
39658986
8.75%
26617650
5.88%
12374700
2.73%
11719224
2.59%
10178934
2.25%
9081834
2.00%
8251512
1.82%
7024398
1.55%
6150300
1.36%
5745738
1.27%
4361118
0.96%
1040916
0.23%
872700
0.19%
439638
0.10%
102672
0.02%
5484
0.00%
453004290

Among 4.27%:
PVMMU overhead ~= 70.88%

Hard to further close the gap of 8.5% due to PVMMU overhead

Case #2: Home Screen Scroll (power)
Gap: 1.2% gap
Profiled by Vtune

Xen overhead =
30/3176 ~= 1%

Xen traces
type
event

Name
IRQ,

event

TRAP,

event

PAGE_FAULT,

event

PRIVOP,

event

count

cost
1843

cost%
18323532

7.040037304

88

131352

0.050466416

943

3237852

1.244006825

1385

533748

0.205069952

TIMER_IRQ,

144

2062704

0.792506221

hcall

mmu_update,

990

8866296

3.40649688

hcall

fpu_taskswitch,

95

66816

0.025671204

hcall

multicall,

8736

109199952

41.9554339

hcall

xen_version,

3914

10860348

4.172626492

hcall

vcpu_op,

9694

55009236

21.13495769

hcall

mmuext_op,

3858

34409052

13.22021375

hcall

event_channel_op,

1188

10105920

3.882769643

hcall

physdev_op,

1078

7469256

2.869743719

hcall

mwait_idle_op,

3938

Among 1%:
PVMMU overhead
= 59.83%

23493503868

total cost

260276064

cost of
PAGE_FAULT,
mmu_update,
multicall,
mmuext_op

PVMMU overhead again…
155713152

59.82615136

Other Gaps
Other cases have the similar Xen overheads:
• PVMMU
• TLS/stack switching
Some cases could be optimized by reducing the hypercall
numbers by optimizing guest
• For example, Quadrant I/O
While, some cases could be hard to optimize due to PV overhead
• For example, CF-Bench malloc

Could be fixed by HVM Dom0

Summary
• Dom0 Android achieved near-native power and performance
• Still found some power and performance gaps caused by PVOPS
− PVMMU
− TLS/Stack switch

• Those gaps could be fixed by HVM Dom0

Q&A
• Questions?
• or contact
Jack Ren <jack.ren@intel.com>
Xiantao Zhang <xiantao.zhang@intel.com>

XPDS13: Performance Optimization on Xen-based Android Device - Jack Ren, Intel and Xiantao Zhang, Intel

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to XPDS13: Performance Optimization on Xen-based Android Device - Jack Ren, Intel and Xiantao Zhang, Intel

Similar to XPDS13: Performance Optimization on Xen-based Android Device - Jack Ren, Intel and Xiantao Zhang, Intel (20)

More from The Linux Foundation

More from The Linux Foundation (20)

Recently uploaded

Recently uploaded (20)

XPDS13: Performance Optimization on Xen-based Android Device - Jack Ren, Intel and Xiantao Zhang, Intel