Mobile devices, such as smart phones and tablets, are becoming de-facto everyday computing and communication devices, virtualization can bring additional benfits to mobile devices for both security and manageability. IT department may use hypervisor, as a highly secure solution, to manage autherized mobile devices, such as for network traffic monitoring, filtering, scan (for virus detection), and/or OS update/patching even when the guest OS becomes completely dead. We insert Xen to the mobile OS Android to deprivilege Android as guest for security and manageability purpose. However, the usage case of mobile device is quit different with that of server, for example mobile devices runs completely different benchmarks (mostly multimedia focused) vs. that in server (mostly responsiveness focused). We analyze the gap of Xen as a mobile hypervisor and present how we improve the performance.
4. Overview
• Back to Xen Summit 2011 in Seoul…
“Mobile virtualization will be more important…Xen has unique advantages there”
- <<Mobile Virtualization using the Xen Technoligies>>, Jun Nakajima, Intel.
And Jun proposed xen-based Android system:
5. Overview continue
• New use case: Android in Dom0, hypervisor as TEE
Dom0
TEE:
Trusted Execution Engine
Android userland (ring 3)
Gallery
VideoPlayer
Surface Manager
OpenGLES
GFX
Video
Virtual CPU
Virtual MMU
Browser
Android framework
Android Kernel
(ring 1)
Xen
(ring 0)
…
Dalvik
…
PM
…
Virtual IRQ
…
But we don’t want to sacrifice performance and power too much
6. Design Details
• Android runs almost natively
Virtualization
performance
I/O
pass-through to Android
close to native performance
CPU
vCPUs pinned to physical
CPUs
Eliminate the vCPU scheduling
penalty
MMU
Para-virtualized
Good run time performance
IRQ
Xen owns, dispatch to
Main overhead: ring switch
Android via I/O: 21% downgrade
− For example, Quadrantevent channel
FPU
Para-virtualized
No vCPU scheduling, very good
performance
CpuIdle
Pass-through to Android
Completely consistent with
Android PM
CpuFreq
Pass-through to Android
Same as above
Standby (S3) Pass-through to Android
Same as above
Standby (S3) is a little bit tricky…
7. Design Details continue
Re-design S3
• Dom0 owns the full suspend/resume logic.
• Xen assists Dom0 to issue the real monitor/mwait.
• 2X faster than native for S3 resume.
CPU0
HYPERVISOR_
do_mwait_suspend()
do_mwait_suspend()
CPU1
CPU2
CPU3
mwait
wake up
CPU0
CPU1
sleep
HYPERVISOR_
vcpu_op(VCPUOP_down)
do_mwait_suspend()
mwait
mwait
HYPERVISOR_
vcpu_op(VCPUOP_up)
CPU2
CPU3
Time line
8. Preliminary Power (normalized)
• > 90% of benchmarks reach 95% of native power
Power KPIs
105%
100%
95%
90%
85%
80%
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully
evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance
But we still identified several gaps…
9. Preliminary Performance (normalized)
•> 90% of benchmarks reach 97% of native performance
Performance KPIs
120.00%
100.00%
80.00%
60.00%
40.00%
USB MTP write…
USB MTP erad large…
CF-Bench (malloc)
WLAN download
3G HSDPA download
H.264 video record
H.264/MPEG-4 AVC…
ColdBoot time to…
GLBenchmark 2.5.1…
GLBenchmark 2.5.1…
Qudrant IO
Qudrant3D
Qudrant2D
SmarkBench2012
BaseMarkES2v1…
BaseMarkES2v1 Taiji
FishIE Tank -200M
Octane
Browsermark
EEMC BrowingBench
Sunspider
AnTutu 2.9.4 CPU Int
Micro Benchmark…
iSPEC00 - speed
CaffeineMark
Dhrystone - BENC
CoreMark
EEMBC
0.00%
Micro Benchmark…
20.00%
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully
evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance
But we still identified several gaps and need some tools to help us…
10. Tools Enabling
Enabled a lot of tools for performance tuning
• vTune
− Based on PMU, mainly used to tune Dom0
• Xentrace
− Based on original Xentrace, but revised to count key events and hypercalls
• Perf
− Based on PMU, mainly used to tune Dom0
• Xenoprofier
− Based on PMU, mainly used to tune Xen
Those tools prove very helpful in the late tuning Performance and power
11. Case #1: Quadrant I/O (perf)
Gap: 21%
• Analysis:
Storage data are cached in page cache which is allocated from
high_memory. Each page cache access needs to kmap/kunmap which
leads to a lot of PVMMU hypercalls
• Optimizations:
− Shrink Xen memory foot print from 168M to 72M
− Force page cache allocated from low memory
• Gap reduced to 8.5%
Can we continue to optimize and close that gap of 8.5%?
12. Case #1: Quadrant I/O (perf) continue
Profiled by Vtune
Among 8.5%:
Xen overhead =
134/3138 ~= 4.27%
Xen traces
type
hcall
hcall
hcall
hcall
hcall
hcall
hcall
event
event
hcall
hcall
hcall
hcall
event
hcall
event
event
hcall
hcall
hcall
name
mwait_idle_op
multicall
mmu_update
mmuext_op
vcpu_op
event_channel_op
xen_version
PAGE_FAULT
IRQ
event_channel_op
physdev_op
event_channel_op
event_channel_op
TIMER_IRQ
event_channel_op
TRAP
PRIVOP
fpu_taskswitch
undfined
apic_op
count
3759
12147
27126
7781
6577
3405
4937
9764
1119
1259
1692
840
761
472
545
1038
1032
1038
21
3
total cost:
cost
cost%
37142118744
145492506
32.12%
113270256
25.00%
50615724
11.17%
39658986
8.75%
26617650
5.88%
12374700
2.73%
11719224
2.59%
10178934
2.25%
9081834
2.00%
8251512
1.82%
7024398
1.55%
6150300
1.36%
5745738
1.27%
4361118
0.96%
1040916
0.23%
872700
0.19%
439638
0.10%
102672
0.02%
5484
0.00%
453004290
Among 4.27%:
PVMMU overhead ~= 70.88%
Hard to further close the gap of 8.5% due to PVMMU overhead
14. Other Gaps
Other cases have the similar Xen overheads:
• PVMMU
• TLS/stack switching
Some cases could be optimized by reducing the hypercall
numbers by optimizing guest
• For example, Quadrant I/O
While, some cases could be hard to optimize due to PV overhead
• For example, CF-Bench malloc
Could be fixed by HVM Dom0
15. Summary
• Dom0 Android achieved near-native power and performance
• Still found some power and performance gaps caused by PVOPS
− PVMMU
− TLS/Stack switch
• Those gaps could be fixed by HVM Dom0
16. Q&A
• Questions?
• or contact
Jack Ren <jack.ren@intel.com>
Xiantao Zhang <xiantao.zhang@intel.com>