Virtualization overheads

Computer Measurement Group, India 1
Performance overheads of
Virtualization
Sandeep Joshi
Principal SDE, Storage Startup
26 April 2014
www.cmgindia.org

Contents
1. Hypervisor classification and overview
2. Overheads introduced
3. Benchmarking
4. Analytic models
5. Conclusion

Not covered in this talk
• Mobile virtualization
- Motorola Evoke QA4 was the first to run two OS.
- new Hyp mode in ARM Cortex A-15 processor.
• Nested virtualization
- running one hypervisor on top of another.
• Network virtualization
- SDN, OpenFlow.
• Containers (aka OS-level virtualization)
- Solaris Zones, LXC, OpenVZ.
• Older hypervisors which did binary translation.

Classification
• Image : blog.technet.com/b/chenley

VMWare ESX
• Image : blog.vmware.com

VMWare ESX
l
Each virtual machine has multiple worlds
(threads), some of which correspond to guest
CPUs and others are dedicated to the device
processing (Run “esxtop” on the host).
• Monolithic kernel. Hardware support limited to
those drivers installed in the hypervisor.

KVM
Used in Google cloud, Eucalyptus, or most Openstack clouds.
• Image : Redhat Summit, June 2013

KVM
Linux is the hypervisor. Leverages Linux features
(device drivers, NAPI, CPU and IO schedulers,
cgroups, madvise, NUMA, etc.)
• Guest OS sits inside Linux process running QEMU
shell; each Virtual CPU is a thread inside this
process.
• Uses QEMU for device virtualization. QEMU in one
guest is not aware of QEMU running in another
guest.

Microsoft HyperV
Used in Microsoft Azure cloud

Xen
When you use Amazon or Rackspace, you are using Xen.

Contents
3. Benchmarking
4. Analytic models
5. Conclusion

Overheads introduced
1. CPU : nested scheduling triggers lock preemption
problem (use gang scheduling), VM exits are exits.
2. Memory : Nested page table, NUMA topology.
3. Disk : Nested filesystems, page cache, IO
schedulers, interrupt delivery, DMA.
4. Network : DMA, interrupt delivery.
•
Next few slides will cover hardware assists, nested
filesystems, nested IO schedulers and benefits of IO
paravirtualization.

Hardware assists
Hardware assists have considerably eased many
virtualization overheads:
1. CPU : Binary translation was replaced by extra CPU rings : root
and guest mode, each with 4 rings.
2. MMU : Shadow table in software replaced by EPT/Nested page
table.
3. IOMMU : during DMA, it converts Machine Physical Address to
Guest Physical Address.
4. IO-APIC : interrupt delivery is done directly to the guest using
IDT.
5. SR-IOV : virtual functions implemented in the NIC (SR-IOV is also
defined for storage adapters but not yet implemented)
Benefits: Hardware assistance reduces CPU cache contention as
well as “Service Demand” on the VM. (Service Demand = CPU
Utilization/Throughput). Higher throughput is obtained for lesser
CPU utilization.

Hardware assists
Image:intel.comImage: virtualizationdeepdive.wordpress.com
IOMMU APIC

How much does hardware assist help?
Ganesan et al (2013) ran microbenchmarks on 2-core Intel Xeon.
Compare native vs Xen, with and without hardware assistance (SR-
IOV, IOMMU, Intel VT).
Finding: Network throughput is near-native with SR-IOV but CPU
utilization still remains high (possibly because interrupt processing is
still triggering numerous guest VM-hypervisor transitions ?).
Chart shows Max throughput with iPerf.
Mbps Dom0 CPU VM CPU
Native 940 NA 16.68
SR-IOV 940 20 65 (high)
No SR-IOV 192 82 39

How much does hardware assist help?
Further results from Ganesan et al (2013). Disk throughput tested
using RUBiS(disk+net intensive) and BLAST(disk intensive).
Finding: Disk IO is not yet benefiting from hardware assists. Most of
the RUBiS improvement comes from SR-IOV rather than IOMMU.
Similar finding with BLAST.

Nested IO scheduling
VM and hypervisor are both running IO scheduling algorithms (and
so is the disk drive). IO requests are rearranged and merged by the
IO scheduler (scheduler can be set in Xen Dom0 or KVM host but not
in ESX).
On Linux, there are 4 schedulers - CFQ, NOOP, Anticipatory,
Deadline. Each block device can have a different scheduler.
Image: dannykim.me

Results of Boutcher and Chandra
• Best combination of schedulers depends on workload.
• Tradeoff between fairness (and VMs) and throughput
• Scheduler closest to workload has most impact.
• NOOP has best throughput in the hypervisor but is least fair by
Jain's fairness measure.
• In guest VM, CFQ is 17% better than Anticipatory for FFSB
benchmark but for Postmark, Anticipatory is 18% better than CFQ.
• On Hypervisor, NOOP is 60% better than CFQ for FFSB and 72%
better for Postmark.

Boutcher's numbers for FFSB on Xen 3.2, 128 threads, each VM
allocated contiguous 120GB space on 500GB SATA drive.
X-axis is scheduler in VM; Y-axis is scheduler in hypervisor.
Numbers in table are approx because they were converted from a
bar graph (Transactions per sec. on Xen).
anticipato
ry
CFQ Deadline NOOP
Anticipato
ry
200 260 175 240
CFQ 260 240 155 160
Deadline 315 360 250 255
NOOP 320 370 245 255

Sequential IO becomes random
• Sequential IO issued from multiple VMs to same block device
becomes random when aggregated in the hypervisor.
• Set longer disk queue length in hypervisor to enable better
aggregation. On VMWare, you can set
disk.SchedReqNumOutstanding=NNN.
• Use PCI flash or SSDs to absorb random writes.

Nested filesystems and page cache
Filesystem on VM can map to a flat file on underlying filesystem, raw
block device (local or iSCSI), NFS. Flat file on FS is preferred for
ease of management. It can be raw, qcow2, vmdk or vhd format.
VM1
Guest filesystem
/dev/sda
Files in /vmfs on hypervisor
/dev/sdc
VM-2
Guest filesystem
/dev/sdb
• Flat files introduce another performance overhead (next slide).
• KVM has four caching modes (none, writeback, writethru, unsafe)
which can disable/enable either cache.
• In Xen Dom0, the page cache comes into play when the file-
storage option is in use.

Le et al (2012) ran FileBench and “fio” on KVM. Tried 42 different
combinations of guest and host file systems. Found worst-case 67%
degradation.
Their conclusion:
• Read-dominated workloads benefit from readahead.
• Avoid journaled filesystems for write-dominated workloads.
• Latency goes up 10-30% in best case.
• Host FS should be like a dumb disk or VirtFS; it should not make
placement decisions over what guest FS has decided.
• Jannen(2013) found that overheads are worse for filesystems on
SSD. On HDD, the overheads are masked by rotational latency.

Le et al (2012) – random, file write test using “fio”.
Y-axis is Host file system; X-axis is Guest file system.
Throughput in MB/sec
ext2 ext3 reiser xfs jfs
ext2 60 55 65 80 95
ext3 60 55 65 80 75
ext4 60 55 55 70 95
reiser 60 55 65 80 100
xfs 60 40 60 70 65
jfs 60 50 65 80 105

Nested IO stacks : use paravirtualization
Hypervisor exposes virtual NIC or storage HBA written in software to
the VM. IO request issued by VM travels to the bottom of the stack
before it is repackaged and reissued by the hypervisor .
Paravirtualization traps the IO request and uses shared memory to
route it faster to the hypervisor.
1. VMWare: Install “VMWare tools” and select “Paravirtual SCSI
controller” for storage and “vmxnet” driver for networking.
VMWare claims PVSCSI offers 12% throughput improvement with
18% less CPU cost with 8Kb block size (blogs.vmware.com)
2. KVM: use newer in-kernel “vhost-net” for networking and “virtio-
scsi” or “virtio-blk-data-plane” drivers for storage.
3. Xen: Split-driver used for PVM guests while HVM guests use
QEMU or StubDom. HVM can also use PV drivers.

Xen: PVM and HVM difference
HVM is 3-6% better than PV guest for CPU+RAM intensive tests. For
1 VM with 4 vCPU and 6GB JVM heap size (Yu 2012).

Xen: PVM and HVM difference
Here, HVM was using PV driver. Outperforms PV by 39% for disk-
intensive test running on SSD. (Yu 2012).

Contents
3. Benchmarking
4. Analytic models
5. Conclusion

Virtual Machine Benchmarking
Two aspects
1.Performance : Compare performance of a consolidated server to
a non-virtualized OS running on bare-metal hardware ?
2.Isolation : Does overloading one VM bring down performance of
other VMs running on same node.
Impact of factors
•. Virtualization-sensitive instructions issued by the guest.
•. VM exits and interrupt delivery.
•. Architectural choices made within hypervisor
•. Interference between VMs due to shared resources (visible and
invisible).

Testing isolation capability of a hypervisor
1. Run application on cloud with collocated VMs.
2. Then run in isolation with no collocated VMs to find the gaps.
3. Then run it in a controlled environment, gradually adding
collocated VMs which create CPU, disk or network load, until you
can simulate the behaviour seen in the cloud.

CPU isolation on Xen
(Barker and Shenoy study 2010)
Find variation in completion times for a single thread which is
running a “floating point operations” test periodically over a few
hours.
• Amazon EC2 small instance: Average completion time was 500
ms, but there was significant jitter. Some tests even took an
entire second.
• Local setup: Same test on local Xen server showed almost NO
variation in completion time.
• Conclusion: CPU scheduler on Xen does not provide perfect
isolation.
• Further tests done to narrow down the problem in the CPU
scheduler.

Xen’s credit scheduler for CPU
Xen has 2 main CPU schedulers – EDF (realtime) and Credit (default).
Each VM runs on one or more virtual CPUs(vCPU). Hypervisor maps
vCPU to physical CPUs (floating or pinned).
For each VM, you can define (weight, cap).
1. Weight = proportion of CPU allocated.
2. Cap = max limit or ceiling on CPU time.
Credit scheduler periodically issues 30ms to each vCPU for use.
Allocation decremented in 10ms intervals. When credits expire, VM
must wait until next 30ms cycle. If VM receives an interrupt, it gets
“Boost” which inserts it to the top of the vCPU queue, provided it has
not exhausted its credits.
Scheduler also has a work-conserving mode which transfers unused
capacity to those VMs that need it.

CPU isolation on Xen
On local setup, tied two VMs to same physical core. Varied (weight,
cap) of foreground VM while keeping background VM busy.
1. First test: Keep weights of both VMs equal. Observed jitter as
seen on EC2 test.
2. Second test: Vary “weight” while keeping “cap” constant.
Weight does not directly correspond to CPU time. Weight ratio of
1:1000 only translates into actual CPU ratio of 1:1.5 (33% more).
3. Third test: Vary “cap” on both VMs. CPU allocation of
foreground VM was in proportion to the “cap” (even when
background VM was idle).
Conclusion: Strong isolation requires pinning VM to a core or
setting the “cap”.

Disk isolation on Xen
Test jitter for small random or large streaming IO to simulate game
servers and media servers.(Barker, 2010)
Amazon EC2 : Found significant variation in completion time for
reads and writes. Write bandwidth can vary upto 50% from mean.
Read bandwidth variation can be due to caching side-effects.
Isolated local setup: Completion times are consistent if there is no
other VM on the Xen node.
Introduce a background VM: Run same tests with another
background VM doing heavy IO. Used CFQ IO scheduler in Dom0
and NOOP in guest VM.
Finding: Xen has no native disk isolation mechanism to identify per-
VM disk flows. Throughput of foreground VM dropped by 65-75%
and latency increased by 70-80%. Limit to the degradation due to
the round-robin policy of Xen Dom0 driver.

Network isolation on Xen
(Barker 2010)
1.Measure “ping” time to next hop
2.Measure sum of “ping” time to first three hops.
3.Measure time to transfer 32KB block between local and EC2
instance.
Pop quiz: what is the point in conducting these three tests ?

(Barker 2010)
1.Measure “ping” time to next hop
2.Measure sum of “ping” time to first three hops.
3.Measure time to transfer 32KB block between local and EC2
instance.
Purpose:
a)First measurement captures jitter of network interface
b)Second captures jitter in routers inside Amazon data center.
c)Third captures Internet WAN transfer rate and jitter.
1. Saw no jitter in first measurement.
2. Significant variation in second. Most took 5ms but there were
significant number which took order of magnitude longer.
3. Third test showed regular variation (related to peak hours) typical
of most WAN applications.

Network latency tests on a Game server and a Media server on local
Xen cloud. (Barker 2010)
Found that “tc” defines per-VM flows using IP address and provides
good isolation. Two ways to allocate bandwidth using Linux “tc” tool.
1.Dedicated : Divide bandwidth between competing VMs and
prevent any VM from using more (i.e. Weight + cap).
2.Shared : Divide bandwidth but allow VMs to draw more if required
(i.e. Weight + work-conserving).
In both game and media server tests, results are consistent.
“Dedicated” mode produced lower latency while “shared”
mode produced lower jitter.
Interference Mean Std deviation
Dedicated 23.6 ms 29.6
Shared 33.9 ms 16.9

Long latency tails on EC2
(Xu et al, Bobtail, 2013)
Initial observations:
1. Median RTTs within EC2 are upto 0.6ms but 99.9 percentile RTT
on EC2 is 4 times longer than that seen in dedicated data centers.
(In other words, a few packets see much longer delays than
normal.)
2. Small instances most susceptible to the problem.
3. Measured RTT between node pairs on same AZ. Pattern not
symmetric. Hence, long tail not caused by location of host on
network.
4. RTT between good and bad nodes in AZ can differ by order of
magnitude.
5. One AZ which had newer CPU models did not return that many
bad nodes.

Experimental setup: On 4-core Xen server, dedicate 2 cores to
Dom0. Remaining 2 cores are shared between 5 VMs with 40%
share each. Vary the combination of latency-sensitive versus CPU-
intensive VMs.
Latency-
sensitive
CPU-intensive RTT
5 0 1 ms
4 1 1 ms
3 2 <10 ms
2 3 ~30 ms
1 4 ~30ms
Long-tail emerges when CPU-intensive VMs exceed number
of cores.

Hypothesis: Do all CPU-intensive VMs cause a problem?
Test: Vary the CPU usage of CPU-intensive VM to find out.
Long tail occurs when a competing VM does not use all its
CPU allocation. The Boost mechanism for quickly scheduling
latency-sensitive VMs fails against such VMs.

1. Latency-sensitive VMs cannot respond in a timely manner
because they are starved of CPU by other VMs.
2. The VMs which starve them are those that are CPU-intensive but
are not using 100% of their allocation within 30ms.
3. The BOOST mechanism in the Xen scheduler runs in FIFO manner
and treats these two types of VMs equally instead of prioritizing
the latency-sensitive VM.
Authors designed “Bobtail” to select the EC2 instance on which to
place a latency-sensitive VM. (see paper)

EC2 Xen settings
Tested for small instances:
1. EC2 uses Xen credit scheduler in non-work-conserving mode,
which reduces efficiency but improves isolation.
2. It allows vCPUs to float across cores instead of pinning them to a
core.
3. Disk and network scheduling is work-conserving, but only
network scheduling has a max cap of 300 Mbps.
(Varadarajan, 2012)

Know your hypervisor : Xen
Xen :
CPU has two schedulers : Credit(2) and EDF.
•
Credit scheduler keeps a per-VM (weight, cap). Can be work-
conserving or not. Work-conserving means “distribute any idle
time to other running processes”; otherwise total CPU quantum is
capped.
• I/O intensive VMs benefit from BOOST, which bumps a vCPU to
the head of the queue when it receives an interrupt, provided it
has not exhausted its credits.
Device scheduler:
• Disk and network IO goes through Domain 0, which schedules
them in batches in round-robin fashion by Domain 0. To control
network bandwidth, use Linux tools to define per-VM flows.
Best practice: Increase CPU weight of Dom0 to be proportional to
the amount of IO. Dedicate core(s) to it. Dedicate memory and
prevent ballooning.

Know your hypervisor - KVM
• QEMU originally started a complete machine emulator [Bellard, 2005].
Code emulation is done by TCG (tiny code generator) originally called
“dyngen”. KVM was later added as another code accelerator into the
QEMU framework.
• Only one IO thread; BIG QEMU Driver lock is held in many IO
functions.
• Redhat “fio” benchmark in Sept 2012 reported 1.4M IOPs with 4 guests
but this was using passthrough IO (i.e. bypassing QEMU)
• Similar numbers reported in Mar 2013 but this time using an
experimental virtio-dataplane feature which utilizes dedicated per-
device threads for IO.
• Performance of RTOS (as a guest OS) in KVM also suffers when it
comes in contact with QEMU [Kiszka].

Tile-based benchmarking to test consolidation
Traditional benchmarks are designed for individual servers. For
virtualization, tiles of virtual machines that mimic actual
consolidation are used.
1. SPECvirt sc2013 (supersedes SPECvirt sc2010)
2. VConsolidate(Intel): tile consisting of SPECjbb, Sysbench,
Webbench and a mail server
3. VMMark (VMWare) : Exchange mail server, standy system,
Apache server, database server.
SPEC sc2013:
•.
Run for 3 hours on a single node to stress CPU, RAM, disk, and
network.
•.
Incorporates four workloads : web server, 4 Java app server
connected to a backend database server (to test multiple vCPU on
SMP), mail server and Batch server.
•.
Keep adding additional sets of virtual machines (tiles) until overall
throughput reaches a peak. All VMs must continue to meet the
required QoS (spec.org/virt_sc2013)

SPECvirt sc2013

Contents
3. Benchmarking
4. Analytic models
5. Conclusion

Value of analytic model
Benchmarks have to
Produce repeatable results.
Be Comparable easily across architectures & platforms.
Have predictive power (extrapolation).
Tension between realism and reproducibility.
Macrobenchmarks simulate real-world conditions but are not
comparable and lack extrapolation.
Microbenchmarks determine the cost of primitive operations.
Need analytic model to tie benchmarks to prospective application
use. Seltzer proposed three approaches :
1. Vector-based: combine system vector with application vector.
2. Trace-based : Generate workload from trace to capture dynamic
sequence of requests.
3. Hybrid : combination of both.
(Mogul 1999; Seltzer et al 1999)

Analytic models for virtualization
1.Layered queuing network (Menasce; Benevenuto
2006).
2.Factor graphs to determine per-VM utilization (Lu
2011)
3.Trace-based approach (Wood, et al)
4.VMBench (Moller @ Karlsruhe)
5.Equations for cache and core interference
(Apparao, et al).
6.Machine learning

Layered Queueing network (for Xen)
VM
Domain 0 Disk
IN
OUT Xen

Layered Queueing network (for Xen)
Total response time R = R(VM) + R(dom0) + R(disk)
For M/M/1 with feedback:
R of one resource = D/[ 1- U ]
U = Utilization = λ * D = Arrival rate * Service demand
U lies between 0 and 1.
D = Service demand = total time taken by one request.
D(resource by VM) = D(bare) * Slowdown(resource)/P(VM)
D(resource by Dom0) = D(VM) * Cost(Dom0)/P(IDD)
where
P=speedup of hardware of VM as compared to bare-metal.
Cost(Dom0) = BusyTime(Dom0)/BusyTime(VM)
Slowdown(resource) = BusyTime(virtual)/BusyTime(bare)

Factor graphs
Resource utilization at guest VMs is known and aggregate utilization
at hypervisor is known. How to determine the function which
defines per-VM utilization of each resource ?
This can be modeled as a “source separation problem” studied in
digital signal processing.
Measurements inside VM and on hypervisor can differ:
1.Disk IO inside VM can be higher than on the hypervisor due to
merging of IOs in the hypervisor.
2.CPU utilization inside a VM can be half of that at the hypervisor
because Xen issues per-VM IO through Dom0 (seen via “xentop”).

Factor graphs
CPU
Disk
Net
Mem
h1
h2
h4
h3
CPU
Disk
Net
Mem
CPU
Disk
Net
Mem
f1
f2
f4
f3
CPU
Disk
Net
Mem
CPU
Disk
Net
Mem
g1
g2
g4
g3
Host
VM1
VM2

Trace-based approach
How to model the migration from bare-metal to virtual
environment?
1.Create platform profiles to measure cost of primitive
operations: Run same microbenchmarks on native (bare-metal)
and virtualized platform.
2.Relate native and virtualized : Formulate set of equations
which relate native metrics to virtualized.
3.Capture trace of application which is to be migrated:
Determine how many primitive operations it uses and plug it in.
(*Actual process employs Statistical methods and is more
complicated)

environment?
Step 1: Create platform profiles?
Run carefully chosen CPU, disk and network-intensive
microbenchmarks on both the bare-metal and virtual environment.
Measure key metrics for each benchmark :
a)CPU – percentage time spent in user, kernel and iowait
b)Network – read and write packets/sec and bytes/sec
c)Disk – read and write blocks/sec and bytes/sec.
Metric CPU user CPU sys CPU
iowait
BareMetal 23 13 3
Virtual 32 20 8

environment?
Step 2: Relate native and virtualized : Formulate set of
equations which relate native metrics to virtualized.
e.g. Util(cpu on VM) = c0 + c1*M1 + c2*M2 + ... cn*Mn
where Mk=Metric gathered from native microbenchmark
Solve for the model coefficients using Least Squares Regression.
The coefficients c_k capture relation between native and virtualized
platform.
e.g. c0=4, c1=19, c2=23, etc...

environment?
Step 3: Capture trace of application which is to be migrated:
And find the new metrics Mk to plug into the above equation. Then
solve it. Voila !
Util(cpu on VM)= 4 * (M1) + 19 * (M2) + ...
Recap:
1. Create platform profiles for native and virtual.
2. Find coefficients which relate native & virtual.
3. Capture application trace and apply the equation.
Their findings:
4. Single model is not applicable for Intel and AMD since CPU
utilization varies.
5. Feedback loop within the application can distort performance
prediction.

Conclusion
All problems in CS can be solved by another level of indirection
-David Wheeler (1997-2004, first PhD in Computer Science)
... and performance problems introduced by indirection require
caching, interlayer cooperation and hardware assists (e.g. TLB
cache, EPT, paravirtualization).
Virtual machines have finally arrived. Dismissed for a
number of years as merely academic curiosities, they are
now seen as cost-effective techniques for organizing
computer systems resources to provide extraordinary
system flexibility and support for certain unique
applications.
[Goldberg, Survey of Virtual Machine Research, 1974]

References
1. Ganesan et al. Empirical study of performance benefits of
hardware assisted virtualization, 2013.
2. Boutcher and Chandra. Does virtualization make disk scheduling
passe.
3. Le at al. Understanding Performance Implications of Nested File
Systems in a Virtualized Environment.
4. Jannen. Virtualize storage, not disks.
5. Yu. Xen PV Performance status and Optimization Opportunities.
6. Barker and Shenoy. Empirical evaluation of latency-sensitive
application performance in the cloud
7. Xu. Bobtail. Avoiding long tails in the cloud.
8. Varadarajan et al. Resource freeing attacks.
9. Bellard. QEMU, a fast and portable dynamic translator.
10.Kiszka. Using KVM as a realtime hypervisor
11.Mogul. Brittle metrics in operating system research.
12.Seltzer et al. The Case for Application-Specific Benchmarking

References
1. Menasce. VIRTUALIZATION: CONCEPTS, APPLICATIONS, AND
PERFORMANCE MODELING
2. Benevenuto et al. Performance Models for Virtualized Applications
3. Lu at al, Untangling Mixed Information to Calibrate Resource
Utilization in Virtual Machines, 2011.
4. Wood. Profiling and Modeling Resource Usage of Virtualized
Applications

BACKUP SLIDES

Classification
• OS-level virtualization : Does not run any intermediary
hypervisor. Modify the OS to support namespaces for
networking, processes and filesystem.
• Paravirtualization : Guest OS is modified and is aware
that it is running inside a hypervisor.
• Full virtualization : Guest OS runs unmodified.
Hypervisor emulates hardware devices.

NUMA/SMP
• If you run a monster server VM with many vCPUs, you may have
to worry about NUMA scaling. Depending on NUMA ratio, 30-40%
higher cost (latency and throughput) in accessing remote memory
• Hypervisor must be able to
1.manually pin a vCPU to a core.
2.export NUMA topology to the guest OS.
3.do automatic NUMA-aware scheduling of all guest VMs.
•. VMWare introduced vNUMA in vSphere 5.
•. On Xen, pin Dom0 to a core. In case of NUMA, put frontend and
backend drivers on the same core.
•. KVM exports NUMA topology to VM but it is still lagging on
automatic scheduling.

NUMA/SMP
• Depending on NUMA ratio, 30-40% higher cost (latency and
throughput) in accessing remote memory
• Hypervisor must support ability to pin a vCPU to a core, and also
allocate memory from specific NUMA node.
• Hypervisor must export NUMA topology (ACPI tables) so guest OS
can do its job.
• Hypervisor should do automatic NUMA-aware scheduling of all
guest VMs.
• VMWare introduced vNUMA in vSphere 5.
• On Xen, pin Dom0 to a core. In case of NUMA, put frontend and
backend drivers on the same core.
• KVM exports NUMA topology to VM but it is still lagging on
automatic scheduling.
• Cross-call overhead : On a SMP machine, when a semaphore is
released by one thread, it issues a cross-call or inter-processor
interrupt if the waiting threads are sleeping on another core. On
a VM, the cross-call becomes a costly privileged op (Akther).
Interrupt delivery may also trigger a cross-call.

Nested CPU scheduling
• Each guest OS runs on one or more virtual CPUs. Hypervisor
schedules virtual CPUs on its run queue and then each guest OS
decides which task to run on that virtual CPU.
• Introduces lock preemption problem: A process in the guest OS
may get scheduled out by the hypervisor while holding a spin lock,
delaying other processes waiting for that spin lock.
• Guest OS would not schedule out a process holding a spin lock but
hypervisor is unaware of processes within the guest OS.
•
• Solution is some form of co-scheduling or “gang scheduling”.
VMWare actively seeks to reduce “skew” between multiple vCPUs
of the same VM.

Nested Page tables
• Page fault in VM may occur because the hypervisor has not
allocated RAM to the VM.
• Guest Page table : Guest Virtual adddress -> Hypervisor Virtual
• Hypervisor Page Table : Hypervisor Virtual -> Actual RAM.
• Earlier, hypervisors would maintain a “shadow page table” for each
guest OS. This function has now moved into hardware by both
Intel and AMD. Its called “Nested page tables”.
• Nested page tables require a costly two-dimensional page walk.
For each step that is resolved in the guest table, you have to look
up the host table.
• Overhead can be alleviated by using “huge pages” and per-VM
tags in the TLB cache.

Memory overheads & solutions
Balloon driver : take back memory from guest.
• -- VMWare
• -- KVM (see virtio_balloon.c in linux_kernel/drivers/virtio)
• -- HyperV calls it Dynamic Memory
• -- Xen Transcendent Memory
Memory deduplication
• -- Present in System/270 (Smith & Nair);
• -- VMWare calls it Transparent Page Sharing (patented)
• -- KVM uses KSM (which calls Linux madvise())
• -- Xen uses KSM in HVM mode only.

Quantifying isolation
• Deshane et al(2007) defined BenchVM to test isolation.
• Run normal VMs alongside a overloaded VM and test if the normal
VM remains responsive.
• On the Overloaded VM, you run various stress tests:
1.CPU stress test
2.Memory stress test : calloc and touch memory without free()
3.Network : threaded UDP send and receive
4.Disk : IOzone
5.Fork bomb : test fast process creation and scheduling.
Their conclusion: Full virtualization provides better isolation than
container-based virtualization. Their other results may be outdated
due to advances in virtualization

VM exits are costly
• Interrupt processing causes context switches between VM and
hypervisor.
• KVM EOI optimization : guest IDT (interrupt descriptor table) is
shadowed.
• VMWare detects cluster of instructions that can cause guest exits.
• Use combination of polling and interrupt (NAPI)

mclock
• Disk capacity varies dynamically and cannot be statically allocated
like CPU or RAM.
• Need proportional sharing algorithm to reserve disk capacity
• Gulati et al propose a dynamic algorithm which interleaves two
schedulers and uses three tags with every IO request.

Hadoop benchmark
• VMWare :
• HyperV (conducted on HDInsight – Microsoft's version of
Hadoop) :
• KVM:
• Xen: (Netflix runs map-reduce on AWS)

HPC/Scientific benchmark
• VMWare paper : SPEC MPI and SPEC OMP
• Xen : Jackson et al (2010) ran NERSC on Amazon EC2. Six times
slower than Linux cluster and 20 times slower than modern HPC
system. EC2 interconnect severely limits performance. Could not
use processor-specific compiler options since heterogenous mix of
CPUs on every node.
• In Jun 2010, Amazon launched “Cluster Compute Nodes” which are
basically nodes running Xen in hvm mode connected via 10G
ethernet (no Infiniband yet).
• KVM and OpenVZ : Regola (2010) ran NPB on these nodes.

Realtime benchmark
• In order to minimize jitter and limit the worst-case latency, a
realtime system must provide mechanisms for resource
reservation, process preemption and prevention of priority
inversion.
• Soft realtime (VoIP) vs Hard realtime. Soft means 20ms jitter
between packets acceptable.
• RT-XEN
• Kiszka KVM – QEMU driver lock.

Layered Queueing network (Xen)
Total response time R = R(vcpu) + R(dom0_cpu) + R(disk)
Resp. Time = Demand/[ 1- Utilization ]
R(vcpu) = D(vcpu)/ [ 1 – U (vcpu) ]
R(dom0_cpu) = D(vcpu)/ [ 1 - U(dom0_vcpu]
R(disk) = D(disk)/ [ 1 – U(disk) ]
Util = λ * D = Arrival rate * Demand
D(vm_cpu) = D(isol_cpu) * S(cpu)/P(vm) where S=slowdown,
P=speedup
D(dom0_cpu) = D(vm_cpu) * Cost(dom0_vm)/P(dom0)
Cost(dom0_vm) = B(dom0_cpu)/B(vm_cpu) where B = busy time
Slowdown(cpu) = B(vm_cpu)/B(bare_cpu)
Slowdown(disk) = B(vm_disk)/B(bare_disk)

Virtualization overheads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Virtualization overheads

Similar to Virtualization overheads (20)

More from Sandeep Joshi

More from Sandeep Joshi (10)

Recently uploaded

Recently uploaded (20)

Virtualization overheads