SlideShare a Scribd company logo
1 of 73
Computer Measurement Group, India 1
Performance overheads of
Virtualization
Sandeep Joshi
Principal SDE, Storage Startup
26 April 2014
www.cmgindia.org
Computer Measurement Group, India 2
Contents
1. Hypervisor classification and overview
2. Overheads introduced
3. Benchmarking
4. Analytic models
5. Conclusion
Computer Measurement Group, India 3
Not covered in this talk
• Mobile virtualization
- Motorola Evoke QA4 was the first to run two OS.
- new Hyp mode in ARM Cortex A-15 processor.
• Nested virtualization
- running one hypervisor on top of another.
• Network virtualization
- SDN, OpenFlow.
• Containers (aka OS-level virtualization)
- Solaris Zones, LXC, OpenVZ.
• Older hypervisors which did binary translation.
Computer Measurement Group, India 4
Classification
• Image : blog.technet.com/b/chenley
Computer Measurement Group, India 5
VMWare ESX
• Image : blog.vmware.com
Computer Measurement Group, India 6
VMWare ESX
l
Each virtual machine has multiple worlds
(threads), some of which correspond to guest
CPUs and others are dedicated to the device
processing (Run “esxtop” on the host).
• Monolithic kernel. Hardware support limited to
those drivers installed in the hypervisor.
Computer Measurement Group, India 7
KVM
Used in Google cloud, Eucalyptus, or most Openstack clouds.
• Image : Redhat Summit, June 2013
Computer Measurement Group, India 8
KVM
Linux is the hypervisor. Leverages Linux features
(device drivers, NAPI, CPU and IO schedulers,
cgroups, madvise, NUMA, etc.)
• Guest OS sits inside Linux process running QEMU
shell; each Virtual CPU is a thread inside this
process.
• Uses QEMU for device virtualization. QEMU in one
guest is not aware of QEMU running in another
guest.
Computer Measurement Group, India 9
Microsoft HyperV
Used in Microsoft Azure cloud
Computer Measurement Group, India 10
Xen
When you use Amazon or Rackspace, you are using Xen.
Computer Measurement Group, India 11
Contents
1. Hypervisor classification and overview
2. Overheads introduced
3. Benchmarking
4. Analytic models
5. Conclusion
Computer Measurement Group, India 12
Overheads introduced
1. CPU : nested scheduling triggers lock preemption
problem (use gang scheduling), VM exits are exits.
2. Memory : Nested page table, NUMA topology.
3. Disk : Nested filesystems, page cache, IO
schedulers, interrupt delivery, DMA.
4. Network : DMA, interrupt delivery.
•
Next few slides will cover hardware assists, nested
filesystems, nested IO schedulers and benefits of IO
paravirtualization.
Computer Measurement Group, India 13
Hardware assists
Hardware assists have considerably eased many
virtualization overheads:
1. CPU : Binary translation was replaced by extra CPU rings : root
and guest mode, each with 4 rings.
2. MMU : Shadow table in software replaced by EPT/Nested page
table.
3. IOMMU : during DMA, it converts Machine Physical Address to
Guest Physical Address.
4. IO-APIC : interrupt delivery is done directly to the guest using
IDT.
5. SR-IOV : virtual functions implemented in the NIC (SR-IOV is also
defined for storage adapters but not yet implemented)
Benefits: Hardware assistance reduces CPU cache contention as
well as “Service Demand” on the VM. (Service Demand = CPU
Utilization/Throughput). Higher throughput is obtained for lesser
CPU utilization.
Computer Measurement Group, India 14
Hardware assists
Image:intel.comImage: virtualizationdeepdive.wordpress.com
IOMMU APIC
Computer Measurement Group, India 15
How much does hardware assist help?
Ganesan et al (2013) ran microbenchmarks on 2-core Intel Xeon.
Compare native vs Xen, with and without hardware assistance (SR-
IOV, IOMMU, Intel VT).
Finding: Network throughput is near-native with SR-IOV but CPU
utilization still remains high (possibly because interrupt processing is
still triggering numerous guest VM-hypervisor transitions ?).
Chart shows Max throughput with iPerf.
Mbps Dom0 CPU VM CPU
Native 940 NA 16.68
SR-IOV 940 20 65 (high)
No SR-IOV 192 82 39
Computer Measurement Group, India 16
How much does hardware assist help?
Further results from Ganesan et al (2013). Disk throughput tested
using RUBiS(disk+net intensive) and BLAST(disk intensive).
Finding: Disk IO is not yet benefiting from hardware assists. Most of
the RUBiS improvement comes from SR-IOV rather than IOMMU.
Similar finding with BLAST.
Computer Measurement Group, India 17
Nested IO scheduling
VM and hypervisor are both running IO scheduling algorithms (and
so is the disk drive). IO requests are rearranged and merged by the
IO scheduler (scheduler can be set in Xen Dom0 or KVM host but not
in ESX).
On Linux, there are 4 schedulers - CFQ, NOOP, Anticipatory,
Deadline. Each block device can have a different scheduler.
Image: dannykim.me
Computer Measurement Group, India 18
Nested IO scheduling
Results of Boutcher and Chandra
• Best combination of schedulers depends on workload.
• Tradeoff between fairness (and VMs) and throughput
• Scheduler closest to workload has most impact.
• NOOP has best throughput in the hypervisor but is least fair by
Jain's fairness measure.
• In guest VM, CFQ is 17% better than Anticipatory for FFSB
benchmark but for Postmark, Anticipatory is 18% better than CFQ.
• On Hypervisor, NOOP is 60% better than CFQ for FFSB and 72%
better for Postmark.
Computer Measurement Group, India 19
Nested IO scheduling
Boutcher's numbers for FFSB on Xen 3.2, 128 threads, each VM
allocated contiguous 120GB space on 500GB SATA drive.
X-axis is scheduler in VM; Y-axis is scheduler in hypervisor.
Numbers in table are approx because they were converted from a
bar graph (Transactions per sec. on Xen).
anticipato
ry
CFQ Deadline NOOP
Anticipato
ry
200 260 175 240
CFQ 260 240 155 160
Deadline 315 360 250 255
NOOP 320 370 245 255
Computer Measurement Group, India 20
Sequential IO becomes random
• Sequential IO issued from multiple VMs to same block device
becomes random when aggregated in the hypervisor.
• Set longer disk queue length in hypervisor to enable better
aggregation. On VMWare, you can set
disk.SchedReqNumOutstanding=NNN.
• Use PCI flash or SSDs to absorb random writes.
Computer Measurement Group, India 21
Nested filesystems and page cache
Filesystem on VM can map to a flat file on underlying filesystem, raw
block device (local or iSCSI), NFS. Flat file on FS is preferred for
ease of management. It can be raw, qcow2, vmdk or vhd format.
VM1
Guest filesystem
/dev/sda
Files in /vmfs on hypervisor
/dev/sdc
VM-2
Guest filesystem
/dev/sdb
• Flat files introduce another performance overhead (next slide).
• KVM has four caching modes (none, writeback, writethru, unsafe)
which can disable/enable either cache.
• In Xen Dom0, the page cache comes into play when the file-
storage option is in use.
Computer Measurement Group, India 22
Nested filesystems and page cache
Le et al (2012) ran FileBench and “fio” on KVM. Tried 42 different
combinations of guest and host file systems. Found worst-case 67%
degradation.
Their conclusion:
• Read-dominated workloads benefit from readahead.
• Avoid journaled filesystems for write-dominated workloads.
• Latency goes up 10-30% in best case.
• Host FS should be like a dumb disk or VirtFS; it should not make
placement decisions over what guest FS has decided.
• Jannen(2013) found that overheads are worse for filesystems on
SSD. On HDD, the overheads are masked by rotational latency.
Computer Measurement Group, India 23
Nested filesystems and page cache
Le et al (2012) – random, file write test using “fio”.
Y-axis is Host file system; X-axis is Guest file system.
Throughput in MB/sec
ext2 ext3 reiser xfs jfs
ext2 60 55 65 80 95
ext3 60 55 65 80 75
ext4 60 55 55 70 95
reiser 60 55 65 80 100
xfs 60 40 60 70 65
jfs 60 50 65 80 105
Computer Measurement Group, India 24
Nested IO stacks : use paravirtualization
Hypervisor exposes virtual NIC or storage HBA written in software to
the VM. IO request issued by VM travels to the bottom of the stack
before it is repackaged and reissued by the hypervisor .
Paravirtualization traps the IO request and uses shared memory to
route it faster to the hypervisor.
1. VMWare: Install “VMWare tools” and select “Paravirtual SCSI
controller” for storage and “vmxnet” driver for networking.
VMWare claims PVSCSI offers 12% throughput improvement with
18% less CPU cost with 8Kb block size (blogs.vmware.com)
2. KVM: use newer in-kernel “vhost-net” for networking and “virtio-
scsi” or “virtio-blk-data-plane” drivers for storage.
3. Xen: Split-driver used for PVM guests while HVM guests use
QEMU or StubDom. HVM can also use PV drivers.
Computer Measurement Group, India 25
Xen: PVM and HVM difference
HVM is 3-6% better than PV guest for CPU+RAM intensive tests. For
1 VM with 4 vCPU and 6GB JVM heap size (Yu 2012).
Computer Measurement Group, India 26
Xen: PVM and HVM difference
Here, HVM was using PV driver. Outperforms PV by 39% for disk-
intensive test running on SSD. (Yu 2012).
Computer Measurement Group, India 27
Contents
1. Hypervisor classification and overview
2. Overheads introduced
3. Benchmarking
4. Analytic models
5. Conclusion
Computer Measurement Group, India 28
Virtual Machine Benchmarking
Two aspects
1.Performance : Compare performance of a consolidated server to
a non-virtualized OS running on bare-metal hardware ?
2.Isolation : Does overloading one VM bring down performance of
other VMs running on same node.
Impact of factors
•. Virtualization-sensitive instructions issued by the guest.
•. VM exits and interrupt delivery.
•. Architectural choices made within hypervisor
•. Interference between VMs due to shared resources (visible and
invisible).
Computer Measurement Group, India 29
Testing isolation capability of a hypervisor
1. Run application on cloud with collocated VMs.
2. Then run in isolation with no collocated VMs to find the gaps.
3. Then run it in a controlled environment, gradually adding
collocated VMs which create CPU, disk or network load, until you
can simulate the behaviour seen in the cloud.
Computer Measurement Group, India 30
CPU isolation on Xen
(Barker and Shenoy study 2010)
Find variation in completion times for a single thread which is
running a “floating point operations” test periodically over a few
hours.
• Amazon EC2 small instance: Average completion time was 500
ms, but there was significant jitter. Some tests even took an
entire second.
• Local setup: Same test on local Xen server showed almost NO
variation in completion time.
• Conclusion: CPU scheduler on Xen does not provide perfect
isolation.
• Further tests done to narrow down the problem in the CPU
scheduler.
Computer Measurement Group, India 31
Xen’s credit scheduler for CPU
(Barker and Shenoy study 2010)
Xen has 2 main CPU schedulers – EDF (realtime) and Credit (default).
Each VM runs on one or more virtual CPUs(vCPU). Hypervisor maps
vCPU to physical CPUs (floating or pinned).
For each VM, you can define (weight, cap).
1. Weight = proportion of CPU allocated.
2. Cap = max limit or ceiling on CPU time.
Credit scheduler periodically issues 30ms to each vCPU for use.
Allocation decremented in 10ms intervals. When credits expire, VM
must wait until next 30ms cycle. If VM receives an interrupt, it gets
“Boost” which inserts it to the top of the vCPU queue, provided it has
not exhausted its credits.
Scheduler also has a work-conserving mode which transfers unused
capacity to those VMs that need it.
Computer Measurement Group, India 32
CPU isolation on Xen
(Barker and Shenoy study 2010)
On local setup, tied two VMs to same physical core. Varied (weight,
cap) of foreground VM while keeping background VM busy.
1. First test: Keep weights of both VMs equal. Observed jitter as
seen on EC2 test.
2. Second test: Vary “weight” while keeping “cap” constant.
Weight does not directly correspond to CPU time. Weight ratio of
1:1000 only translates into actual CPU ratio of 1:1.5 (33% more).
3. Third test: Vary “cap” on both VMs. CPU allocation of
foreground VM was in proportion to the “cap” (even when
background VM was idle).
Conclusion: Strong isolation requires pinning VM to a core or
setting the “cap”.
Computer Measurement Group, India 33
Disk isolation on Xen
Test jitter for small random or large streaming IO to simulate game
servers and media servers.(Barker, 2010)
Amazon EC2 : Found significant variation in completion time for
reads and writes. Write bandwidth can vary upto 50% from mean.
Read bandwidth variation can be due to caching side-effects.
Isolated local setup: Completion times are consistent if there is no
other VM on the Xen node.
Introduce a background VM: Run same tests with another
background VM doing heavy IO. Used CFQ IO scheduler in Dom0
and NOOP in guest VM.
Finding: Xen has no native disk isolation mechanism to identify per-
VM disk flows. Throughput of foreground VM dropped by 65-75%
and latency increased by 70-80%. Limit to the degradation due to
the round-robin policy of Xen Dom0 driver.
Computer Measurement Group, India 34
Network isolation on Xen
(Barker 2010)
1.Measure “ping” time to next hop
2.Measure sum of “ping” time to first three hops.
3.Measure time to transfer 32KB block between local and EC2
instance.
Pop quiz: what is the point in conducting these three tests ?
Computer Measurement Group, India 35
Network isolation on Xen
(Barker 2010)
1.Measure “ping” time to next hop
2.Measure sum of “ping” time to first three hops.
3.Measure time to transfer 32KB block between local and EC2
instance.
Purpose:
a)First measurement captures jitter of network interface
b)Second captures jitter in routers inside Amazon data center.
c)Third captures Internet WAN transfer rate and jitter.
1. Saw no jitter in first measurement.
2. Significant variation in second. Most took 5ms but there were
significant number which took order of magnitude longer.
3. Third test showed regular variation (related to peak hours) typical
of most WAN applications.
Computer Measurement Group, India 36
Network isolation on Xen
Network latency tests on a Game server and a Media server on local
Xen cloud. (Barker 2010)
Found that “tc” defines per-VM flows using IP address and provides
good isolation. Two ways to allocate bandwidth using Linux “tc” tool.
1.Dedicated : Divide bandwidth between competing VMs and
prevent any VM from using more (i.e. Weight + cap).
2.Shared : Divide bandwidth but allow VMs to draw more if required
(i.e. Weight + work-conserving).
In both game and media server tests, results are consistent.
“Dedicated” mode produced lower latency while “shared”
mode produced lower jitter.
Interference Mean Std deviation
Dedicated 23.6 ms 29.6
Shared 33.9 ms 16.9
Computer Measurement Group, India 37
Long latency tails on EC2
(Xu et al, Bobtail, 2013)
Initial observations:
1. Median RTTs within EC2 are upto 0.6ms but 99.9 percentile RTT
on EC2 is 4 times longer than that seen in dedicated data centers.
(In other words, a few packets see much longer delays than
normal.)
2. Small instances most susceptible to the problem.
3. Measured RTT between node pairs on same AZ. Pattern not
symmetric. Hence, long tail not caused by location of host on
network.
4. RTT between good and bad nodes in AZ can differ by order of
magnitude.
5. One AZ which had newer CPU models did not return that many
bad nodes.
Computer Measurement Group, India 38
Long latency tails on EC2
(Xu et al, Bobtail, 2013)
Experimental setup: On 4-core Xen server, dedicate 2 cores to
Dom0. Remaining 2 cores are shared between 5 VMs with 40%
share each. Vary the combination of latency-sensitive versus CPU-
intensive VMs.
Latency-
sensitive
CPU-intensive RTT
5 0 1 ms
4 1 1 ms
3 2 <10 ms
2 3 ~30 ms
1 4 ~30ms
Long-tail emerges when CPU-intensive VMs exceed number
of cores.
Computer Measurement Group, India 39
Long latency tails on EC2
(Xu et al, Bobtail, 2013)
Hypothesis: Do all CPU-intensive VMs cause a problem?
Test: Vary the CPU usage of CPU-intensive VM to find out.
Long tail occurs when a competing VM does not use all its
CPU allocation. The Boost mechanism for quickly scheduling
latency-sensitive VMs fails against such VMs.
Computer Measurement Group, India 40
Long latency tails on EC2
(Xu et al, Bobtail, 2013)
1. Latency-sensitive VMs cannot respond in a timely manner
because they are starved of CPU by other VMs.
2. The VMs which starve them are those that are CPU-intensive but
are not using 100% of their allocation within 30ms.
3. The BOOST mechanism in the Xen scheduler runs in FIFO manner
and treats these two types of VMs equally instead of prioritizing
the latency-sensitive VM.
Authors designed “Bobtail” to select the EC2 instance on which to
place a latency-sensitive VM. (see paper)
Computer Measurement Group, India 41
EC2 Xen settings
Tested for small instances:
1. EC2 uses Xen credit scheduler in non-work-conserving mode,
which reduces efficiency but improves isolation.
2. It allows vCPUs to float across cores instead of pinning them to a
core.
3. Disk and network scheduling is work-conserving, but only
network scheduling has a max cap of 300 Mbps.
(Varadarajan, 2012)
Computer Measurement Group, India 42
Know your hypervisor : Xen
Xen :
CPU has two schedulers : Credit(2) and EDF.
•
Credit scheduler keeps a per-VM (weight, cap). Can be work-
conserving or not. Work-conserving means “distribute any idle
time to other running processes”; otherwise total CPU quantum is
capped.
• I/O intensive VMs benefit from BOOST, which bumps a vCPU to
the head of the queue when it receives an interrupt, provided it
has not exhausted its credits.
Device scheduler:
• Disk and network IO goes through Domain 0, which schedules
them in batches in round-robin fashion by Domain 0. To control
network bandwidth, use Linux tools to define per-VM flows.
Best practice: Increase CPU weight of Dom0 to be proportional to
the amount of IO. Dedicate core(s) to it. Dedicate memory and
prevent ballooning.
Computer Measurement Group, India 43
Know your hypervisor - KVM
• QEMU originally started a complete machine emulator [Bellard, 2005].
Code emulation is done by TCG (tiny code generator) originally called
“dyngen”. KVM was later added as another code accelerator into the
QEMU framework.
• Only one IO thread; BIG QEMU Driver lock is held in many IO
functions.
• Redhat “fio” benchmark in Sept 2012 reported 1.4M IOPs with 4 guests
but this was using passthrough IO (i.e. bypassing QEMU)
• Similar numbers reported in Mar 2013 but this time using an
experimental virtio-dataplane feature which utilizes dedicated per-
device threads for IO.
• Performance of RTOS (as a guest OS) in KVM also suffers when it
comes in contact with QEMU [Kiszka].
Computer Measurement Group, India 44
Tile-based benchmarking to test consolidation
Traditional benchmarks are designed for individual servers. For
virtualization, tiles of virtual machines that mimic actual
consolidation are used.
1. SPECvirt sc2013 (supersedes SPECvirt sc2010)
2. VConsolidate(Intel): tile consisting of SPECjbb, Sysbench,
Webbench and a mail server
3. VMMark (VMWare) : Exchange mail server, standy system,
Apache server, database server.
SPEC sc2013:
•.
Run for 3 hours on a single node to stress CPU, RAM, disk, and
network.
•.
Incorporates four workloads : web server, 4 Java app server
connected to a backend database server (to test multiple vCPU on
SMP), mail server and Batch server.
•.
Keep adding additional sets of virtual machines (tiles) until overall
throughput reaches a peak. All VMs must continue to meet the
required QoS (spec.org/virt_sc2013)
Computer Measurement Group, India 45
SPECvirt sc2013
Computer Measurement Group, India 46
Contents
1. Hypervisor classification and overview
2. Overheads introduced
3. Benchmarking
4. Analytic models
5. Conclusion
Computer Measurement Group, India 47
Value of analytic model
Benchmarks have to
Produce repeatable results.
Be Comparable easily across architectures & platforms.
Have predictive power (extrapolation).
Tension between realism and reproducibility.
Macrobenchmarks simulate real-world conditions but are not
comparable and lack extrapolation.
Microbenchmarks determine the cost of primitive operations.
Need analytic model to tie benchmarks to prospective application
use. Seltzer proposed three approaches :
1. Vector-based: combine system vector with application vector.
2. Trace-based : Generate workload from trace to capture dynamic
sequence of requests.
3. Hybrid : combination of both.
(Mogul 1999; Seltzer et al 1999)
Computer Measurement Group, India 48
Analytic models for virtualization
1.Layered queuing network (Menasce; Benevenuto
2006).
2.Factor graphs to determine per-VM utilization (Lu
2011)
3.Trace-based approach (Wood, et al)
4.VMBench (Moller @ Karlsruhe)
5.Equations for cache and core interference
(Apparao, et al).
6.Machine learning
Computer Measurement Group, India 49
Layered Queueing network (for Xen)
VM
Domain 0 Disk
IN
OUT Xen
Computer Measurement Group, India 50
Layered Queueing network (for Xen)
Total response time R = R(VM) + R(dom0) + R(disk)
For M/M/1 with feedback:
R of one resource = D/[ 1- U ]
U = Utilization = λ * D = Arrival rate * Service demand
U lies between 0 and 1.
D = Service demand = total time taken by one request.
D(resource by VM) = D(bare) * Slowdown(resource)/P(VM)
D(resource by Dom0) = D(VM) * Cost(Dom0)/P(IDD)
where
P=speedup of hardware of VM as compared to bare-metal.
Cost(Dom0) = BusyTime(Dom0)/BusyTime(VM)
Slowdown(resource) = BusyTime(virtual)/BusyTime(bare)
Computer Measurement Group, India 51
Factor graphs
Resource utilization at guest VMs is known and aggregate utilization
at hypervisor is known. How to determine the function which
defines per-VM utilization of each resource ?
This can be modeled as a “source separation problem” studied in
digital signal processing.
Measurements inside VM and on hypervisor can differ:
1.Disk IO inside VM can be higher than on the hypervisor due to
merging of IOs in the hypervisor.
2.CPU utilization inside a VM can be half of that at the hypervisor
because Xen issues per-VM IO through Dom0 (seen via “xentop”).
Computer Measurement Group, India 52
Factor graphs
CPU
Disk
Net
Mem
h1
h2
h4
h3
CPU
Disk
Net
Mem
CPU
Disk
Net
Mem
f1
f2
f4
f3
CPU
Disk
Net
Mem
CPU
Disk
Net
Mem
g1
g2
g4
g3
Host
VM1
VM2
Computer Measurement Group, India 53
Trace-based approach
How to model the migration from bare-metal to virtual
environment?
1.Create platform profiles to measure cost of primitive
operations: Run same microbenchmarks on native (bare-metal)
and virtualized platform.
2.Relate native and virtualized : Formulate set of equations
which relate native metrics to virtualized.
3.Capture trace of application which is to be migrated:
Determine how many primitive operations it uses and plug it in.
(*Actual process employs Statistical methods and is more
complicated)
Computer Measurement Group, India 54
Trace-based approach
How to model the migration from bare-metal to virtual
environment?
Step 1: Create platform profiles?
Run carefully chosen CPU, disk and network-intensive
microbenchmarks on both the bare-metal and virtual environment.
Measure key metrics for each benchmark :
a)CPU – percentage time spent in user, kernel and iowait
b)Network – read and write packets/sec and bytes/sec
c)Disk – read and write blocks/sec and bytes/sec.
Metric CPU user CPU sys CPU
iowait
BareMetal 23 13 3
Virtual 32 20 8
Computer Measurement Group, India 55
Trace-based approach
How to model the migration from bare-metal to virtual
environment?
Step 2: Relate native and virtualized : Formulate set of
equations which relate native metrics to virtualized.
e.g. Util(cpu on VM) = c0 + c1*M1 + c2*M2 + ... cn*Mn
where Mk=Metric gathered from native microbenchmark
Solve for the model coefficients using Least Squares Regression.
The coefficients c_k capture relation between native and virtualized
platform.
e.g. c0=4, c1=19, c2=23, etc...
Computer Measurement Group, India 56
Trace-based approach
How to model the migration from bare-metal to virtual
environment?
Step 3: Capture trace of application which is to be migrated:
And find the new metrics Mk to plug into the above equation. Then
solve it. Voila !
Util(cpu on VM)= 4 * (M1) + 19 * (M2) + ...
Recap:
1. Create platform profiles for native and virtual.
2. Find coefficients which relate native & virtual.
3. Capture application trace and apply the equation.
Their findings:
4. Single model is not applicable for Intel and AMD since CPU
utilization varies.
5. Feedback loop within the application can distort performance
prediction.
Computer Measurement Group, India 57
Conclusion
All problems in CS can be solved by another level of indirection
-David Wheeler (1997-2004, first PhD in Computer Science)
... and performance problems introduced by indirection require
caching, interlayer cooperation and hardware assists (e.g. TLB
cache, EPT, paravirtualization).
Virtual machines have finally arrived. Dismissed for a
number of years as merely academic curiosities, they are
now seen as cost-effective techniques for organizing
computer systems resources to provide extraordinary
system flexibility and support for certain unique
applications.
[Goldberg, Survey of Virtual Machine Research, 1974]
Computer Measurement Group, India 58
References
1. Ganesan et al. Empirical study of performance benefits of
hardware assisted virtualization, 2013.
2. Boutcher and Chandra. Does virtualization make disk scheduling
passe.
3. Le at al. Understanding Performance Implications of Nested File
Systems in a Virtualized Environment.
4. Jannen. Virtualize storage, not disks.
5. Yu. Xen PV Performance status and Optimization Opportunities.
6. Barker and Shenoy. Empirical evaluation of latency-sensitive
application performance in the cloud
7. Xu. Bobtail. Avoiding long tails in the cloud.
8. Varadarajan et al. Resource freeing attacks.
9. Bellard. QEMU, a fast and portable dynamic translator.
10.Kiszka. Using KVM as a realtime hypervisor
11.Mogul. Brittle metrics in operating system research.
12.Seltzer et al. The Case for Application-Specific Benchmarking
Computer Measurement Group, India 59
References
1. Menasce. VIRTUALIZATION: CONCEPTS, APPLICATIONS, AND
PERFORMANCE MODELING
2. Benevenuto et al. Performance Models for Virtualized Applications
3. Lu at al, Untangling Mixed Information to Calibrate Resource
Utilization in Virtual Machines, 2011.
4. Wood. Profiling and Modeling Resource Usage of Virtualized
Applications
Computer Measurement Group, India 60
BACKUP SLIDES
Computer Measurement Group, India 61
Classification
• OS-level virtualization : Does not run any intermediary
hypervisor. Modify the OS to support namespaces for
networking, processes and filesystem.
• Paravirtualization : Guest OS is modified and is aware
that it is running inside a hypervisor.
• Full virtualization : Guest OS runs unmodified.
Hypervisor emulates hardware devices.
Computer Measurement Group, India 62
NUMA/SMP
• If you run a monster server VM with many vCPUs, you may have
to worry about NUMA scaling. Depending on NUMA ratio, 30-40%
higher cost (latency and throughput) in accessing remote memory
• Hypervisor must be able to
1.manually pin a vCPU to a core.
2.export NUMA topology to the guest OS.
3.do automatic NUMA-aware scheduling of all guest VMs.
•. VMWare introduced vNUMA in vSphere 5.
•. On Xen, pin Dom0 to a core. In case of NUMA, put frontend and
backend drivers on the same core.
•. KVM exports NUMA topology to VM but it is still lagging on
automatic scheduling.
Computer Measurement Group, India 63
NUMA/SMP
• Depending on NUMA ratio, 30-40% higher cost (latency and
throughput) in accessing remote memory
• Hypervisor must support ability to pin a vCPU to a core, and also
allocate memory from specific NUMA node.
• Hypervisor must export NUMA topology (ACPI tables) so guest OS
can do its job.
• Hypervisor should do automatic NUMA-aware scheduling of all
guest VMs.
• VMWare introduced vNUMA in vSphere 5.
• On Xen, pin Dom0 to a core. In case of NUMA, put frontend and
backend drivers on the same core.
• KVM exports NUMA topology to VM but it is still lagging on
automatic scheduling.
• Cross-call overhead : On a SMP machine, when a semaphore is
released by one thread, it issues a cross-call or inter-processor
interrupt if the waiting threads are sleeping on another core. On
a VM, the cross-call becomes a costly privileged op (Akther).
Interrupt delivery may also trigger a cross-call.
Computer Measurement Group, India 64
Nested CPU scheduling
• Each guest OS runs on one or more virtual CPUs. Hypervisor
schedules virtual CPUs on its run queue and then each guest OS
decides which task to run on that virtual CPU.
• Introduces lock preemption problem: A process in the guest OS
may get scheduled out by the hypervisor while holding a spin lock,
delaying other processes waiting for that spin lock.
• Guest OS would not schedule out a process holding a spin lock but
hypervisor is unaware of processes within the guest OS.
•
• Solution is some form of co-scheduling or “gang scheduling”.
VMWare actively seeks to reduce “skew” between multiple vCPUs
of the same VM.
Computer Measurement Group, India 65
Nested Page tables
• Page fault in VM may occur because the hypervisor has not
allocated RAM to the VM.
• Guest Page table : Guest Virtual adddress -> Hypervisor Virtual
• Hypervisor Page Table : Hypervisor Virtual -> Actual RAM.
• Earlier, hypervisors would maintain a “shadow page table” for each
guest OS. This function has now moved into hardware by both
Intel and AMD. Its called “Nested page tables”.
• Nested page tables require a costly two-dimensional page walk.
For each step that is resolved in the guest table, you have to look
up the host table.
• Overhead can be alleviated by using “huge pages” and per-VM
tags in the TLB cache.
Computer Measurement Group, India 66
Memory overheads & solutions
Balloon driver : take back memory from guest.
• -- VMWare
• -- KVM (see virtio_balloon.c in linux_kernel/drivers/virtio)
• -- HyperV calls it Dynamic Memory
• -- Xen Transcendent Memory
Memory deduplication
• -- Present in System/270 (Smith & Nair);
• -- VMWare calls it Transparent Page Sharing (patented)
• -- KVM uses KSM (which calls Linux madvise())
• -- Xen uses KSM in HVM mode only.
Computer Measurement Group, India 67
Quantifying isolation
• Deshane et al(2007) defined BenchVM to test isolation.
• Run normal VMs alongside a overloaded VM and test if the normal
VM remains responsive.
• On the Overloaded VM, you run various stress tests:
1.CPU stress test
2.Memory stress test : calloc and touch memory without free()
3.Network : threaded UDP send and receive
4.Disk : IOzone
5.Fork bomb : test fast process creation and scheduling.
Their conclusion: Full virtualization provides better isolation than
container-based virtualization. Their other results may be outdated
due to advances in virtualization
Computer Measurement Group, India 68
VM exits are costly
• Interrupt processing causes context switches between VM and
hypervisor.
• KVM EOI optimization : guest IDT (interrupt descriptor table) is
shadowed.
• VMWare detects cluster of instructions that can cause guest exits.
• Use combination of polling and interrupt (NAPI)
Computer Measurement Group, India 69
mclock
• Disk capacity varies dynamically and cannot be statically allocated
like CPU or RAM.
• Need proportional sharing algorithm to reserve disk capacity
• Gulati et al propose a dynamic algorithm which interleaves two
schedulers and uses three tags with every IO request.
Computer Measurement Group, India 70
Hadoop benchmark
• VMWare :
• HyperV (conducted on HDInsight – Microsoft's version of
Hadoop) :
• KVM:
• Xen: (Netflix runs map-reduce on AWS)
Computer Measurement Group, India 71
HPC/Scientific benchmark
• VMWare paper : SPEC MPI and SPEC OMP
• Xen : Jackson et al (2010) ran NERSC on Amazon EC2. Six times
slower than Linux cluster and 20 times slower than modern HPC
system. EC2 interconnect severely limits performance. Could not
use processor-specific compiler options since heterogenous mix of
CPUs on every node.
• In Jun 2010, Amazon launched “Cluster Compute Nodes” which are
basically nodes running Xen in hvm mode connected via 10G
ethernet (no Infiniband yet).
• KVM and OpenVZ : Regola (2010) ran NPB on these nodes.
Computer Measurement Group, India 72
Realtime benchmark
• In order to minimize jitter and limit the worst-case latency, a
realtime system must provide mechanisms for resource
reservation, process preemption and prevention of priority
inversion.
• Soft realtime (VoIP) vs Hard realtime. Soft means 20ms jitter
between packets acceptable.
• RT-XEN
• Kiszka KVM – QEMU driver lock.
Computer Measurement Group, India 73
Layered Queueing network (Xen)
Total response time R = R(vcpu) + R(dom0_cpu) + R(disk)
Resp. Time = Demand/[ 1- Utilization ]
R(vcpu) = D(vcpu)/ [ 1 – U (vcpu) ]
R(dom0_cpu) = D(vcpu)/ [ 1 - U(dom0_vcpu]
R(disk) = D(disk)/ [ 1 – U(disk) ]
Util = λ * D = Arrival rate * Demand
D(vm_cpu) = D(isol_cpu) * S(cpu)/P(vm) where S=slowdown,
P=speedup
D(dom0_cpu) = D(vm_cpu) * Cost(dom0_vm)/P(dom0)
Cost(dom0_vm) = B(dom0_cpu)/B(vm_cpu) where B = busy time
Slowdown(cpu) = B(vm_cpu)/B(bare_cpu)
Slowdown(disk) = B(vm_disk)/B(bare_disk)

More Related Content

What's hot

제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUITommy Lee
 
PostgreSQL Extensions: A deeper look
PostgreSQL Extensions:  A deeper lookPostgreSQL Extensions:  A deeper look
PostgreSQL Extensions: A deeper lookJignesh Shah
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档YUCHENG HU
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardsonharryvanhaaren
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with PacemakerKris Buytaert
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsBrendan Gregg
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Brendan Gregg
 
Cgroup resource mgmt_v1
Cgroup resource mgmt_v1Cgroup resource mgmt_v1
Cgroup resource mgmt_v1sprdd
 
High Availability in 37 Easy Steps
High Availability in 37 Easy StepsHigh Availability in 37 Easy Steps
High Availability in 37 Easy StepsTim Serong
 
Fast Userspace OVS with AF_XDP, OVS CONF 2018
Fast Userspace OVS with AF_XDP, OVS CONF 2018Fast Userspace OVS with AF_XDP, OVS CONF 2018
Fast Userspace OVS with AF_XDP, OVS CONF 2018Cheng-Chun William Tu
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Yoshinori Matsunobu
 
Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologiesBrendan Gregg
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBDDan Frincu
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리NAVER D2
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 

What's hot (20)

제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-CRUI
 
PostgreSQL Extensions: A deeper look
PostgreSQL Extensions:  A deeper lookPostgreSQL Extensions:  A deeper look
PostgreSQL Extensions: A deeper look
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
 
How swift is your Swift - SD.pptx
How swift is your Swift - SD.pptxHow swift is your Swift - SD.pptx
How swift is your Swift - SD.pptx
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance Results
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
Cgroup resource mgmt_v1
Cgroup resource mgmt_v1Cgroup resource mgmt_v1
Cgroup resource mgmt_v1
 
High Availability in 37 Easy Steps
High Availability in 37 Easy StepsHigh Availability in 37 Easy Steps
High Availability in 37 Easy Steps
 
Fast Userspace OVS with AF_XDP, OVS CONF 2018
Fast Userspace OVS with AF_XDP, OVS CONF 2018Fast Userspace OVS with AF_XDP, OVS CONF 2018
Fast Userspace OVS with AF_XDP, OVS CONF 2018
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
Xen in Linux 3.x (or PVOPS)
Xen in Linux 3.x (or PVOPS)Xen in Linux 3.x (or PVOPS)
Xen in Linux 3.x (or PVOPS)
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
 
Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologies
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 

Viewers also liked

BKK16-404A PCI Development Meeting
BKK16-404A PCI Development MeetingBKK16-404A PCI Development Meeting
BKK16-404A PCI Development MeetingLinaro
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingAnne Nicolas
 
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsTackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsThe Linux Foundation
 
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUsSpecification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUsAlexander Kamkin
 
Reverse engineering for_beginners-en
Reverse engineering for_beginners-enReverse engineering for_beginners-en
Reverse engineering for_beginners-enAndri Yabu
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Praguetomasbart
 
Linux numa evolution
Linux numa evolutionLinux numa evolution
Linux numa evolutionLukas Pirl
 
BKK16-104 sched-freq
BKK16-104 sched-freqBKK16-104 sched-freq
BKK16-104 sched-freqLinaro
 
Cgroup resource mgmt_v1
Cgroup resource mgmt_v1Cgroup resource mgmt_v1
Cgroup resource mgmt_v1sprdd
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Nakul Manchanda
 
Known basic of NFV Features
Known basic of NFV FeaturesKnown basic of NFV Features
Known basic of NFV FeaturesRaul Leite
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)Linaro
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesRaghavendra Prabhu
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadOpen-NFP
 
Cpu scheduling(suresh)
Cpu scheduling(suresh)Cpu scheduling(suresh)
Cpu scheduling(suresh)Nagarajan
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareLinaro
 

Viewers also liked (20)

BKK16-404A PCI Development Meeting
BKK16-404A PCI Development MeetingBKK16-404A PCI Development Meeting
BKK16-404A PCI Development Meeting
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
 
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core SystemsTackling the Management Challenges of Server Consolidation on Multi-core Systems
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
 
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUsSpecification-Based Test Program Generation for ARM VMSAv8-64 MMUs
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
 
Dulloor xen-summit
Dulloor xen-summitDulloor xen-summit
Dulloor xen-summit
 
Reverse engineering for_beginners-en
Reverse engineering for_beginners-enReverse engineering for_beginners-en
Reverse engineering for_beginners-en
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
 
Linux numa evolution
Linux numa evolutionLinux numa evolution
Linux numa evolution
 
BKK16-104 sched-freq
BKK16-104 sched-freqBKK16-104 sched-freq
BKK16-104 sched-freq
 
Cgroup resource mgmt_v1
Cgroup resource mgmt_v1Cgroup resource mgmt_v1
Cgroup resource mgmt_v1
 
Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)Non-Uniform Memory Access ( NUMA)
Non-Uniform Memory Access ( NUMA)
 
Known basic of NFV Features
Known basic of NFV FeaturesKnown basic of NFV Features
Known basic of NFV Features
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and Opportunities
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
Cpu scheduling(suresh)
Cpu scheduling(suresh)Cpu scheduling(suresh)
Cpu scheduling(suresh)
 
Process scheduling linux
Process scheduling linuxProcess scheduling linux
Process scheduling linux
 
NUMA overview
NUMA overviewNUMA overview
NUMA overview
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
 

Similar to Virtualization overheads

Wp intelli cache_reduction_iops_xd5.6_fp1_xs6.1
Wp intelli cache_reduction_iops_xd5.6_fp1_xs6.1Wp intelli cache_reduction_iops_xd5.6_fp1_xs6.1
Wp intelli cache_reduction_iops_xd5.6_fp1_xs6.1Nuno Alves
 
virtualization and hypervisors
virtualization and hypervisorsvirtualization and hypervisors
virtualization and hypervisorsGaurav Suri
 
Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)Vinayak Hegde
 
VMworld 2013: Successfully Virtualize Microsoft Exchange Server
VMworld 2013: Successfully Virtualize Microsoft Exchange Server VMworld 2013: Successfully Virtualize Microsoft Exchange Server
VMworld 2013: Successfully Virtualize Microsoft Exchange Server VMworld
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Odinot Stanislas
 
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR
LOAD BALANCING OF APPLICATIONS  USING XEN HYPERVISORLOAD BALANCING OF APPLICATIONS  USING XEN HYPERVISOR
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISORVanika Kapoor
 
OpenNebula TechDay Boston 2015 - Hyperconvergence and OpenNebula
OpenNebula TechDay Boston 2015 - Hyperconvergence and OpenNebulaOpenNebula TechDay Boston 2015 - Hyperconvergence and OpenNebula
OpenNebula TechDay Boston 2015 - Hyperconvergence and OpenNebulaOpenNebula Project
 
The Unofficial VCAP / VCP VMware Study Guide
The Unofficial VCAP / VCP VMware Study GuideThe Unofficial VCAP / VCP VMware Study Guide
The Unofficial VCAP / VCP VMware Study GuideVeeam Software
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A CloudSky Jian
 
04_virtualization1_v1.pdf
04_virtualization1_v1.pdf04_virtualization1_v1.pdf
04_virtualization1_v1.pdfHossainOrnob
 
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...The Linux Foundation
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
 
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI ServersGet Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI ServersUnidesk Corporation
 
A Survey of Performance Comparison between Virtual Machines and Containers
A Survey of Performance Comparison between Virtual Machines and ContainersA Survey of Performance Comparison between Virtual Machines and Containers
A Survey of Performance Comparison between Virtual Machines and Containersprashant desai
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
OpenNebula TechDay Waterloo 2015 - Hyperconvergence and OpenNebula
OpenNebula TechDay Waterloo 2015 - Hyperconvergence  and  OpenNebulaOpenNebula TechDay Waterloo 2015 - Hyperconvergence  and  OpenNebula
OpenNebula TechDay Waterloo 2015 - Hyperconvergence and OpenNebulaOpenNebula Project
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopDataWorks Summit
 

Similar to Virtualization overheads (20)

Wp intelli cache_reduction_iops_xd5.6_fp1_xs6.1
Wp intelli cache_reduction_iops_xd5.6_fp1_xs6.1Wp intelli cache_reduction_iops_xd5.6_fp1_xs6.1
Wp intelli cache_reduction_iops_xd5.6_fp1_xs6.1
 
virtualization and hypervisors
virtualization and hypervisorsvirtualization and hypervisors
virtualization and hypervisors
 
Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)
 
PROSE
PROSEPROSE
PROSE
 
VMworld 2013: Successfully Virtualize Microsoft Exchange Server
VMworld 2013: Successfully Virtualize Microsoft Exchange Server VMworld 2013: Successfully Virtualize Microsoft Exchange Server
VMworld 2013: Successfully Virtualize Microsoft Exchange Server
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
 
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR
LOAD BALANCING OF APPLICATIONS  USING XEN HYPERVISORLOAD BALANCING OF APPLICATIONS  USING XEN HYPERVISOR
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR
 
OpenNebula TechDay Boston 2015 - Hyperconvergence and OpenNebula
OpenNebula TechDay Boston 2015 - Hyperconvergence and OpenNebulaOpenNebula TechDay Boston 2015 - Hyperconvergence and OpenNebula
OpenNebula TechDay Boston 2015 - Hyperconvergence and OpenNebula
 
The Unofficial VCAP / VCP VMware Study Guide
The Unofficial VCAP / VCP VMware Study GuideThe Unofficial VCAP / VCP VMware Study Guide
The Unofficial VCAP / VCP VMware Study Guide
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A Cloud
 
04_virtualization1_v1.pdf
04_virtualization1_v1.pdf04_virtualization1_v1.pdf
04_virtualization1_v1.pdf
 
Sql saturday dc vm ware
Sql saturday dc vm wareSql saturday dc vm ware
Sql saturday dc vm ware
 
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI ServersGet Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
 
A Survey of Performance Comparison between Virtual Machines and Containers
A Survey of Performance Comparison between Virtual Machines and ContainersA Survey of Performance Comparison between Virtual Machines and Containers
A Survey of Performance Comparison between Virtual Machines and Containers
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
OpenNebula TechDay Waterloo 2015 - Hyperconvergence and OpenNebula
OpenNebula TechDay Waterloo 2015 - Hyperconvergence  and  OpenNebulaOpenNebula TechDay Waterloo 2015 - Hyperconvergence  and  OpenNebula
OpenNebula TechDay Waterloo 2015 - Hyperconvergence and OpenNebula
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
os
osos
os
 

More from Sandeep Joshi

Synthetic data generation
Synthetic data generationSynthetic data generation
Synthetic data generationSandeep Joshi
 
How to build a feedback loop in software
How to build a feedback loop in softwareHow to build a feedback loop in software
How to build a feedback loop in softwareSandeep Joshi
 
Programming workshop
Programming workshopProgramming workshop
Programming workshopSandeep Joshi
 
Hash function landscape
Hash function landscapeHash function landscape
Hash function landscapeSandeep Joshi
 
Android malware presentation
Android malware presentationAndroid malware presentation
Android malware presentationSandeep Joshi
 
Doveryai, no proveryai - Introduction to tla+
Doveryai, no proveryai - Introduction to tla+Doveryai, no proveryai - Introduction to tla+
Doveryai, no proveryai - Introduction to tla+Sandeep Joshi
 
Apache spark undocumented extensions
Apache spark undocumented extensionsApache spark undocumented extensions
Apache spark undocumented extensionsSandeep Joshi
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 

More from Sandeep Joshi (10)

Block ciphers
Block ciphersBlock ciphers
Block ciphers
 
Synthetic data generation
Synthetic data generationSynthetic data generation
Synthetic data generation
 
How to build a feedback loop in software
How to build a feedback loop in softwareHow to build a feedback loop in software
How to build a feedback loop in software
 
Programming workshop
Programming workshopProgramming workshop
Programming workshop
 
Hash function landscape
Hash function landscapeHash function landscape
Hash function landscape
 
Android malware presentation
Android malware presentationAndroid malware presentation
Android malware presentation
 
Doveryai, no proveryai - Introduction to tla+
Doveryai, no proveryai - Introduction to tla+Doveryai, no proveryai - Introduction to tla+
Doveryai, no proveryai - Introduction to tla+
 
Apache spark undocumented extensions
Apache spark undocumented extensionsApache spark undocumented extensions
Apache spark undocumented extensions
 
Lockless
LocklessLockless
Lockless
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Virtualization overheads

  • 1. Computer Measurement Group, India 1 Performance overheads of Virtualization Sandeep Joshi Principal SDE, Storage Startup 26 April 2014 www.cmgindia.org
  • 2. Computer Measurement Group, India 2 Contents 1. Hypervisor classification and overview 2. Overheads introduced 3. Benchmarking 4. Analytic models 5. Conclusion
  • 3. Computer Measurement Group, India 3 Not covered in this talk • Mobile virtualization - Motorola Evoke QA4 was the first to run two OS. - new Hyp mode in ARM Cortex A-15 processor. • Nested virtualization - running one hypervisor on top of another. • Network virtualization - SDN, OpenFlow. • Containers (aka OS-level virtualization) - Solaris Zones, LXC, OpenVZ. • Older hypervisors which did binary translation.
  • 4. Computer Measurement Group, India 4 Classification • Image : blog.technet.com/b/chenley
  • 5. Computer Measurement Group, India 5 VMWare ESX • Image : blog.vmware.com
  • 6. Computer Measurement Group, India 6 VMWare ESX l Each virtual machine has multiple worlds (threads), some of which correspond to guest CPUs and others are dedicated to the device processing (Run “esxtop” on the host). • Monolithic kernel. Hardware support limited to those drivers installed in the hypervisor.
  • 7. Computer Measurement Group, India 7 KVM Used in Google cloud, Eucalyptus, or most Openstack clouds. • Image : Redhat Summit, June 2013
  • 8. Computer Measurement Group, India 8 KVM Linux is the hypervisor. Leverages Linux features (device drivers, NAPI, CPU and IO schedulers, cgroups, madvise, NUMA, etc.) • Guest OS sits inside Linux process running QEMU shell; each Virtual CPU is a thread inside this process. • Uses QEMU for device virtualization. QEMU in one guest is not aware of QEMU running in another guest.
  • 9. Computer Measurement Group, India 9 Microsoft HyperV Used in Microsoft Azure cloud
  • 10. Computer Measurement Group, India 10 Xen When you use Amazon or Rackspace, you are using Xen.
  • 11. Computer Measurement Group, India 11 Contents 1. Hypervisor classification and overview 2. Overheads introduced 3. Benchmarking 4. Analytic models 5. Conclusion
  • 12. Computer Measurement Group, India 12 Overheads introduced 1. CPU : nested scheduling triggers lock preemption problem (use gang scheduling), VM exits are exits. 2. Memory : Nested page table, NUMA topology. 3. Disk : Nested filesystems, page cache, IO schedulers, interrupt delivery, DMA. 4. Network : DMA, interrupt delivery. • Next few slides will cover hardware assists, nested filesystems, nested IO schedulers and benefits of IO paravirtualization.
  • 13. Computer Measurement Group, India 13 Hardware assists Hardware assists have considerably eased many virtualization overheads: 1. CPU : Binary translation was replaced by extra CPU rings : root and guest mode, each with 4 rings. 2. MMU : Shadow table in software replaced by EPT/Nested page table. 3. IOMMU : during DMA, it converts Machine Physical Address to Guest Physical Address. 4. IO-APIC : interrupt delivery is done directly to the guest using IDT. 5. SR-IOV : virtual functions implemented in the NIC (SR-IOV is also defined for storage adapters but not yet implemented) Benefits: Hardware assistance reduces CPU cache contention as well as “Service Demand” on the VM. (Service Demand = CPU Utilization/Throughput). Higher throughput is obtained for lesser CPU utilization.
  • 14. Computer Measurement Group, India 14 Hardware assists Image:intel.comImage: virtualizationdeepdive.wordpress.com IOMMU APIC
  • 15. Computer Measurement Group, India 15 How much does hardware assist help? Ganesan et al (2013) ran microbenchmarks on 2-core Intel Xeon. Compare native vs Xen, with and without hardware assistance (SR- IOV, IOMMU, Intel VT). Finding: Network throughput is near-native with SR-IOV but CPU utilization still remains high (possibly because interrupt processing is still triggering numerous guest VM-hypervisor transitions ?). Chart shows Max throughput with iPerf. Mbps Dom0 CPU VM CPU Native 940 NA 16.68 SR-IOV 940 20 65 (high) No SR-IOV 192 82 39
  • 16. Computer Measurement Group, India 16 How much does hardware assist help? Further results from Ganesan et al (2013). Disk throughput tested using RUBiS(disk+net intensive) and BLAST(disk intensive). Finding: Disk IO is not yet benefiting from hardware assists. Most of the RUBiS improvement comes from SR-IOV rather than IOMMU. Similar finding with BLAST.
  • 17. Computer Measurement Group, India 17 Nested IO scheduling VM and hypervisor are both running IO scheduling algorithms (and so is the disk drive). IO requests are rearranged and merged by the IO scheduler (scheduler can be set in Xen Dom0 or KVM host but not in ESX). On Linux, there are 4 schedulers - CFQ, NOOP, Anticipatory, Deadline. Each block device can have a different scheduler. Image: dannykim.me
  • 18. Computer Measurement Group, India 18 Nested IO scheduling Results of Boutcher and Chandra • Best combination of schedulers depends on workload. • Tradeoff between fairness (and VMs) and throughput • Scheduler closest to workload has most impact. • NOOP has best throughput in the hypervisor but is least fair by Jain's fairness measure. • In guest VM, CFQ is 17% better than Anticipatory for FFSB benchmark but for Postmark, Anticipatory is 18% better than CFQ. • On Hypervisor, NOOP is 60% better than CFQ for FFSB and 72% better for Postmark.
  • 19. Computer Measurement Group, India 19 Nested IO scheduling Boutcher's numbers for FFSB on Xen 3.2, 128 threads, each VM allocated contiguous 120GB space on 500GB SATA drive. X-axis is scheduler in VM; Y-axis is scheduler in hypervisor. Numbers in table are approx because they were converted from a bar graph (Transactions per sec. on Xen). anticipato ry CFQ Deadline NOOP Anticipato ry 200 260 175 240 CFQ 260 240 155 160 Deadline 315 360 250 255 NOOP 320 370 245 255
  • 20. Computer Measurement Group, India 20 Sequential IO becomes random • Sequential IO issued from multiple VMs to same block device becomes random when aggregated in the hypervisor. • Set longer disk queue length in hypervisor to enable better aggregation. On VMWare, you can set disk.SchedReqNumOutstanding=NNN. • Use PCI flash or SSDs to absorb random writes.
  • 21. Computer Measurement Group, India 21 Nested filesystems and page cache Filesystem on VM can map to a flat file on underlying filesystem, raw block device (local or iSCSI), NFS. Flat file on FS is preferred for ease of management. It can be raw, qcow2, vmdk or vhd format. VM1 Guest filesystem /dev/sda Files in /vmfs on hypervisor /dev/sdc VM-2 Guest filesystem /dev/sdb • Flat files introduce another performance overhead (next slide). • KVM has four caching modes (none, writeback, writethru, unsafe) which can disable/enable either cache. • In Xen Dom0, the page cache comes into play when the file- storage option is in use.
  • 22. Computer Measurement Group, India 22 Nested filesystems and page cache Le et al (2012) ran FileBench and “fio” on KVM. Tried 42 different combinations of guest and host file systems. Found worst-case 67% degradation. Their conclusion: • Read-dominated workloads benefit from readahead. • Avoid journaled filesystems for write-dominated workloads. • Latency goes up 10-30% in best case. • Host FS should be like a dumb disk or VirtFS; it should not make placement decisions over what guest FS has decided. • Jannen(2013) found that overheads are worse for filesystems on SSD. On HDD, the overheads are masked by rotational latency.
  • 23. Computer Measurement Group, India 23 Nested filesystems and page cache Le et al (2012) – random, file write test using “fio”. Y-axis is Host file system; X-axis is Guest file system. Throughput in MB/sec ext2 ext3 reiser xfs jfs ext2 60 55 65 80 95 ext3 60 55 65 80 75 ext4 60 55 55 70 95 reiser 60 55 65 80 100 xfs 60 40 60 70 65 jfs 60 50 65 80 105
  • 24. Computer Measurement Group, India 24 Nested IO stacks : use paravirtualization Hypervisor exposes virtual NIC or storage HBA written in software to the VM. IO request issued by VM travels to the bottom of the stack before it is repackaged and reissued by the hypervisor . Paravirtualization traps the IO request and uses shared memory to route it faster to the hypervisor. 1. VMWare: Install “VMWare tools” and select “Paravirtual SCSI controller” for storage and “vmxnet” driver for networking. VMWare claims PVSCSI offers 12% throughput improvement with 18% less CPU cost with 8Kb block size (blogs.vmware.com) 2. KVM: use newer in-kernel “vhost-net” for networking and “virtio- scsi” or “virtio-blk-data-plane” drivers for storage. 3. Xen: Split-driver used for PVM guests while HVM guests use QEMU or StubDom. HVM can also use PV drivers.
  • 25. Computer Measurement Group, India 25 Xen: PVM and HVM difference HVM is 3-6% better than PV guest for CPU+RAM intensive tests. For 1 VM with 4 vCPU and 6GB JVM heap size (Yu 2012).
  • 26. Computer Measurement Group, India 26 Xen: PVM and HVM difference Here, HVM was using PV driver. Outperforms PV by 39% for disk- intensive test running on SSD. (Yu 2012).
  • 27. Computer Measurement Group, India 27 Contents 1. Hypervisor classification and overview 2. Overheads introduced 3. Benchmarking 4. Analytic models 5. Conclusion
  • 28. Computer Measurement Group, India 28 Virtual Machine Benchmarking Two aspects 1.Performance : Compare performance of a consolidated server to a non-virtualized OS running on bare-metal hardware ? 2.Isolation : Does overloading one VM bring down performance of other VMs running on same node. Impact of factors •. Virtualization-sensitive instructions issued by the guest. •. VM exits and interrupt delivery. •. Architectural choices made within hypervisor •. Interference between VMs due to shared resources (visible and invisible).
  • 29. Computer Measurement Group, India 29 Testing isolation capability of a hypervisor 1. Run application on cloud with collocated VMs. 2. Then run in isolation with no collocated VMs to find the gaps. 3. Then run it in a controlled environment, gradually adding collocated VMs which create CPU, disk or network load, until you can simulate the behaviour seen in the cloud.
  • 30. Computer Measurement Group, India 30 CPU isolation on Xen (Barker and Shenoy study 2010) Find variation in completion times for a single thread which is running a “floating point operations” test periodically over a few hours. • Amazon EC2 small instance: Average completion time was 500 ms, but there was significant jitter. Some tests even took an entire second. • Local setup: Same test on local Xen server showed almost NO variation in completion time. • Conclusion: CPU scheduler on Xen does not provide perfect isolation. • Further tests done to narrow down the problem in the CPU scheduler.
  • 31. Computer Measurement Group, India 31 Xen’s credit scheduler for CPU (Barker and Shenoy study 2010) Xen has 2 main CPU schedulers – EDF (realtime) and Credit (default). Each VM runs on one or more virtual CPUs(vCPU). Hypervisor maps vCPU to physical CPUs (floating or pinned). For each VM, you can define (weight, cap). 1. Weight = proportion of CPU allocated. 2. Cap = max limit or ceiling on CPU time. Credit scheduler periodically issues 30ms to each vCPU for use. Allocation decremented in 10ms intervals. When credits expire, VM must wait until next 30ms cycle. If VM receives an interrupt, it gets “Boost” which inserts it to the top of the vCPU queue, provided it has not exhausted its credits. Scheduler also has a work-conserving mode which transfers unused capacity to those VMs that need it.
  • 32. Computer Measurement Group, India 32 CPU isolation on Xen (Barker and Shenoy study 2010) On local setup, tied two VMs to same physical core. Varied (weight, cap) of foreground VM while keeping background VM busy. 1. First test: Keep weights of both VMs equal. Observed jitter as seen on EC2 test. 2. Second test: Vary “weight” while keeping “cap” constant. Weight does not directly correspond to CPU time. Weight ratio of 1:1000 only translates into actual CPU ratio of 1:1.5 (33% more). 3. Third test: Vary “cap” on both VMs. CPU allocation of foreground VM was in proportion to the “cap” (even when background VM was idle). Conclusion: Strong isolation requires pinning VM to a core or setting the “cap”.
  • 33. Computer Measurement Group, India 33 Disk isolation on Xen Test jitter for small random or large streaming IO to simulate game servers and media servers.(Barker, 2010) Amazon EC2 : Found significant variation in completion time for reads and writes. Write bandwidth can vary upto 50% from mean. Read bandwidth variation can be due to caching side-effects. Isolated local setup: Completion times are consistent if there is no other VM on the Xen node. Introduce a background VM: Run same tests with another background VM doing heavy IO. Used CFQ IO scheduler in Dom0 and NOOP in guest VM. Finding: Xen has no native disk isolation mechanism to identify per- VM disk flows. Throughput of foreground VM dropped by 65-75% and latency increased by 70-80%. Limit to the degradation due to the round-robin policy of Xen Dom0 driver.
  • 34. Computer Measurement Group, India 34 Network isolation on Xen (Barker 2010) 1.Measure “ping” time to next hop 2.Measure sum of “ping” time to first three hops. 3.Measure time to transfer 32KB block between local and EC2 instance. Pop quiz: what is the point in conducting these three tests ?
  • 35. Computer Measurement Group, India 35 Network isolation on Xen (Barker 2010) 1.Measure “ping” time to next hop 2.Measure sum of “ping” time to first three hops. 3.Measure time to transfer 32KB block between local and EC2 instance. Purpose: a)First measurement captures jitter of network interface b)Second captures jitter in routers inside Amazon data center. c)Third captures Internet WAN transfer rate and jitter. 1. Saw no jitter in first measurement. 2. Significant variation in second. Most took 5ms but there were significant number which took order of magnitude longer. 3. Third test showed regular variation (related to peak hours) typical of most WAN applications.
  • 36. Computer Measurement Group, India 36 Network isolation on Xen Network latency tests on a Game server and a Media server on local Xen cloud. (Barker 2010) Found that “tc” defines per-VM flows using IP address and provides good isolation. Two ways to allocate bandwidth using Linux “tc” tool. 1.Dedicated : Divide bandwidth between competing VMs and prevent any VM from using more (i.e. Weight + cap). 2.Shared : Divide bandwidth but allow VMs to draw more if required (i.e. Weight + work-conserving). In both game and media server tests, results are consistent. “Dedicated” mode produced lower latency while “shared” mode produced lower jitter. Interference Mean Std deviation Dedicated 23.6 ms 29.6 Shared 33.9 ms 16.9
  • 37. Computer Measurement Group, India 37 Long latency tails on EC2 (Xu et al, Bobtail, 2013) Initial observations: 1. Median RTTs within EC2 are upto 0.6ms but 99.9 percentile RTT on EC2 is 4 times longer than that seen in dedicated data centers. (In other words, a few packets see much longer delays than normal.) 2. Small instances most susceptible to the problem. 3. Measured RTT between node pairs on same AZ. Pattern not symmetric. Hence, long tail not caused by location of host on network. 4. RTT between good and bad nodes in AZ can differ by order of magnitude. 5. One AZ which had newer CPU models did not return that many bad nodes.
  • 38. Computer Measurement Group, India 38 Long latency tails on EC2 (Xu et al, Bobtail, 2013) Experimental setup: On 4-core Xen server, dedicate 2 cores to Dom0. Remaining 2 cores are shared between 5 VMs with 40% share each. Vary the combination of latency-sensitive versus CPU- intensive VMs. Latency- sensitive CPU-intensive RTT 5 0 1 ms 4 1 1 ms 3 2 <10 ms 2 3 ~30 ms 1 4 ~30ms Long-tail emerges when CPU-intensive VMs exceed number of cores.
  • 39. Computer Measurement Group, India 39 Long latency tails on EC2 (Xu et al, Bobtail, 2013) Hypothesis: Do all CPU-intensive VMs cause a problem? Test: Vary the CPU usage of CPU-intensive VM to find out. Long tail occurs when a competing VM does not use all its CPU allocation. The Boost mechanism for quickly scheduling latency-sensitive VMs fails against such VMs.
  • 40. Computer Measurement Group, India 40 Long latency tails on EC2 (Xu et al, Bobtail, 2013) 1. Latency-sensitive VMs cannot respond in a timely manner because they are starved of CPU by other VMs. 2. The VMs which starve them are those that are CPU-intensive but are not using 100% of their allocation within 30ms. 3. The BOOST mechanism in the Xen scheduler runs in FIFO manner and treats these two types of VMs equally instead of prioritizing the latency-sensitive VM. Authors designed “Bobtail” to select the EC2 instance on which to place a latency-sensitive VM. (see paper)
  • 41. Computer Measurement Group, India 41 EC2 Xen settings Tested for small instances: 1. EC2 uses Xen credit scheduler in non-work-conserving mode, which reduces efficiency but improves isolation. 2. It allows vCPUs to float across cores instead of pinning them to a core. 3. Disk and network scheduling is work-conserving, but only network scheduling has a max cap of 300 Mbps. (Varadarajan, 2012)
  • 42. Computer Measurement Group, India 42 Know your hypervisor : Xen Xen : CPU has two schedulers : Credit(2) and EDF. • Credit scheduler keeps a per-VM (weight, cap). Can be work- conserving or not. Work-conserving means “distribute any idle time to other running processes”; otherwise total CPU quantum is capped. • I/O intensive VMs benefit from BOOST, which bumps a vCPU to the head of the queue when it receives an interrupt, provided it has not exhausted its credits. Device scheduler: • Disk and network IO goes through Domain 0, which schedules them in batches in round-robin fashion by Domain 0. To control network bandwidth, use Linux tools to define per-VM flows. Best practice: Increase CPU weight of Dom0 to be proportional to the amount of IO. Dedicate core(s) to it. Dedicate memory and prevent ballooning.
  • 43. Computer Measurement Group, India 43 Know your hypervisor - KVM • QEMU originally started a complete machine emulator [Bellard, 2005]. Code emulation is done by TCG (tiny code generator) originally called “dyngen”. KVM was later added as another code accelerator into the QEMU framework. • Only one IO thread; BIG QEMU Driver lock is held in many IO functions. • Redhat “fio” benchmark in Sept 2012 reported 1.4M IOPs with 4 guests but this was using passthrough IO (i.e. bypassing QEMU) • Similar numbers reported in Mar 2013 but this time using an experimental virtio-dataplane feature which utilizes dedicated per- device threads for IO. • Performance of RTOS (as a guest OS) in KVM also suffers when it comes in contact with QEMU [Kiszka].
  • 44. Computer Measurement Group, India 44 Tile-based benchmarking to test consolidation Traditional benchmarks are designed for individual servers. For virtualization, tiles of virtual machines that mimic actual consolidation are used. 1. SPECvirt sc2013 (supersedes SPECvirt sc2010) 2. VConsolidate(Intel): tile consisting of SPECjbb, Sysbench, Webbench and a mail server 3. VMMark (VMWare) : Exchange mail server, standy system, Apache server, database server. SPEC sc2013: •. Run for 3 hours on a single node to stress CPU, RAM, disk, and network. •. Incorporates four workloads : web server, 4 Java app server connected to a backend database server (to test multiple vCPU on SMP), mail server and Batch server. •. Keep adding additional sets of virtual machines (tiles) until overall throughput reaches a peak. All VMs must continue to meet the required QoS (spec.org/virt_sc2013)
  • 45. Computer Measurement Group, India 45 SPECvirt sc2013
  • 46. Computer Measurement Group, India 46 Contents 1. Hypervisor classification and overview 2. Overheads introduced 3. Benchmarking 4. Analytic models 5. Conclusion
  • 47. Computer Measurement Group, India 47 Value of analytic model Benchmarks have to Produce repeatable results. Be Comparable easily across architectures & platforms. Have predictive power (extrapolation). Tension between realism and reproducibility. Macrobenchmarks simulate real-world conditions but are not comparable and lack extrapolation. Microbenchmarks determine the cost of primitive operations. Need analytic model to tie benchmarks to prospective application use. Seltzer proposed three approaches : 1. Vector-based: combine system vector with application vector. 2. Trace-based : Generate workload from trace to capture dynamic sequence of requests. 3. Hybrid : combination of both. (Mogul 1999; Seltzer et al 1999)
  • 48. Computer Measurement Group, India 48 Analytic models for virtualization 1.Layered queuing network (Menasce; Benevenuto 2006). 2.Factor graphs to determine per-VM utilization (Lu 2011) 3.Trace-based approach (Wood, et al) 4.VMBench (Moller @ Karlsruhe) 5.Equations for cache and core interference (Apparao, et al). 6.Machine learning
  • 49. Computer Measurement Group, India 49 Layered Queueing network (for Xen) VM Domain 0 Disk IN OUT Xen
  • 50. Computer Measurement Group, India 50 Layered Queueing network (for Xen) Total response time R = R(VM) + R(dom0) + R(disk) For M/M/1 with feedback: R of one resource = D/[ 1- U ] U = Utilization = λ * D = Arrival rate * Service demand U lies between 0 and 1. D = Service demand = total time taken by one request. D(resource by VM) = D(bare) * Slowdown(resource)/P(VM) D(resource by Dom0) = D(VM) * Cost(Dom0)/P(IDD) where P=speedup of hardware of VM as compared to bare-metal. Cost(Dom0) = BusyTime(Dom0)/BusyTime(VM) Slowdown(resource) = BusyTime(virtual)/BusyTime(bare)
  • 51. Computer Measurement Group, India 51 Factor graphs Resource utilization at guest VMs is known and aggregate utilization at hypervisor is known. How to determine the function which defines per-VM utilization of each resource ? This can be modeled as a “source separation problem” studied in digital signal processing. Measurements inside VM and on hypervisor can differ: 1.Disk IO inside VM can be higher than on the hypervisor due to merging of IOs in the hypervisor. 2.CPU utilization inside a VM can be half of that at the hypervisor because Xen issues per-VM IO through Dom0 (seen via “xentop”).
  • 52. Computer Measurement Group, India 52 Factor graphs CPU Disk Net Mem h1 h2 h4 h3 CPU Disk Net Mem CPU Disk Net Mem f1 f2 f4 f3 CPU Disk Net Mem CPU Disk Net Mem g1 g2 g4 g3 Host VM1 VM2
  • 53. Computer Measurement Group, India 53 Trace-based approach How to model the migration from bare-metal to virtual environment? 1.Create platform profiles to measure cost of primitive operations: Run same microbenchmarks on native (bare-metal) and virtualized platform. 2.Relate native and virtualized : Formulate set of equations which relate native metrics to virtualized. 3.Capture trace of application which is to be migrated: Determine how many primitive operations it uses and plug it in. (*Actual process employs Statistical methods and is more complicated)
  • 54. Computer Measurement Group, India 54 Trace-based approach How to model the migration from bare-metal to virtual environment? Step 1: Create platform profiles? Run carefully chosen CPU, disk and network-intensive microbenchmarks on both the bare-metal and virtual environment. Measure key metrics for each benchmark : a)CPU – percentage time spent in user, kernel and iowait b)Network – read and write packets/sec and bytes/sec c)Disk – read and write blocks/sec and bytes/sec. Metric CPU user CPU sys CPU iowait BareMetal 23 13 3 Virtual 32 20 8
  • 55. Computer Measurement Group, India 55 Trace-based approach How to model the migration from bare-metal to virtual environment? Step 2: Relate native and virtualized : Formulate set of equations which relate native metrics to virtualized. e.g. Util(cpu on VM) = c0 + c1*M1 + c2*M2 + ... cn*Mn where Mk=Metric gathered from native microbenchmark Solve for the model coefficients using Least Squares Regression. The coefficients c_k capture relation between native and virtualized platform. e.g. c0=4, c1=19, c2=23, etc...
  • 56. Computer Measurement Group, India 56 Trace-based approach How to model the migration from bare-metal to virtual environment? Step 3: Capture trace of application which is to be migrated: And find the new metrics Mk to plug into the above equation. Then solve it. Voila ! Util(cpu on VM)= 4 * (M1) + 19 * (M2) + ... Recap: 1. Create platform profiles for native and virtual. 2. Find coefficients which relate native & virtual. 3. Capture application trace and apply the equation. Their findings: 4. Single model is not applicable for Intel and AMD since CPU utilization varies. 5. Feedback loop within the application can distort performance prediction.
  • 57. Computer Measurement Group, India 57 Conclusion All problems in CS can be solved by another level of indirection -David Wheeler (1997-2004, first PhD in Computer Science) ... and performance problems introduced by indirection require caching, interlayer cooperation and hardware assists (e.g. TLB cache, EPT, paravirtualization). Virtual machines have finally arrived. Dismissed for a number of years as merely academic curiosities, they are now seen as cost-effective techniques for organizing computer systems resources to provide extraordinary system flexibility and support for certain unique applications. [Goldberg, Survey of Virtual Machine Research, 1974]
  • 58. Computer Measurement Group, India 58 References 1. Ganesan et al. Empirical study of performance benefits of hardware assisted virtualization, 2013. 2. Boutcher and Chandra. Does virtualization make disk scheduling passe. 3. Le at al. Understanding Performance Implications of Nested File Systems in a Virtualized Environment. 4. Jannen. Virtualize storage, not disks. 5. Yu. Xen PV Performance status and Optimization Opportunities. 6. Barker and Shenoy. Empirical evaluation of latency-sensitive application performance in the cloud 7. Xu. Bobtail. Avoiding long tails in the cloud. 8. Varadarajan et al. Resource freeing attacks. 9. Bellard. QEMU, a fast and portable dynamic translator. 10.Kiszka. Using KVM as a realtime hypervisor 11.Mogul. Brittle metrics in operating system research. 12.Seltzer et al. The Case for Application-Specific Benchmarking
  • 59. Computer Measurement Group, India 59 References 1. Menasce. VIRTUALIZATION: CONCEPTS, APPLICATIONS, AND PERFORMANCE MODELING 2. Benevenuto et al. Performance Models for Virtualized Applications 3. Lu at al, Untangling Mixed Information to Calibrate Resource Utilization in Virtual Machines, 2011. 4. Wood. Profiling and Modeling Resource Usage of Virtualized Applications
  • 60. Computer Measurement Group, India 60 BACKUP SLIDES
  • 61. Computer Measurement Group, India 61 Classification • OS-level virtualization : Does not run any intermediary hypervisor. Modify the OS to support namespaces for networking, processes and filesystem. • Paravirtualization : Guest OS is modified and is aware that it is running inside a hypervisor. • Full virtualization : Guest OS runs unmodified. Hypervisor emulates hardware devices.
  • 62. Computer Measurement Group, India 62 NUMA/SMP • If you run a monster server VM with many vCPUs, you may have to worry about NUMA scaling. Depending on NUMA ratio, 30-40% higher cost (latency and throughput) in accessing remote memory • Hypervisor must be able to 1.manually pin a vCPU to a core. 2.export NUMA topology to the guest OS. 3.do automatic NUMA-aware scheduling of all guest VMs. •. VMWare introduced vNUMA in vSphere 5. •. On Xen, pin Dom0 to a core. In case of NUMA, put frontend and backend drivers on the same core. •. KVM exports NUMA topology to VM but it is still lagging on automatic scheduling.
  • 63. Computer Measurement Group, India 63 NUMA/SMP • Depending on NUMA ratio, 30-40% higher cost (latency and throughput) in accessing remote memory • Hypervisor must support ability to pin a vCPU to a core, and also allocate memory from specific NUMA node. • Hypervisor must export NUMA topology (ACPI tables) so guest OS can do its job. • Hypervisor should do automatic NUMA-aware scheduling of all guest VMs. • VMWare introduced vNUMA in vSphere 5. • On Xen, pin Dom0 to a core. In case of NUMA, put frontend and backend drivers on the same core. • KVM exports NUMA topology to VM but it is still lagging on automatic scheduling. • Cross-call overhead : On a SMP machine, when a semaphore is released by one thread, it issues a cross-call or inter-processor interrupt if the waiting threads are sleeping on another core. On a VM, the cross-call becomes a costly privileged op (Akther). Interrupt delivery may also trigger a cross-call.
  • 64. Computer Measurement Group, India 64 Nested CPU scheduling • Each guest OS runs on one or more virtual CPUs. Hypervisor schedules virtual CPUs on its run queue and then each guest OS decides which task to run on that virtual CPU. • Introduces lock preemption problem: A process in the guest OS may get scheduled out by the hypervisor while holding a spin lock, delaying other processes waiting for that spin lock. • Guest OS would not schedule out a process holding a spin lock but hypervisor is unaware of processes within the guest OS. • • Solution is some form of co-scheduling or “gang scheduling”. VMWare actively seeks to reduce “skew” between multiple vCPUs of the same VM.
  • 65. Computer Measurement Group, India 65 Nested Page tables • Page fault in VM may occur because the hypervisor has not allocated RAM to the VM. • Guest Page table : Guest Virtual adddress -> Hypervisor Virtual • Hypervisor Page Table : Hypervisor Virtual -> Actual RAM. • Earlier, hypervisors would maintain a “shadow page table” for each guest OS. This function has now moved into hardware by both Intel and AMD. Its called “Nested page tables”. • Nested page tables require a costly two-dimensional page walk. For each step that is resolved in the guest table, you have to look up the host table. • Overhead can be alleviated by using “huge pages” and per-VM tags in the TLB cache.
  • 66. Computer Measurement Group, India 66 Memory overheads & solutions Balloon driver : take back memory from guest. • -- VMWare • -- KVM (see virtio_balloon.c in linux_kernel/drivers/virtio) • -- HyperV calls it Dynamic Memory • -- Xen Transcendent Memory Memory deduplication • -- Present in System/270 (Smith & Nair); • -- VMWare calls it Transparent Page Sharing (patented) • -- KVM uses KSM (which calls Linux madvise()) • -- Xen uses KSM in HVM mode only.
  • 67. Computer Measurement Group, India 67 Quantifying isolation • Deshane et al(2007) defined BenchVM to test isolation. • Run normal VMs alongside a overloaded VM and test if the normal VM remains responsive. • On the Overloaded VM, you run various stress tests: 1.CPU stress test 2.Memory stress test : calloc and touch memory without free() 3.Network : threaded UDP send and receive 4.Disk : IOzone 5.Fork bomb : test fast process creation and scheduling. Their conclusion: Full virtualization provides better isolation than container-based virtualization. Their other results may be outdated due to advances in virtualization
  • 68. Computer Measurement Group, India 68 VM exits are costly • Interrupt processing causes context switches between VM and hypervisor. • KVM EOI optimization : guest IDT (interrupt descriptor table) is shadowed. • VMWare detects cluster of instructions that can cause guest exits. • Use combination of polling and interrupt (NAPI)
  • 69. Computer Measurement Group, India 69 mclock • Disk capacity varies dynamically and cannot be statically allocated like CPU or RAM. • Need proportional sharing algorithm to reserve disk capacity • Gulati et al propose a dynamic algorithm which interleaves two schedulers and uses three tags with every IO request.
  • 70. Computer Measurement Group, India 70 Hadoop benchmark • VMWare : • HyperV (conducted on HDInsight – Microsoft's version of Hadoop) : • KVM: • Xen: (Netflix runs map-reduce on AWS)
  • 71. Computer Measurement Group, India 71 HPC/Scientific benchmark • VMWare paper : SPEC MPI and SPEC OMP • Xen : Jackson et al (2010) ran NERSC on Amazon EC2. Six times slower than Linux cluster and 20 times slower than modern HPC system. EC2 interconnect severely limits performance. Could not use processor-specific compiler options since heterogenous mix of CPUs on every node. • In Jun 2010, Amazon launched “Cluster Compute Nodes” which are basically nodes running Xen in hvm mode connected via 10G ethernet (no Infiniband yet). • KVM and OpenVZ : Regola (2010) ran NPB on these nodes.
  • 72. Computer Measurement Group, India 72 Realtime benchmark • In order to minimize jitter and limit the worst-case latency, a realtime system must provide mechanisms for resource reservation, process preemption and prevention of priority inversion. • Soft realtime (VoIP) vs Hard realtime. Soft means 20ms jitter between packets acceptable. • RT-XEN • Kiszka KVM – QEMU driver lock.
  • 73. Computer Measurement Group, India 73 Layered Queueing network (Xen) Total response time R = R(vcpu) + R(dom0_cpu) + R(disk) Resp. Time = Demand/[ 1- Utilization ] R(vcpu) = D(vcpu)/ [ 1 – U (vcpu) ] R(dom0_cpu) = D(vcpu)/ [ 1 - U(dom0_vcpu] R(disk) = D(disk)/ [ 1 – U(disk) ] Util = λ * D = Arrival rate * Demand D(vm_cpu) = D(isol_cpu) * S(cpu)/P(vm) where S=slowdown, P=speedup D(dom0_cpu) = D(vm_cpu) * Cost(dom0_vm)/P(dom0) Cost(dom0_vm) = B(dom0_cpu)/B(vm_cpu) where B = busy time Slowdown(cpu) = B(vm_cpu)/B(bare_cpu) Slowdown(disk) = B(vm_disk)/B(bare_disk)