(CMP402) Amazon EC2 Instances Deep Dive

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
John Phillips, Principal Product Manager, Amazon EC2
October 7, 2015
CMP402
Amazon EC2 Instances Deep Dive
Delivering System Performance

InstancesAPIs
Networking
EC2
EC2
Purchase Options
Amazon Elastic Compute Cloud is Big

Host Server
Hypervisor
Guest 1 Guest 2 Guest n
Amazon EC2 Instances

2006 2008 2010 2012 2014 2016
m1.small
m1.large
m1.xlarge
c1.medium
c1.xlarge
m2.xlarge
m2.4xlarge
m2.2xlarge
cc1.4xlarge
t1.micro
cg1.4xlarge
cc2.8xlarge
m1.medium
hi1.4xlarge
m3.xlarge
m3.2xlarge
hs1.8xlarge
cr1.8xlarge
c3.large
c3.xlarge
c3.2xlarge
c3.4xlarge
c3.8xlarge
g2.2xlarge
i2.xlarge
i2.2xlarge
i2.4xlarge
i2.4xlarge
m3.medium
m3.large
r3.large
r3.xlarge
r3.2xlarge
r3.4xlarge
r3.8xlarge
t2.micro
t2.small
t2.med
c4.large
c4.xlarge
c4.2xlarge
c4.4xlarge
c4.8xlarge
d2.xlarge
d2.2xlarge
d2.4xlarge
d2.8xlarge
g2.8xlarge
t2.large
m4.large
m4.xlarge
m4.2xlarge
m4.4xlarge
m4.10xlarge
Amazon EC2 Instances History

What to Expect from the Session
• Defining system performance and how it is
characterized for different workloads
• How Amazon EC2 instances deliver performance
while providing flexibility and agility
• How to make the most of your EC2 instance experience
through the lens of several instance types

• Servers are hired to do jobs
• Performance is measured differently depending on the job
Hiring a Server
?

• What performance means
depend on your perspective:
– Response time
– Throughput
– Consistency
Defining Performance: Perspective Matters
Application
System libraries
System calls
Kernel
Devices
Workload

Simple Performance Model for Single Thread
• Using CPU: executing (in user mode)
• Not using CPU: waiting for turn on CPU, waiting for disk or
network I/O, thread locks, memory paging, or for more work.

Performance Factors
Resource Performance factors Key indicators
CPU Sockets, number of cores, clock
frequency, bursting capability
CPU utilization, run queue length
Memory Memory capacity Free memory, anonymous paging,
thread swapping
Network
interface
Max bandwidth, packet rate Receive throughput, transmit throughput
over max bandwidth
Disks Input / output operations per
second, throughput
Wait queue length, device utilization,
device errors

Resource Utilization
• For given performance, how efficiently are resources being used
• Something at 100% utilization can’t accept any more work
• Low utilization can indicate more resource is being purchased
than needed

Example: Web Application
• MediaWiki installed on Apache with 140 pages of content
• Load increased in intervals over time

• Memory stats

• Disk stats

• Network stats

• CPU stats

• Picking an instance is tantamount to resource performance tuning
• Give back instances as easily as you can acquire new ones
• Find an ideal instance type and workload combination
Instance Selection = Performance Tuning

Delivering Compute Performance with

CPU Instructions and Protection Levels
Kernel
Application
• CPU has at least two protection levels: ring0 and ring1
• Privileged instructions can’t be executed in user mode to protect
system. Applications leverage system calls to the kernel.

Example: Web application system calls

X86 CPU Virtualization: Prior to Intel VT-x
VMM
Application
Kernel
PV
• Binary translation for privileged instructions
• Para-virtualization (PV)
• PV requires going through the VMM, adding latency
• Applications that are system call bound are most affected

X86 CPU Virtualization: After Intel VT-x
Kernel
Application
VMM
PV-HVM
• Hardware assisted virtualization (HVM)
• PV-HVM uses PV drivers opportunistically for operations that are
slow emulated:
• e.g. network and block I/O

Time Keeping Explained
• Time keeping in an instance is deceptively hard
• gettimeofday(), clock_gettime(), QueryPerformanceCounter()
• The TSC
• CPU counter, accessible from userspace
• Requires calibration, vDSO
• Invariant on Sandy Bridge+ processors
• Xen pvclock; does not support vDSO
• On current generation instances, use TSC as clocksource

CPU Performance and Scheduling
• Hypervisor ensures every guest receives CPU time
• Fixed allocation
• Uncapped vs. capped
• Variable allocation
• Different schedulers can be used depending on the goal
• Fairness
• Response time / deadline
• Shares

Review: C4 Instances
Custom Intel E5-2666 v3 at 2.9 GHz
P-state and C-state controls
Model vCPU Memory (GiB) EBS (Mbps)
c4.large 2 3.75 500
c4.xlarge 4 7.5 750
c4.2xlarge 8 15 1,000
c4.4xlarge 16 30 2,000
c4.8xlarge 36 60 4,000

• By entering deeper idle states, non-idle cores can achieve
up to 300MHz higher clock frequencies
• But… deeper idle states require more time to exit, may not
be appropriate for latency sensitive workloads
What’s new in C4: P-state and C-state control

Tip: P-state control for AVX2
• If an application makes heavy use of AVX2 on all cores, the
processor may attempt to draw more power than it should
• Processor will transparently reduce frequency
• Frequent changes of CPU frequency can slow an application

Review: T2 Instances
• Lowest cost EC2 Instance at $0.013 per hour
• Burstable performance
• Fixed allocation enforced with CPU Credits
Model vCPU CPU Credits
/ Hour
Memory
(GiB)
Storage
t2.micro 1 6 1 EBS Only
t2.small 1 12 2 EBS Only
t2.medium 2 24 4 EBS Only
t2.large 2 36 8 EBS Only

How Credits Work
Baseline Rate
Credit
Balance
• A CPU Credit provides the
performance of a full CPU core for
one minute
• An instance earns CPU credits at
a steady rate
• An instance consumes credits
when active
• Credits expire (leak) after 24 hours
Burst
Rate

Tip: Monitor CPU credit balance

Monitoring CPU Performance in Guest
• Indicators that work is being done
• User time
• System time (kernel mode)
• Wait I/O, threads blocked on disk I/O
• Else, Idle
• What happens if OS is scheduled off the CPU?

Tip: How to interpret Steal Time
• Fixed CPU allocations of CPU can be offered through
CPU caps
• Steal time happens when CPU cap is enforced
• Leverage CloudWatch metrics

Delivering I/O Performance with

I/O and Devices Virtualization
• Scheduling I/O requests between virtual devices and
shared physical hardware
• Split driver model
• Intel VT-d
• Direct pass through and IOMMU for dedicated devices
• Enhanced Networking

Hardware
Split Driver Model
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling

Split Driver Model
• Each virtual device has two main components
• Communication ring buffer
• An event channel signaling activity in the ring buffer
• Data is transferred through shared pages
• Shared pages requires inter domain permissions, or granting

Review: I2 Instances
16 vCPU: 3.2 TB SSD; 32 vCPU: 6.4 TB SSD
365K random read IOPS for 32 vCPU instance
Model vCPU Memory
(GiB)
Storage Read IOPS Write IOPS
i2.xlarge 4 30.5 1 x 800 SSD 35,000 35,000
i2.2xlarge 8 61 2 x 800 SSD 75,000 75,000
i2.4xlarge 16 122 4 x 800 SSD 175,000 155,000
i2.8xlarge 32 244 8 x 800 SSD 365,000 315,000

Granting in pre-3.8.0 Kernels
• Requires “grant mapping” prior to 3.8.0
• Grant mappings are expensive operations due to TLB flushes
read(fd, buffer,…)

• Grant mappings are setup in a pool once
• Data is copied in and out of the grant pool
read(fd, buffer…)
Granting in 3.8.0+ Kernels, Persistent and Indirect
Copy to
and from
grant pool

Tip: Use 3.8+ kernel
• Amazon Linux 13.09 or later
• Ubuntu 14.04 or later
• RHEL7 or later
• Etc.

Event Handling
• Guest vCPUs are interrupted to process events.
• Pre-2.6.36 kernels: notifications went to a single virtual
hardware interrupt
• Post-2.6.36 kernels: allow instance to tell hypervisor to deliver
notification to a specific vCPU for balancing
• Check "dmesg" for the following text: "Xen HVM callback vector for
event delivery is enabled“
• Also, check version of irqbalance is 1.0.7 or higher

Hardware
Split Driver Model: Networking
VMM
Frontend
driver
Frontend
driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application

Device Pass Through: Enhanced Networking
• SR-IOV eliminates need for driver domain
• Physical network device exposes virtual function to
instance
• Requires a specialized driver, which means:
• Your instance OS needs to know about it
• EC2 needs to be told your instance can use it

Hardware
After Enhanced Networking
VMM
Frontend
driver
NIC
Driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
SR-IOV Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application

Tip: Use Enhanced Networking
• Highest packets-per-second
• Lowest variance in latency
• Instance OS must support it
• Look for SR-IOV property of instance or image

• Find an instance type and workload combination
– Define performance
– Monitor resource utilization
– Make changes
Instance Selection = Performance Tuning

• PV-HVM
• Time keeping: use TSC
• C state and P state controls
• Monitor T2 CPU credits
• Persistent grants for I/O performance
• Event callbacks and IRQ balancing
• Enhanced Networking
Recap: Getting the Most Out of EC2 Instances

Next steps
• Visit the EC2 Instance Documentation
• Come visit us in the Developer Chat to hear more

Remember to complete
your evaluations!

(CMP402) Amazon EC2 Instances Deep Dive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to (CMP402) Amazon EC2 Instances Deep Dive

Similar to (CMP402) Amazon EC2 Instances Deep Dive (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(CMP402) Amazon EC2 Instances Deep Dive