© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Adam Boeglin, HPC Solutions Architect
Wednesday, May 3, 2023
Choosing the Right EC2 Instance and Applicable Use Cases
Understanding the factors that going into choosing an EC2 instance
Defining system performance and how it is characterized for different workloads
How Amazon EC2 instances deliver performance while providing flexibility and agility
How to make the most of your EC2 instance experience through the lens of several instance types
What to Expect from the Session
InstancesAPI
Networking
EC2EC2
Purchase options
Amazon Elastic Compute Cloud is Big
Host ServerHypervisor
Guest 1 Guest 2 Guest n
Amazon EC2 Instances
In the past
First launched in August 2006 M1 Instance
“One size fits all” M1
2006 2008 2010 2012 2014 2016
m1.small
m1.largem1.xlarge
c1.mediumc1.xlarge
m2.xlarge
m2.4xlargem2.2xlarge
cc1.4xlarge
t1.micro
cg1.4xlarge
cc2.8xlarge
m1.medium
hi1.4xlarge
m3.xlargem3.2xlarge
hs1.8xlarge
cr1.8xlarge
c3.largec3.xlarge
c3.2xlargec3.4xlargec3.8xlargeg2.2xlarge
i2.xlargei2.2xlargei2.4xlargei2.4xlarge
m3.mediumm3.large
r3.larger3.xlarger3.2xlarge
r3.4xlarger3.8xlarge
t2.microt2.smallt2.med
c4.largec4.xlargec4.2xlargec4.4xlargec4.8xlarge
d2.xlarged2.2xlarged2.4xlarged2.8xlarge
g2.8xlarge
t2.largem4.large
m4.xlargem4.2xlargem4.4xlarge
m4.10xlarge
Amazon EC2 Instances History
x1.32xlarge
t2.nano
Instance generation
c4.largeInstance family
Instance size
Choices and Flexibility
Choice of Processor Memory Storage Options Accelerated Graphics Burstable Performance
Servers are hired to do jobs Performance is measured differently depending on the job
Hiring a Server
?
Performance Factors
Resource Performance factors Key indicatorsCPU Sockets, number of cores, clock
frequency, bursting capabilityCPU utilization, run queue length
Memory Memory capacity Free memory, anonymous paging, thread swapping
Network interface
Max bandwidth, packet rate Receive throughput, transmit throughput over max bandwidth
Disks Input / output operations per second, throughput
Wait queue length, device utilization, device errors
Resource Utilization
For given performance, how efficiently are resources being used
Something at 100% utilization can’t accept any more work
Low utilization can indicate more resource is being purchased than needed
Example: Web Application MediaWiki installed on Apache with 140 pages of content Load increased in intervals over time
Example: Web Application Memory stats
Example: Web Application Disk stats
Example: Web Application Network stats
Example: Web Application CPU stats
“Launching new instances and running tests in parallel is easy…[when choosing an instance] there is no substitute for measuring the performance of your full application.”
- EC2 Documentation
How not to choose an EC2 instance
Brute Force Testing Ignoring Metrics Favoring old generation instances Guessing based on what you already have
EC2 Instance Families
General purpose
Computeoptimized
C3
Storage and IOoptimized
I2 G2
GPUenabled
Memoryoptimized
R3C4M4D2
X1
Give back instances as easily as you can acquire new ones Find an ideal instance type and workload combination EC2 Instance Pages provide “Use Case” Guidance With EBS, storage and instance size don’t need to be coupled
Instance Selection = Performance Tuning
Instance sizing
c4.8xlarge 2 - c4.4xlarge
≈
4 - c4.2xlarge
≈
8 - c4.xlarge
≈
Choosing the right size
Understand your unit of work Web request Database / Table Batch Process
What is that unit’s requirements? CPU threads Memory Constraints Disk & Network
What are it’s availability requirements?
CPU Instructions and Protection Levels
CPU has at least two protection levels. Privileged instructions can’t be executed in user mode to protect
system. Applications leverage system calls to the kernel.
Kernel
Application
VMM
Application
Kernel
PV
X86 CPU Virtualization: Prior to Intel VT-x
Binary translation for privileged instructions Para-virtualization (PV)
PV requires going through the VMM, adding latency Applications that are system call bound are most affected
KernelApplication
VMM
PV-HVM
X86 CPU Virtualization: After Intel VT-x
Hardware assisted virtualization (HVM) PV-HVM uses PV drivers opportunistically for operations that
are slow emulated: e.g., network and block I/O
Tip: Use HVM AMIs with EBS
Time Keeping Explained
Time keeping in an instance is deceptively hard gettimeofday(), clock_gettime(), QueryPerformanceCounter() The TSC
CPU counter, accessible from userspace Requires calibration, vDSO Invariant on Sandy Bridge+ processors
Xen pvclock; does not support vDSO On current generation instances, use TSC as clocksource
# cat /sys/devices/system/cl*/cl*/available_clocksourcexen tsc hpet acpi_pm # cat /sys/devices/system/cl*/cl*/current_clocksourcetsc
# vi /boot/grub/menu.list && reboot
# cat /proc/cmdlineroot=/dev/xvda1 ro clock source=tsc
Change with:
Tip: Use TSC as clocksource
Review: C4 Instances Custom Intel E5-2666 v3 at 2.9 GHz P-state and C-state controls
Model vCPU Memory (GiB) EBS (Mbps)c4.large 2 3.75 500c4.xlarge 4 7.5 750c4.2xlarge 8 15 1,000c4.4xlarge 16 30 2,000c4.8xlarge 36 60 4,000
Batch & HPC workloads, Game Servers, Ad Serving, & High Traffic Web Servers
What’s new in C4: P-state and C-state control
Intel Turbo Boost up to 3.5Ghz By entering deeper idle states, non-idle cores can achieve up to 300MHz
higher clock frequencies But… deeper idle states require more time to exit, may not be appropriate
for latency-sensitive workloads
Tip: P-state control for AVX2
If an application makes heavy use of AVX2 on all cores, the processor may attempt to draw more power than it should
Processor will transparently reduce frequency Frequent changes of CPU frequency can slow an application
Review: T2 Instances Lowest cost EC2 instance at $0.0065 per hour Burstable performance Fixed allocation enforced with CPU credits
Model vCPU Baseline CPU Credits / Hour
Memory (GiB)
Storage
t2.nano 1 5% 3 .5 EBS Onlyt2.micro 1 10% 6 1 EBS Onlyt2.small 1 20% 12 2 EBS Onlyt2.medium 2 40%** 24 4 EBS Onlyt2.large 2 60%** 36 8 EBS Only
General Purpose, Web Serving, Developer Environments, Small Databases
How Credits Work
A CPU credit provides the performance of a full CPU core for one minute
An instance earns CPU credits at a steady rate An instance consumes credits when active Credits expire (leak) after 24 hours
Baseline rate
Credit balance
Burstrate
Tip: Monitor CPU credit balance
Tip: How to Interpret Steal Time
Fixed CPU allocations of CPU can be offered through CPU caps Steal time happens when CPU cap is enforced Leverage CloudWatch metrics
Announced: X1 Instances
Largest memory instance with 2TB of DRAM Quad socket, Intel E7 processors with 128 vCPUs
Model vCPU Memory (GiB) Local Storage
x1.32xlarge 128 1952 2x 1920GB
In-Memory Databases, Big Data Processing, HPC Workloads
NUMA
Non-uniform memory access Each processor in a multi-CPU system has local memory that is
accessible through a fast interconnect Each processor can also access memory from other CPUs, but
local memory access is a lot faster than remote memory Performance is related to the number of CPU sockets and how they
are connected - Intel QuickPath Interconnect (QPI)
QPI
122GB 122GB
16 vCPU’s 16 vCPU’s
r3.8xlarge
QPI
QPI
QPIQPI
QPI
488GB
488GB
488GB
488GB
32 vCPU’s 32 vCPU’s
32 vCPU’s 32 vCPU’s
x1.32xlarge
Tip: Kernel Support for NUMA Balancing
An application will perform best when the threads of its processes are accessing memory on the same NUMA node.
NUMA balancing moves tasks closer to the memory they are accessing. This is all done automatically by the Linux kernel when automatic NUMA
balancing is active: version 3.8+ of the Linux kernel. Windows support for NUMA first appeared in the Enterprise and Data
Center SKUs of Windows Server 2003.
Review: I2 Instances 16 vCPU: 3.2 TB SSD; 32 vCPU: 6.4 TB SSD 365K random read IOPS for 32 vCPU instance
Model vCPU Memory (GiB)
Storage Read IOPS Write IOPS
i2.xlarge 4 30.5 1 x 800 SSD 35,000 35,000i2.2xlarge 8 61 2 x 800 SSD 75,000 75,000i2.4xlarge 16 122 4 x 800 SSD 175,000 155,000i2.8xlarge 32 244 8 x 800 SSD 365,000 315,000
NoSQL Databases, Clustered Databases, Online Transaction Processing (OLTP)
Hardware
Split Driver ModelDriver Domain Guest Domain Guest Domain
VMM
Frontend driver
Frontend driver
Backend driver
DeviceDriver
Physical CPU
Physical Memory
Network Device
Virtual CPU Virtual Memory
CPU Scheduling
Sockets
Application1
23
4
5
Granting in pre-3.8.0 Kernels
Requires “grant mapping” prior to 3.8.0 Grant mappings are expensive operations due to TLB flushes
SSD
Inter domain I/O:(1) Grant memory(2) Write to ring buffer(3) Signal event(4) Read ring buffer(5) Map grants(6) Read or write grants(7) Unmap grants
read(fd, buffer,…)
I/O domain Instance
Granting in 3.8.0+ Kernels, Persistent and Indirect
Grant mappings are setup in a pool once Data is copied in and out of the grant pool
SSD
read(fd, buffer…)
I/O domain Instance
Grant pool
Copy to and from grant pool
2009 – Longer ago than you think Avatar was the top movie in the theaters Facebook overtook MySpace in active users President Obama was sworn into office
The 2.6.32 Linux kernel was released
Tip: Use 3.8+ kernel
Amazon Linux 13.09 or later Ubuntu 14.04 or later RHEL/Centos 7 or later Etc.
Device Pass Through: Enhanced Networking
SR-IOV eliminates need for driver domain Physical network device exposes virtual function to instance Requires a specialized driver, which means:
Your instance OS needs to know about it EC2 needs to be told your instance can use it
Hardware
After Enhanced NetworkingDriver Domain Guest Domain Guest Domain
VMM
NIC Driver
Physical CPU
Physical Memory
SR-IOV Network Device
Virtual CPU Virtual Memory
CPU Scheduling
Sockets
Application1
2
3
NIC Driver
Elastic Network Adapter
Next Generation of Enhanced Networking Hardware Checksums Multi-Queue Support Receive Side Steering
20Gbps in a Placement Group New Open Source Amazon Network Driver
EBS Performance
Instance Size Matters Match your volume size and
type to your instance Use EBS Optimization if EBS
performance is important
Choose HVM AMI’s Time keeping: use TSC C state and P state controls Monitor T2 CPU credits Use a modern Linux kernel NUMA balancing Persistent grants for I/O performance Enhanced networking
Summary: Getting the Most Out of EC2 Instances
Bare metal performance goal, and in many scenarios already there History of eliminating hypervisor intermediation and driver domains
Hardware assisted virtualization Scheduling and granting efficiencies Device pass through
Virtualization Themes
Next steps
Visit the Amazon EC2 documentation Launch an instance and try your app!
Thank you!