(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

November 14, 2014 | Las Vegas, NV
Jason Stowe, Cycle Computing
Patrick Saris, USC
David Hinz, HGST

We believe access to Cloud Cluster Computing accelerates invention & discovery

Cluster Computing is Everywhere
•Strategic Answers
•Speed & Agility
4

Jevons Paradox
•UK in the 1860’s: “we need a fixed amount of steam power”
•People thought:
More efficient coal use = use less coal
•Jevons disagreed!

Jevons Paradox
•Jevons was contrarian:
Increasing efficiency in turning coal to steam, making the interface simpler to consume, radically increases demand.

Cloud helps capacity…
Fixed clusters are:
Too small when needed most,
Too large every other time…
But this work is hard to move: data scheduling, encryption, multi-AZ, security, etc.
Cycle powers access at scale

Internal:
500 Servers,
100% Full
Data Workflow
Cloud Orchestration
Drug Designer
Cycle solutions help access
Cluster Container
40 years of drug design in 9 hours
3 new compounds, $4,372 in Spot
10,600
Servers
Molecule
Data
Molecule
Data
Burst

Thanks to cloud, people can: Ask the right questionsGet better answers, faster

Record Scale, Enterprise Speed
•Very innovative work by:
–Patrick Saris, USC
–David Hinz, HGST
•Both will show the importance of:
–Asking the right question, regardless of scale
–Getting results faster to increase throughput

November 14, 2014 | Las Vegas, NV
Patrick Saris, University of Southern California

Biomass
5.6%
Hydroelectric
3.1%
Wind
2.0%
Solar: 0.4%
Geothermal
0.3%
Fossil Fuels: 79%
Nuclear: 10%
Source: U.S. Energy Information Administration, Monthly Energy Review –Table 1.2
Renewables: 11%

Chlorophyl Indigo
Graphite
fragments

0
5
10
15
20
25
0
5
10
15
20
25
Experiment
0
5
10
15
20
25
0
10
20
Calculation
Experiment

-3.66 eV
-2.65 eV
1
2
3
4
Mat Halls, Schrodinger Inc.

‘Band gap’ of parent structure

Production Cycle Deployment
First live deployment 2008
File System (PBs)
If an internal cluster
Exists.
Jobs & data
Blob data
(S3)
Cloud Filer
Glacier
Auto-scaling
external
environment
HPC
Cluster
Internal HPC
Blob data
(S3)
Cloud Filer
Glacier
Auto-scaling
external
environment
HPC
Blob data Cluster
Cloud Filer
Cold Storage
Auto-scaling
external
environment
HPC
Cluster
Scheduled
Data

Metric
Count
Compute Hours of Work
2,312,959 hours
Compute Days of Work
96,373 days
Compute Years of Work
264 years
Molecule Count
205,000 materials
Run Time
< 18 hours
Max Scale (cores)
156,314 cores across 8 regions
Max Scale (instances)
16,788 instances

How did we do this?
Auto-scaling
Execute Nodes
JUPITER
Distributed
Queue
Data
Automated in 8 Cloud Regions,
4 continents, Double resiliency
…
14 nodes controlling 16,788

© 2014 HGST, INC.
David HinzGlobal Director, IS&T Cloud and DataCenterComputing Solutions
Cost Effective High Performance Computing On Amazon Web Services

© 2014 HGST, INC. | HGST CONFIDENTIAL 48
Agenda
•Who is HGST and how is HPC used?
•HGST’s AWS HPC Journey
•Use of Cloudability for Cost Analysis and RI Planning
•What’s Next

Capacity Enterprise
Performance Enterprise
Cloud & Datacenter
Enterprise SSD
(+3 acquisitions in 2013)
7200 RPM &
CoolSpin
HDDs
Ultrastar®
Ultrastar® &
MegaScale DC™
10K & 15K
HDDs
PCIe
SAS
HGST History
 Founded in 2003 through the combination of the hard drive
businesses of IBM, the inventor of the hard drive, and
Hitachi, Ltd (“Hitachi”)
 Acquired by Western Digital in 2012
 More than 4,200 active worldwide patents
 Headquartered in San Jose, California
 Approximately 42,000 employees worldwide
 Develops innovative, advanced hard disk drives (HDD),
enterprise-class solid state drives (SSD), external storage
solutions and services
 Delivers intelligent storage devices that tightly integrate
hardware and software to maximize solution performance
Ultrastar He6
with HelioSeal™ technology

HPC Modeling and Simulation:
HGST’s Innovation Engine
• Improved Mechanical Innovation
- Internal/External Mechanical Structural Analysis of HDD
- Critical Lubricant Attributes and Physics
- Airflow / He inside HDD
- Optimal combination of HDD head and media
compositions, spindle design, lubricants
- Storage Array: HDD location, airflow investigations
• Faster Aerial Density Improvements
- Micro magnetic analysis for Heat Assisted Magnetic Recording (HAMR)
- Head-Medium Spacing (HMS)
Magnetic Medium
Magnetic Head Sensors
Magnetic Spacing
Trailing Edge
of Slider
HPC Doing The “Physics Work” Driving HGST Innovation

HGST’s HPC AWS Evolution
• Stage 1 : PoC
- First HPC PoC: Sept 2013
• Stage 2 : Small Start
- 1st HPC Production Cluster: Nov 2013
• Stage 3 : Optimize Workloads And Flexibility
- AWS C3 Deployments Jan 2014
- 4th HPC Production Cluster: June 2014
• Stage 4 : Lower Cost
- Use Spot And Reserved Instances: Oct / Nov / Dec 2014
• Stage 5 : Business metrics
- Utilization and cost reports to HGST engineers : Dec 2014

Stage 1 + 2 + 3: Shape and Scale ComputeFluid Dynamics1.4x Overall Throughput GainMicroMagneticsMolecular Dynamics
ParameterSweeps
ThroughputGain
Model 1
1.23x -1.78x
Model 2
1.01x –1.67x
Model 3
1.23x–1.69x
Model 4
Up to 2.7x
Simulation Type
Throughput Increase
HeadDrive Interface Vacuum Gaps
1.99x
Vacuum Gap "collection"
4.00x
Media Grains for HAMR (FePt/C)
2.03x
4 Carbon Molecule Clusters
5.67xMolecular Dynamics 1.67x Initial Overall Throughput Gain

Stage 3: Optimize For Workload Flexibility
Not all workloads and work require same compute resources 24 x 7 x 365

Stage 4: Spot Instances For Advanced HDD Research
EBS
Submit jobs,
orchestrate HPC
clusters over VPN
Simulated 22 advanced
head designs across 3
materials possibilities
= 15 compute years
Used AWS c3 instances
6x faster run-time:
Ran in 5 days, not 30!
Total cost:
$4,026.02
New Drive
Head
Design
Workloads
Encrypt, route data to
AWS, return results
HPC Cluster
1024 Cores
Of Spot
Instances

EBS
Submit jobs, orchestrate
HPC clusters over VPC
Run 1 Million drive head
designs = 70.75 core-years
90x throughput:
Ran in 8 hours, not 30 days!
3 days from idea to running!
70,908 cores, 729 TFLOPS
c3, r3 with Intel IvyBridge
Cost: $5,594, Spot
Instances
New Drive
Head
Design
Workloads
World’s Largest F500 Cloud Run
Transforming drive design to store the world’s data
Encrypt, route data to
AWS, return results
Cluster with
70,908 Cores
Of Spot
Instances

Stage 4 + 5: Optimize For Utilization and Cost
Great Solutions Are Available To Ease Optimization Effort

Summary
•HGST’s HPC AWS Journey ~15 months
•Take The Right Steps Along The Journey To the Cloud
•Pick The Right Partners and Tools For Success
•Continually Evaluate Environment and Needs

Thank You

Take advantage of efficiency
•Find more uses for this efficient, inexpensive compute
Please ask the right questions, get answers quickly
Go invent and discover!

(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Similar to (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014