(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
"Not only did the 156,000+ core run (nicknamed the MegaRun) on Amazon EC2 break industry records for size, scale, and power, but it also delivered real-world results. The University of Southern California ran the high-performance computing job in the cloud to evaluate over 220,000 compounds and build a better organic solar cell. In this session, USC provides an update on the six promising compounds that we have found and is now synthesizing in laboratories for a clean energy project. We discuss the implementation of and lessons learned in running a cluster in eight AWS regions worldwide, with highlights from Cycle Computing's project Jupiter, a low-overhead cloud scheduler and workload manager. This session also looks at how the MegaRun was financially achievable using the Amazon EC2 Spot Instance market, including an in-depth discussion on leveraging Spot Instances, a strategy to deal with the variability of Spot pricing, and a template to avoid compromising workflow integrity, security, or management.
After a year of production workloads on AWS, HGST, a Western Digital Company, has zeroed in on understanding how to create on-demand clusters to maximize value on AWS. HGST will outline the company's successes in addressing the company's changes in operations, culture, and behavior to this new vision of on-demand clusters. In addition, the session will provide insights into leveraging Amazon EC2 Spot Instances to reduce costs and maximize value, while maintaining the needed flexibility, and agility that AWS is known for.andquot;
"
Similar to (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
Similar to (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014 (20)
WordPress Websites for Engineers: Elevate Your Brand
(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
1. November 14, 2014 | Las Vegas, NV
Jason Stowe, Cycle Computing
Patrick Saris, USC
David Hinz, HGST
5. Jevons Paradox
•UK in the 1860’s: “we need a fixed amount of steam power”
•People thought:
More efficient coal use = use less coal
•Jevons disagreed!
6. Jevons Paradox
•Jevons was contrarian:
Increasing efficiency in turning coal to steam, making the interface simpler to consume, radically increases demand.
7. Cloud helps capacity…
Fixed clusters are:
Too small when needed most,
Too large every other time…
But this work is hard to move: data scheduling, encryption, multi-AZ, security, etc.
Cycle powers access at scale
8. Internal:
500 Servers,
100% Full
Data Workflow
Cloud Orchestration
Drug Designer
Cycle solutions help access
Cluster Container
40 years of drug design in 9 hours
3 new compounds, $4,372 in Spot
10,600
Servers
Molecule
Data
Molecule
Data
Burst
9. Thanks to cloud, people can: Ask the right questionsGet better answers, faster
10. Record Scale, Enterprise Speed
•Very innovative work by:
–Patrick Saris, USC
–David Hinz, HGST
•Both will show the importance of:
–Asking the right question, regardless of scale
–Getting results faster to increase throughput
11. November 14, 2014 | Las Vegas, NV
Patrick Saris, University of Southern California
12. Biomass
5.6%
Hydroelectric
3.1%
Wind
2.0%
Solar: 0.4%
Geothermal
0.3%
Fossil Fuels: 79%
Nuclear: 10%
Source: U.S. Energy Information Administration, Monthly Energy Review –Table 1.2
Renewables: 11%
41. Production Cycle Deployment
First live deployment 2008
File System (PBs)
If an internal cluster
Exists.
Jobs & data
Blob data
(S3)
Cloud Filer
Glacier
Auto-scaling
external
environment
HPC
Cluster
Internal HPC
Blob data
(S3)
Cloud Filer
Glacier
Auto-scaling
external
environment
HPC
Blob data Cluster
Cloud Filer
Cold Storage
Auto-scaling
external
environment
HPC
Cluster
Scheduled
Data
42. Metric
Count
Compute Hours of Work
2,312,959 hours
Compute Days of Work
96,373 days
Compute Years of Work
264 years
Molecule Count
205,000 materials
Run Time
< 18 hours
Max Scale (cores)
156,314 cores across 8 regions
Max Scale (instances)
16,788 instances
43. How did we do this?
Auto-scaling
Execute Nodes
JUPITER
Distributed
Queue
Data
Automated in 8 Cloud Regions,
4 continents, Double resiliency
…
14 nodes controlling 16,788
59. Take advantage of efficiency
•Find more uses for this efficient, inexpensive compute
Please ask the right questions, get answers quickly
Go invent and discover!