SlideShare a Scribd company logo
1 of 38
Download to read offline
HPC Resource Accounting: Progress 
Against Allocation — Lessons Learned! 
Ken Schumacher! 
LISA 2014 - Seattle, WA.! 
12 November 2014
My Backround in Batch System Accounting 
In other words "Why should I listen to this guy?"! 
• I've been at Fermilab since Jan 1997, nearly 18 years! 
- I started supporting a few specific experiments. My group 
managed Fermilab's central Unix cluster. I later moved to batch 
farms using LSF as the batch scheduler.! 
- I was also a developer on the team that developed the first 
prototype for the Gartia Grid Accounting System (used by OSG, 
the Open Science Grid).! 
• For the last 5 years I have been part of the HPPC Group 
administering several InfiniBand based HPC clusters.! 
- I generate weekly resource accounting reports! 
- I work with Principle Investigators (PIs) to manage allocations! 
- I monitor our compliance with SLOs and Fermilab policies 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
2
Why call it Resource Accounting? 
• First: Resources! 
- Wikipedia - A source or supply from which benefit is produced! 
- Compute clusters offering unique resources, designed around the 
needs of a particular group of researchers.! 
• LQCD Cluster - Actually 4 CPU and 2 GPU sub-clusters! 
• Cosmology Cluster ! 
• Accelerator Modeling Cluster! 
- Also offering shared on-line storage in our Lustre Clusters! 
- Access to offline storage service from DMS (Data Movement and 
Stroage) department! 
- And the ever present staff as a resource. But accounting for staff 
is outside the scope of my presentation. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
3
Why do I call it Resource Accounting? 
• Next: Accounting ! 
- Noun: a report or description of an event or experience: a detailed 
account of what has been achieved.! 
• The Stake Holders that oversee and fund these collaborations and 
their research need to know several things! 
• More than just how their money was spent but what it accomplished in 
the form of:! 
• Availability/uptime of the computers, storage and services! 
• Usage by projects within the collaboration of the resources offered! 
• Papers and reports of progress on research being conducted.! 
• The usage reports allow for budgeting and planning for future projects 
including new hardware acquisition 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
4
Current State of the Reporting System 
• The reporting tools that we use today are a work in progress! 
- Over the last four years there as been a great improvement in the 
workflow of generating (automating) the weekly report! 
- The scope of the reporting has been revised as the requirements 
have expanded! 
- There is a significant list of changes and improvements still 
needed. ! 
• I am here to share those things that became important (and 
useful) as the scope of our reports expanded.! 
- We now include additional types of resources (GPU and storage) 
as part of the allocation! 
- We added more detailed reporting of the usage by projects so we 
can adjust both quotas and batch submission priorities 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
5
Who are my customers?! 
• The HPPC department supports several Massively Parallel 
Compute Clusters used by different groups! 
- Theoretical Particle Physicists associated with Lattice Quantum 
Chromodynamics or LQCD.! 
• The users within this collaboration are from all over the world.! 
• The collaboration has compute resources at several institutions! 
- Astrophysicists at Fermilab using our Cosmology Cluster! 
- Fermilab scientists, engineers and software developers doing 
Accelerator Modeling on the AMR cluster 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
6 
Ds Cluster ! 
at FNAL 
10q Cluster ! 
at JLab
Disclaimer: I am not a Theoretical Particle Physicist 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
7
Why we need High Performance Computing (HPC) clusters 
Discovered in the early 1970s, the theory of Quantum ! 
chromodynamics (QCD) consists of equations that describe the! 
strong force that causes quarks to clump together to form ! 
protons and other constituents ! 
of matter. For a long time solving ! 
these equations was a struggle. ! 
But in the last decade using ! 
powerful supercomputers ! 
theorists are now able to finally ! 
solve the equations of QCD ! 
with high precision. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
8
Lattice Quantum Chromodynamics Computing at FNAL 
• Fermilab's LQCD Computing cluster is made up of a few sub-clusters 
based on similar configurations! 
- Sub-clusters of conventional CPU based nodes! 
• Jpsi cluster - decommissioned May 2014, has been our standard for 
normalized core-hours since 2010. 856 nodes, dual-socket quad-core 
Opteron 2352 (2.1 GHz) on DDR InfiniBand fabric! 
• Ds cluster - 420 nodes with quad-socket eight-core Opteron 6128 (2.0 
GHz) on QDR InfiniBand fabric. This is 13,440 cores.! 
• Bc cluster - 224 nodes quad-socket eight-core Opteron 6320 (2.8 GHz) 
on QDR InfiniBand fabric. This is 7,168 cores.! 
• Pi0 cluster - 214 node dual-socket eight-core Intel E5-2650v2 "Ivy 
Bridge" (2.6 GHz) on QDR InfiniBand fabric. This is 3,424 cores.! 
( continued . . .) 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
9
Lattice Quantum Chromodynamics Computing 
• Fermilab's LQCD Computing cluster (. . . continued)! 
- Sub-clusters of nodes enhanced with GPU processors! 
• Dsg cluster - 76 nodes with quad-socket eight-core Opteron 6128 (2.0 
GHz) with two NVidia Tesla M2050 GPUs on QDR InfiniBand fabric. ! 
• Pi0g cluster - 32 node dual-socket eight-core Intel E5-2650v2 "Ivy 
Bridge" (2.6 GHz) with four NVidia Tesla K40m GPUs on QDR fabric.! 
- On-line Disk based storage in a Lustre Cluster! 
• LQCD Lustre Cluster has 964 TB of on-line storage after our most 
recent expansion. ! 
• Cosmology Lustre Cluster has 129 TB of on-line storage! 
- Tape based storage in our SL8500 robotic tape libraries! 
• 1,617 LTO4 tapes (1,293.6 TB)! 
• 331 10KC tapes (1,655.0 TB) 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
10
Other Compute Resources within USQCD 
• From the PY 2014-15 Call for Proposals! 
- Compute resources dedicated to Lattice QCD! 
• 71 M BG/Q core-hours at BNL! 
• 397 M Jpsi-core hours on clusters at FNAL and JLAB! 
• 8.9 M GPU-hours on GPU clusters at FNAL and JLAB! 
- Compute resource awards to USQCD from the DOE's INCITE 
program! 
• 100 M Cray XK7 core-hours at Oak Ridge OLCF! 
• 240 M BG/Q core-hours at Argonne ALCF 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
11
Allocations at FNAL/QCD 
• USQCD Allocation Committee allocates time on the clusters in 
units of 'normalized core hours'.! 
• Program year is July 1 through June 30th! 
• Three classes of Allocations! 
- A - large allocations which to "support calculations of benefit for 
the whole USQCD Collaboration and/or addressing critical 
scientific needs."! 
- B - medium allocations (<2.5M c-h) "intended to support 
calculations in an early stage of development which address, or 
have the potential to address, scientific needs of the 
collaboration."! 
- C - small and/or short term allocations, to explore / test / 
benchmark calculations with the potential to address scientific 
needs of the collaboration. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
12
Lesson 1 - We needed a normalized "currency" 
• The allocation is like a budget. ! 
• We base allocations on normalized core hours. ! 
• Normalized core hours are basically our currency.! 
• CPU performance and GPU performance are like Apples and 
Oranges.! 
• GPUs are designed for vector math or floating point calculations! 
• Some simulations rely more heavily on floating point.! 
• Code compiled for GPUs can not run on CPUs! 
• We had to develop new benchmarks for use with GPUs.! 
• So projects will get separate allocations for CPU and GPU 
processing. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
13
Normalizing the USQCD Clusters at Fermilab 
• Our existing HPC clusters for LQCD! 
» Ds cluster factor of 1.33 nC-H, 13,440 cores, 17,875 nCores! 
» Bc cluster factor of 1.48 nC-H, 7,168 cores, 10,609 nCores! 
» Pi0 cluster factor of 3.14 nC-H, 3,424 cores, 10,751 nCores! 
» Ds GPU cluster factor of 1.0 nC-H, 152 GPUs, 152 nGPUs! 
» Pi0 GPU cluster factor of 2.2 nC-H, 128 GPUs, 280 nGPUs! 
• Storage: Tape at 3K nC-H per TB, Disk at 30K nC-H per TB! 
• PY July 1, 2014 through June 30, 2015 allocation! 
» 270,375,000 CPU nC-H! 
» 2,815,000 GPU nC-H 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
14
Sample Allocation Notification Letter 
Hello Professor,! 
I am setting the Allocations and configuring accounts for the USQCD 
Scientific Program Committee allocations at Fermilab for the program 
year 2014-2015. I have you listed as the PI contact for the following 
allocation.! 
Flavor Physics from B, D, and K Mesons on the 2+1+1-Flavor HISQ 
Ensembles! 
49.28 M cpu hours, 825 K gnu hours, 50 TB disk and 251 TB tape! 
If this does not match your information, please us know. We need two 
things from you, please:! 
1) Your choice of a project name ! 
2) A list of the users allowed to submit jobs to this project name 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
15
FNAL Progress Against Allocation Reports 
• We are using a series of homegrown PERL and Python scripts 
to generate our weekly progress against allocation report! 
- A summary of who used each sub-cluster! 
- A listing of specific credits and debits for the week! 
- A YTD summary of CPU cluster usage! 
- A YTD summary of GPU cluster usage! 
• We can include debits/credits against the allocation for several 
reasons (explained later): ! 
- Credits for reduced performance during load shed events! 
- Debits for storage: long-term (tapes) and on-line (disk)! 
- Credits for failed jobs (due to a systems failure)! 
- Debits for dedicated nodes set aside for a project 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
16
Sample of a CPU Sub-cluster Weekly Detail Report 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
17
Sample of GPU sub-cluster weekly detail report 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
18
Sample of the Debits / Credits - Quarterly Tape Usage 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
19
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 20 Learned 12 November 2014
Part I - The Report Header 
• The Header describes where we are in the program year.! 
• It also provides some explanation of the numbers in the report 
itself. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
21
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 22 Learned 12 November 2014
Part II - Allocation and Usage by Project (Weekly) 
• The left side of the weekly summary report has the summary 
usage for just this week but across all the sub clusters. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
23
Part II - Allocation and Usage by Project (Allocation) 
• The middle part of the weekly summary report has the 
allocation granted and used PYTD by project. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
24
Part II - Allocation and Usage by Project (PYTD) 
• The right side of the weekly summary report has the summary 
usage for the Program YTD across all the sub clusters. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
25
Lesson 2 - Adjustments to Batch Priorities (lower) 
• Project charmonium has just crossed over allocation.! 
- I will go into the configuration files of the batch scheduler and 
change the priority for this one project! 
- The new priority is set to a negative number! 
- This causes any jobs that this project puts into the queue to wait 
until the jobs of those projects who still have allocation remaining 
are allowed to run! 
• A sub-cluster that is not currently billable is configured so that 
all projects, regardless of their allocation, run at equal priority.! 
• Opportunistic running! 
- If there are nodes available, we allow these over allocation 
projects or a project that simply does not have an allocation to run 
on what would otherwise be idle nodes. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
26
Lesson 2 - Adjustments to Batch Priorities (increase) 
• There are occasions where we may increase the priority for 
one or more projects.! 
- To meet a deadline for a paper! 
- To generate simulation data that is needed as inputs to upcoming 
simulations that will run! 
• In some cases we may dedicate some number of nodes for a 
specific project. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
27
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 28 Learned 12 November 2014
Part III - Totals and Progress Against Allocation (Weekly) 
• The left side of the weekly summary report has the summary 
usage for just this week but across all the sub clusters. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
29
Part III - Totals and Progress Against Allocation (Allocation) 
• The middle part of the weekly summary report has the 
allocation granted and used PYTD by project. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
30
Part III - Totals and Progress Against Allocation (PYTD) 
• The right side of the weekly summary report has the summary 
usage for the Program YTD across all the sub clusters. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
31
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 32 Learned 12 November 2014
Lesson 3 - Credits for failed jobs 
• Occasionally a job will fail "softly". ! 
• It does not crash and it reports a successful completion. So it is 
billed against an allocation.! 
• When a soft failure is discovered, we will manually calculate a 
credit to the project to reimburse the previous charges 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
33
Lesson 4 - Credits for reduced performance 
• We have had occasions during the summer where our cooling 
equipment could not keep up with the heat being generated! 
• Our Facilities group will notify us of a potential load-shed event! 
• During a load-shed event, some number of the nodes in our 
clusters are simply turned off.! 
• The nodes that remain on-line have clock speeds reduced and 
they run at a decreased load and wall-time limits are increased! 
• All jobs during a load shed get a credit for extra time used.! 
! 
• Sorry, no sample. We have been able to avoid load shed 
events since July 1. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
34
Lesson 5 - Usefulness of burn rates 
• The burn rates allow us to notify a PI that the project may be 
using its allocation at a rate that is too high or low. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
35
Summary of Lessons Learned 
• Normalized core-hours ! 
- Using a Standard Unit Across Different Facilities! 
- CPU vs GPU: These truly are "Apples vs Oranges"! 
- Charges for Storage! 
• Adjustments to Batch Priorities for Fair Share! 
- Reduced priority for over allocation or un-allocated! 
- Increased priority or dedicate nodes where needed! 
• Charges for dedicated nodes! 
• Credits for failed jobs and for load shed events! 
- Failed jobs that are not the user's fault! 
- Load shed events that are driven by "mother nature"! 
• Usefulness of Burn Rates 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
36
Time for your questions. 
And I appreciate your patience with my hearing-loss. Please step to a 
microphone. 
Feel free to find me in the Hallway Track. I am here all week. 
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 
37
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 38 Learned 12 November 2014

More Related Content

What's hot

Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platformst_ivanov
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentDataWorks Summit
 
Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptxTed Dunning
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakDatabricks
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaDataWorks Summit
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 

What's hot (20)

February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptx
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 

Similar to HPC Resource Accounting

Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...confluent
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
"The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P..."The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P...lccausp
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performanceinside-BigData.com
 
ICEOTOPE & OCF: Performance for Manufacturing
ICEOTOPE & OCF: Performance for Manufacturing ICEOTOPE & OCF: Performance for Manufacturing
ICEOTOPE & OCF: Performance for Manufacturing IceotopePR
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning
 
Optimized NFV placement in Openstack Clouds
Optimized NFV placement in Openstack CloudsOptimized NFV placement in Openstack Clouds
Optimized NFV placement in Openstack CloudsYathiraj Udupi, Ph.D.
 
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain ProjectCeph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Projectinside-BigData.com
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsGanesan Narayanasamy
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
"Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A..."Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A...
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...Altair
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...NECST Lab @ Politecnico di Milano
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph Community
 
Intro to the CNCF Research User Group
Intro to the CNCF Research User GroupIntro to the CNCF Research User Group
Intro to the CNCF Research User GroupBob Killen
 

Similar to HPC Resource Accounting (20)

Available HPC Resources at CSUC
Available HPC Resources at CSUCAvailable HPC Resources at CSUC
Available HPC Resources at CSUC
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
"The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P..."The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P...
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
ICEOTOPE & OCF: Performance for Manufacturing
ICEOTOPE & OCF: Performance for Manufacturing ICEOTOPE & OCF: Performance for Manufacturing
ICEOTOPE & OCF: Performance for Manufacturing
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
Optimized NFV placement in Openstack Clouds
Optimized NFV placement in Openstack CloudsOptimized NFV placement in Openstack Clouds
Optimized NFV placement in Openstack Clouds
 
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain ProjectCeph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
"Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A..."Performance Evaluation,  Scalability Analysis, and  Optimization Tuning of A...
"Performance Evaluation, Scalability Analysis, and Optimization Tuning of A...
 
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
XeMPUPiL: Towards Performance-aware Power Capping Orchestrator for the Xen Hy...
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Intro to the CNCF Research User Group
Intro to the CNCF Research User GroupIntro to the CNCF Research User Group
Intro to the CNCF Research User Group
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Recently uploaded (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

HPC Resource Accounting

  • 1. HPC Resource Accounting: Progress Against Allocation — Lessons Learned! Ken Schumacher! LISA 2014 - Seattle, WA.! 12 November 2014
  • 2. My Backround in Batch System Accounting In other words "Why should I listen to this guy?"! • I've been at Fermilab since Jan 1997, nearly 18 years! - I started supporting a few specific experiments. My group managed Fermilab's central Unix cluster. I later moved to batch farms using LSF as the batch scheduler.! - I was also a developer on the team that developed the first prototype for the Gartia Grid Accounting System (used by OSG, the Open Science Grid).! • For the last 5 years I have been part of the HPPC Group administering several InfiniBand based HPC clusters.! - I generate weekly resource accounting reports! - I work with Principle Investigators (PIs) to manage allocations! - I monitor our compliance with SLOs and Fermilab policies Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 2
  • 3. Why call it Resource Accounting? • First: Resources! - Wikipedia - A source or supply from which benefit is produced! - Compute clusters offering unique resources, designed around the needs of a particular group of researchers.! • LQCD Cluster - Actually 4 CPU and 2 GPU sub-clusters! • Cosmology Cluster ! • Accelerator Modeling Cluster! - Also offering shared on-line storage in our Lustre Clusters! - Access to offline storage service from DMS (Data Movement and Stroage) department! - And the ever present staff as a resource. But accounting for staff is outside the scope of my presentation. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 3
  • 4. Why do I call it Resource Accounting? • Next: Accounting ! - Noun: a report or description of an event or experience: a detailed account of what has been achieved.! • The Stake Holders that oversee and fund these collaborations and their research need to know several things! • More than just how their money was spent but what it accomplished in the form of:! • Availability/uptime of the computers, storage and services! • Usage by projects within the collaboration of the resources offered! • Papers and reports of progress on research being conducted.! • The usage reports allow for budgeting and planning for future projects including new hardware acquisition Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 4
  • 5. Current State of the Reporting System • The reporting tools that we use today are a work in progress! - Over the last four years there as been a great improvement in the workflow of generating (automating) the weekly report! - The scope of the reporting has been revised as the requirements have expanded! - There is a significant list of changes and improvements still needed. ! • I am here to share those things that became important (and useful) as the scope of our reports expanded.! - We now include additional types of resources (GPU and storage) as part of the allocation! - We added more detailed reporting of the usage by projects so we can adjust both quotas and batch submission priorities Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 5
  • 6. Who are my customers?! • The HPPC department supports several Massively Parallel Compute Clusters used by different groups! - Theoretical Particle Physicists associated with Lattice Quantum Chromodynamics or LQCD.! • The users within this collaboration are from all over the world.! • The collaboration has compute resources at several institutions! - Astrophysicists at Fermilab using our Cosmology Cluster! - Fermilab scientists, engineers and software developers doing Accelerator Modeling on the AMR cluster Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 6 Ds Cluster ! at FNAL 10q Cluster ! at JLab
  • 7. Disclaimer: I am not a Theoretical Particle Physicist Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 7
  • 8. Why we need High Performance Computing (HPC) clusters Discovered in the early 1970s, the theory of Quantum ! chromodynamics (QCD) consists of equations that describe the! strong force that causes quarks to clump together to form ! protons and other constituents ! of matter. For a long time solving ! these equations was a struggle. ! But in the last decade using ! powerful supercomputers ! theorists are now able to finally ! solve the equations of QCD ! with high precision. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 8
  • 9. Lattice Quantum Chromodynamics Computing at FNAL • Fermilab's LQCD Computing cluster is made up of a few sub-clusters based on similar configurations! - Sub-clusters of conventional CPU based nodes! • Jpsi cluster - decommissioned May 2014, has been our standard for normalized core-hours since 2010. 856 nodes, dual-socket quad-core Opteron 2352 (2.1 GHz) on DDR InfiniBand fabric! • Ds cluster - 420 nodes with quad-socket eight-core Opteron 6128 (2.0 GHz) on QDR InfiniBand fabric. This is 13,440 cores.! • Bc cluster - 224 nodes quad-socket eight-core Opteron 6320 (2.8 GHz) on QDR InfiniBand fabric. This is 7,168 cores.! • Pi0 cluster - 214 node dual-socket eight-core Intel E5-2650v2 "Ivy Bridge" (2.6 GHz) on QDR InfiniBand fabric. This is 3,424 cores.! ( continued . . .) Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 9
  • 10. Lattice Quantum Chromodynamics Computing • Fermilab's LQCD Computing cluster (. . . continued)! - Sub-clusters of nodes enhanced with GPU processors! • Dsg cluster - 76 nodes with quad-socket eight-core Opteron 6128 (2.0 GHz) with two NVidia Tesla M2050 GPUs on QDR InfiniBand fabric. ! • Pi0g cluster - 32 node dual-socket eight-core Intel E5-2650v2 "Ivy Bridge" (2.6 GHz) with four NVidia Tesla K40m GPUs on QDR fabric.! - On-line Disk based storage in a Lustre Cluster! • LQCD Lustre Cluster has 964 TB of on-line storage after our most recent expansion. ! • Cosmology Lustre Cluster has 129 TB of on-line storage! - Tape based storage in our SL8500 robotic tape libraries! • 1,617 LTO4 tapes (1,293.6 TB)! • 331 10KC tapes (1,655.0 TB) Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 10
  • 11. Other Compute Resources within USQCD • From the PY 2014-15 Call for Proposals! - Compute resources dedicated to Lattice QCD! • 71 M BG/Q core-hours at BNL! • 397 M Jpsi-core hours on clusters at FNAL and JLAB! • 8.9 M GPU-hours on GPU clusters at FNAL and JLAB! - Compute resource awards to USQCD from the DOE's INCITE program! • 100 M Cray XK7 core-hours at Oak Ridge OLCF! • 240 M BG/Q core-hours at Argonne ALCF Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 11
  • 12. Allocations at FNAL/QCD • USQCD Allocation Committee allocates time on the clusters in units of 'normalized core hours'.! • Program year is July 1 through June 30th! • Three classes of Allocations! - A - large allocations which to "support calculations of benefit for the whole USQCD Collaboration and/or addressing critical scientific needs."! - B - medium allocations (<2.5M c-h) "intended to support calculations in an early stage of development which address, or have the potential to address, scientific needs of the collaboration."! - C - small and/or short term allocations, to explore / test / benchmark calculations with the potential to address scientific needs of the collaboration. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 12
  • 13. Lesson 1 - We needed a normalized "currency" • The allocation is like a budget. ! • We base allocations on normalized core hours. ! • Normalized core hours are basically our currency.! • CPU performance and GPU performance are like Apples and Oranges.! • GPUs are designed for vector math or floating point calculations! • Some simulations rely more heavily on floating point.! • Code compiled for GPUs can not run on CPUs! • We had to develop new benchmarks for use with GPUs.! • So projects will get separate allocations for CPU and GPU processing. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 13
  • 14. Normalizing the USQCD Clusters at Fermilab • Our existing HPC clusters for LQCD! » Ds cluster factor of 1.33 nC-H, 13,440 cores, 17,875 nCores! » Bc cluster factor of 1.48 nC-H, 7,168 cores, 10,609 nCores! » Pi0 cluster factor of 3.14 nC-H, 3,424 cores, 10,751 nCores! » Ds GPU cluster factor of 1.0 nC-H, 152 GPUs, 152 nGPUs! » Pi0 GPU cluster factor of 2.2 nC-H, 128 GPUs, 280 nGPUs! • Storage: Tape at 3K nC-H per TB, Disk at 30K nC-H per TB! • PY July 1, 2014 through June 30, 2015 allocation! » 270,375,000 CPU nC-H! » 2,815,000 GPU nC-H Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 14
  • 15. Sample Allocation Notification Letter Hello Professor,! I am setting the Allocations and configuring accounts for the USQCD Scientific Program Committee allocations at Fermilab for the program year 2014-2015. I have you listed as the PI contact for the following allocation.! Flavor Physics from B, D, and K Mesons on the 2+1+1-Flavor HISQ Ensembles! 49.28 M cpu hours, 825 K gnu hours, 50 TB disk and 251 TB tape! If this does not match your information, please us know. We need two things from you, please:! 1) Your choice of a project name ! 2) A list of the users allowed to submit jobs to this project name Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 15
  • 16. FNAL Progress Against Allocation Reports • We are using a series of homegrown PERL and Python scripts to generate our weekly progress against allocation report! - A summary of who used each sub-cluster! - A listing of specific credits and debits for the week! - A YTD summary of CPU cluster usage! - A YTD summary of GPU cluster usage! • We can include debits/credits against the allocation for several reasons (explained later): ! - Credits for reduced performance during load shed events! - Debits for storage: long-term (tapes) and on-line (disk)! - Credits for failed jobs (due to a systems failure)! - Debits for dedicated nodes set aside for a project Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 16
  • 17. Sample of a CPU Sub-cluster Weekly Detail Report Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 17
  • 18. Sample of GPU sub-cluster weekly detail report Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 18
  • 19. Sample of the Debits / Credits - Quarterly Tape Usage Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 19
  • 20. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 20 Learned 12 November 2014
  • 21. Part I - The Report Header • The Header describes where we are in the program year.! • It also provides some explanation of the numbers in the report itself. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 21
  • 22. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 22 Learned 12 November 2014
  • 23. Part II - Allocation and Usage by Project (Weekly) • The left side of the weekly summary report has the summary usage for just this week but across all the sub clusters. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 23
  • 24. Part II - Allocation and Usage by Project (Allocation) • The middle part of the weekly summary report has the allocation granted and used PYTD by project. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 24
  • 25. Part II - Allocation and Usage by Project (PYTD) • The right side of the weekly summary report has the summary usage for the Program YTD across all the sub clusters. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 25
  • 26. Lesson 2 - Adjustments to Batch Priorities (lower) • Project charmonium has just crossed over allocation.! - I will go into the configuration files of the batch scheduler and change the priority for this one project! - The new priority is set to a negative number! - This causes any jobs that this project puts into the queue to wait until the jobs of those projects who still have allocation remaining are allowed to run! • A sub-cluster that is not currently billable is configured so that all projects, regardless of their allocation, run at equal priority.! • Opportunistic running! - If there are nodes available, we allow these over allocation projects or a project that simply does not have an allocation to run on what would otherwise be idle nodes. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 26
  • 27. Lesson 2 - Adjustments to Batch Priorities (increase) • There are occasions where we may increase the priority for one or more projects.! - To meet a deadline for a paper! - To generate simulation data that is needed as inputs to upcoming simulations that will run! • In some cases we may dedicate some number of nodes for a specific project. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 27
  • 28. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 28 Learned 12 November 2014
  • 29. Part III - Totals and Progress Against Allocation (Weekly) • The left side of the weekly summary report has the summary usage for just this week but across all the sub clusters. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 29
  • 30. Part III - Totals and Progress Against Allocation (Allocation) • The middle part of the weekly summary report has the allocation granted and used PYTD by project. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 30
  • 31. Part III - Totals and Progress Against Allocation (PYTD) • The right side of the weekly summary report has the summary usage for the Program YTD across all the sub clusters. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 31
  • 32. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 32 Learned 12 November 2014
  • 33. Lesson 3 - Credits for failed jobs • Occasionally a job will fail "softly". ! • It does not crash and it reports a successful completion. So it is billed against an allocation.! • When a soft failure is discovered, we will manually calculate a credit to the project to reimburse the previous charges Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 33
  • 34. Lesson 4 - Credits for reduced performance • We have had occasions during the summer where our cooling equipment could not keep up with the heat being generated! • Our Facilities group will notify us of a potential load-shed event! • During a load-shed event, some number of the nodes in our clusters are simply turned off.! • The nodes that remain on-line have clock speeds reduced and they run at a decreased load and wall-time limits are increased! • All jobs during a load shed get a credit for extra time used.! ! • Sorry, no sample. We have been able to avoid load shed events since July 1. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 34
  • 35. Lesson 5 - Usefulness of burn rates • The burn rates allow us to notify a PI that the project may be using its allocation at a rate that is too high or low. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 35
  • 36. Summary of Lessons Learned • Normalized core-hours ! - Using a Standard Unit Across Different Facilities! - CPU vs GPU: These truly are "Apples vs Oranges"! - Charges for Storage! • Adjustments to Batch Priorities for Fair Share! - Reduced priority for over allocation or un-allocated! - Increased priority or dedicate nodes where needed! • Charges for dedicated nodes! • Credits for failed jobs and for load shed events! - Failed jobs that are not the user's fault! - Load shed events that are driven by "mother nature"! • Usefulness of Burn Rates Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 36
  • 37. Time for your questions. And I appreciate your patience with my hearing-loss. Please step to a microphone. Feel free to find me in the Hallway Track. I am here all week. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014 37
  • 38. Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 38 Learned 12 November 2014