HPC Resource Accounting

HPC Resource Accounting: Progress
Against Allocation — Lessons Learned!
Ken Schumacher!
LISA 2014 - Seattle, WA.!
12 November 2014

My Backround in Batch System Accounting
In other words "Why should I listen to this guy?"!
• I've been at Fermilab since Jan 1997, nearly 18 years!
- I started supporting a few specific experiments. My group
managed Fermilab's central Unix cluster. I later moved to batch
farms using LSF as the batch scheduler.!
- I was also a developer on the team that developed the first
prototype for the Gartia Grid Accounting System (used by OSG,
the Open Science Grid).!
• For the last 5 years I have been part of the HPPC Group
administering several InfiniBand based HPC clusters.!
- I generate weekly resource accounting reports!
- I work with Principle Investigators (PIs) to manage allocations!
- I monitor our compliance with SLOs and Fermilab policies
Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons Learned 12 November 2014
2

Why call it Resource Accounting?
• First: Resources!
- Wikipedia - A source or supply from which benefit is produced!
- Compute clusters offering unique resources, designed around the
needs of a particular group of researchers.!
• LQCD Cluster - Actually 4 CPU and 2 GPU sub-clusters!
• Cosmology Cluster !
• Accelerator Modeling Cluster!
- Also offering shared on-line storage in our Lustre Clusters!
- Access to offline storage service from DMS (Data Movement and
Stroage) department!
- And the ever present staff as a resource. But accounting for staff
is outside the scope of my presentation.
3

Why do I call it Resource Accounting?
• Next: Accounting !
- Noun: a report or description of an event or experience: a detailed
account of what has been achieved.!
• The Stake Holders that oversee and fund these collaborations and
their research need to know several things!
• More than just how their money was spent but what it accomplished in
the form of:!
• Availability/uptime of the computers, storage and services!
• Usage by projects within the collaboration of the resources offered!
• Papers and reports of progress on research being conducted.!
• The usage reports allow for budgeting and planning for future projects
including new hardware acquisition
4

Current State of the Reporting System
• The reporting tools that we use today are a work in progress!
- Over the last four years there as been a great improvement in the
workflow of generating (automating) the weekly report!
- The scope of the reporting has been revised as the requirements
have expanded!
- There is a significant list of changes and improvements still
needed. !
• I am here to share those things that became important (and
useful) as the scope of our reports expanded.!
- We now include additional types of resources (GPU and storage)
as part of the allocation!
- We added more detailed reporting of the usage by projects so we
can adjust both quotas and batch submission priorities
5

Who are my customers?!
• The HPPC department supports several Massively Parallel
Compute Clusters used by different groups!
- Theoretical Particle Physicists associated with Lattice Quantum
Chromodynamics or LQCD.!
• The users within this collaboration are from all over the world.!
• The collaboration has compute resources at several institutions!
- Astrophysicists at Fermilab using our Cosmology Cluster!
- Fermilab scientists, engineers and software developers doing
Accelerator Modeling on the AMR cluster
6
Ds Cluster !
at FNAL
10q Cluster !
at JLab

Disclaimer: I am not a Theoretical Particle Physicist
7

Why we need High Performance Computing (HPC) clusters
Discovered in the early 1970s, the theory of Quantum !
chromodynamics (QCD) consists of equations that describe the!
strong force that causes quarks to clump together to form !
protons and other constituents !
of matter. For a long time solving !
these equations was a struggle. !
But in the last decade using !
powerful supercomputers !
theorists are now able to finally !
solve the equations of QCD !
with high precision.
8

Lattice Quantum Chromodynamics Computing at FNAL
• Fermilab's LQCD Computing cluster is made up of a few sub-clusters
based on similar configurations!
- Sub-clusters of conventional CPU based nodes!
• Jpsi cluster - decommissioned May 2014, has been our standard for
normalized core-hours since 2010. 856 nodes, dual-socket quad-core
Opteron 2352 (2.1 GHz) on DDR InfiniBand fabric!
• Ds cluster - 420 nodes with quad-socket eight-core Opteron 6128 (2.0
GHz) on QDR InfiniBand fabric. This is 13,440 cores.!
• Bc cluster - 224 nodes quad-socket eight-core Opteron 6320 (2.8 GHz)
on QDR InfiniBand fabric. This is 7,168 cores.!
• Pi0 cluster - 214 node dual-socket eight-core Intel E5-2650v2 "Ivy
Bridge" (2.6 GHz) on QDR InfiniBand fabric. This is 3,424 cores.!
( continued . . .)
9

Lattice Quantum Chromodynamics Computing
• Fermilab's LQCD Computing cluster (. . . continued)!
- Sub-clusters of nodes enhanced with GPU processors!
• Dsg cluster - 76 nodes with quad-socket eight-core Opteron 6128 (2.0
GHz) with two NVidia Tesla M2050 GPUs on QDR InfiniBand fabric. !
• Pi0g cluster - 32 node dual-socket eight-core Intel E5-2650v2 "Ivy
Bridge" (2.6 GHz) with four NVidia Tesla K40m GPUs on QDR fabric.!
- On-line Disk based storage in a Lustre Cluster!
• LQCD Lustre Cluster has 964 TB of on-line storage after our most
recent expansion. !
• Cosmology Lustre Cluster has 129 TB of on-line storage!
- Tape based storage in our SL8500 robotic tape libraries!
• 1,617 LTO4 tapes (1,293.6 TB)!
• 331 10KC tapes (1,655.0 TB)
10

Other Compute Resources within USQCD
• From the PY 2014-15 Call for Proposals!
- Compute resources dedicated to Lattice QCD!
• 71 M BG/Q core-hours at BNL!
• 397 M Jpsi-core hours on clusters at FNAL and JLAB!
• 8.9 M GPU-hours on GPU clusters at FNAL and JLAB!
- Compute resource awards to USQCD from the DOE's INCITE
program!
• 100 M Cray XK7 core-hours at Oak Ridge OLCF!
• 240 M BG/Q core-hours at Argonne ALCF
11

Allocations at FNAL/QCD
• USQCD Allocation Committee allocates time on the clusters in
units of 'normalized core hours'.!
• Program year is July 1 through June 30th!
• Three classes of Allocations!
- A - large allocations which to "support calculations of benefit for
the whole USQCD Collaboration and/or addressing critical
scientific needs."!
- B - medium allocations (<2.5M c-h) "intended to support
calculations in an early stage of development which address, or
have the potential to address, scientific needs of the
collaboration."!
- C - small and/or short term allocations, to explore / test /
benchmark calculations with the potential to address scientific
needs of the collaboration.
12

Lesson 1 - We needed a normalized "currency"
• The allocation is like a budget. !
• We base allocations on normalized core hours. !
• Normalized core hours are basically our currency.!
• CPU performance and GPU performance are like Apples and
Oranges.!
• GPUs are designed for vector math or floating point calculations!
• Some simulations rely more heavily on floating point.!
• Code compiled for GPUs can not run on CPUs!
• We had to develop new benchmarks for use with GPUs.!
• So projects will get separate allocations for CPU and GPU
processing.
13

Normalizing the USQCD Clusters at Fermilab
• Our existing HPC clusters for LQCD!
» Ds cluster factor of 1.33 nC-H, 13,440 cores, 17,875 nCores!
» Bc cluster factor of 1.48 nC-H, 7,168 cores, 10,609 nCores!
» Pi0 cluster factor of 3.14 nC-H, 3,424 cores, 10,751 nCores!
» Ds GPU cluster factor of 1.0 nC-H, 152 GPUs, 152 nGPUs!
» Pi0 GPU cluster factor of 2.2 nC-H, 128 GPUs, 280 nGPUs!
• Storage: Tape at 3K nC-H per TB, Disk at 30K nC-H per TB!
• PY July 1, 2014 through June 30, 2015 allocation!
» 270,375,000 CPU nC-H!
» 2,815,000 GPU nC-H
14

Sample Allocation Notification Letter
Hello Professor,!
I am setting the Allocations and configuring accounts for the USQCD
Scientific Program Committee allocations at Fermilab for the program
year 2014-2015. I have you listed as the PI contact for the following
allocation.!
Flavor Physics from B, D, and K Mesons on the 2+1+1-Flavor HISQ
Ensembles!
49.28 M cpu hours, 825 K gnu hours, 50 TB disk and 251 TB tape!
If this does not match your information, please us know. We need two
things from you, please:!
1) Your choice of a project name !
2) A list of the users allowed to submit jobs to this project name
15

FNAL Progress Against Allocation Reports
• We are using a series of homegrown PERL and Python scripts
to generate our weekly progress against allocation report!
- A summary of who used each sub-cluster!
- A listing of specific credits and debits for the week!
- A YTD summary of CPU cluster usage!
- A YTD summary of GPU cluster usage!
• We can include debits/credits against the allocation for several
reasons (explained later): !
- Credits for reduced performance during load shed events!
- Debits for storage: long-term (tapes) and on-line (disk)!
- Credits for failed jobs (due to a systems failure)!
- Debits for dedicated nodes set aside for a project
16

Sample of a CPU Sub-cluster Weekly Detail Report
17

Sample of GPU sub-cluster weekly detail report
18

Sample of the Debits / Credits - Quarterly Tape Usage
19

Ken Schumacher I HPC Resource Accounting: Progress Against Allocation—Lessons 20 Learned 12 November 2014

Part I - The Report Header
• The Header describes where we are in the program year.!
• It also provides some explanation of the numbers in the report
itself.
21

Part II - Allocation and Usage by Project (Weekly)
• The left side of the weekly summary report has the summary
usage for just this week but across all the sub clusters.
23

Part II - Allocation and Usage by Project (Allocation)
• The middle part of the weekly summary report has the
allocation granted and used PYTD by project.
24

Part II - Allocation and Usage by Project (PYTD)
• The right side of the weekly summary report has the summary
usage for the Program YTD across all the sub clusters.
25

Lesson 2 - Adjustments to Batch Priorities (lower)
• Project charmonium has just crossed over allocation.!
- I will go into the configuration files of the batch scheduler and
change the priority for this one project!
- The new priority is set to a negative number!
- This causes any jobs that this project puts into the queue to wait
until the jobs of those projects who still have allocation remaining
are allowed to run!
• A sub-cluster that is not currently billable is configured so that
all projects, regardless of their allocation, run at equal priority.!
• Opportunistic running!
- If there are nodes available, we allow these over allocation
projects or a project that simply does not have an allocation to run
on what would otherwise be idle nodes.
26

Lesson 2 - Adjustments to Batch Priorities (increase)
• There are occasions where we may increase the priority for
one or more projects.!
- To meet a deadline for a paper!
- To generate simulation data that is needed as inputs to upcoming
simulations that will run!
• In some cases we may dedicate some number of nodes for a
specific project.
27

Part III - Totals and Progress Against Allocation (Weekly)
• The left side of the weekly summary report has the summary
usage for just this week but across all the sub clusters.
29

Part III - Totals and Progress Against Allocation (Allocation)
• The middle part of the weekly summary report has the
allocation granted and used PYTD by project.
30

Part III - Totals and Progress Against Allocation (PYTD)
• The right side of the weekly summary report has the summary
usage for the Program YTD across all the sub clusters.
31

Lesson 3 - Credits for failed jobs
• Occasionally a job will fail "softly". !
• It does not crash and it reports a successful completion. So it is
billed against an allocation.!
• When a soft failure is discovered, we will manually calculate a
credit to the project to reimburse the previous charges
33

Lesson 4 - Credits for reduced performance
• We have had occasions during the summer where our cooling
equipment could not keep up with the heat being generated!
• Our Facilities group will notify us of a potential load-shed event!
• During a load-shed event, some number of the nodes in our
clusters are simply turned off.!
• The nodes that remain on-line have clock speeds reduced and
they run at a decreased load and wall-time limits are increased!
• All jobs during a load shed get a credit for extra time used.!
!
• Sorry, no sample. We have been able to avoid load shed
events since July 1.
34

Lesson 5 - Usefulness of burn rates
• The burn rates allow us to notify a PI that the project may be
using its allocation at a rate that is too high or low.
35

Summary of Lessons Learned
• Normalized core-hours !
- Using a Standard Unit Across Different Facilities!
- CPU vs GPU: These truly are "Apples vs Oranges"!
- Charges for Storage!
• Adjustments to Batch Priorities for Fair Share!
- Reduced priority for over allocation or un-allocated!
- Increased priority or dedicate nodes where needed!
• Charges for dedicated nodes!
• Credits for failed jobs and for load shed events!
- Failed jobs that are not the user's fault!
- Load shed events that are driven by "mother nature"!
• Usefulness of Burn Rates
36

Time for your questions.
And I appreciate your patience with my hearing-loss. Please step to a
microphone.
Feel free to find me in the Hallway Track. I am here all week.
37

HPC Resource Accounting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HPC Resource Accounting

Similar to HPC Resource Accounting (20)

Recently uploaded

Recently uploaded (20)

HPC Resource Accounting