SlideShare a Scribd company logo
1 of 25
HDFS	
  	
  
High	
  Availability	
  
Suresh	
   S rinivas-­‐	
   H ortonworks	
  
Aaron	
   T .	
   M yers	
   -­‐ 	
   C loudera	
  
Overview	
  
•  Part	
  1	
  –	
  Suresh	
  Srinivas(Hortonworks)	
  
   − HDFS	
  Availability	
  and	
  Reliability	
  –	
  what	
  is	
  the	
  record?	
  
   − HA	
  Use	
  Cases	
  
   − HA	
  Design	
  
•  Part	
  2	
  –	
  Aaron	
  T.	
  Myers	
  (Cloudera)	
  
   − NN	
  HA	
  Design	
  Details	
  
            ü AutomaJc	
  failure	
  detecJon	
  and	
  NN	
  failover	
  
            ü Client-­‐NN	
  connecJon	
  failover	
  
   − OperaJons	
  and	
  Admin	
  of	
  HA	
  
   − Future	
  Work	
  



                                                          2	
  
Availability,	
  Reliability	
  and	
  Maintainability	
  
Reliability	
  =	
  MTBF/(1	
  +	
  MTBF)	
  
•  Probability	
  a	
  system	
  performs	
  its	
  funcJons	
  without	
  failure	
  for	
  
   a	
  desired	
  period	
  of	
  Jme	
  
Maintainability	
  =	
  1/(1+MTTR)	
  
•  Probability	
  that	
  a	
  failed	
  system	
  can	
  be	
  restored	
  within	
  a	
  given	
  
   Jmeframe	
  
Availability	
  =	
  MTTF/MTBF	
  
•  Probability	
  that	
  a	
  system	
  is	
  up	
  when	
  requested	
  for	
  use	
  
•  Depends	
  on	
  both	
  on	
  Reliability	
  and	
  Maintainability	
  
	
  
Mean	
  Time	
  To	
  Failure	
  (MTTF):	
  Average	
  Jme	
  between	
  successive	
  failures	
  
Mean	
  Time	
  To	
  Repair/Restore	
  (MTTR):	
  Average	
  Jme	
  to	
  repair	
  failed	
  system	
  
Mean	
  Time	
  Between	
  Failures	
  (MTBF):	
  Average	
  Jme	
  between	
  successive	
  failures	
  =	
  MTTR	
  +	
  MTTF	
  
	
  
                                                                 3	
  
Current	
  HDFS	
  Availability	
  &	
  Data	
  Integrity	
  
•  Simple	
  design	
  for	
  Higher	
  Reliability	
  
   − Storage:	
  Rely	
  on	
  NaJve	
  file	
  system	
  on	
  the	
  OS	
  rather	
  than	
  use	
  raw	
  disk	
  
   − Single	
  NameNode	
  master	
  
          ü  EnJre	
  file	
  system	
  state	
  is	
  in	
  memory	
  
   − DataNodes	
  simply	
  store	
  and	
  deliver	
  blocks	
  
              ü  All	
  sophisJcated	
  recovery	
  mechanisms	
  in	
  NN	
  
•  Fault	
  Tolerance	
  
   − Design	
  assumes	
  disks,	
  nodes	
  and	
  racks	
  fail	
  
   − MulJple	
  replicas	
  of	
  blocks	
  
              ü  acJve	
  monitoring	
  and	
  replicaJon	
  
              ü  DN	
  acJvely	
  monitor	
  for	
  block	
  deleJon	
  and	
  corrupJon	
  
   − Restart/migrate	
  the	
  NameNode	
  on	
  failure	
  
              ü  Persistent	
  state:	
  	
  mulJple	
  copies	
  	
  +	
  checkpoints	
  
              ü  FuncJons	
  as	
  Cold	
  Standby	
  
   − Restart/replace	
  the	
  DNs	
  on	
  failure	
  
   − DNs	
  tolerate	
  individual	
  disk	
  failures	
  

                                                                          4	
  
How	
  Well	
  Did	
  HDFS	
  Work?	
  

•  Data	
  Reliability	
  
   − Lost	
  19	
  out	
  of	
  329	
  Million	
  blocks	
  on	
  10	
  clusters	
  with	
  20K	
  nodes	
  in	
  2009	
  	
  
   − 7-­‐9’s	
  of	
  reliability	
  
   − Related	
  bugs	
  fixed	
  in	
  20	
  and	
  21.	
  
•  NameNode	
  Availability	
  
   − 18	
  months	
  Study:	
  22	
  failures	
  on	
  25	
  clusters	
  -­‐	
  0.58	
  failures	
  per	
  year	
  per	
  cluster	
  
   − Only	
  8	
  would	
  have	
  benefi1ed	
  from	
  HA	
  failover!!	
  (0.23	
  failures	
  per	
  cluster	
  year)	
  
   − NN	
  is	
  very	
  reliable	
  
              ü  Resilient	
  against	
  overload	
  caused	
  by	
  misbehaving	
  apps	
  
•  Maintainability	
  
   − Large	
  clusters	
  see	
  failure	
  of	
  one	
  DataNode/day	
  and	
  more	
  frequent	
  disk	
  failures	
  
   − Maintenance	
  once	
  in	
  3	
  months	
  to	
  repair	
  or	
  replace	
  DataNodes	
  

                                                                      5	
  
Why	
  NameNode	
  HA?	
  
•  NameNode	
  is	
  highly	
  reliable	
  (low	
  MTTF)	
  
   − But	
  Availability	
  is	
  not	
  the	
  same	
  as	
  Reliability	
  
•  NameNode	
  MTTR	
  depends	
  on	
  
   − RestarJng	
  NameNode	
  daemon	
  on	
  failure	
  
             ü  Operator	
  restart	
  –	
  (failure	
  detecJon	
  +	
  manual	
  restore)	
  Jme	
  
             ü  AutomaJc	
  restart	
  –	
  1-­‐2	
  minutes	
  
   − NameNode	
  Startup	
  Jme	
  
             ü  Small/medium	
  cluster	
  1-­‐2	
  minutes	
  
             ü  Very	
  large	
  cluster	
  –	
  5-­‐15	
  minutes	
  
•  Affects	
  applicaJons	
  that	
  have	
  real	
  Jme	
  requirement	
  
•  For	
  higher	
  HDFS	
  Availability	
  
   − Need	
  redundant	
  NameNode	
  to	
  eliminate	
  SPOF	
  
   − Need	
  automaJc	
  failover	
  to	
  reduce	
  MTTR	
  and	
  improve	
  Maintainability	
  
   − Need	
  Hot	
  standby	
  to	
  reduce	
  MTTR	
  for	
  very	
  large	
  clusters	
  
             ü  Cold	
  standby	
  is	
  sufficient	
  for	
  small	
  clusters	
  


                                                                   6	
  
NameNode	
  HA	
  –	
  IniLal	
  Goals	
  

•  Support	
  for	
  AcJve	
  and	
  a	
  single	
  Standby	
  
   − AcJve	
  and	
  Standby	
  with	
  manual	
  failover	
  
            ü  Standby	
  could	
  be	
  cold/warm/hot	
  
            ü  Addresses	
  downJme	
  during	
  upgrades	
  –	
  main	
  cause	
  of	
  unavailability	
  
   − AcJve	
  and	
  Standby	
  with	
  automaJc	
  failover	
  
            ü  Hot	
  standby	
  
            ü  Addresses	
  downJme	
  during	
  upgrades	
  and	
  other	
  failures	
  
•  Backward	
  compaJble	
  configuraJon	
  
•  Standby	
  performs	
  checkpoinJng	
  
   − Secondary	
  NameNode	
  not	
  needed	
  
•  Management	
  and	
  monitoring	
  tools	
  
•  Design	
  philosophy	
  –	
  choose	
  data	
  integrity	
  over	
  service	
  availability	
  


                                                            7	
  
High	
  Level	
  Use	
  Cases	
  
•  Planned	
  downJme	
                        Supported	
  failures	
  
  − Upgrades	
                                 •  Single	
  hardware	
  failure	
  
  − Config	
  changes	
  
                                                   − Double	
  hardware	
  failure	
  not	
  
  − Main	
  reason	
  for	
  downJme	
              supported	
  
                                               •  Some	
  sogware	
  failures	
  
                                                   − Same	
  sogware	
  failure	
  affects	
  
•  Unplanned	
  downJme	
                           both	
  acJve	
  and	
  standby	
  
  − Hardware	
  failure	
  
  − Server	
  unresponsive	
  
  − Sogware	
  failures	
  
  − Occurs	
  infrequently	
  



                                           8	
  
High	
  Level	
  Design	
  
•  Service	
  monitoring	
  and	
  leader	
  elecJon	
  outside	
  NN	
  
   − Similar	
  to	
  industry	
  standard	
  HA	
  frameworks	
  
•  Parallel	
  Block	
  reports	
  to	
  both	
  AcJve	
  and	
  Standby	
  NN	
  
•  Shared	
  or	
  non-­‐shared	
  NN	
  file	
  system	
  state	
  
•  Fencing	
  of	
  shared	
  resources/data	
  
   − DataNodes	
  
   − Shared	
  NN	
  state	
  (if	
  any)	
  
•  Client	
  failover	
  
   − Client	
  side	
  failover	
  (based	
  on	
  configuraJon	
  or	
  ZooKeeper)	
  
   − IP	
  Failover	
  


                                                     9	
  
Design	
  ConsideraLons	
  
•  Sharing	
  state	
  between	
  AcJve	
  and	
  Hot	
  Standby	
  
   − File	
  system	
  state	
  and	
  Block	
  locaJons	
  
•  AutomaJc	
  Failover	
  
   − Monitoring	
  AcJve	
  NN	
  and	
  performing	
  failover	
  on	
  failure	
  
•  Making	
  a	
  NameNode	
  acJve	
  during	
  startup	
  
   − Reliable	
  mechanism	
  for	
  choosing	
  only	
  one	
  NN	
  as	
  acJve	
  and	
  the	
  other	
  as	
  
    standby	
  
•  Prevent	
  data	
  corrupJon	
  on	
  split	
  brain	
  
   − Shared	
  Resource	
  Fencing	
  
            ü  DataNodes	
  and	
  shared	
  storage	
  for	
  NN	
  metadata	
  
   − NameNode	
  Fencing	
  
            ü  when	
  shared	
  resource	
  cannot	
  be	
  fenced	
  
•  Client	
  failover	
  
   − Clients	
  connect	
  to	
  the	
  new	
  AcJve	
  NN	
  during	
  failover	
  


                                                            10	
  
Failover	
  Control	
  Outside	
  NN	
  
                                                                          •  Similar	
  to	
  Industry	
  Standard	
  HA	
  
                                                                             frameworks	
  
                                                                          •  HA	
  daemon	
  outside	
  NameNode	
  
                                           ZooKeeper	
  
                                                  	
  
                                                                                 − Simpler	
  to	
  build	
  
                                                                                 − Immune	
  to	
  NN	
  failures	
  
                                                                          •  Daemon	
  manages	
  resources	
  
                                                         Resources	
  
 Failover	
                                               Resources	
  
Controller	
           AcJons	
  
                 start,	
  stop,	
  	
  
                                                             	
  
                                                         Resources	
  
                                                            	
  
                                                                	
  
                                                                                 − Resources	
  –	
  OS,	
  HW,	
  Network	
  etc.	
  
                                                                                 − NameNode	
  is	
  just	
  another	
  resource	
  
                 failover,	
  monitor,	
  …	
  



                                                                          •  Performs	
  
                                             Shared	
  
                                            Resources	
                          − AcJve	
  NN	
  elecJon	
  during	
  startup	
  
                                                  	
  
                                                                                 − AutomaJc	
  Failover	
  
                                                                                 − Fencing	
  
                                                                                         ü Shared	
  resources	
  
                                                                                         ü NameNode	
  


                                                                          	
  
Architecture	
  
                                      ZK	
           ZK	
              ZK	
  

                                               Leader	
  elecJon	
  

       Failover	
                                                                                Failover	
  
      Controller	
                                                                              Controller	
  
        AcJve	
                                                                                   Standby	
  
                         Cmds	
                       editlog	
  
 Monitor	
  Health	
                                                                          Monitor	
  Health	
  
                                                      editlogs	
  
                                NN	
                 (fencing)	
                  NN	
  
                               AcJve	
                                          Standby	
  



                                                 Block	
  Reports	
  




                             DN	
                       DN	
                         DN	
  
First	
  Phase	
  –	
  Hot	
  Standby	
  

                                                                Needs	
  to	
  be	
  HA	
  



                                      editlogs	
  
                    NN	
       (Shared	
  NFS	
  storage)	
                NN	
  
                   AcJve	
                                              Standby	
  
                                  Manual	
  Failover	
  



                                    Block	
  Reports	
  
                                    DN	
  fencing	
  



                 DN	
                     DN	
                                   DN	
  
HA	
  Design	
  Details	
  


                              14	
  
Client	
  Failover	
  Design	
  Details	
  

•  Smart	
  clients	
  (client	
  side	
  failover)	
  
   − Users	
  use	
  one	
  logical	
  URI,	
  client	
  selects	
  correct	
  NN	
  to	
  connect	
  to	
  
   − Clients	
  know	
  which	
  operaJons	
  are	
  idempotent,	
  therefore	
  safe	
  to	
  retry	
  
    on	
  a	
  failover	
  
   − Clients	
  have	
  configurable	
  failover/retry	
  strategies	
  
•  Current	
  implementaJon	
  
   − Client	
  configured	
  with	
  the	
  addresses	
  of	
  all	
  NNs	
  
•  Other	
  implementaJons	
  in	
  the	
  future	
  (more	
  later)	
  




                                                        15	
  
Client	
  Failover	
  ConfiguraLon	
  Example	
  
...
<property>
  <name>dfs.namenode.rpc-address.name-service1.nn1</name>
  <value>host1.example.com:8020</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.name-service1.nn2</name>
  <value>host2.example.com:8020</value>
</property>
<property>
  <name>dfs.namenode.http-address.name-service1.nn1</name>
  <value>host1.example.com:50070</value>
</property>
...



                                     16	
  
AutomaLc	
  Failover	
  Design	
  Details	
  
•  AutomaJc	
  failover	
  requires	
  Zookeeper	
  
   − Not	
  required	
  for	
  manual	
  failover	
  
   − ZK	
  makes	
  it	
  easy	
  to:	
  
           ü Detect	
  failure	
  of	
  the	
  acJve	
  NN	
  
           ü Determine	
  which	
  NN	
  should	
  become	
  the	
  AcJve	
  NN	
  
•  On	
  both	
  NN	
  machines,	
  run	
  another	
  daemon	
  
   − ZKFailoverController	
  (Zookeeper	
  Failover	
  Controller)	
  
•  Each	
  ZKFC	
  is	
  responsible	
  for:	
  
   − Health	
  monitoring	
  of	
  its	
  associated	
  NameNode	
  
   − ZK	
  session	
  management	
  /	
  ZK-­‐based	
  leader	
  elecJon	
  
•  See	
  HDFS-­‐2185	
  and	
  HADOOP-­‐8206	
  for	
  more	
  details	
  


                                                        17	
  
AutomaLc	
  Failover	
  Design	
  Details	
  (cont)	
  




                             18	
  
Ops/Admin:	
  Shared	
  Storage	
  

•  To	
  share	
  NN	
  state,	
  need	
  shared	
  storage	
  
   − Needs	
  to	
  be	
  HA	
  itself	
  to	
  avoid	
  just	
  shiging	
  SPOF	
  
   − Many	
  come	
  with	
  IP	
  fencing	
  opJons	
  
   − Recommended	
  mount	
  opJons:	
  
             ü tcp,soft,intr,timeo=60,retrans=10
•  SJll	
  configure	
  local	
  edits	
  dirs,	
  but	
  shared	
  dir	
  is	
  special	
  
•  Work	
  is	
  currently	
  underway	
  to	
  do	
  away	
  with	
  shared	
  storage	
  
   requirement	
  (more	
  later)	
  




                                                             19	
  
Ops/Admin:	
  NN	
  fencing	
  
•  CriJcal	
  for	
  correctness	
  that	
  only	
  one	
  NN	
  is	
  acJve	
  at	
  a	
  Jme	
  
•  Out	
  of	
  the	
  box	
  
   − RPC	
  to	
  acJve	
  NN	
  to	
  tell	
  it	
  to	
  go	
  to	
  standby	
  (graceful	
  failover)	
  
   − SSH	
  to	
  acJve	
  NN	
  and	
  `kill -9’	
  NN	
  
•  Pluggable	
  opJons	
  
   − Many	
  filers	
  have	
  protocols	
  for	
  IP-­‐based	
  fencing	
  opJons	
  
   − Many	
  PDUs	
  have	
  protocols	
  for	
  IP-­‐based	
  plug-­‐pulling	
  (STONITH)	
  
             ü Nuke	
  the	
  node	
  from	
  orbit.	
  It’s	
  the	
  only	
  way	
  to	
  be	
  sure.	
  
•  Configure	
  extra	
  opJons	
  if	
  available	
  to	
  you	
  
   − Will	
  be	
  tried	
  in	
  order	
  during	
  a	
  failover	
  event	
  
   − Escalate	
  the	
  aggressiveness	
  of	
  the	
  method	
  
   − Fencing	
  is	
  criJcal	
  for	
  correctness	
  of	
  NN	
  metadata	
  


                                                                   20	
  
Ops/Admin:	
  AutomaLc	
  Failover	
  
•  Deploy	
  ZK	
  as	
  usual	
  (3	
  or	
  5	
  nodes)	
  or	
  reuse	
  exisJng	
  ZK	
  
   − ZK	
  daemons	
  have	
  light	
  resource	
  requirement	
  
   − OK	
  to	
  collocate	
  1	
  on	
  each	
  NN,	
  many	
  collocate	
  3rd	
  on	
  the	
  YARN	
  RM	
  
   − Advisable	
  to	
  configure	
  ZK	
  daemons	
  with	
  dedicated	
  disks	
  for	
  isolaJon	
  
   − Fine	
  to	
  use	
  the	
  same	
  ZK	
  quorum	
  as	
  for	
  HBase,	
  etc.	
  
•  Fencing	
  methods	
  sJll	
  required	
  
   − The	
  ZKFC	
  that	
  wins	
  the	
  elecJon	
  is	
  responsible	
  for	
  performing	
  fencing	
  
   − Fencing	
  script(s)	
  must	
  be	
  configured	
  and	
  work	
  from	
  the	
  NNs	
  
•  Admin	
  commands	
  which	
  manually	
  iniJate	
  failovers	
  sJll	
  work	
  
   − But	
  rather	
  than	
  coordinaJng	
  the	
  failover	
  themselves,	
  use	
  the	
  ZKFCs	
  



                                                         21	
  
Ops/Admin:	
  Monitoring	
  
•  New	
  NN	
  metrics	
  
   − Size	
  of	
  pending	
  DN	
  message	
  queues	
  
   − Seconds	
  since	
  the	
  standby	
  NN	
  last	
  read	
  from	
  shared	
  edit	
  log	
  
   − DN	
  block	
  report	
  lag	
  
   − All	
  measurements	
  of	
  standby	
  NN	
  lag	
  –	
  monitor/alert	
  on	
  all	
  of	
  these	
  
•  Monitor	
  shared	
  storage	
  soluJon	
  
   − Volumes	
  fill	
  up,	
  disks	
  go	
  bad,	
  etc	
  
   − Should	
  configure	
  paranoid	
  edit	
  log	
  retenJon	
  policy	
  (default	
  is	
  2)	
  
•  Canary-­‐based	
  monitoring	
  of	
  HDFS	
  a	
  good	
  idea	
  
   − Pinging	
  both	
  NNs	
  not	
  sufficient	
  



                                                         22	
  
Ops/Admin:	
  Hardware	
  
•  AcJve/Standby	
  NNs	
  should	
  be	
  on	
  separate	
  racks	
  
•  Shared	
  storage	
  system	
  should	
  be	
  on	
  separate	
  rack	
  
•  AcJve/Standby	
  NNs	
  should	
  have	
  close	
  to	
  the	
  same	
  hardware	
  
   − Same	
  amount	
  of	
  RAM	
  –	
  need	
  to	
  store	
  the	
  same	
  things	
  
   − Same	
  #	
  of	
  processors	
  -­‐	
  need	
  to	
  serve	
  same	
  number	
  of	
  clients	
  
•  All	
  the	
  same	
  recommendaJons	
  sJll	
  apply	
  for	
  NN	
  
   − ECC	
  memory,	
  48GB	
  
   − Several	
  separate	
  disks	
  for	
  NN	
  metadata	
  directories	
  
   − Redundant	
  disks	
  for	
  OS	
  drives,	
  probably	
  RAID	
  5	
  or	
  mirroring	
  
   − Redundant	
  power	
  



                                                            23	
  
Future	
  Work	
  
•  Other	
  opJons	
  to	
  share	
  NN	
  metadata	
  
   − Journal	
  daemons	
  with	
  list	
  of	
  acJve	
  JDs	
  stored	
  in	
  ZK	
  (HDFS-­‐3092)	
  
   − Journal	
  daemons	
  with	
  quorum	
  writes	
  (HDFS-­‐3077)	
  
   	
  
•  More	
  advanced	
  client	
  failover/load	
  shedding	
  
   − Serve	
  stale	
  reads	
  from	
  the	
  standby	
  NN	
  
   − SpeculaJve	
  RPC	
  
   − Non-­‐RPC	
  clients	
  (IP	
  failover,	
  DNS	
  failover,	
  proxy,	
  etc.)	
  
   − Less	
  client-­‐side	
  configuraJon	
  (ZK,	
  custom	
  DNS	
  records,	
  HDFS-­‐3043)	
  
   	
  
•  Even	
  Higher	
  HA	
  
   − MulJple	
  standby	
  NNs	
  

                                                         24	
  
QA	
  

•  HA	
  design:	
  HDFS-­‐1623	
  
  − First	
  released	
  in	
  Hadoop	
  2.0.0-­‐alpha	
  
•  Auto	
  failover	
  design:	
  HDFS-­‐3042	
  /	
  -­‐2185	
  
  − First	
  released	
  in	
  Hadoop	
  2.0.1-­‐alpha	
  
•  Community	
  effort	
  



                                  25	
  

More Related Content

What's hot

HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High AvailabilityHortonworks
 
Tech Talk - Overview of Dash framework for building dashboards
Tech Talk - Overview of Dash framework for building dashboardsTech Talk - Overview of Dash framework for building dashboards
Tech Talk - Overview of Dash framework for building dashboardsAppsilon Data Science
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
IA Générative et Graphes Neo4j : RAG time !
IA Générative et Graphes Neo4j : RAG time !IA Générative et Graphes Neo4j : RAG time !
IA Générative et Graphes Neo4j : RAG time !Neo4j
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformCloudera, Inc.
 

What's hot (20)

HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
 
Tech Talk - Overview of Dash framework for building dashboards
Tech Talk - Overview of Dash framework for building dashboardsTech Talk - Overview of Dash framework for building dashboards
Tech Talk - Overview of Dash framework for building dashboards
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
IA Générative et Graphes Neo4j : RAG time !
IA Générative et Graphes Neo4j : RAG time !IA Générative et Graphes Neo4j : RAG time !
IA Générative et Graphes Neo4j : RAG time !
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-time
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
PyTorch under the hood
PyTorch under the hoodPyTorch under the hood
PyTorch under the hood
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
NVMe overview
NVMe overviewNVMe overview
NVMe overview
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data Platform
 

Viewers also liked

Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Hdfs ha using journal nodes
Hdfs ha using journal nodesHdfs ha using journal nodes
Hdfs ha using journal nodesEvans Ye
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterEdureka!
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HAHortonworks
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop AdministrationEdureka!
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)Edureka!
 
Redis memcached pdf
Redis memcached pdfRedis memcached pdf
Redis memcached pdfErin O'Neill
 
Hhm 3479 mq clustering and shared queues for high availability
Hhm 3479 mq clustering and shared queues for high availabilityHhm 3479 mq clustering and shared queues for high availability
Hhm 3479 mq clustering and shared queues for high availabilityPete Siddall
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Adam Kawa
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeArvind Prabhakar
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem GetInData
 

Viewers also liked (20)

Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Hdfs ha using journal nodes
Hdfs ha using journal nodesHdfs ha using journal nodes
Hdfs ha using journal nodes
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
 
HDFS Federation
HDFS FederationHDFS Federation
HDFS Federation
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)
 
Redis memcached pdf
Redis memcached pdfRedis memcached pdf
Redis memcached pdf
 
Hhm 3479 mq clustering and shared queues for high availability
Hhm 3479 mq clustering and shared queues for high availabilityHhm 3479 mq clustering and shared queues for high availability
Hhm 3479 mq clustering and shared queues for high availability
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
 

Similar to HDFS HA Deep Dive

Hadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityHadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityCloudera, Inc.
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.finalHortonworks
 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Moleoscon2007
 
Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) PostgreSQL Experts, Inc.
 
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2Cloudera, Inc.
 
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...Cloudera, Inc.
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and Performancepradeepfn
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Govind Kanshi
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interactionGovind Kanshi
 
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
How an Enterprise Data Fabric (EDF) can improve resiliency and performanceHow an Enterprise Data Fabric (EDF) can improve resiliency and performance
How an Enterprise Data Fabric (EDF) can improve resiliency and performancegojkoadzic
 
Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfhik_lhz
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Chris Nauroth
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemCloudera, Inc.
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and PerformanceWSO2
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msIlya Ganelin
 
CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012
CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012
CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012Amazon Web Services
 
Tuning Android Applications (Part Deux)
Tuning Android Applications (Part Deux)Tuning Android Applications (Part Deux)
Tuning Android Applications (Part Deux)CommonsWare
 

Similar to HDFS HA Deep Dive (20)

Hadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityHadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High Availability
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.final
 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Mole
 
Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009) Performance Whack-a-Mole Tutorial (pgCon 2009)
Performance Whack-a-Mole Tutorial (pgCon 2009)
 
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
 
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
Hadoop World 2011: HDFS Name Node High Availablity - Aaron Myers, Cloudera & ...
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and Performance
 
Performance Whackamole (short version)
Performance Whackamole (short version)Performance Whackamole (short version)
Performance Whackamole (short version)
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interaction
 
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
How an Enterprise Data Fabric (EDF) can improve resiliency and performanceHow an Enterprise Data Fabric (EDF) can improve resiliency and performance
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
 
Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmf
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and Performance
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
 
CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012
CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012
CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Tuning Android Applications (Part Deux)
Tuning Android Applications (Part Deux)Tuning Android Applications (Part Deux)
Tuning Android Applications (Part Deux)
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

HDFS HA Deep Dive

  • 1. HDFS     High  Availability   Suresh   S rinivas-­‐   H ortonworks   Aaron   T .   M yers   -­‐   C loudera  
  • 2. Overview   •  Part  1  –  Suresh  Srinivas(Hortonworks)   − HDFS  Availability  and  Reliability  –  what  is  the  record?   − HA  Use  Cases   − HA  Design   •  Part  2  –  Aaron  T.  Myers  (Cloudera)   − NN  HA  Design  Details   ü AutomaJc  failure  detecJon  and  NN  failover   ü Client-­‐NN  connecJon  failover   − OperaJons  and  Admin  of  HA   − Future  Work   2  
  • 3. Availability,  Reliability  and  Maintainability   Reliability  =  MTBF/(1  +  MTBF)   •  Probability  a  system  performs  its  funcJons  without  failure  for   a  desired  period  of  Jme   Maintainability  =  1/(1+MTTR)   •  Probability  that  a  failed  system  can  be  restored  within  a  given   Jmeframe   Availability  =  MTTF/MTBF   •  Probability  that  a  system  is  up  when  requested  for  use   •  Depends  on  both  on  Reliability  and  Maintainability     Mean  Time  To  Failure  (MTTF):  Average  Jme  between  successive  failures   Mean  Time  To  Repair/Restore  (MTTR):  Average  Jme  to  repair  failed  system   Mean  Time  Between  Failures  (MTBF):  Average  Jme  between  successive  failures  =  MTTR  +  MTTF     3  
  • 4. Current  HDFS  Availability  &  Data  Integrity   •  Simple  design  for  Higher  Reliability   − Storage:  Rely  on  NaJve  file  system  on  the  OS  rather  than  use  raw  disk   − Single  NameNode  master   ü  EnJre  file  system  state  is  in  memory   − DataNodes  simply  store  and  deliver  blocks   ü  All  sophisJcated  recovery  mechanisms  in  NN   •  Fault  Tolerance   − Design  assumes  disks,  nodes  and  racks  fail   − MulJple  replicas  of  blocks   ü  acJve  monitoring  and  replicaJon   ü  DN  acJvely  monitor  for  block  deleJon  and  corrupJon   − Restart/migrate  the  NameNode  on  failure   ü  Persistent  state:    mulJple  copies    +  checkpoints   ü  FuncJons  as  Cold  Standby   − Restart/replace  the  DNs  on  failure   − DNs  tolerate  individual  disk  failures   4  
  • 5. How  Well  Did  HDFS  Work?   •  Data  Reliability   − Lost  19  out  of  329  Million  blocks  on  10  clusters  with  20K  nodes  in  2009     − 7-­‐9’s  of  reliability   − Related  bugs  fixed  in  20  and  21.   •  NameNode  Availability   − 18  months  Study:  22  failures  on  25  clusters  -­‐  0.58  failures  per  year  per  cluster   − Only  8  would  have  benefi1ed  from  HA  failover!!  (0.23  failures  per  cluster  year)   − NN  is  very  reliable   ü  Resilient  against  overload  caused  by  misbehaving  apps   •  Maintainability   − Large  clusters  see  failure  of  one  DataNode/day  and  more  frequent  disk  failures   − Maintenance  once  in  3  months  to  repair  or  replace  DataNodes   5  
  • 6. Why  NameNode  HA?   •  NameNode  is  highly  reliable  (low  MTTF)   − But  Availability  is  not  the  same  as  Reliability   •  NameNode  MTTR  depends  on   − RestarJng  NameNode  daemon  on  failure   ü  Operator  restart  –  (failure  detecJon  +  manual  restore)  Jme   ü  AutomaJc  restart  –  1-­‐2  minutes   − NameNode  Startup  Jme   ü  Small/medium  cluster  1-­‐2  minutes   ü  Very  large  cluster  –  5-­‐15  minutes   •  Affects  applicaJons  that  have  real  Jme  requirement   •  For  higher  HDFS  Availability   − Need  redundant  NameNode  to  eliminate  SPOF   − Need  automaJc  failover  to  reduce  MTTR  and  improve  Maintainability   − Need  Hot  standby  to  reduce  MTTR  for  very  large  clusters   ü  Cold  standby  is  sufficient  for  small  clusters   6  
  • 7. NameNode  HA  –  IniLal  Goals   •  Support  for  AcJve  and  a  single  Standby   − AcJve  and  Standby  with  manual  failover   ü  Standby  could  be  cold/warm/hot   ü  Addresses  downJme  during  upgrades  –  main  cause  of  unavailability   − AcJve  and  Standby  with  automaJc  failover   ü  Hot  standby   ü  Addresses  downJme  during  upgrades  and  other  failures   •  Backward  compaJble  configuraJon   •  Standby  performs  checkpoinJng   − Secondary  NameNode  not  needed   •  Management  and  monitoring  tools   •  Design  philosophy  –  choose  data  integrity  over  service  availability   7  
  • 8. High  Level  Use  Cases   •  Planned  downJme   Supported  failures   − Upgrades   •  Single  hardware  failure   − Config  changes   − Double  hardware  failure  not   − Main  reason  for  downJme   supported   •  Some  sogware  failures   − Same  sogware  failure  affects   •  Unplanned  downJme   both  acJve  and  standby   − Hardware  failure   − Server  unresponsive   − Sogware  failures   − Occurs  infrequently   8  
  • 9. High  Level  Design   •  Service  monitoring  and  leader  elecJon  outside  NN   − Similar  to  industry  standard  HA  frameworks   •  Parallel  Block  reports  to  both  AcJve  and  Standby  NN   •  Shared  or  non-­‐shared  NN  file  system  state   •  Fencing  of  shared  resources/data   − DataNodes   − Shared  NN  state  (if  any)   •  Client  failover   − Client  side  failover  (based  on  configuraJon  or  ZooKeeper)   − IP  Failover   9  
  • 10. Design  ConsideraLons   •  Sharing  state  between  AcJve  and  Hot  Standby   − File  system  state  and  Block  locaJons   •  AutomaJc  Failover   − Monitoring  AcJve  NN  and  performing  failover  on  failure   •  Making  a  NameNode  acJve  during  startup   − Reliable  mechanism  for  choosing  only  one  NN  as  acJve  and  the  other  as   standby   •  Prevent  data  corrupJon  on  split  brain   − Shared  Resource  Fencing   ü  DataNodes  and  shared  storage  for  NN  metadata   − NameNode  Fencing   ü  when  shared  resource  cannot  be  fenced   •  Client  failover   − Clients  connect  to  the  new  AcJve  NN  during  failover   10  
  • 11. Failover  Control  Outside  NN   •  Similar  to  Industry  Standard  HA   frameworks   •  HA  daemon  outside  NameNode   ZooKeeper     − Simpler  to  build   − Immune  to  NN  failures   •  Daemon  manages  resources   Resources   Failover   Resources   Controller   AcJons   start,  stop,       Resources       − Resources  –  OS,  HW,  Network  etc.   − NameNode  is  just  another  resource   failover,  monitor,  …   •  Performs   Shared   Resources   − AcJve  NN  elecJon  during  startup     − AutomaJc  Failover   − Fencing   ü Shared  resources   ü NameNode    
  • 12. Architecture   ZK   ZK   ZK   Leader  elecJon   Failover   Failover   Controller   Controller   AcJve   Standby   Cmds   editlog   Monitor  Health   Monitor  Health   editlogs   NN   (fencing)   NN   AcJve   Standby   Block  Reports   DN   DN   DN  
  • 13. First  Phase  –  Hot  Standby   Needs  to  be  HA   editlogs   NN   (Shared  NFS  storage)   NN   AcJve   Standby   Manual  Failover   Block  Reports   DN  fencing   DN   DN   DN  
  • 15. Client  Failover  Design  Details   •  Smart  clients  (client  side  failover)   − Users  use  one  logical  URI,  client  selects  correct  NN  to  connect  to   − Clients  know  which  operaJons  are  idempotent,  therefore  safe  to  retry   on  a  failover   − Clients  have  configurable  failover/retry  strategies   •  Current  implementaJon   − Client  configured  with  the  addresses  of  all  NNs   •  Other  implementaJons  in  the  future  (more  later)   15  
  • 16. Client  Failover  ConfiguraLon  Example   ... <property> <name>dfs.namenode.rpc-address.name-service1.nn1</name> <value>host1.example.com:8020</value> </property> <property> <name>dfs.namenode.rpc-address.name-service1.nn2</name> <value>host2.example.com:8020</value> </property> <property> <name>dfs.namenode.http-address.name-service1.nn1</name> <value>host1.example.com:50070</value> </property> ... 16  
  • 17. AutomaLc  Failover  Design  Details   •  AutomaJc  failover  requires  Zookeeper   − Not  required  for  manual  failover   − ZK  makes  it  easy  to:   ü Detect  failure  of  the  acJve  NN   ü Determine  which  NN  should  become  the  AcJve  NN   •  On  both  NN  machines,  run  another  daemon   − ZKFailoverController  (Zookeeper  Failover  Controller)   •  Each  ZKFC  is  responsible  for:   − Health  monitoring  of  its  associated  NameNode   − ZK  session  management  /  ZK-­‐based  leader  elecJon   •  See  HDFS-­‐2185  and  HADOOP-­‐8206  for  more  details   17  
  • 18. AutomaLc  Failover  Design  Details  (cont)   18  
  • 19. Ops/Admin:  Shared  Storage   •  To  share  NN  state,  need  shared  storage   − Needs  to  be  HA  itself  to  avoid  just  shiging  SPOF   − Many  come  with  IP  fencing  opJons   − Recommended  mount  opJons:   ü tcp,soft,intr,timeo=60,retrans=10 •  SJll  configure  local  edits  dirs,  but  shared  dir  is  special   •  Work  is  currently  underway  to  do  away  with  shared  storage   requirement  (more  later)   19  
  • 20. Ops/Admin:  NN  fencing   •  CriJcal  for  correctness  that  only  one  NN  is  acJve  at  a  Jme   •  Out  of  the  box   − RPC  to  acJve  NN  to  tell  it  to  go  to  standby  (graceful  failover)   − SSH  to  acJve  NN  and  `kill -9’  NN   •  Pluggable  opJons   − Many  filers  have  protocols  for  IP-­‐based  fencing  opJons   − Many  PDUs  have  protocols  for  IP-­‐based  plug-­‐pulling  (STONITH)   ü Nuke  the  node  from  orbit.  It’s  the  only  way  to  be  sure.   •  Configure  extra  opJons  if  available  to  you   − Will  be  tried  in  order  during  a  failover  event   − Escalate  the  aggressiveness  of  the  method   − Fencing  is  criJcal  for  correctness  of  NN  metadata   20  
  • 21. Ops/Admin:  AutomaLc  Failover   •  Deploy  ZK  as  usual  (3  or  5  nodes)  or  reuse  exisJng  ZK   − ZK  daemons  have  light  resource  requirement   − OK  to  collocate  1  on  each  NN,  many  collocate  3rd  on  the  YARN  RM   − Advisable  to  configure  ZK  daemons  with  dedicated  disks  for  isolaJon   − Fine  to  use  the  same  ZK  quorum  as  for  HBase,  etc.   •  Fencing  methods  sJll  required   − The  ZKFC  that  wins  the  elecJon  is  responsible  for  performing  fencing   − Fencing  script(s)  must  be  configured  and  work  from  the  NNs   •  Admin  commands  which  manually  iniJate  failovers  sJll  work   − But  rather  than  coordinaJng  the  failover  themselves,  use  the  ZKFCs   21  
  • 22. Ops/Admin:  Monitoring   •  New  NN  metrics   − Size  of  pending  DN  message  queues   − Seconds  since  the  standby  NN  last  read  from  shared  edit  log   − DN  block  report  lag   − All  measurements  of  standby  NN  lag  –  monitor/alert  on  all  of  these   •  Monitor  shared  storage  soluJon   − Volumes  fill  up,  disks  go  bad,  etc   − Should  configure  paranoid  edit  log  retenJon  policy  (default  is  2)   •  Canary-­‐based  monitoring  of  HDFS  a  good  idea   − Pinging  both  NNs  not  sufficient   22  
  • 23. Ops/Admin:  Hardware   •  AcJve/Standby  NNs  should  be  on  separate  racks   •  Shared  storage  system  should  be  on  separate  rack   •  AcJve/Standby  NNs  should  have  close  to  the  same  hardware   − Same  amount  of  RAM  –  need  to  store  the  same  things   − Same  #  of  processors  -­‐  need  to  serve  same  number  of  clients   •  All  the  same  recommendaJons  sJll  apply  for  NN   − ECC  memory,  48GB   − Several  separate  disks  for  NN  metadata  directories   − Redundant  disks  for  OS  drives,  probably  RAID  5  or  mirroring   − Redundant  power   23  
  • 24. Future  Work   •  Other  opJons  to  share  NN  metadata   − Journal  daemons  with  list  of  acJve  JDs  stored  in  ZK  (HDFS-­‐3092)   − Journal  daemons  with  quorum  writes  (HDFS-­‐3077)     •  More  advanced  client  failover/load  shedding   − Serve  stale  reads  from  the  standby  NN   − SpeculaJve  RPC   − Non-­‐RPC  clients  (IP  failover,  DNS  failover,  proxy,  etc.)   − Less  client-­‐side  configuraJon  (ZK,  custom  DNS  records,  HDFS-­‐3043)     •  Even  Higher  HA   − MulJple  standby  NNs   24  
  • 25. QA   •  HA  design:  HDFS-­‐1623   − First  released  in  Hadoop  2.0.0-­‐alpha   •  Auto  failover  design:  HDFS-­‐3042  /  -­‐2185   − First  released  in  Hadoop  2.0.1-­‐alpha   •  Community  effort   25