Server Roles – Name Nodes
• Job Manager for YARN data-processing framework
– Heartbeats from data nodes
– 10th heartbeat is a block report from which it generates
– Checks in every hour to mirror metadata / block map
– Not a hot-spare – requires manual fail-over
• High Availability (HA) can be added in some
– Results in a dedicated HA node that acts as a witness
to the Name Node cluster
Server Roles - Edge Nodes
Server Roles - Data Node
› 更多服務（Impala/Spark) 需要更多記憶體
– 很多的本地硬碟 (JBOD / Non-RAID mode)
› SFF (2.5”) for performance-based workloads
› LFF (3.5”)for capacity-centric workloads
– CPUs – legacy recommendation of 1:1 core:spindle ratio
› SSDs, faster HDD (10K+), and in-memory workloads make this less of an issue
› 10 and 12 core are the best practice default
Hadoop Cluster Deployment – Installation Best
• Use pre-built, assembled & cabled racks from vendor
• ⾃自動佈署⼯工具 (ex: Open Crowbar)
• Purchase nodes in standard size groups for easy capacity growth and ordering, not in single node
– Common increments are ½ or full rack for easy deployment and sizing
• For each type of hardware, purchase spare components to keep on site for easy, rapid repair
• HDFS protects information through replication of the data between nodes, the default Replication
Factor is 3, but is configurable.
• HDFS Raw Capacity = Number of Compute Nodes x Number of Drives x Capacity of Drives
• HDFS Usable Capacity = HDFS Raw Capacity/Replication Factor
Big Data Networking Best Practices
• Traditional Ethernet is used since it’s affordable and already prevalent.
• 1GbE networking was used initially in early drafts of the solution but with the reduction in cost it’s
much more efficient to go with 10GbE.
• Multiple ports are teamed both for redundancy and throughput. LACP or software bonding are the
most common methods.
• IPv4 is most widely used. IPv6 has limited support at the OS and Hadoop level.
Attributes of a Good Switch for Big Data
• Non-blocking backplane
• Deep per-port packet buffers (shared buffers do not work well). During sort/shuffle phases of
map/reduce operations network traffic is so chaotic that it can saturate any and all shared buffers,
impacting multiple host’s network performance.
• Good choices:
Dell Points of Integration
• VLT / VRRP is a very affordable way to team switches both at the ToR and the aggregation tiers.
This makes the Dell Networking Force10 switches a great choice.
• Active Fabric Manager
– Speeds up the creation and administration of the required VLT / VRRP configuration on the switches.
– Helps with capacity-planning as customer scale
Big Data Networking Futures
• 40GbE onboard LOMs will begin to be used for high-volume clusters. Right now the cost:benefit
ratio isn’t there yet.
• As HPC and Big Data converge, we’ll start to see the use of IB for node-to-node connectivity.
• In-memory (Spark / Impala) workloads are reducing the bottlenecks that used to exist at the disk
and now move to the processor and network. Expect customers to be looking to increase core
counts and network speed to overcome this.
@Dell_Enterprise Enterprise Solutions
Etu+Dell = complete Hadoop/Big Data solution provider
Best of breed
solutions for Big Data
Dell Professional Services for Big Data
Installation and configuration service
Complete end-to-end implementation
Discover Plan ImplementInvestigate
2. Store1. Integrate
Toad Data Point
Desktop – integrate, cleanse
Cloud – integrate, correlate
Stock market data
Dell Statistica Big Data
Desktop – crawl, save
• Speed Improvements in Map / Reduce
• More in-memory workloads
– Possible move to Spark to replace Map/Reduce
• Virtualized Hadoop
– VMWare Big Data Extensions
– Openstack Sahara
– Microsoft HDInsights (Hortonworks)
Dell In-Memory Appliance for Cloudera Enterprise
Configurations at a glance
16 Node Cluster
PowerEegeR720- 4 Infrastructure Nodes
PowerEdgeR720XD- 12 Data Nodes with
Dell Rack 42U
~528TB (disk raw space)
8 Node Cluster
PowerEdge R720- 4 Infrastructure Nodes
PowerEdgeR720XD- 4 Data Nodes with
Dell Rack 42U
~176TB (disk raw space)
24 Node Cluster
PowerEdgeR720- 4 Infrastructure Nodes
PowerEdgeR720XD- 20 Data Nodes with
Dell Rack 42U
~880TB (disk raw space)
Expansion Unit- PowerEdgeR720XD-4 Data Nodes w ProSupport, Cloudera Enterprise, Scales in