Practice of large Hadoop cluster in China Mobile

Practice of large Hadoop cluster in China
Mobile
Speaker: Duan Yunfeng, Pan Yuxuan
China Mobile Communications Corporation(CMCC)
2018-6

Big Data in China Mobile
Data in China Mobile
• 900 million customers
• 3 million base stations
• 200 million IoT connections
• 100PB data generated per day
Hadoop in China Mobile
• A big centralized control cluster, CBA cluster, 1600
nodes
• Several small compute cluster in each province
• 1+N architecture
• 15, 000 Hadoop nodes in all

Outline
02
Experience in
Construction
01
Introduction of CBA
ClusterC
目
录
ONTENTS
03 Future works

About CBA Project
Data Source:
• 2/3/4G, WLAN log
• Detailed data traffic
• Customer information
• Service Record
• Web Crawler
• …
Applications:
• In Finance, Security, Tourism,
Traffic, Advertisement,
Healthcare
• On-line behavior analysis
• Internet Opinion analysis
• Customer portrait
Centralized Business Analysis (CBA)
BI system based on Enterprise data warehouse for Group Company,
branch companies and subsidiary companies

Brief History of CBA Project
Stage 1
• 2016.8
• 600 nodes totally
• Max Hadoop: 400 nodes
Stage 2
• 2017.10
• 2,400 nodes totally
• Max Hadoop: 1600 nodes
Stage 3
• 2018.12
• 21,000 nodes totally
• Max Hadoop: 14,000 nodes
…
…
…
…
…
…

Current CBA Cluster
FTP
（20）
Flume
（90）
Hadoop Cluster
（1584+17）
HBase 2
（222+3）
Application
（10）
Application
（20）
Application
（20）
Business
System
HBase 1
（222+7）
Gateway
Crawler
（30）
Application
（25）
Flume
（18）
Kafka
（6）
Spark
streaming
（14）
H CORE
Branch Companies
H DMZ B DMZ
B CORE
FTP
（4）
Gateway
Largest Hadoop Cluster: 1600+ nodes
380 million HDFS files，total capacity: 62.38 PB
2PB input data per day, 20000 jobs per day
14 million files into HDFS by Flume per day
Group Company

BigData Platform
Hadoop 2.8.2
Hive 1.2.2
Spark 2.2.0
HBase 1.2.6
Flume 1.6.0
Ambari（HControl） 2.1.1

Challenge
• Large amount of data, massive data types
• Build the whole cluster in several months from
nothing
Deployment Test Product
Ambari Tuning LDAP Tuning
HDFS Tuning
Flume Tuning
Operation Mangement
Application Tuning

Highlights
data collector, filtering, normalization and
encryption
• SQL Based Flume Interceptor
• easy extended for various data sources
HDFS Turning
• using NN Federation to scale NS horizontally
• using fair callqueue to reduce RPC latency
cluster deployment and maintenance
• Ambari turning
• cluster maintenance with AI

Flume Background
Operations Flume do before sending data to
HDFS:
Decompression
Filter by certain fields
Normalization
Encryption
Problem:
• Performance
50MB/s per node, need 400 nodes in
all
• Unstable, GC overhead
• Hard code in Interceptor of Flume
Gateway
Cluster
Collect
or
Cluster

Flume: SQL based Interceptor
Filtering, Normalization, Encryption in
each Interceptor
Implement logic in one SQL based
Interceptor
Use UDF to implement certain logic
Use Hive to parse SQL and get
Query Execution Plan
Select
c1_2, c3, c4_7,
sm4(normalizePhoneNum(c8),
strtolong(c20)),
c9_19, strtodate(c20),
strtodate(c21), c22_200
from event
FilterOperat
or
SelectOperat
or
SinkOperator
Execute Operator code in Flume interceptor
Overwrite SinkOperator,
convert record object
to flume record
Source
Deser
ialize
Process
Seriali
ze
SQL Inteceptor
Channel
Source
Deser
ialize
Process
Seriali
ze
Filter Inteceptor
Deser
ialize
Process
Seriali
ze
Encryption Inteceptor
Deser
ialize
Process
Seriali
ze
Normalization Inteceptor
Channel

Flume—SerDe
 Serialization/Deserialization
• Use LazySimpleSerDe From Hive
• Merge fields in Serialization
before after
Not all of the column need to be normalized or encrypted.
Merge unprocessed fields
Reduce CPU spent and JAVA Objects created
2X Performance improvement
Deserialize
Deserialize
Serialize
Serialize

Flume—tuning
• Use ConcurrentLinkedDeque instead of LinkedBlockingDeque in
MemoryChannel
• Reduce memory consume by adjusting Channel Capacity
agent.channels.c1.capacity=2000000
• Adjust MemoryChannel Keep-alive parameter to handle HDFS performance
variation
agent.channels.c1.keep-alive=24
• Reduce HDFS files to write by adjusting HDFSEventSink hdfs.idleTimeout
agent.sinks.s1.hdfs.idleTimeout=600
• Improve JVM performance by adding option -XX:+UseLargePages
Performance improvement: 50MB/s ->790MB/s per node
Flume nodes reduced: 400 nodes -> 90 nodes

Challenge of HDFS
Too many files in NameNode
 More than 300 million files in Namespace.
 NN memory usage over 150G, configured 180G
NN RPC performance becomes the bottleneck
 Process 30 million RPC calls per hour
 RPC accumulates in callQueue, RPC response over 10s
Namenode-HA failure
 Dead lock when HA in high concurrency situations(code bug)
 Have to restart all when active NN fails

Too many HDFS failures
HDFS(namenode) always goes down, HA does not work,
downtime over 2 hours each time
• Namenode JVM GC overtime
• Too many HDFS-Audit log and disk is full
• Network failure
• RPC pressure too high
......

Optimize the NameSpace
NS1
Hive/Flume/App
s
NS2
Apps
NS1
Hive
NS2
Flume
NS3
Apps
NS4
Apps
NS5
Apps
• Scale the NameSpace
Federation with 2 NS -> 5 NS
• YARN log files over 100 million
Introduce Ambari Logsearch tool to manage YARN log, no need to save
these logs for a long time.
• NameNode memory usage : 160G -> 90G

FairCallQueue
Challenge：
• Most RPC call from batch job users
• Flume task requires low latency of HDFS
• Flume Task needs higher priority of accessing HDFS
queue0
queue1
queue2
queue3
rpc
Priority=
1
FairCallQueue
Multi
plexe
r
rpc
take
• Use FairCallQueue
• Massive, not sensitive to
latency, batch job RPC -> Low
priority RPC queue
• Few, sensitive to latency RPC
-> High priority queue
• Latency of RPC from Flume:
More than 10s ->less than
0.5shttps://issues.apache.org/jira/browse/HADOOP-9640

Namenode GC algorithm
-XX:ParallelGCThreads=8
-XX:+UseConcMarkSweepGC
-
XX:CMSInitiatingOccupancyFraction=7
0
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseG1GC
-XX:ParallelGCThreads=20
-XX:ConcGCThreads=20
-XX:MaxGCPauseMillis=5000
GC algorithm in Namenode JVM options: CMS GC -> G1GC
• GC time reduce from 15ms to 2ms
• Long time GC suspend: Concurrent Mode Failure no longer occur

Block Placement
rack1 rack2 rack3
Problem:
• Number of Nodes on each Rack may
be different
• Nodes on smaller rack have
received more block replicas
• Run out of space , load is
higher
rack4
Analysis：
• Default block placement used in cluster
• Placement of the 1st replica is balanced
• Placement of the 2nd/3rdb block replica is imbalanced when racks
are not balanced

WeightedRackBlockPlacementPolicy
TotalNode
s
Racks
Nodes Per
NormalRack
Nodes Per
SmallRack
Without
WeightedRack
BlockPlacemen
t
With
WeightedRack
BlockPlacement
(better close to
1)
35 3 15 5 1.334 1.103
50 4 15 5 1.189 1.023
65 5 15 5 1.114 1.035
140 10 15 5 1.087 1.014
95 4 30 5 1.247 1.030
155 4 50 5 1.288 1.030
455 10 50 5 1.108 1.014
Solution:
• Implement new
BlockPlacementPolicy:
WeightedRackBlockPlacementPolic
y
• Calculate the probability of
block placement on each rack,
according to number of nodes on
each rack
• Calculate the weight of each
according to the probability.
Higher probability with lower
weight
• Adjust the block placement by
https://issues.apache.org/jira/browse/HDFS-13279

Other tuning
• Set MR property
mapreduce.fileoutputcommitter.algorithm.version=2, which
reduce rpc calls to the NN
For certain big job, RPC calls are reduced by 40%
• Use TEZ instead of MR in Hive.
hive.execution.engine=tez
• Merge tasks in ETL process, reduce HDFS access
https://issues.apache.org/jira/browse/MAPREDUCE-4815

Operation Management
NameSpace Quota
• Set NameSpace Quota of the root of each NS to 300 million
Estimate application RPC producing
• Count RPC generated by one Application in DEV environment before
it enters production.
• According to hdfs-audit
Limit the HEAVY RPC
• Heavy RPC: Recursive operation(delete, getContentSummary) to a
huge directory
• Implements in Ranger

Ambari tuning
Challenge: Nodes in Cluster up to 1600, Ambari service
becomes very slow
Parameters tuning
Apply patches
Code Improvement
New features support:
• Support NameNode Federation deployment
• High Available of Ambari Server

LDAP Tuning for Kerberized Cluster
Solutions
• Use NSCD to cache user information
on local
• Support multi LDAP Server in Hadoop
Connections on LDAP Server: 7000 -> 700
HDFS groupsHDFS groups
LDAP
Server1:389
LDAP
Server2: 389
MultiLdap
GroupsMapping
LDAP Server
N :389
负载均衡
<property>
<name>hadoop.security.group.mapping.ldap.url</name>
<value>ldap1:389,ldap2:389,ldap3:389</value>
</property>
Too many connections
Connections to LDAP Server over 7000, latency on LDAP Server node over 8s
Load balance

HSmart
An intelligent Operation management tool helps user to optimize Hadoop
cluster and jobs running on the cluster.
 Cluster Health Inspection
• Score the cluster status
• Suggestion for problems
 Cluster Resource Prediction
 Job Tuning
• 35 selected cluster metrics
• Predict future resource consumption by LSTM Algorithm
• Collect and analyze job log, counters and metrics
• Provide tuning suggestion for jobs
• Referred to Dr. Elephant@LinkedIn

Challenge in Future
 Cluster Scale
Growing very fast: 1600 ->14000+ nodes
HDFS-Federation limitation？
Yarn cluster limitation？
Ambari limitation？
 Multi Sub-clusters
 Single Sub-cluster: 3000 to 5000 nodes
 RouterBasedFederation/ YarnFederation
 Balance among Namespaces
Data divided to NSs by business currently
Large load difference among NSs
Different Load type on Namenode: Some Namenode has more files but less
RPC requests
 Balancer to move data from NSs
5000？
4000？
3000？
Sub Cluster
HDFS Federation
YARN
NS1 NS2
NS3 NS4
Sub Cluster
HDFS Federation
YARN
NS1 NS2
NS3 NS4
Sub Cluster
HDFS Federation
YARN
NS1 NS2
NS3 NS4
RouterBasedFederatio
n
YarnFederation
balance
r
balancerAmbari Ambari Ambari

Summary
Challenge in construction of Hadoop Cluster
• Flume、HDFS、Ambari、LDAP
• not only follow the community，but also add self work
Namenode is the most difficult bottleneck
• NS space、RPC performance
• Extend NSs、parameter tuning、FairCallQueue、Operation
Management
Large cluster maintenance
• Introduce AI into cluster maintenance
Challenge in future: 1600 nodes -> 14000 nodes

Thank you!
Contact e-mail: panyuxuan@cmss.chinamobile.com

Appendix
Code example to compile a SQL and get the Operator list from Hive
Dirver.

Practice of large Hadoop cluster in China Mobile

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Practice of large Hadoop cluster in China Mobile

Similar to Practice of large Hadoop cluster in China Mobile (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Practice of large Hadoop cluster in China Mobile

Editor's Notes