2. Who am I
• Senior Software Engineer of SK Telecom, South Korea’s largest
wireless communications provider
• Work on commercial products (~ ’15)
- She worked with Hadoop DW
- She worked with IaaS(OpenStack)
- She worked with PaaS(CloudFoundry)
• Mail to : jerryjung@sk.com
2
3. 3
Table of Contents
1. Big Data in SK Telecom
2. Benefit of Spark
3. Spark Real Workload
Real-Time Network Analytics
4. Ongoing R&D
4. Big Data in SKT in a Nutshell
✓ Data Size
- Currently collecting 250 TB/day
!
✓ Big Data Management Infrastructure
- Hadoop cluster (1400+ nodes); migrated from
MPP RDBMS
✓ Use cases
- Real-Time Analytics of Base Stations
- Network Enterprise DW
!
✓ Ongoing R&D
- SKT Hadoop DW Appliance with H/W acceleration
4
5. Operating over 1400 nodes (30 PB+) of Hadoop cluster
SKT Hadoop Infrastructure
• Optimized configuration
• Fault tolerant and effective resource management system 5
Data Collector
Data Collect "
& pre-processing
Main Cluster
Analysis
R&D Cluster
~250 TB/day (500+ node)
Service!
Logic
Repository
(400+ Node)
(100+ node)
Service Cluster
(400+ node)
Marketing
NW
Analytics
VoC
SKT Hadoop Infra
Data Feeding
Data Feeding
Commercialize
Develop.
6. Batch LayerInterface Layer
Flume
Kafka"
HDFS
(Data Mart)
oozie (workflow)
Hive
(ETL)
Spark
(ETL)
Analytics Layer
1
2
Spark SQL
Spark MlLib
Spark GraphX
Spark R
YARN (Unified Resource Manager)
Real-Time Layer
NoSQL
Elastic
Search
HDFS
Data Service
Layer
BI
Legacy
App
3
Analytics Layer
Batch Processing Layer -
Hadoop EDW
Real-Time Processing Layer
– Real Time Analysis
3
1
2
【 Components 】
Spark Streaming"
!
H/W Accelerator
(SSD, FGPA)
Cluster Manger
Ambari
SKT Big Data Reference Architecture
Designed to handle both real-time & batch data processing and high level
analysis using Spark as a core technology
6
7. Benefit of Spark
Spark help us to have the gains in processing speed and implement various big
data applications easily and speedily
▪ Support for Event Stream Processing
▪ Fast Data Queries in Real Time
▪ Improved Programmer Productivity
▪ Fast Batch Processing of Large Data Set
Why SKT use spark …
7
8. Use cases: Summary
Network
Enterprise DW
APOLLO
• End-to-end network quality assurance and
fault analysis in a timely manner
• Real-time analysis of radio access network
to improve operation efficiency
Network analytics
8
10. “Hadoop S/W and Commodity H/W
Based Cost-effective IT Infrastructure System”
【 SKT DW Infrastructure】
“High-price, High-performance
Proprietary IT Infrastructure System”
【 Legacy IT Infrastructure 】
※ MPP Massively Parallel Processing, SAN Storage Area Network, NAS Network Attached Storage, RDBMS Relational DB Management System
Structured/Un-structured Data
Scale-out Structure (Petabyte, Exabyte)Data
Structured Data
Scale-up Structure (Terabyte)
Commodity H/W (x86 Server)H/W
High Performance H/W
(MPP, Fabric Switch, etc.)
Hadoop Architecture
SQL on Hadoop
S/W
Proprietary S/W
(RDBMS, etc.)
Transaction/Batch
Processing"
(SQL) Hadoop File System
Hadoop DW can handle telco big data with scalability & cost efficiency
Use case 2: Hadoop based Enterprise DW
10
※ MPP Massively Parallel Processing
11. 11
Use case 2: Network Enterprise DW
NMS#1
DBMS
…
NMS#1
DBMS
NMS#N-1
DBMS
[ Current ]
Siloed Data & IT Management
Access NW Core NW Transport
Expected advantages
• Unification of 130+ legacy DMBSs, each of which was managing separate network
monitoring system, enabling thorough analysis over the entire network
• Quick and accurate identification of root causes of network failure
Data scientists need unified platform to collect data from all network equipment
for management and analysis purpose
NMS
#1 …
NMS
#2
NMS
#N-1
Legacy
NMS
#N
Hadoop DW
DW
Legacy
NEWN
MS#1
… NEW
NMS#
N
BI &
Analytic
…
[ Goal (4Q, 2015) ]"
Network Enterprise DW
12. Network EDW is a Hadoop-based data warehouse built on Spark for various
network statistics or raw data
User Benefits
• End-to-End quality assurance,
Fault analysis
• Reduces analysis lead time
(days → minutes)
• Saves TCO (1/5 less than legacy DW)
!
Hadoop DW
• Spark-SQL functions and query
optimizer
• Bulk-loading and timely processing of
large data
• SSD caching applied for
performance enhancement
Acess
Core
Transport
EMS
EMS
T-Pani
EMS
Hadoop DW
DW Data
Data Mart
SQL on
Hadoop
(Spark SQL)
IP
EMS
AnalyticsSQL
ETL
ETL
O!
D!
S
MQE*
(Meta Query
Engine)
H/W
Accelerator !
SSD Caching
H/W
Accelerator
SSD Caching
BI
* MQE (Meta Query Engine) : Heterogeneous database integration query, including the Hadoop.
Use case 2: Network Enterprise DW
12
13. 13
https://github.com/bitnine-oss/octopus
Use case 2: Meta Query Engine
Features"
1. Subset of ANSI-SQL"
2. Queries on multiple databases
including Spark-SQL, Oracle."
3. SQL-based authorization"
4. User authentication"
5. Unified schema view
14. Use case 2: Requirements & Challenges
Timely Processing -ETL"
Integrated BI Tools"
Quick Response
Requirements
14
MDS #1
MQE #1
HA Proxy
Thrift Server
#1
Thrift Server
#2
Spark SQL
HDFS
YARN
WEB
MDS
BI
MQE
Meta Store
Octopus
NW EDW # 96
ETL
Spark
3
2
1
4
16. Use case 2: BI Integration
16
spark.sql.thriftServer.incrementalCollect true!
spark.driver.maxResultSize 10g
Configuration
17. Use case 2: Patches
17
SPARK-7792! - HiveContext registerTempTable not thread safe!
SPARK-7936! - Add configuration for initial size and limit of hash for aggregation!
SPARK-8153! - Add configuration for disabling partial aggregation in runtime!
SPARK-8285! - CombineSum should be calculated as unlimited decimal first!
SPARK-8312! - Populate statistics info of hive tables if it's needed to be!
SPARK-8333! - Spark failed to delete temp directory created by HiveContext!
SPARK-8334 ! - Binary logical plan should provide more realistic statistics!
SPARK-8357! - Memory leakage on unsafe aggregation path with empty input!
SPARK-8420! - Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0!
SPARK-8552! - Using incorrect database in multiple sessions!
SPARK-8707! - RDD#toDebugString fails if any cached RDD has invalid partitions!
SPARK-8826! - Fix ClassCastException in GeneratedAggregate!
SPARK-9685! - Unspported dataType: char(X) in Hive!
SPARK-10151! - Support invocation of hive macro!
SPARK-10152! - Support Init script for hive-thriftserver!
SPARK-10679! - javax.jdo.JDOFatalUserException in executor!
SPARK-10684! - StructType.interpretedOrdering need not to be serialised!
SPARK-10216 - Avoid creating empty files during overwrite into Hive table with group by query
Open Issues