1. Kafka & Hadoop in Rakuten
Apr 21st, 2021
Yongduck Lee
Cloud Platform. Dept.
Rakuten Group, Inc.
2. 2
What is Apache Kafka?
Apache Kafka is an open-source distributed event streaming platform used by
thousands of companies for high-performance data pipelines, streaming
analytics, data integration, and mission-critical applications.
• Unified platform for handling real-time data feeds
• High-throughput to support high volume event streams
• Graceful dealing with large data backlogs
• Low-latency delivery to handle more traditional messaging
use-cases.
• Fault-tolerance in the presence of machine failures
• Not use in-process cache of the data
https://kafka.apache.org
3. 3
What is Elasticsearch?
Elasticsearch is a distributed, RESTful search and analytics engine capable
of addressing a growing number of use cases. As the heart of the Elastic
Stack, it centrally stores your data for lightning-fast search, fine-tuned
relevancy, and powerful analytics that scale with ease.
https://www.elastic.co/elasticsear
ch/
primary replica
Data Nodes
Master Nodes
ML Nodes
Coordinating Nodes
Transform Nodes
Remote Cluster
Nodes
Cluster A
Cluster B
Cluster C
Client
4. 4
What is Hadoop?
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.
https://hadoop.apache.org
It is designed to scale up from single
servers to thousands of machines, each
offering local computation and storage.
Rather than rely on hardware to deliver
high-availability, the library itself is
designed to detect and handle failures at
the application layer, so delivering a
highly-available service on top of a cluster
of computers, each of which may be
prone to failures.
5. 5
Data Pipeline Concept
Data
Provider
Data Collection Data Wrangling Data Process &
Analysis
Visualization
• Data investigation
• Reporting
• Historical data
• Near time consumers
Realtime data (5-15sec)
• Realtime dashboards
• Traffic anomalies
• Initial research
• Recent data
• Real-Time Collection
• CDC
• Full Dump Data
System /Network
• Application logs
• Access logs
• Transactions logs
• OS Logs/ Network Traffic
User Behaviors
• Purchase
• Page View
• Click
• RQ/LDTime
• Geo Location
• Review
• Product Search
Service
Platform
Event / Product / Profile Info
• Email
• Campaign
• Questionnaires
• Product/Item
• Demography
Product
• Enrichment
• Normalization
• Cleaning
Data
Users
(Actors)
Heterogenous Data
INFRA
RCMD
RANKING
Log
Management
Data
Analysis
….
Near Real-Time Or Batch
- Unstructured data
- Semi-structured data
- Structured data
……
6. 6
Data Pipeline Concept
Sub second / interactive investigation of
data as time series
Complex analytics, data processing, AI
etc over large datasets.
May take from seconds to days to run
depending on workload and processing
framework
7. 7
Kafka in Rakuten
We have been providing Kafka Service from Kafka 0.8 to 2.4 with PLAINTEXT, SASL_PLANTEXT, and
SASL_TLS, Handling around 1.3 Million Message/sec ( 10 GB/sec IN/OUT) around peak time at normal date.
At 2021 Super Sale, we handled more than 2.5 times messages and traffics.
62 Kafka Clusters (7440 Core, 21TB Mem, 4972
Topics)
5th/Mar/2021
22:45 PM
77 Kafka Clusters (7904 Core, 22TB Mem, 5091
Topics)
08/Apr/2021
7.440 K
8. 8
Kafka in Rakuten
NA EU JP
69
4
4
Near Real-Time One-way Mirroring
Cross-DC Active/Active | Active/Hot Standby Kafka
using MirrorMaker2 + KafkaConnect
9. 9
Elasticsearch in Rakuten
We have been providing ES Service from 2.X to 7.X with Basic & Commercial Subscriptions, indexing
hundreds of thousands doc/sec for near-real time log management & monitoring and user behavior & KPI
analysis. At 2021 Super Sale, we handled more than 2 times docs and traffics.
47 ES Clusters (5960 Core, 6.4TB Mem, 71TB
Indices)
10. 10
Hadoop In Rakuten
Vcore Mem Disk
72K 442TB 130 PB
RAM
Nodes
1K
08/Apr/2021
We are providing HDP2 & HDP3 Clusters in JP/EU/US regions. Our use case is very aggressive multi-tenants
who are using as data lake/data analysis/backup & archiving, etc. All CPU-intensive, Memory-intensive,
Disk-intensive use-case are running on clusters at the same time but we are providing high stability and
performance service with rich experiences on Hadoop administration from the 1st generation of Apache
Hadoop.
12. 12
Challenge on Kafka
Mirroring Throughput between Region or Zone
- Temporary network failure.
- High Latency
- Location of MirrorMaker Pros & Cons
Instability or cluster broken
- High Load during Rebalancing or Recovery.
- Rack-awareness
- Major/Minor Upgrade or Patching
JDK & Cross-Realm Issue
- Consuming & Producing between Cluster
with different Realm or Service Name
- JDK Specification about Kerberos Authentication
OOM on Brokers or Zookeeper
- Many Consumer or Producer
- Large size of message
- Z-node creation
- Increase # of partitions
- Relocate Mirror Maker on Source Side and increase
Producer Parallelism
Parallelism
- Reduce size of data which will be replicated during
recovery or rebalancing by small servers with proper
size of DISK, CPU, and Mem for java/scalar
Scale-Out than Scale-Up
- Use Streaming Framework (Spark, Flink, and so on)
- Use Middleware which are supporting different
service name and Version.(NiFi)
- Use Global KDC and one Realm for Kafka Clusters
Global KDC and Proper Streaming Solution
- Guide users by proactive consultation as
professional.
- Authorization on ZK nodes
Confirm Use-Case and Dedicated ZK
13. 13
Challenge on Elasticsearch
Mixed Indexing Query Pattern
• Doc/sec (100K doc/sec ~ 1K doc/sec)
• Size per index (1TB/hour~1GB/hour)
• Short- or Long-term query
Unbalanced Shard distribution
• total # of shard per nodes
• balance of high or low loads of shards per nodes
Too many Indices and shards
• long retention
• Many shards on index for load distribution.
Arbitrary Docs indexing on ES
• Arbitrary # of Json Field.
• Invalid data which are not matched with Data Type
• Too many Json Field in doc.
Fast Query in the middle of High load of indexing.
OOM on Data Nodes and Coordinating Nodes.
Hard to scale out only for High load index.
……
14. 14
Challenge on Elasticsearch
Hot
Cold
Data Nodes
Master Nodes
Coordinating Nodes
Client
Coordinating Nodes
Hot
Cold
Hot
Cold HL
Group
ML
Group
LL
Group
Routing
SEH
Template
IDX
Move/Merge/READ-ONLY
15. 15
Challenge on Elasticsearch
Hot
Cold
Data Nodes
Master Nodes
Coordinating Nodes
Client
Coordinating Nodes
Hot
Cold
Hot
Cold HL
Group
ML
Group
LL
Group
SEH
Template
IDX
Move/READ-ONLY
16. 16
Challenge on Hadoop
Aggressive Multi-tenant on Big box of Cluster
- Job Pending or Execution Delay
- NameNode Slowdown
- Zookeeper Timeout
- NameNode Heap
- Localization Issues
- Large # of Files
High Performance & Low Cost
- CPU-Intensive
- Memory Intensive
- Disk-Intensive
Preemption
Federation
Zookeeper Separation
Continuous
balancing
Dedicated Node
with Labeling
Heterogenous
Proper Node Design
Based on Needs
Utilizing SSD & HDD
On-Premise
Training Course
NameNode RPC QoS
17. 17
Future Challenges
Self-Service
• Self-Operation
• Data Profiling &
Governance
• Broker Level Administration
• Active-Active Mirroring
Next Generation
• Kafka vs ???
• Elasticsearch vs ???
Return To Apache Hadoop
• HDP Subscription Policy
• Ambari to Chef or Ansible
• Rakuten Distribution
Hadoop
Containerization
• Service Discovery
• Persistent Storage or Local
Storage
• Physical vs Logical
Separation