SlideShare a Scribd company logo
1 of 18
Kafka & Hadoop in Rakuten
Apr 21st, 2021
Yongduck Lee
Cloud Platform. Dept.
Rakuten Group, Inc.
2
What is Apache Kafka?
Apache Kafka is an open-source distributed event streaming platform used by
thousands of companies for high-performance data pipelines, streaming
analytics, data integration, and mission-critical applications.
• Unified platform for handling real-time data feeds
• High-throughput to support high volume event streams
• Graceful dealing with large data backlogs
• Low-latency delivery to handle more traditional messaging
use-cases.
• Fault-tolerance in the presence of machine failures
• Not use in-process cache of the data
https://kafka.apache.org
3
What is Elasticsearch?
Elasticsearch is a distributed, RESTful search and analytics engine capable
of addressing a growing number of use cases. As the heart of the Elastic
Stack, it centrally stores your data for lightning-fast search, fine-tuned
relevancy, and powerful analytics that scale with ease.
https://www.elastic.co/elasticsear
ch/
primary replica
Data Nodes
Master Nodes
ML Nodes
Coordinating Nodes
Transform Nodes
Remote Cluster
Nodes
Cluster A
Cluster B
Cluster C
Client
4
What is Hadoop?
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.
https://hadoop.apache.org
It is designed to scale up from single
servers to thousands of machines, each
offering local computation and storage.
Rather than rely on hardware to deliver
high-availability, the library itself is
designed to detect and handle failures at
the application layer, so delivering a
highly-available service on top of a cluster
of computers, each of which may be
prone to failures.
5
Data Pipeline Concept
Data
Provider
Data Collection Data Wrangling Data Process &
Analysis
Visualization
• Data investigation
• Reporting
• Historical data
• Near time consumers
Realtime data (5-15sec)
• Realtime dashboards
• Traffic anomalies
• Initial research
• Recent data
• Real-Time Collection
• CDC
• Full Dump Data
System /Network
• Application logs
• Access logs
• Transactions logs
• OS Logs/ Network Traffic
User Behaviors
• Purchase
• Page View
• Click
• RQ/LDTime
• Geo Location
• Review
• Product Search
Service
Platform
Event / Product / Profile Info
• Email
• Campaign
• Questionnaires
• Product/Item
• Demography
Product
• Enrichment
• Normalization
• Cleaning
Data
Users
(Actors)
Heterogenous Data
INFRA
RCMD
RANKING
Log
Management
Data
Analysis
….
Near Real-Time Or Batch
- Unstructured data
- Semi-structured data
- Structured data
……
6
Data Pipeline Concept
Sub second / interactive investigation of
data as time series
Complex analytics, data processing, AI
etc over large datasets.
May take from seconds to days to run
depending on workload and processing
framework
7
Kafka in Rakuten
We have been providing Kafka Service from Kafka 0.8 to 2.4 with PLAINTEXT, SASL_PLANTEXT, and
SASL_TLS, Handling around 1.3 Million Message/sec ( 10 GB/sec IN/OUT) around peak time at normal date.
At 2021 Super Sale, we handled more than 2.5 times messages and traffics.
62 Kafka Clusters (7440 Core, 21TB Mem, 4972
Topics)
5th/Mar/2021
22:45 PM
77 Kafka Clusters (7904 Core, 22TB Mem, 5091
Topics)
08/Apr/2021
7.440 K
8
Kafka in Rakuten
NA EU JP
69
4
4
Near Real-Time One-way Mirroring
Cross-DC Active/Active | Active/Hot Standby Kafka
using MirrorMaker2 + KafkaConnect
9
Elasticsearch in Rakuten
We have been providing ES Service from 2.X to 7.X with Basic & Commercial Subscriptions, indexing
hundreds of thousands doc/sec for near-real time log management & monitoring and user behavior & KPI
analysis. At 2021 Super Sale, we handled more than 2 times docs and traffics.
47 ES Clusters (5960 Core, 6.4TB Mem, 71TB
Indices)
10
Hadoop In Rakuten
Vcore Mem Disk
72K 442TB 130 PB
RAM
Nodes
1K
08/Apr/2021
We are providing HDP2 & HDP3 Clusters in JP/EU/US regions. Our use case is very aggressive multi-tenants
who are using as data lake/data analysis/backup & archiving, etc. All CPU-intensive, Memory-intensive,
Disk-intensive use-case are running on clusters at the same time but we are providing high stability and
performance service with rich experiences on Hadoop administration from the 1st generation of Apache
Hadoop.
11
Hadoop in Rakuten
12
Challenge on Kafka
Mirroring Throughput between Region or Zone
- Temporary network failure.
- High Latency
- Location of MirrorMaker Pros & Cons
Instability or cluster broken
- High Load during Rebalancing or Recovery.
- Rack-awareness
- Major/Minor Upgrade or Patching
JDK & Cross-Realm Issue
- Consuming & Producing between Cluster
with different Realm or Service Name
- JDK Specification about Kerberos Authentication
OOM on Brokers or Zookeeper
- Many Consumer or Producer
- Large size of message
- Z-node creation
- Increase # of partitions
- Relocate Mirror Maker on Source Side and increase
Producer Parallelism
Parallelism
- Reduce size of data which will be replicated during
recovery or rebalancing by small servers with proper
size of DISK, CPU, and Mem for java/scalar
Scale-Out than Scale-Up
- Use Streaming Framework (Spark, Flink, and so on)
- Use Middleware which are supporting different
service name and Version.(NiFi)
- Use Global KDC and one Realm for Kafka Clusters
Global KDC and Proper Streaming Solution
- Guide users by proactive consultation as
professional.
- Authorization on ZK nodes
Confirm Use-Case and Dedicated ZK
13
Challenge on Elasticsearch
Mixed Indexing Query Pattern
• Doc/sec (100K doc/sec ~ 1K doc/sec)
• Size per index (1TB/hour~1GB/hour)
• Short- or Long-term query
Unbalanced Shard distribution
• total # of shard per nodes
• balance of high or low loads of shards per nodes
Too many Indices and shards
• long retention
• Many shards on index for load distribution.
Arbitrary Docs indexing on ES
• Arbitrary # of Json Field.
• Invalid data which are not matched with Data Type
• Too many Json Field in doc.
Fast Query in the middle of High load of indexing.
OOM on Data Nodes and Coordinating Nodes.
Hard to scale out only for High load index.
……
14
Challenge on Elasticsearch
Hot
Cold
Data Nodes
Master Nodes
Coordinating Nodes
Client
Coordinating Nodes
Hot
Cold
Hot
Cold HL
Group
ML
Group
LL
Group
Routing
SEH
Template
IDX
Move/Merge/READ-ONLY
15
Challenge on Elasticsearch
Hot
Cold
Data Nodes
Master Nodes
Coordinating Nodes
Client
Coordinating Nodes
Hot
Cold
Hot
Cold HL
Group
ML
Group
LL
Group
SEH
Template
IDX
Move/READ-ONLY
16
Challenge on Hadoop
Aggressive Multi-tenant on Big box of Cluster
- Job Pending or Execution Delay
- NameNode Slowdown
- Zookeeper Timeout
- NameNode Heap
- Localization Issues
- Large # of Files
High Performance & Low Cost
- CPU-Intensive
- Memory Intensive
- Disk-Intensive
Preemption
Federation
Zookeeper Separation
Continuous
balancing
Dedicated Node
with Labeling
Heterogenous
Proper Node Design
Based on Needs
Utilizing SSD & HDD
On-Premise
Training Course
NameNode RPC QoS
17
Future Challenges
Self-Service
• Self-Operation
• Data Profiling &
Governance
• Broker Level Administration
• Active-Active Mirroring
Next Generation
• Kafka vs ???
• Elasticsearch vs ???
Return To Apache Hadoop
• HDP Subscription Policy
• Ambari to Chef or Ansible
• Rakuten Distribution
Hadoop
Containerization
• Service Discovery
• Persistent Storage or Local
Storage
• Physical vs Logical
Separation
Kafka & Hadoop in Rakuten

More Related Content

What's hot

Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfRakuten Group, Inc.
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafkaconfluent
 
Cloud Migration Paths: Kubernetes, IaaS, or DBaaS
Cloud Migration Paths: Kubernetes, IaaS, or DBaaSCloud Migration Paths: Kubernetes, IaaS, or DBaaS
Cloud Migration Paths: Kubernetes, IaaS, or DBaaSEDB
 
5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency DatabaseScyllaDB
 
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)NTT DATA Technology & Innovation
 
Databricksを初めて使う人に向けて.pptx
Databricksを初めて使う人に向けて.pptxDatabricksを初めて使う人に向けて.pptx
Databricksを初めて使う人に向けて.pptxotato
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャーRakuten Group, Inc.
 
ベアメタルで実現するSpark&Trino on K8sなデータ基盤
ベアメタルで実現するSpark&Trino on K8sなデータ基盤ベアメタルで実現するSpark&Trino on K8sなデータ基盤
ベアメタルで実現するSpark&Trino on K8sなデータ基盤MicroAd, Inc.(Engineer)
 
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...NTT DATA Technology & Innovation
 
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)NTT DATA Technology & Innovation
 
ビジネスリテラシーとしての統計 ビッグデータと統計の活用
ビジネスリテラシーとしての統計 ビッグデータと統計の活用ビジネスリテラシーとしての統計 ビッグデータと統計の活用
ビジネスリテラシーとしての統計 ビッグデータと統計の活用Rakuten Group, Inc.
 
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingKubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingTheofilos Papapanagiotou
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTT DATA Technology & Innovation
 
ビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けRecruit Technologies
 
SQL Server 使いのための Azure Synapse Analytics - Spark 入門
SQL Server 使いのための Azure Synapse Analytics - Spark 入門SQL Server 使いのための Azure Synapse Analytics - Spark 入門
SQL Server 使いのための Azure Synapse Analytics - Spark 入門Daiyu Hatakeyama
 
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)NTT DATA Technology & Innovation
 
性能問題を起こしにくい 強いDBシステムの作り方(Ver. 2018.9)
性能問題を起こしにくい 強いDBシステムの作り方(Ver. 2018.9)性能問題を起こしにくい 強いDBシステムの作り方(Ver. 2018.9)
性能問題を起こしにくい 強いDBシステムの作り方(Ver. 2018.9)Tomoyuki Oota
 

What's hot (20)

Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
Cloud Migration Paths: Kubernetes, IaaS, or DBaaS
Cloud Migration Paths: Kubernetes, IaaS, or DBaaSCloud Migration Paths: Kubernetes, IaaS, or DBaaS
Cloud Migration Paths: Kubernetes, IaaS, or DBaaS
 
5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database5 Factors When Selecting a High Performance, Low Latency Database
5 Factors When Selecting a High Performance, Low Latency Database
 
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
 
Databricksを初めて使う人に向けて.pptx
Databricksを初めて使う人に向けて.pptxDatabricksを初めて使う人に向けて.pptx
Databricksを初めて使う人に向けて.pptx
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
ベアメタルで実現するSpark&Trino on K8sなデータ基盤
ベアメタルで実現するSpark&Trino on K8sなデータ基盤ベアメタルで実現するSpark&Trino on K8sなデータ基盤
ベアメタルで実現するSpark&Trino on K8sなデータ基盤
 
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
トランザクション処理可能な分散DB 「YugabyteDB」入門(Open Source Conference 2022 Online/Fukuoka 発...
 
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
 
Rakuten's Private Cloud
Rakuten's Private CloudRakuten's Private Cloud
Rakuten's Private Cloud
 
ビジネスリテラシーとしての統計 ビッグデータと統計の活用
ビジネスリテラシーとしての統計 ビッグデータと統計の活用ビジネスリテラシーとしての統計 ビッグデータと統計の活用
ビジネスリテラシーとしての統計 ビッグデータと統計の活用
 
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingKubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
ビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分け
 
SQL Server 使いのための Azure Synapse Analytics - Spark 入門
SQL Server 使いのための Azure Synapse Analytics - Spark 入門SQL Server 使いのための Azure Synapse Analytics - Spark 入門
SQL Server 使いのための Azure Synapse Analytics - Spark 入門
 
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)
 
性能問題を起こしにくい 強いDBシステムの作り方(Ver. 2018.9)
性能問題を起こしにくい 強いDBシステムの作り方(Ver. 2018.9)性能問題を起こしにくい 強いDBシステムの作り方(Ver. 2018.9)
性能問題を起こしにくい 強いDBシステムの作り方(Ver. 2018.9)
 
Azure Data Engineering.pptx
Azure Data Engineering.pptxAzure Data Engineering.pptx
Azure Data Engineering.pptx
 

Similar to Kafka & Hadoop in Rakuten

Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Anthony Baker
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john malloryAmazon Web Services
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 

Similar to Kafka & Hadoop in Rakuten (20)

Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

More from Rakuten Group, Inc.

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話Rakuten Group, Inc.
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のりRakuten Group, Inc.
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Rakuten Group, Inc.
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みRakuten Group, Inc.
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開Rakuten Group, Inc.
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfRakuten Group, Inc.
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfRakuten Group, Inc.
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情Rakuten Group, Inc.
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャーRakuten Group, Inc.
 
楽天サービスとインフラ部隊
楽天サービスとインフラ部隊楽天サービスとインフラ部隊
楽天サービスとインフラ部隊Rakuten Group, Inc.
 
Unclouding Container Challenges
 Unclouding  Container Challenges Unclouding  Container Challenges
Unclouding Container ChallengesRakuten Group, Inc.
 
Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...
Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...
Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...Rakuten Group, Inc.
 
アジャイル開発とメトリクス
アジャイル開発とメトリクスアジャイル開発とメトリクス
アジャイル開発とメトリクスRakuten Group, Inc.
 
Introduction of Rakuten Commerce QA Night#2
Introduction of Rakuten Commerce QA Night#2Introduction of Rakuten Commerce QA Night#2
Introduction of Rakuten Commerce QA Night#2Rakuten Group, Inc.
 
Improve test automation operation
Improve test automation operationImprove test automation operation
Improve test automation operationRakuten Group, Inc.
 

More from Rakuten Group, Inc. (19)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 
楽天サービスとインフラ部隊
楽天サービスとインフラ部隊楽天サービスとインフラ部隊
楽天サービスとインフラ部隊
 
Rakuten Platform
Rakuten PlatformRakuten Platform
Rakuten Platform
 
Unclouding Container Challenges
 Unclouding  Container Challenges Unclouding  Container Challenges
Unclouding Container Challenges
 
Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...
Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...
Functional Programming in Pattern-Match-Oriented Programming Style <Programmi...
 
アジャイル開発とメトリクス
アジャイル開発とメトリクスアジャイル開発とメトリクス
アジャイル開発とメトリクス
 
AR/SLAM and IoT
AR/SLAM and IoTAR/SLAM and IoT
AR/SLAM and IoT
 
Introduction of Rakuten Commerce QA Night#2
Introduction of Rakuten Commerce QA Night#2Introduction of Rakuten Commerce QA Night#2
Introduction of Rakuten Commerce QA Night#2
 
Improve test automation operation
Improve test automation operationImprove test automation operation
Improve test automation operation
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Kafka & Hadoop in Rakuten

  • 1. Kafka & Hadoop in Rakuten Apr 21st, 2021 Yongduck Lee Cloud Platform. Dept. Rakuten Group, Inc.
  • 2. 2 What is Apache Kafka? Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. • Unified platform for handling real-time data feeds • High-throughput to support high volume event streams • Graceful dealing with large data backlogs • Low-latency delivery to handle more traditional messaging use-cases. • Fault-tolerance in the presence of machine failures • Not use in-process cache of the data https://kafka.apache.org
  • 3. 3 What is Elasticsearch? Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning-fast search, fine-tuned relevancy, and powerful analytics that scale with ease. https://www.elastic.co/elasticsear ch/ primary replica Data Nodes Master Nodes ML Nodes Coordinating Nodes Transform Nodes Remote Cluster Nodes Cluster A Cluster B Cluster C Client
  • 4. 4 What is Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. https://hadoop.apache.org It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 5. 5 Data Pipeline Concept Data Provider Data Collection Data Wrangling Data Process & Analysis Visualization • Data investigation • Reporting • Historical data • Near time consumers Realtime data (5-15sec) • Realtime dashboards • Traffic anomalies • Initial research • Recent data • Real-Time Collection • CDC • Full Dump Data System /Network • Application logs • Access logs • Transactions logs • OS Logs/ Network Traffic User Behaviors • Purchase • Page View • Click • RQ/LDTime • Geo Location • Review • Product Search Service Platform Event / Product / Profile Info • Email • Campaign • Questionnaires • Product/Item • Demography Product • Enrichment • Normalization • Cleaning Data Users (Actors) Heterogenous Data INFRA RCMD RANKING Log Management Data Analysis …. Near Real-Time Or Batch - Unstructured data - Semi-structured data - Structured data ……
  • 6. 6 Data Pipeline Concept Sub second / interactive investigation of data as time series Complex analytics, data processing, AI etc over large datasets. May take from seconds to days to run depending on workload and processing framework
  • 7. 7 Kafka in Rakuten We have been providing Kafka Service from Kafka 0.8 to 2.4 with PLAINTEXT, SASL_PLANTEXT, and SASL_TLS, Handling around 1.3 Million Message/sec ( 10 GB/sec IN/OUT) around peak time at normal date. At 2021 Super Sale, we handled more than 2.5 times messages and traffics. 62 Kafka Clusters (7440 Core, 21TB Mem, 4972 Topics) 5th/Mar/2021 22:45 PM 77 Kafka Clusters (7904 Core, 22TB Mem, 5091 Topics) 08/Apr/2021 7.440 K
  • 8. 8 Kafka in Rakuten NA EU JP 69 4 4 Near Real-Time One-way Mirroring Cross-DC Active/Active | Active/Hot Standby Kafka using MirrorMaker2 + KafkaConnect
  • 9. 9 Elasticsearch in Rakuten We have been providing ES Service from 2.X to 7.X with Basic & Commercial Subscriptions, indexing hundreds of thousands doc/sec for near-real time log management & monitoring and user behavior & KPI analysis. At 2021 Super Sale, we handled more than 2 times docs and traffics. 47 ES Clusters (5960 Core, 6.4TB Mem, 71TB Indices)
  • 10. 10 Hadoop In Rakuten Vcore Mem Disk 72K 442TB 130 PB RAM Nodes 1K 08/Apr/2021 We are providing HDP2 & HDP3 Clusters in JP/EU/US regions. Our use case is very aggressive multi-tenants who are using as data lake/data analysis/backup & archiving, etc. All CPU-intensive, Memory-intensive, Disk-intensive use-case are running on clusters at the same time but we are providing high stability and performance service with rich experiences on Hadoop administration from the 1st generation of Apache Hadoop.
  • 12. 12 Challenge on Kafka Mirroring Throughput between Region or Zone - Temporary network failure. - High Latency - Location of MirrorMaker Pros & Cons Instability or cluster broken - High Load during Rebalancing or Recovery. - Rack-awareness - Major/Minor Upgrade or Patching JDK & Cross-Realm Issue - Consuming & Producing between Cluster with different Realm or Service Name - JDK Specification about Kerberos Authentication OOM on Brokers or Zookeeper - Many Consumer or Producer - Large size of message - Z-node creation - Increase # of partitions - Relocate Mirror Maker on Source Side and increase Producer Parallelism Parallelism - Reduce size of data which will be replicated during recovery or rebalancing by small servers with proper size of DISK, CPU, and Mem for java/scalar Scale-Out than Scale-Up - Use Streaming Framework (Spark, Flink, and so on) - Use Middleware which are supporting different service name and Version.(NiFi) - Use Global KDC and one Realm for Kafka Clusters Global KDC and Proper Streaming Solution - Guide users by proactive consultation as professional. - Authorization on ZK nodes Confirm Use-Case and Dedicated ZK
  • 13. 13 Challenge on Elasticsearch Mixed Indexing Query Pattern • Doc/sec (100K doc/sec ~ 1K doc/sec) • Size per index (1TB/hour~1GB/hour) • Short- or Long-term query Unbalanced Shard distribution • total # of shard per nodes • balance of high or low loads of shards per nodes Too many Indices and shards • long retention • Many shards on index for load distribution. Arbitrary Docs indexing on ES • Arbitrary # of Json Field. • Invalid data which are not matched with Data Type • Too many Json Field in doc. Fast Query in the middle of High load of indexing. OOM on Data Nodes and Coordinating Nodes. Hard to scale out only for High load index. ……
  • 14. 14 Challenge on Elasticsearch Hot Cold Data Nodes Master Nodes Coordinating Nodes Client Coordinating Nodes Hot Cold Hot Cold HL Group ML Group LL Group Routing SEH Template IDX Move/Merge/READ-ONLY
  • 15. 15 Challenge on Elasticsearch Hot Cold Data Nodes Master Nodes Coordinating Nodes Client Coordinating Nodes Hot Cold Hot Cold HL Group ML Group LL Group SEH Template IDX Move/READ-ONLY
  • 16. 16 Challenge on Hadoop Aggressive Multi-tenant on Big box of Cluster - Job Pending or Execution Delay - NameNode Slowdown - Zookeeper Timeout - NameNode Heap - Localization Issues - Large # of Files High Performance & Low Cost - CPU-Intensive - Memory Intensive - Disk-Intensive Preemption Federation Zookeeper Separation Continuous balancing Dedicated Node with Labeling Heterogenous Proper Node Design Based on Needs Utilizing SSD & HDD On-Premise Training Course NameNode RPC QoS
  • 17. 17 Future Challenges Self-Service • Self-Operation • Data Profiling & Governance • Broker Level Administration • Active-Active Mirroring Next Generation • Kafka vs ??? • Elasticsearch vs ??? Return To Apache Hadoop • HDP Subscription Policy • Ambari to Chef or Ansible • Rakuten Distribution Hadoop Containerization • Service Discovery • Persistent Storage or Local Storage • Physical vs Logical Separation