SlideShare a Scribd company logo
1 of 47
Download to read offline
Roman Nikitchenko, 10.05.2015
BIG DATA: FROM MAMMOTH TO ELEPHANT
MAMMOTH
The only real truth we know
about them is their rests. Do
you feel your enterprise data
infrastructure goes this way?
Come and see in the nearest
data center...
2
TWO YEARS AGO
● Our exciting high scalability realtime
BIG DATA solution with broad
technologies stack in production.
3
This is our
PRESENT DAY
.. yet is powered by
4
storage storage
SQL DB
Processed
inbound data
Inbound Outbound
SQL DB
Processed
inbound data
Healthcare
providers
data: labs,
cares ...
Mostly
insurance
companies
SQL DB
Application data
SQL DB
Outbound
information
OUR INITIAL STATE: TOP VIEW
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
5
storage storage
SQL DB
Processed
inbound data
Inbound Outbound
SQL DB
Processed
inbound data
Mostly
insurance
companies
SQL DB
Application data
SQL DB
Outbound
information
OUR INITIAL STATE: TOP VIEW
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
Inbound data archives
(pretty short cycle)
One SQL DB
per application
Huge amount of data. Serious
amount of duplicates
How about retention
and data issues
investigation?
Healthcare
providers
data: labs,
cares ...
6
Outbound flow
is slow because
of RDBMS
processing
storage storage
SQL DB
Processed
inbound data
Inbound Outbound
SQL DB
Processed
inbound data
Mostly
insurance
companies
SQL DB
Application data
SQL DB
Outbound
information
OUR INITIAL STATE: TOP VIEW
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
Inbound data retention cycle
is short, so prolonged period
data investigation is hard
Overall huge amount of SQL databases,
high operational complexity
One application DB per service client
makes inter-application analytics and
monitoring extremely hard
YELLOW ALARMS
Healthcare
providers
data: labs,
cares ...
7
8
BIG DATA
Better ways to store huge data
volumes: cheaper, safer and easier.
WHAT TO RUN FOR?
MORE STORAGE
9
BIG DATA
WHAT TO RUN FOR?
Scalable effective distributed
processing models to open new
opportunities like machine
learning.
MORE POWER
10
BIG DATA
WHAT TO RUN FOR?
More flexible data
structures closer
to subject area
and real world.
11
RDBMS LIMITS
● Good for anything
● Not so good for
anything in
particular
OUR MAIN ENEMY WAS ...
12
MASSIVE ANALYSISIs about massive access to your data objects
Your
database
Subject area
objects data
Subject area
objects data
Subject area
objects data
Subject area
objects data
Processing
Processing
Processing
Processing
Transformation from
database structure
into object structure
Distributed
parallel
processing
Effective results
collection
Distributed
processing
results to be
joined
WHY SQL IS EVIL
13
RDBMS LIMITS
When you go massive
processing, objects
collection is getting too
complex. Think about
100.000.000 people
data scan.
Address ID City Street
1 New York 1020, Blue lake
2 Atlanta 203, Bricks av.
3 Seattle 120, Green drv.
FirstName LastName Address Payer
John Smith 1 2
Kate Davis 2 1
Samuel Brown 3 2
Payer ID Name State
1 SaferLife GA
2 YourGuard CA
Kate Davis,
Atlanta 203, Bricks av.
SafeLife, GA
SUBJECT AREA OBJECT COLLECTION
14
FirstName
LastName
Address
Payer
Birthday
RDBMS LIMITS
FirstName LastName Address Payer
John Smith 1 2
Kate Davis 2 1
Samuel Brown 3 2
And now let us add new «Birthday» column.
Easy as pie!
Let it be Patients table ...
ALTER TABLE Patient ADD Birthday ...
TABLE STRUCTURE MODIFICATION
Let's do this with 2.000.000.000 rows MySQL table in
production. What to do if your table grows further?
15
ANY RELATIONAL DATA MODEL
SOONER OR LATER
16
Your SQL
database
Shard
Shard
Shard
Shard
Processing
Processing
Processing
Processing
How to partition
data? What to do
when new shard is
added?
Need another
cluster for
processing?
Distributed
processing
results to be
joined
HOW TO SCALE?
RDBMS LIMITS
17
If you need to store plain text log,
collection of objects for a long
time or current user session
attributes do you really need
SQL?
18
Cross-application
data storage
SQL DB
Application data
SQL DB
Application data
SQL DB
Application data
Small realtime requests
Batch analytic
and reporting
load
ETL
ETL
ETL
● One-time ETL as initial step and backup strategy.
● Full migration to Apache Hbase.
● As a transition period solution — realtime synchronization.
OUR INITIAL
BIG PLAN WAS
19
OPEN SOURCE framework for big data.
Both distributed storage and processing
Provides RELIABILITY
and fault tolerance by
SOFTWARE design (for
example file system with
replication factor 3 as
default one.Horizontal scalability from
single computer up to
thousands of nodes
Why Hadoop (initially 1.x)?
20
First ever world
DATA OS
10.000 nodes computer...
Can start in production from just 4 servers, 1 of
them is for management and coordination.
Single server is enough for development
environment.
21
HBase motivation
WHY
LATENCY, SPEED AND ALL
HADOOP PROPERTIES
22
Database
Region server
Distributed
processing
WHY YET ?
DataNode Node
File system Hardware
TaskTracker
Region server DataNode NodeTaskTracker
Region server DataNode NodeTaskTracker
Region server DataNode NodeTaskTracker
● Good both for OLTP and batch load.
● Natural scaling and reliability with Hadoop.
● Data processing locality, natural sharding with regions.
● Coordination with ZooKeeper.
23
ZooKeeper
Because coordinating distributed systems is a Zoo.
● Quorum based service for
fast distributed system
coordination.
● Came in our stack with
Apache Hbase where it was
needed for coordination.
Now is part of core Hadoop
infrastructure.
● Yet we use it for our own
applications,
24
Finally we went
initial production with HADOOP 2.0
RESOURCE MANAGEMENT
DISTRIBUTED PROCESSING
FILE SYSTEM
COORDINATION
HADOOP
2.x CORE
25
Database
Region server
Distributed
processing &
coordination
Real initial approach
DataNode Node
File system Hardware
Region server DataNode Node
Region server DataNode Node
Region server DataNode Node
● ZooKeeper Instances are distributed among cluster.
● MapReduce is not service in Hadoop 2.x, just YARN application.
Resource
management
NodeManager
NodeManager
NodeManager
NodeManager
26
FIRST REAL RESULT
Cross-application
data storage
SQL DB
Application data
SQL DB
Application data
SQL DB
Application data
Small realtime requests Batch analytic
and reporting
load
ETL
ETL
ETL
CLOSE BUT NOT EXACT PLAN
Daily ETL. Satisfied our daily reporting needs with major SQL
infrastructure offload. Direct profit — massive processing is much
faster, can handle inter-application data.
DO NOT WEAR PINK GLASSES
27
APPROACH WE HAVE FIXED MUCH LATER
SQL
server
JOIN
Table1
Table2
Table3
Table4
ETL stream
SQL
server
JOIN
Table1
Table2
Table3
Table4
ETL stream
ETL stream
ETL stream
ETL stream
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
Bulk
load
Bulk
load
28
Hadoop: don't do it yourself
DON'T DO IT YOURSELF
Because of number of factors starting
from our distributed team support
needs we have selected
29
x MAX
+
=
BIG
DATA
BIG
DATA
BIG
DATA
HADOOP as INFRASTRUCTURE
30
WHERE TO GO FROM HERE?
31
The admission of
temporary residents into
Canada is a privilege, not
a right.
http://www.cic.gc.ca/
SEARCH /
SECONDARY
INDICES
32
NO SEARCH OUT OF
THE BOX OTHER THAN
LINEAR SCAN OVER
THE TABLE AND
FILTERS.
SEARCH /
SECONDARY
INDICES
The same happened to be applicable
to secondary indices in Hbase.
33
SEARCH / SECONDARY INDICES
HOW WE MADE IT
HBase
handles user
data changes
Indexes are
built on SOLR
NGData Lily indexer
transforms data
changes into SOLR
index updates
34
HBase: Data and search integration
Data
update
Client
User just puts (or
deletes) data.
Search responses
Lily HBase
NRT indexerREPLICATION
Translates data
changes into SOLR
index updates.
SOLR cloud
Search requests (HTTP)
Apache
Zookeeper does
all coordination Provides real
indexing
Search and indexing together
35
● Kafka is a high throughput distributed
messaging system.
● Allows true realtime system reaction
through publish-subscribe approach.
● New services can subscribe to data
events stream.
GOING REALTIME
Batch load
Realtime load
New
data
36
● Kafka can be separated
from Hadoop infrastructure
or have backup cluster.
● Data publishers can switch
to another cluster.
● Subscribers (including
Spark on Hadoop) keep 2
places of subscription.
● So now you are free to put
Kafka cluster in
maintenance or backup
subscribers.
GOING REALTIME
GENTLY
MAINTENANCE
37
This is our
PRESENT DAY
.. yet is powered by
38
SO WHERE ARE
WE GOING?
39
OVER BIG DATAREACTIVE
MANIFESTO
MOTIVATION
… users expect millisecond response times and 100%
uptime. Data is measured in Petabytes. Today's demands
are simply not met by yesterday’s software architectures.
40
OVER BIG DATAREACTIVE
MANIFESTO
… we want systems that are
Responsive, Resilient, Elastic
and Message Driven. We call
these Reactive Systems. http://www.reactivemanifesto.org/
41
OVER BIG DATAREACTIVE
MANIFESTO
Responsiveness is
the cornerstone of
usability and utility,
but more than that,
responsiveness
means that
problems may be
detected quickly and
dealt with effectively.
RESPONSIVE
42
OVER BIG DATAREACTIVE
MANIFESTO
The system stays
responsive in the
face of failure.
… The client of a
component is not
burdened with
handling its failures.
RESILIENT All services here are located through ZooKeeper
which is quorum based so resilience is achieved
43
OVER BIG DATAREACTIVE
MANIFESTO
Reactive Systems
can react to changes
in the input rate by
increasing or
decreasing the
resources allocated
to service these
inputs.
ELASTIC
Both HDFS and Hbase
allow dynamic node
addition / removal
YARN already handles
most resource allocation
work and makes progress
44
OVER BIG DATAREACTIVE
MANIFESTO
Reactive Systems rely
on asynchronous
message-passing to
establish a boundary
between components
that ensures loose
coupling.
MESSAGE
DRIVEN
Asynchronous
messages from
applications
Any application can
subscribe, not only
Hadoop services
45
LESSONS LEARNED
● No transition in one step. You
enter Big Data world step by step.
● Change you mind first. You should
stop thinking in old style. Do not
try simply to map your existing
approaches.
● No silver bullet. Don't ruin your
existing infrastructure. Extend it.
NoSQL is not always good and
some cases are really to be kept
on SQL. Use the right tool.
● As you progress you pay more
attention to operations and
reactive system properties.
46
QUESTION?
47

More Related Content

What's hot

Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to RedshiftTreasure Data, Inc.
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidCharles Allen
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Spark Summit
 
Data ingestion
Data ingestionData ingestion
Data ingestionnitheeshe2
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium confluent
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid Imply
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 

What's hot (20)

Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
 
Data ingestion
Data ingestionData ingestion
Data ingestion
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 

Viewers also liked

Spring Boot. Boot up your development. JEEConf 2015
Spring Boot. Boot up your development. JEEConf 2015Spring Boot. Boot up your development. JEEConf 2015
Spring Boot. Boot up your development. JEEConf 2015Strannik_2013
 
Web application I have always dreamt of
Web application I have always dreamt ofWeb application I have always dreamt of
Web application I have always dreamt ofVictor_Cr
 
JEE Conf 2015: Less JS!
JEE Conf 2015: Less JS!JEE Conf 2015: Less JS!
JEE Conf 2015: Less JS!_Dewy_
 
Generics Past, Present and Future
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and FutureRichardWarburton
 
Statis code analysis
Statis code analysisStatis code analysis
Statis code analysischashnikov
 
Spring cloud for microservices architecture
Spring cloud for microservices architectureSpring cloud for microservices architecture
Spring cloud for microservices architectureIgor Khotin
 
Do we need JMS in 21st century?
Do we need JMS in 21st century?Do we need JMS in 21st century?
Do we need JMS in 21st century?Mikalai Alimenkou
 

Viewers also liked (10)

Spring Boot. Boot up your development. JEEConf 2015
Spring Boot. Boot up your development. JEEConf 2015Spring Boot. Boot up your development. JEEConf 2015
Spring Boot. Boot up your development. JEEConf 2015
 
X text
X textX text
X text
 
Web application I have always dreamt of
Web application I have always dreamt ofWeb application I have always dreamt of
Web application I have always dreamt of
 
JEE Conf 2015: Less JS!
JEE Conf 2015: Less JS!JEE Conf 2015: Less JS!
JEE Conf 2015: Less JS!
 
Generics Past, Present and Future
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and Future
 
Statis code analysis
Statis code analysisStatis code analysis
Statis code analysis
 
Spring data jee conf
Spring data jee confSpring data jee conf
Spring data jee conf
 
Scala Rock-Painting
Scala Rock-PaintingScala Rock-Painting
Scala Rock-Painting
 
Spring cloud for microservices architecture
Spring cloud for microservices architectureSpring cloud for microservices architecture
Spring cloud for microservices architecture
 
Do we need JMS in 21st century?
Do we need JMS in 21st century?Do we need JMS in 21st century?
Do we need JMS in 21st century?
 

Similar to BIG DATA: From mammoth to elephant

Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Roman Nikitchenko
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.Roman Nikitchenko
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Sid Anand
 
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.GeeksLab Odessa
 
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...Athens Big Data
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataCloudera, Inc.
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Dataexponential-inc
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationMongoDB
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Svetlin Stanchev
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperScott Gray
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit
 
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...Mydbops
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 

Similar to BIG DATA: From mammoth to elephant (20)

Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
 
Big Data - Big Pitfalls.
Big Data - Big Pitfalls.Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Data
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_Whitepaper
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 

More from Roman Nikitchenko

Public presentations for software engineers
Public presentations for software engineersPublic presentations for software engineers
Public presentations for software engineersRoman Nikitchenko
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Roman Nikitchenko
 
Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopRoman Nikitchenko
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.Roman Nikitchenko
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.Roman Nikitchenko
 

More from Roman Nikitchenko (6)

Public presentations for software engineers
Public presentations for software engineersPublic presentations for software engineers
Public presentations for software engineers
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
 
Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with Hadoop
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

BIG DATA: From mammoth to elephant

  • 1. Roman Nikitchenko, 10.05.2015 BIG DATA: FROM MAMMOTH TO ELEPHANT
  • 2. MAMMOTH The only real truth we know about them is their rests. Do you feel your enterprise data infrastructure goes this way? Come and see in the nearest data center... 2
  • 3. TWO YEARS AGO ● Our exciting high scalability realtime BIG DATA solution with broad technologies stack in production. 3
  • 4. This is our PRESENT DAY .. yet is powered by 4
  • 5. storage storage SQL DB Processed inbound data Inbound Outbound SQL DB Processed inbound data Healthcare providers data: labs, cares ... Mostly insurance companies SQL DB Application data SQL DB Outbound information OUR INITIAL STATE: TOP VIEW CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS 5
  • 6. storage storage SQL DB Processed inbound data Inbound Outbound SQL DB Processed inbound data Mostly insurance companies SQL DB Application data SQL DB Outbound information OUR INITIAL STATE: TOP VIEW CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS Inbound data archives (pretty short cycle) One SQL DB per application Huge amount of data. Serious amount of duplicates How about retention and data issues investigation? Healthcare providers data: labs, cares ... 6
  • 7. Outbound flow is slow because of RDBMS processing storage storage SQL DB Processed inbound data Inbound Outbound SQL DB Processed inbound data Mostly insurance companies SQL DB Application data SQL DB Outbound information OUR INITIAL STATE: TOP VIEW CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS CLIENT APPLICATIONS Inbound data retention cycle is short, so prolonged period data investigation is hard Overall huge amount of SQL databases, high operational complexity One application DB per service client makes inter-application analytics and monitoring extremely hard YELLOW ALARMS Healthcare providers data: labs, cares ... 7
  • 8. 8
  • 9. BIG DATA Better ways to store huge data volumes: cheaper, safer and easier. WHAT TO RUN FOR? MORE STORAGE 9
  • 10. BIG DATA WHAT TO RUN FOR? Scalable effective distributed processing models to open new opportunities like machine learning. MORE POWER 10
  • 11. BIG DATA WHAT TO RUN FOR? More flexible data structures closer to subject area and real world. 11
  • 12. RDBMS LIMITS ● Good for anything ● Not so good for anything in particular OUR MAIN ENEMY WAS ... 12
  • 13. MASSIVE ANALYSISIs about massive access to your data objects Your database Subject area objects data Subject area objects data Subject area objects data Subject area objects data Processing Processing Processing Processing Transformation from database structure into object structure Distributed parallel processing Effective results collection Distributed processing results to be joined WHY SQL IS EVIL 13
  • 14. RDBMS LIMITS When you go massive processing, objects collection is getting too complex. Think about 100.000.000 people data scan. Address ID City Street 1 New York 1020, Blue lake 2 Atlanta 203, Bricks av. 3 Seattle 120, Green drv. FirstName LastName Address Payer John Smith 1 2 Kate Davis 2 1 Samuel Brown 3 2 Payer ID Name State 1 SaferLife GA 2 YourGuard CA Kate Davis, Atlanta 203, Bricks av. SafeLife, GA SUBJECT AREA OBJECT COLLECTION 14
  • 15. FirstName LastName Address Payer Birthday RDBMS LIMITS FirstName LastName Address Payer John Smith 1 2 Kate Davis 2 1 Samuel Brown 3 2 And now let us add new «Birthday» column. Easy as pie! Let it be Patients table ... ALTER TABLE Patient ADD Birthday ... TABLE STRUCTURE MODIFICATION Let's do this with 2.000.000.000 rows MySQL table in production. What to do if your table grows further? 15
  • 16. ANY RELATIONAL DATA MODEL SOONER OR LATER 16
  • 17. Your SQL database Shard Shard Shard Shard Processing Processing Processing Processing How to partition data? What to do when new shard is added? Need another cluster for processing? Distributed processing results to be joined HOW TO SCALE? RDBMS LIMITS 17
  • 18. If you need to store plain text log, collection of objects for a long time or current user session attributes do you really need SQL? 18
  • 19. Cross-application data storage SQL DB Application data SQL DB Application data SQL DB Application data Small realtime requests Batch analytic and reporting load ETL ETL ETL ● One-time ETL as initial step and backup strategy. ● Full migration to Apache Hbase. ● As a transition period solution — realtime synchronization. OUR INITIAL BIG PLAN WAS 19
  • 20. OPEN SOURCE framework for big data. Both distributed storage and processing Provides RELIABILITY and fault tolerance by SOFTWARE design (for example file system with replication factor 3 as default one.Horizontal scalability from single computer up to thousands of nodes Why Hadoop (initially 1.x)? 20
  • 21. First ever world DATA OS 10.000 nodes computer... Can start in production from just 4 servers, 1 of them is for management and coordination. Single server is enough for development environment. 21
  • 22. HBase motivation WHY LATENCY, SPEED AND ALL HADOOP PROPERTIES 22
  • 23. Database Region server Distributed processing WHY YET ? DataNode Node File system Hardware TaskTracker Region server DataNode NodeTaskTracker Region server DataNode NodeTaskTracker Region server DataNode NodeTaskTracker ● Good both for OLTP and batch load. ● Natural scaling and reliability with Hadoop. ● Data processing locality, natural sharding with regions. ● Coordination with ZooKeeper. 23
  • 24. ZooKeeper Because coordinating distributed systems is a Zoo. ● Quorum based service for fast distributed system coordination. ● Came in our stack with Apache Hbase where it was needed for coordination. Now is part of core Hadoop infrastructure. ● Yet we use it for our own applications, 24
  • 25. Finally we went initial production with HADOOP 2.0 RESOURCE MANAGEMENT DISTRIBUTED PROCESSING FILE SYSTEM COORDINATION HADOOP 2.x CORE 25
  • 26. Database Region server Distributed processing & coordination Real initial approach DataNode Node File system Hardware Region server DataNode Node Region server DataNode Node Region server DataNode Node ● ZooKeeper Instances are distributed among cluster. ● MapReduce is not service in Hadoop 2.x, just YARN application. Resource management NodeManager NodeManager NodeManager NodeManager 26
  • 27. FIRST REAL RESULT Cross-application data storage SQL DB Application data SQL DB Application data SQL DB Application data Small realtime requests Batch analytic and reporting load ETL ETL ETL CLOSE BUT NOT EXACT PLAN Daily ETL. Satisfied our daily reporting needs with major SQL infrastructure offload. Direct profit — massive processing is much faster, can handle inter-application data. DO NOT WEAR PINK GLASSES 27
  • 28. APPROACH WE HAVE FIXED MUCH LATER SQL server JOIN Table1 Table2 Table3 Table4 ETL stream SQL server JOIN Table1 Table2 Table3 Table4 ETL stream ETL stream ETL stream ETL stream BIG DATA shard BIG DATA shard BIG DATA shard BIG DATA shard BIG DATA shard BIG DATA shard Bulk load Bulk load 28
  • 29. Hadoop: don't do it yourself DON'T DO IT YOURSELF Because of number of factors starting from our distributed team support needs we have selected 29
  • 31. WHERE TO GO FROM HERE? 31
  • 32. The admission of temporary residents into Canada is a privilege, not a right. http://www.cic.gc.ca/ SEARCH / SECONDARY INDICES 32
  • 33. NO SEARCH OUT OF THE BOX OTHER THAN LINEAR SCAN OVER THE TABLE AND FILTERS. SEARCH / SECONDARY INDICES The same happened to be applicable to secondary indices in Hbase. 33
  • 34. SEARCH / SECONDARY INDICES HOW WE MADE IT HBase handles user data changes Indexes are built on SOLR NGData Lily indexer transforms data changes into SOLR index updates 34
  • 35. HBase: Data and search integration Data update Client User just puts (or deletes) data. Search responses Lily HBase NRT indexerREPLICATION Translates data changes into SOLR index updates. SOLR cloud Search requests (HTTP) Apache Zookeeper does all coordination Provides real indexing Search and indexing together 35
  • 36. ● Kafka is a high throughput distributed messaging system. ● Allows true realtime system reaction through publish-subscribe approach. ● New services can subscribe to data events stream. GOING REALTIME Batch load Realtime load New data 36
  • 37. ● Kafka can be separated from Hadoop infrastructure or have backup cluster. ● Data publishers can switch to another cluster. ● Subscribers (including Spark on Hadoop) keep 2 places of subscription. ● So now you are free to put Kafka cluster in maintenance or backup subscribers. GOING REALTIME GENTLY MAINTENANCE 37
  • 38. This is our PRESENT DAY .. yet is powered by 38
  • 39. SO WHERE ARE WE GOING? 39
  • 40. OVER BIG DATAREACTIVE MANIFESTO MOTIVATION … users expect millisecond response times and 100% uptime. Data is measured in Petabytes. Today's demands are simply not met by yesterday’s software architectures. 40
  • 41. OVER BIG DATAREACTIVE MANIFESTO … we want systems that are Responsive, Resilient, Elastic and Message Driven. We call these Reactive Systems. http://www.reactivemanifesto.org/ 41
  • 42. OVER BIG DATAREACTIVE MANIFESTO Responsiveness is the cornerstone of usability and utility, but more than that, responsiveness means that problems may be detected quickly and dealt with effectively. RESPONSIVE 42
  • 43. OVER BIG DATAREACTIVE MANIFESTO The system stays responsive in the face of failure. … The client of a component is not burdened with handling its failures. RESILIENT All services here are located through ZooKeeper which is quorum based so resilience is achieved 43
  • 44. OVER BIG DATAREACTIVE MANIFESTO Reactive Systems can react to changes in the input rate by increasing or decreasing the resources allocated to service these inputs. ELASTIC Both HDFS and Hbase allow dynamic node addition / removal YARN already handles most resource allocation work and makes progress 44
  • 45. OVER BIG DATAREACTIVE MANIFESTO Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling. MESSAGE DRIVEN Asynchronous messages from applications Any application can subscribe, not only Hadoop services 45
  • 46. LESSONS LEARNED ● No transition in one step. You enter Big Data world step by step. ● Change you mind first. You should stop thinking in old style. Do not try simply to map your existing approaches. ● No silver bullet. Don't ruin your existing infrastructure. Extend it. NoSQL is not always good and some cases are really to be kept on SQL. Use the right tool. ● As you progress you pay more attention to operations and reactive system properties. 46