You run your SQL-centric infrastructure for 10 years and slowly starting to note you can’t do this way anymore – everything is getting too expensive but your business requires things which are simply impossible without radical changes.
This is exact situation we had 2 years before. So we’d like to show our experience:
- Why and how we came into Big Data?
- Why we choose Apache and Hadoop?
- What to do and what is already done?
- What lessons were learned?
- Hadoop and relational databases: fight or synergy?
- Reactive Big Data manifest.
2. MAMMOTH
The only real truth we know
about them is their rests. Do
you feel your enterprise data
infrastructure goes this way?
Come and see in the nearest
data center...
2
3. TWO YEARS AGO
● Our exciting high scalability realtime
BIG DATA solution with broad
technologies stack in production.
3
5. storage storage
SQL DB
Processed
inbound data
Inbound Outbound
SQL DB
Processed
inbound data
Healthcare
providers
data: labs,
cares ...
Mostly
insurance
companies
SQL DB
Application data
SQL DB
Outbound
information
OUR INITIAL STATE: TOP VIEW
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
5
6. storage storage
SQL DB
Processed
inbound data
Inbound Outbound
SQL DB
Processed
inbound data
Mostly
insurance
companies
SQL DB
Application data
SQL DB
Outbound
information
OUR INITIAL STATE: TOP VIEW
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
Inbound data archives
(pretty short cycle)
One SQL DB
per application
Huge amount of data. Serious
amount of duplicates
How about retention
and data issues
investigation?
Healthcare
providers
data: labs,
cares ...
6
7. Outbound flow
is slow because
of RDBMS
processing
storage storage
SQL DB
Processed
inbound data
Inbound Outbound
SQL DB
Processed
inbound data
Mostly
insurance
companies
SQL DB
Application data
SQL DB
Outbound
information
OUR INITIAL STATE: TOP VIEW
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
CLIENT
APPLICATIONS
Inbound data retention cycle
is short, so prolonged period
data investigation is hard
Overall huge amount of SQL databases,
high operational complexity
One application DB per service client
makes inter-application analytics and
monitoring extremely hard
YELLOW ALARMS
Healthcare
providers
data: labs,
cares ...
7
9. BIG DATA
Better ways to store huge data
volumes: cheaper, safer and easier.
WHAT TO RUN FOR?
MORE STORAGE
9
10. BIG DATA
WHAT TO RUN FOR?
Scalable effective distributed
processing models to open new
opportunities like machine
learning.
MORE POWER
10
11. BIG DATA
WHAT TO RUN FOR?
More flexible data
structures closer
to subject area
and real world.
11
12. RDBMS LIMITS
● Good for anything
● Not so good for
anything in
particular
OUR MAIN ENEMY WAS ...
12
13. MASSIVE ANALYSISIs about massive access to your data objects
Your
database
Subject area
objects data
Subject area
objects data
Subject area
objects data
Subject area
objects data
Processing
Processing
Processing
Processing
Transformation from
database structure
into object structure
Distributed
parallel
processing
Effective results
collection
Distributed
processing
results to be
joined
WHY SQL IS EVIL
13
14. RDBMS LIMITS
When you go massive
processing, objects
collection is getting too
complex. Think about
100.000.000 people
data scan.
Address ID City Street
1 New York 1020, Blue lake
2 Atlanta 203, Bricks av.
3 Seattle 120, Green drv.
FirstName LastName Address Payer
John Smith 1 2
Kate Davis 2 1
Samuel Brown 3 2
Payer ID Name State
1 SaferLife GA
2 YourGuard CA
Kate Davis,
Atlanta 203, Bricks av.
SafeLife, GA
SUBJECT AREA OBJECT COLLECTION
14
15. FirstName
LastName
Address
Payer
Birthday
RDBMS LIMITS
FirstName LastName Address Payer
John Smith 1 2
Kate Davis 2 1
Samuel Brown 3 2
And now let us add new «Birthday» column.
Easy as pie!
Let it be Patients table ...
ALTER TABLE Patient ADD Birthday ...
TABLE STRUCTURE MODIFICATION
Let's do this with 2.000.000.000 rows MySQL table in
production. What to do if your table grows further?
15
18. If you need to store plain text log,
collection of objects for a long
time or current user session
attributes do you really need
SQL?
18
19. Cross-application
data storage
SQL DB
Application data
SQL DB
Application data
SQL DB
Application data
Small realtime requests
Batch analytic
and reporting
load
ETL
ETL
ETL
● One-time ETL as initial step and backup strategy.
● Full migration to Apache Hbase.
● As a transition period solution — realtime synchronization.
OUR INITIAL
BIG PLAN WAS
19
20. OPEN SOURCE framework for big data.
Both distributed storage and processing
Provides RELIABILITY
and fault tolerance by
SOFTWARE design (for
example file system with
replication factor 3 as
default one.Horizontal scalability from
single computer up to
thousands of nodes
Why Hadoop (initially 1.x)?
20
21. First ever world
DATA OS
10.000 nodes computer...
Can start in production from just 4 servers, 1 of
them is for management and coordination.
Single server is enough for development
environment.
21
23. Database
Region server
Distributed
processing
WHY YET ?
DataNode Node
File system Hardware
TaskTracker
Region server DataNode NodeTaskTracker
Region server DataNode NodeTaskTracker
Region server DataNode NodeTaskTracker
● Good both for OLTP and batch load.
● Natural scaling and reliability with Hadoop.
● Data processing locality, natural sharding with regions.
● Coordination with ZooKeeper.
23
24. ZooKeeper
Because coordinating distributed systems is a Zoo.
● Quorum based service for
fast distributed system
coordination.
● Came in our stack with
Apache Hbase where it was
needed for coordination.
Now is part of core Hadoop
infrastructure.
● Yet we use it for our own
applications,
24
25. Finally we went
initial production with HADOOP 2.0
RESOURCE MANAGEMENT
DISTRIBUTED PROCESSING
FILE SYSTEM
COORDINATION
HADOOP
2.x CORE
25
26. Database
Region server
Distributed
processing &
coordination
Real initial approach
DataNode Node
File system Hardware
Region server DataNode Node
Region server DataNode Node
Region server DataNode Node
● ZooKeeper Instances are distributed among cluster.
● MapReduce is not service in Hadoop 2.x, just YARN application.
Resource
management
NodeManager
NodeManager
NodeManager
NodeManager
26
27. FIRST REAL RESULT
Cross-application
data storage
SQL DB
Application data
SQL DB
Application data
SQL DB
Application data
Small realtime requests Batch analytic
and reporting
load
ETL
ETL
ETL
CLOSE BUT NOT EXACT PLAN
Daily ETL. Satisfied our daily reporting needs with major SQL
infrastructure offload. Direct profit — massive processing is much
faster, can handle inter-application data.
DO NOT WEAR PINK GLASSES
27
28. APPROACH WE HAVE FIXED MUCH LATER
SQL
server
JOIN
Table1
Table2
Table3
Table4
ETL stream
SQL
server
JOIN
Table1
Table2
Table3
Table4
ETL stream
ETL stream
ETL stream
ETL stream
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
BIG DATA shard
Bulk
load
Bulk
load
28
29. Hadoop: don't do it yourself
DON'T DO IT YOURSELF
Because of number of factors starting
from our distributed team support
needs we have selected
29
32. The admission of
temporary residents into
Canada is a privilege, not
a right.
http://www.cic.gc.ca/
SEARCH /
SECONDARY
INDICES
32
33. NO SEARCH OUT OF
THE BOX OTHER THAN
LINEAR SCAN OVER
THE TABLE AND
FILTERS.
SEARCH /
SECONDARY
INDICES
The same happened to be applicable
to secondary indices in Hbase.
33
34. SEARCH / SECONDARY INDICES
HOW WE MADE IT
HBase
handles user
data changes
Indexes are
built on SOLR
NGData Lily indexer
transforms data
changes into SOLR
index updates
34
35. HBase: Data and search integration
Data
update
Client
User just puts (or
deletes) data.
Search responses
Lily HBase
NRT indexerREPLICATION
Translates data
changes into SOLR
index updates.
SOLR cloud
Search requests (HTTP)
Apache
Zookeeper does
all coordination Provides real
indexing
Search and indexing together
35
36. ● Kafka is a high throughput distributed
messaging system.
● Allows true realtime system reaction
through publish-subscribe approach.
● New services can subscribe to data
events stream.
GOING REALTIME
Batch load
Realtime load
New
data
36
37. ● Kafka can be separated
from Hadoop infrastructure
or have backup cluster.
● Data publishers can switch
to another cluster.
● Subscribers (including
Spark on Hadoop) keep 2
places of subscription.
● So now you are free to put
Kafka cluster in
maintenance or backup
subscribers.
GOING REALTIME
GENTLY
MAINTENANCE
37
40. OVER BIG DATAREACTIVE
MANIFESTO
MOTIVATION
… users expect millisecond response times and 100%
uptime. Data is measured in Petabytes. Today's demands
are simply not met by yesterday’s software architectures.
40
41. OVER BIG DATAREACTIVE
MANIFESTO
… we want systems that are
Responsive, Resilient, Elastic
and Message Driven. We call
these Reactive Systems. http://www.reactivemanifesto.org/
41
42. OVER BIG DATAREACTIVE
MANIFESTO
Responsiveness is
the cornerstone of
usability and utility,
but more than that,
responsiveness
means that
problems may be
detected quickly and
dealt with effectively.
RESPONSIVE
42
43. OVER BIG DATAREACTIVE
MANIFESTO
The system stays
responsive in the
face of failure.
… The client of a
component is not
burdened with
handling its failures.
RESILIENT All services here are located through ZooKeeper
which is quorum based so resilience is achieved
43
44. OVER BIG DATAREACTIVE
MANIFESTO
Reactive Systems
can react to changes
in the input rate by
increasing or
decreasing the
resources allocated
to service these
inputs.
ELASTIC
Both HDFS and Hbase
allow dynamic node
addition / removal
YARN already handles
most resource allocation
work and makes progress
44
45. OVER BIG DATAREACTIVE
MANIFESTO
Reactive Systems rely
on asynchronous
message-passing to
establish a boundary
between components
that ensures loose
coupling.
MESSAGE
DRIVEN
Asynchronous
messages from
applications
Any application can
subscribe, not only
Hadoop services
45
46. LESSONS LEARNED
● No transition in one step. You
enter Big Data world step by step.
● Change you mind first. You should
stop thinking in old style. Do not
try simply to map your existing
approaches.
● No silver bullet. Don't ruin your
existing infrastructure. Extend it.
NoSQL is not always good and
some cases are really to be kept
on SQL. Use the right tool.
● As you progress you pay more
attention to operations and
reactive system properties.
46