This document provides an agenda and overview for a quick start lab on JBoss Data Grid (JDG). The agenda includes an introduction to big data and NoSQL terminology, how developers scale out applications, consistent hashing, what a data grid is, a demonstration, JDG/Infinispan features, and a question and answer section. It describes key concepts like the CAP theorem, consistent hashing algorithms, different JDG cache topologies (local, invalidation, replicated, distributed), and how to tune hashing in JDG.
1. ROME 27-28 march 2015 – Ugo Landini
Quick Start Lab
JBoss Data Grid
Ugo Landini
Senior Solution Architect
ugol@redhat.com
March 26th 2015
2. Quick Start Lab - JBoss Data Grid2
• Big Data & NoSQL: super quick introduction to terminology
• What developers do to scale out
• Consistent Hashing
• What’s a Data Grid
• DEMO
• Infinispan/JDG features
• Q&A
Agenda
3. Quick Start Lab - JBoss Data Grid3
• Big Data & NoSQL: super quick introduction to terminology
• What developers do to scale out
• Consistent Hashing
• What’s a Data Grid
• DEMO
• Infinispan/JDG features
• Q&A
Agenda
4. Quick Start Lab - JBoss Data Grid4
new generation of
technologies ... designed to
economically extract value
from very large volumes of a
wide variety of data, by
enabling high velocity
capture, discovery and/or
analysis
IDC, 2012
Big Data
5. Quick Start Lab - JBoss Data Grid5
Not Only SQL
Just an alternative to
RDBMS
NoSQL
6. Quick Start Lab - JBoss Data Grid6
K/V Store
Document Store
Column based DB
Graph DB
XML, Object DB, Multidimensional, Grid/Cloud, …
see map on https://451research.com/images/Marketing/dataplatformsmapoctober2014.pdf
NoSQL
9. Quick Start Lab - JBoss Data Grid9
•Very hard to categorise in a systematic way
•Many nuances
•Many cases of “Evolutionary Convergence”
•i.e. evolving similar features having to adapt to similar
environments
NoSQL
11. Quick Start Lab - JBoss Data Grid11
•Brewer’sTheorem (2000, proven in 2002)
•Three guarantees of a Distributed System
•Consistency
•Availability
•PartitionTolerance
CAP Theorem
12. Quick Start Lab - JBoss Data Grid12
All nodes see the same data at the same time
Consistency
13. Quick Start Lab - JBoss Data Grid13
A guarantee that every request receives a response
about whether it succeeded or failed
Availability
14. Quick Start Lab - JBoss Data Grid14
The system continues to operate despite arbitrary
message loss or failure of part of the system
Partition Tolerance
15. Quick Start Lab - JBoss Data Grid15
The system continues to operate despite arbitrary
message loss or failure of part of the system
Partition Tolerance
16. Quick Start Lab - JBoss Data Grid16
Consistency:
Transactions
Availability:
Redundancy
Partition
Tolerance:
Scaleout
CAP: Popular Version
17. Quick Start Lab - JBoss Data Grid17
Consistency:
Transactions
Availability:
Redundancy
Partition
Tolerance:
Scaleout
NO
GO
CAP: Popular Version
18. Quick Start Lab - JBoss Data Grid18
Consistency:
Transactions
Availability:
Redundancy
Partition
Tolerance:
Scaleout
RDBMS
CAP: Popular Version
19. Quick Start Lab - JBoss Data Grid19
Consistency:
Transactions
Availability:
Redundancy
Partition
Tolerance:
Scaleout
NoSQL
CAP: Popular Version
20. Quick Start Lab - JBoss Data Grid20
Brewer wrote an essay in 2012 to clarify some of the
CAP implications
http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
CAP: Modern Version
21. Quick Start Lab - JBoss Data Grid21
The "two out of three" concept can be misleading or
misapplied and it should be considered as a tautology
Many vendors used CAP theorem just as an excuse to
sacrifice Consistency
CAP: Modern Version
22. Quick Start Lab - JBoss Data Grid22
Partitions are rare, so there is little reason to forfeit C or
A when the system is not partitioned
The choice between C and A can occur many times
within the same system at very fine granularity
CAP: Modern Version
23. Quick Start Lab - JBoss Data Grid23
Different decisions about C and A:
•for different operations
•for different data
•in different moments
CAP: Modern Version
24. Quick Start Lab - JBoss Data Grid24
Finally, C, A e P are more continuos than binary:
•A is obviously continuous
•Many levels of Consistency (think isolation level in
classic DB)
•Even Partitions have nuances, including disagreement
within the system about whether a partition exists
CAP: Modern Version
25. Quick Start Lab - JBoss Data Grid25
• Big Data & NoSQL: super quick introduction to terminology
• What developers do to scale out
• Consistent Hashing
• What’s a Data Grid
• DEMO
• Infinispan/JDG features
• Q&A
Agenda
28. 28
Virtual Machine 1
Client 1
Cache 1
RDBMS
Virtual Machine 2
Client 2
Cache 2
1. Client 1 reads A
First try at distributed caching
29. 29
Virtual Machine 1
Client 1
Cache 1
RDBMS
Virtual Machine 2
Client 2
Cache 2
2. Client 1 writes A
to Cache 1
First try at distributed caching
30. 30
Virtual Machine 1
Client 1
Cache 1
RDBMS
Virtual Machine 2
Client 2
Cache 2
3. Client 2 writes A2
to RDBMS
First try at distributed caching
31. 31
Virtual Machine 1
Client 1
Cache 1
RDBMS
Virtual Machine 2
Client 2
Cache 2
4. Client 1 reads A
from Cache 1
First try at distributed caching
32. 32
Distributed Caching on many nodes
What about dirty reads? (i.e. how to cope with multiple
writes, invalidation, etc.)
First try at distributed caching
33. 33
Virtual Machine 1
Client 1
Cache 1
RDBMS
Virtual Machine 2
Client 2
Cache 2
1. Client 2 writes A2
to RDBMS
Second try at distributed caching
36. 36
Virtual Machine 1
Client 1
Cache 1
RDBMS
Virtual Machine 2
Client 2
Cache 2
1. Client 1 reads A2
from Cache 1
Second try at distributed caching
37. Quick Start Lab - JBoss Data Grid37
New Cache topology
Startup time
State transfers
Incompatible JVM tunings
GCs
Non Java clients
Second try at distributed caching
38. Quick Start Lab - JBoss Data Grid38
• Big Data & NoSQL: super quick introduction to terminology
• What developers do to scale out
• Consistent Hashing
• What’s a Data Grid
• Infinispan/JDG features
• Q&A
Agenda
39. Quick Start Lab - JBoss Data Grid39
Hashing Wheel: a mathematical “wheel” on which you
hash Ks (keys) and Ns (nodes).
The relative position of Ks and Ns determines which
Node is the “owner” of that particular K in a topology
Consistent Hashing
41. Quick Start Lab - JBoss Data Grid41
Ns (nodes) on the “wheel” partition the hash space in
segments
Every segment contains a range of Ks
Consistent Hashing
45. Quick Start Lab - JBoss Data Grid45
Going clockwise from the K:
•the first N is the owner
•next N is the replica
•next next N could be another replica, and so on
Consistent Hashing
51. Quick Start Lab - JBoss Data Grid51
The real CH algorithm implemented in JDG is slightly
different
CH is optimized to minimize state transfer (i.e. number
of keys moving when a node dies or a new one joins the
cluster)
Consistent Hashing
52. Quick Start Lab - JBoss Data Grid52
• Big Data & NoSQL: super quick introduction to terminology
• What developers do to scale out
• Consistent Hashing
• What’s a Data Grid
• DEMO
• Infinispan/JDG features
• Q&A
Agenda
53. Quick Start Lab - JBoss Data Grid53
Distributed Memory Storage Engine
Networked Memory
A Distributed Cache “on steroids”
ATransactional NoSQL
What’s a Data Grid?
54. Quick Start Lab - JBoss Data Grid54
•Key/Value storage
•Search Engine (from K/V to Document storage)
•Linear Scalability, Elasticity and Fault tolerance
•Thanks to CH
•Memory based
•Persistence engines are optional
What’s a Data Grid?
55. Quick Start Lab - JBoss Data Grid55
•DifferentTopologies
•Querying
•Task Execution & Map/Reduce
•Partition Handling
•Data Affinity (to squeeze every bit of
performance)
Data Grid > Distributed Caching
65. Quick Start Lab - JBoss Data Grid65
•Replicated:
•“Small” set of data with high % of reads vs
writes
•Distributed:
•“Big” set of data: linear scaling
•You need M/R & Distexec
How do I choose?
66. Quick Start Lab - JBoss Data Grid66
•You can have different Cache configurations
in the same CacheManager
•mix&match Replicated and Distributed as
needed
JDG Cache Topologies (Cluster modes)
67. Quick Start Lab - JBoss Data Grid67
•Default hashing (Distributed mode):
MurmurHash3.
•It’s a simple and standard Hashing:
•you can change it as you like, f.e. if your
key already identifies a partitioning criteria
Tuning your hashing
68. Quick Start Lab - JBoss Data Grid68
•Can be “fine tuned” in 4 different ways:
•Server Hinting
•Virtual Servers
•Grouping
•Key Affinity
Tuning your hashing
69. Quick Start Lab - JBoss Data Grid69
•A triple (site, rack, server)
•You increase availability avoiding that replicas
ends up in the same (site, rack, server) of the
master
Server Hinting
70. Quick Start Lab - JBoss Data Grid70
•Number of di “segments” in which the
cluster is partitioned
•Improve the node distribution on the
hashing wheel to have a better distribution
of keys
•Default: 60
Virtual Servers
71. Quick Start Lab - JBoss Data Grid71
•Data colocation
•A cache node contains K but also other
relevant data afferent to K
•Example: customer and its bank
movements
•You just have to define a group, JDG will
colocate all data of the same group in the
same node
Grouping
72. Quick Start Lab - JBoss Data Grid72
•Like Grouping, but from another perspective:
•You just ask a node for a key that will be
hashed on that node
•Grouping/Affinity are your best friends if you
want to reach JDG Nirvana!
Key Affinity
73. Quick Start Lab - JBoss Data Grid73
•All data needed by a node of your application are local,
at the distance of a single Java method call
JDG Nirvana
74. Quick Start Lab - JBoss Data Grid74
• Big Data & NoSQL: super quick introduction to terminology
• What developers do to scale out
• Consistent Hashing
• What’s a Data Grid
• DEMO
• Infinispan/JDG features
• Q&A
Agenda
75. Quick Start Lab - JBoss Data Grid75
•Small self-contained projects that can be used to
simply explain JDG to customers
•https://github.com/redhat-italy/jdg-quickstarts
JDG Quickstarts
76. Quick Start Lab - JBoss Data Grid76
• Big Data & NoSQL: super quick introduction to terminology
• What developers do to scale out
• Consistent Hashing
• What’s a Data Grid
• DEMO
• Infinispan/JDG features
• Q&A
Agenda
77. Quick Start Lab - JBoss Data Grid77
•If JDG detects a split brain, partitions enter
in degraded mode
•A degraded partition can read/write ONLY
fully owned keys
•A partition fully owns a key if contains
master and replicas nodes for that key
•You’ll get an AvailabilityException for other
keys
Partition Handling
78. Quick Start Lab - JBoss Data Grid78
•Cache Store
•Not only in memory!
•Write through & write behind (ACK sync or
async)
•Pluggable “drivers”
•File System, JPA, LevelDB (supported)
•MongoDB, Cassandra, BerkeleyDB, etc.
(community)
Persistence
79. Quick Start Lab - JBoss Data Grid79
•To avoid Out Of Memory
•Entry can be “passivated” on disk (you’ll need a
CacheStore)
Eviction
80. Quick Start Lab - JBoss Data Grid80
•To avoid Out Of Memory
•Entry can be “passivated” on disk (you’ll need a
CacheStore)
Eviction
81. Quick Start Lab - JBoss Data Grid81
•You assign a lifespan or a max idle time to a
key
•The key will then be automatically removed
after that time
•You don’t need to write “Garbage Clean
code”
Expiry
83. Quick Start Lab - JBoss Data Grid83
•Both avoid Out Of Memory
•“Evicted” data can be maintained in the Grid
with Passivation
•Eviction is a Cache configuration
•Expiration is a Key configuration
•Expiration could be a business requisite
•Eviction is a system feature
Eviction/Expiry: differences
84. Quick Start Lab - JBoss Data Grid84
•JDG has full support for transactions
•LocalTransactions
•GlobalTransactions (XA): if running inside an
AS automatically uses itsTX Manager
•Batching API
Transactions
85. Quick Start Lab - JBoss Data Grid85
•Cache/CacheManager events
•Topology changes
•Entries being added, removed, modified
•Cluster listeners
Listener/Notifications
86. Quick Start Lab - JBoss Data Grid86
•Infinispan-query module
•Hibernate Search & Lucene
•Querying via DSL
•Lucene indexes can be kept in memory, on
disk or in the grid
Querying the grid
87. Quick Start Lab - JBoss Data Grid87
•with M/R you can implement distributed global
operation on the grid
•Each node works on its data (Map)
•Results are later aggregated (Reduce)
Map/Reduce
90. Quick Start Lab - JBoss Data Grid90
•JDG 7 will implement HDFS API
•So it will be able to act as a super fast Hadoop
store
Hadoop, coming soon…
91. Quick Start Lab - JBoss Data Grid91
•With Distexec you can submit “tasks” to the
Grid
•The task can be executed on each node or
on a subset of the nodes
•The task can modify data in the Grid
Distributed Execution (Distexec)
93. Quick Start Lab - JBoss Data Grid93
•“Follow the Sun” architectures
•Many different clusters that can be kept in
sync
Cross Site Replication
94. Quick Start Lab - JBoss Data Grid94
•JSR-107
•JavaTemporary Caching API
•Confirmed in January 2015
•In roadmap for JDG 6.5
•JSR-347
•Data Grids for the Java Platform
•JSR Retired in January 2015
Standard APIs
95. Quick Start Lab - JBoss Data Grid95
•Command Line Console
•JMX
•JON Plugin
Management Tooling
96. Quick Start Lab - JBoss Data Grid96
•User Authentication
•SASL
•Role Based Access Control (RBAC)
•Users, Roles and mapping between roles and
operations on Cache / Cache-Manager
•Node Authentication & Authorisation
•Encrypted communication between nodes
Data Security
97. Quick Start Lab - JBoss Data Grid97
•Library mode
•Embedded in your JVM
•C/S mode
•REST
•Memcached
•Hot Rod
Embedded vs Client/Server
98. Quick Start Lab - JBoss Data Grid98
Embedded vs Client/Server
99. Quick Start Lab - JBoss Data Grid99
Protocol
Client
Libs
Smart
Routing
Load
Balancing/
Failover
TX Listeners M/R Dist Querying
Separated
Cluster
Library
mode
inVM N/A Yes Dynamic Yes Yes Yes Yes Yes No
REST Text HTTP No
Any HTTP
load
balancer
No No No No No Yes
Memcached Text Many No
Predefined
server list
No No No No No Yes
Hot Rod Binary
Java/
Python/
C++
Yes Dynamic
Local w
MVCC
Yes (6.4) No No Yes (6.3) Yes
Protocol Comparison