This document provides a compressed introduction to Hadoop, SQL-on-Hadoop, and NoSQL technologies. It begins with welcoming remarks and then provides short overviews of key concepts in less than 3 sentences each. These include introductions to Hadoop origins and architecture, HDFS, YARN, MapReduce, Hive, and HBase. It also includes quick demos and encourages questions from the audience.
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
1. Cmprssd Intrduction To
Hadoop, SQL-on-Hadoop, NoSQL
@arsenyspb
Arseny.Chernov@Dell.com
Singapore University of Technology & Design
2016-11-09
2. Thank You For Inviting!
My special kind regards to:
Professor Meihui Zhang
Associate Director Hou Liang Seah
Industry Outreach Manager Robin Soo
3. 🤔 What am I supposed to do?..
Please raise hand if you…
…want to learn about modern data analytics ?..
…are OK if I use words like “Java” or “Command Line” or “Port”?..
…got enough kopi / teh / red bull for next 1 hour?..
…have hands-on experience with Hadoop, Spark, Hive?..
16. YARN In A 🌰 Nutshell
Yet Another Resource Negotiator
17. 17
Traditional SQL databases: structured Schema-on-Write
Legacy SQL Is All Structured
row keys color shape timestamp
row
row
row
......
first red square HH:MM:SS
second blue round HH:MM:SS
1 create schema on file
or block storage
2 load data
3 query data
select ROW KEY, COLOR from … where
Can’t add data before the schema is created.
To change schema, drop and re-loaded entire table.
A drop of TB-size table with Foreign Keys could last days.
18. 18
file.csv & other.txt
Unstructured Schema-on-Read Query
MapReduce In Color
1 load data
straight
from HDFS
2 query data
- map
- shuffle
- reduce
22. Hive In A 🌰 Nutshell
SQL interface to MapReduce Jobs
23. 23
Relational DB
Relational DB and SQL conceived to
– Remove repeated data, replace with tabular structure & relationships
▪ Provide efficient & robust structure for data storage
Exploit regular structure with declarative query language
–Structured Query Language
DRY – Don’t Repeat Yourself
24. 24
What Hive Is…
A SQL-like processing capability based on Hadoop
Enables easy data summarisation, ad-hoc reporting and querying, and
analysis of large volumes of data
Built on HQL, a SQL-like query language
– Statements run as mapreduce jobs
– Also allows mapreduce programmers to plugin custom mappers and
reducers
• Works with Plain text, Hbase, ORC, Parquet and others formats
• Metadata is stored in MySQL
25. 25
Hive Schemas
Hive is schema-on-read
– Schema is only enforced when the data is read (at query time)
– Allows greater flexibility: same data can be read using multiple
schemas
Contrast with RDBMSes, which are schema-on-write
– Schema is enforced when the data is loaded
– Speeds up queries at the expense of load times
27. 27
What Hive Is Not…
Hive, like Hadoop, is designed for batch processing of large
datasets
Not a real-time system, not fully SQL-92 compliant
– “Sibling” solutions like Tez, Impala and HAWQ offer more compliance
Latency and throughput are both high compared to a
traditional RDBMS
– Even when dealing with relatively small data (<100 MB)
29. HBASE In A 🌰 Nutshell
SQL interface to MapReduce Jobs
30. 30
ACID is Business Requirement for RDBMs
Traditional DB-s have excellent support for ACID transactions
–Atomic: All write operations succeed, or nothing is written
–Consistent: Integrity rules guaranteed at commit
–Isolation: It appears to the user as if only one process executes at a
time. (Two concurrent transactions will not see on another’s
transaction while “in flight”.)
–Durable: The updates made to the database in a committed
transaction will be visible to future transactions. (Effects of a process
do not get lost if the system crashes.)
31. 31
Scale RDBMS?..
RDBMS is bad fit for huge scale, online applications
How to do Sharding?..
Unlimited but Scaling up?..
Maybe give up on Joins for latency and do Master-Slave?..
Big Data describes problem, Not only SQLdefines the general approach
to solution:
– Emphasis on scale, distributed processing, use of commodity
hardware
32. 32
Business Needs for “Not Only SQL”
Not Only SQL DBs evolved from web-scale use-cases
– Google, Amazon, Facebook, Twitter, Yahoo, …
▪ “Google Cache” = Entire page saved in to a cell of a BigTable database
▪ Columnar layout preferred
▪ filters to reduce the disk lookups for non-existent rows or columns increases the performance of a
database query operation.
– Requirement for massive scale, relational fits badly
▪ Queries relatively simple
▪ Direct interaction with online customers
– Cost-effective, dynamic horizontal scaling required
▪ Many nodes based on inexpensive (commodity) hardware
▪ Must manage frequent node failures & addition of nodes at any time
34. 34
Reminder: The CAP Theorem (2 not 3)
Consistency
Partition
tolerance
Availability“Once a writer has
written, all readers
will see that write”
Single Version of Truth?
“System is
Available to serve
100% of requests
and complete them
successfully.”
No SPOF?..
“A system can
continue to operate
in the presence of a
network Partitions”
Replicas?..
35. 35
Eventually Consistent vs. ACID
An artificial acronym you may see is BASE
–Basically Available
▪ System seems to work all the time
–Soft State
▪ Not wholly consistent all the time, but…
–Eventual Consistency
▪ After a period with no updates, a given dataset will be consistent
Resulting systems characterized as “eventually consistent”
– Overbooking an airline or hotel and passing risk to customer
36. 36
Non-relational distributed database
• HBase is a database: has a schema, but it’s non-relational
row keys
column family
“color”
column family
“shape”
row
row
first “red”: #F00
“blue”: #00F
“yellow”: #F0F
“square”:
second
“round”:
“size”: XXL
1.) Create column families
2.) Load data, multiples of rows
form region files on HDFS
3.) Query data
hbase>get “first”, “color”:”yellow”
COLUMN CELL
yellow timestamp=1295774833226, value=“#F0F”
hbase>get “second”, “shape”:”size”
COLUMN CELL
size timestamp=1295723467122, value=“XXL”
Expect students from a DB background to be comfortable here. Expect they will become uncomfortable when we get to CAP/BASE.
Each data-block is read from one of the data-nodes that holds it (assuming it is replicated multiple times). The NameNode tries to assign the read to the least busy data-node.
Note: the ‘client’ is whatever code that is reading the data from hdfs. It could be anything: a web app, Spring batch, Spring integration, a HAWQ query, anything. Typically it is running on one of the nodes in the Hadoop cluster rather than externally to the cluster.
Good deep background on this can be found at http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/
The # of writes to data nodes depends on the replication factor, of course. NameNode returns as multiple (data-node/data-block) pairs as per the replication factor.
The first write is ‘on rack’, or at least as close to the client as possible. The second two writes are ‘off rack’ – on a different rack as the first write.
We believe it is the resource manager that takes care of the copying of the job jars, config
A confusing aspect of this slide is that there is a ‘Node Manager Node’ that spawns MRAppMaster AND a “Node Manager Node” that launches a YARN child. From Hadoop the Definitive Guide (p197 & 198), it seems what this is saying is that all data nodes will also have a NodeManager daemon process running. That process could be contacted by the ResourceManager to launch a MR job, which would create an MRAppMaster internally. During the management of the job, the MRAppMaster could contact another NodeManager elsewhere in the cluster to spawn a YARN child, which would either run a Mapper task or a Reducer task.
Note that Pig is schema on read too. So is Map Reduce.
Expect students from a DB background to be comfortable here. Expect they will become uncomfortable when we get to CAP/BASE.
Hierarchical and Network DBs actually predate Relational.
Most of these companies were small startups, did not start with the resources necessary to buy big iron.
Hierarchical and Network DBs actually predate Relational.
Most of these companies were small startups, did not start with the resources necessary to buy big iron.
Maybe they don't need consistent data ever for some datasets!
Examples, based on previous:
- The bank has decided that it is ok to allow deposits and withdrawals during partition failure. When the system comes back up we will reconcile, see how much we lost, and book it as the cost of high availability. My online banking site shows pending deposits and withdrawals (soft state), even they can’t tell me at a precise moment in time what is in my account!
- An airline or hotel may decide to overbook and pass the cost of inconsistent state on to the customer (ever happen to you?)
For deep background, a student suggested “Building on Quicksand”: http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf
Moby is integrated with CDH 5.1.2 and 5.1.3 and Ambari 1.5.1, 1.6