SlideShare a Scribd company logo
1 of 42
Cmprssd Intrduction To
Hadoop, SQL-on-Hadoop, NoSQL
@arsenyspb
Arseny.Chernov@Dell.com
Singapore University of Technology & Design
2016-11-09
Thank You For Inviting!
My special kind regards to:
Professor Meihui Zhang
Associate Director Hou Liang Seah
Industry Outreach Manager Robin Soo
🤔 What am I supposed to do?..
Please raise hand if you…
…want to learn about modern data analytics ?..
…are OK if I use words like “Java” or “Command Line” or “Port”?..
…got enough kopi / teh / red bull for next 1 hour?..
…have hands-on experience with Hadoop, Spark, Hive?..
Shameless Self-Intro
5
Hi, My Name Is Arseny, And I’m…
Hadoop In A 🌰 Nutshell
7
1998
2016
It All Started At Google
8
2003
2004
2006
Hadoop is Google’s Tech in Open Source
2006
9
Hadoop Originates From Hyperscale Approach
However, in 2016 big
data & Hadoop don’t
need a hyperscale
datacenter
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Closer Look, i.e. Hortonworks Data Platform (HDP)
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
TezTez Tez Slider Slider
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Hortonworks Data Platform 2.3
Data Lifecycle &
Governance
Falcon
Atlas
We will “compress” all these topics during next 1 hour
Quick demo
HDFS In A 🌰 Nutshell
Hadoop Distributed File System
13© 2015 Pivotal Software, Inc. All rights reserved.
Reading Data From HDFS
Client Node
Client JVM
Distributed
FileSystem
HDFS
Client
1: open
FSData
InputStream
namenode
JVM
NameNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
2: Request file block
locations
3: read
6: close
4: read from
block
5: read from
block
14© 2015 Pivotal Software, Inc. All rights reserved.
Writing Data to HDFS
Client Node
Client JVM
Distributed
FileSystem
HDFS
Client
1: create
FSDataOutputStr
eam
namenode
JVM
NameNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
2: create
3a: write
6: close
4a: write packet
5c: ack packet
4b: write
packet
4c: write
packet
5b: ack
packet
5a: ack
packet
7: complete
DataStreamer
3b: Request allocation
(as new blocks required)
3c: Three data-node,
data-block pairs returned
Diagram shows
3x replication
Quick demo
YARN In A 🌰 Nutshell
Yet Another Resource Negotiator
17
Traditional SQL databases: structured Schema-on-Write
Legacy SQL Is All Structured
row keys color shape timestamp
row
row
row
......
first red square HH:MM:SS
second blue round HH:MM:SS
1 create schema on file
or block storage
2 load data
3 query data
select ROW KEY, COLOR from … where
Can’t add data before the schema is created.
To change schema, drop and re-loaded entire table.
A drop of TB-size table with Foreign Keys could last days.
18
file.csv & other.txt
Unstructured Schema-on-Read Query
MapReduce In Color
1 load data
straight
from HDFS
2 query data
- map
- shuffle
- reduce
19
MapReduce In Process Diagram
20© 2015 Pivotal Software, Inc. All rights reserved.
Starting Job – MapReduce v2.0
Client Node
Client JVM
Job
MapReduce
program
Jobtracker Node
1: initiate job 2: request new
application
3: copy job
jars, config
4: submit job
9: retrieve job jars,
data
Node Manager Node
JVM
Node manager
Child JVM
YARN
child
Mapper or
Reducer
10: run
Shared File-System
(e.g. HDFS)
6: determine
input splits
7b: start
container
Node Manager Node
JVM
MRApp
Master
Node Manager
5b: launch
5c: initialize job
5a: start container
7a: allocate task resources
8: launch
JVM
ResourceManager
Quick demo
Hive In A 🌰 Nutshell
SQL interface to MapReduce Jobs
23
Relational DB
 Relational DB and SQL conceived to
– Remove repeated data, replace with tabular structure & relationships
▪ Provide efficient & robust structure for data storage
 Exploit regular structure with declarative query language
–Structured Query Language
DRY – Don’t Repeat Yourself
24
What Hive Is…
 A SQL-like processing capability based on Hadoop
 Enables easy data summarisation, ad-hoc reporting and querying, and
analysis of large volumes of data
 Built on HQL, a SQL-like query language
– Statements run as mapreduce jobs
– Also allows mapreduce programmers to plugin custom mappers and
reducers
• Works with Plain text, Hbase, ORC, Parquet and others formats
• Metadata is stored in MySQL
25
Hive Schemas
 Hive is schema-on-read
– Schema is only enforced when the data is read (at query time)
– Allows greater flexibility: same data can be read using multiple
schemas
 Contrast with RDBMSes, which are schema-on-write
– Schema is enforced when the data is loaded
– Speeds up queries at the expense of load times
26
Hive Architecture
Hive Metastore + MySQL
27
What Hive Is Not…
 Hive, like Hadoop, is designed for batch processing of large
datasets
 Not a real-time system, not fully SQL-92 compliant
– “Sibling” solutions like Tez, Impala and HAWQ offer more compliance
 Latency and throughput are both high compared to a
traditional RDBMS
– Even when dealing with relatively small data (<100 MB)
Quick demo
HBASE In A 🌰 Nutshell
SQL interface to MapReduce Jobs
30
ACID is Business Requirement for RDBMs
 Traditional DB-s have excellent support for ACID transactions
–Atomic: All write operations succeed, or nothing is written
–Consistent: Integrity rules guaranteed at commit
–Isolation: It appears to the user as if only one process executes at a
time. (Two concurrent transactions will not see on another’s
transaction while “in flight”.)
–Durable: The updates made to the database in a committed
transaction will be visible to future transactions. (Effects of a process
do not get lost if the system crashes.)
31
Scale RDBMS?..
 RDBMS is bad fit for huge scale, online applications
 How to do Sharding?..
 Unlimited but Scaling up?..
 Maybe give up on Joins for latency and do Master-Slave?..
 Big Data describes problem, Not only SQLdefines the general approach
to solution:
– Emphasis on scale, distributed processing, use of commodity
hardware
32
Business Needs for “Not Only SQL”
 Not Only SQL DBs evolved from web-scale use-cases
– Google, Amazon, Facebook, Twitter, Yahoo, …
▪ “Google Cache” = Entire page saved in to a cell of a BigTable database
▪ Columnar layout preferred
▪ filters to reduce the disk lookups for non-existent rows or columns increases the performance of a
database query operation.
– Requirement for massive scale, relational fits badly
▪ Queries relatively simple
▪ Direct interaction with online customers
– Cost-effective, dynamic horizontal scaling required
▪ Many nodes based on inexpensive (commodity) hardware
▪ Must manage frequent node failures & addition of nodes at any time
🤔 But how to build such DB?..
34
Reminder: The CAP Theorem (2 not 3)
Consistency
Partition
tolerance
Availability“Once a writer has
written, all readers
will see that write”
Single Version of Truth?
“System is
Available to serve
100% of requests
and complete them
successfully.”
No SPOF?..
“A system can
continue to operate
in the presence of a
network Partitions”
Replicas?..
35
Eventually Consistent vs. ACID
 An artificial acronym you may see is BASE
–Basically Available
▪ System seems to work all the time
–Soft State
▪ Not wholly consistent all the time, but…
–Eventual Consistency
▪ After a period with no updates, a given dataset will be consistent
 Resulting systems characterized as “eventually consistent”
– Overbooking an airline or hotel and passing risk to customer
36
Non-relational distributed database
• HBase is a database: has a schema, but it’s non-relational
row keys
column family
“color”
column family
“shape”
row
row
first “red”: #F00
“blue”: #00F
“yellow”: #F0F
“square”:
second
“round”:
“size”: XXL
1.) Create column families
2.) Load data, multiples of rows
form region files on HDFS
3.) Query data
hbase>get “first”, “color”:”yellow”
COLUMN CELL
yellow timestamp=1295774833226, value=“#F0F”
hbase>get “second”, “shape”:”size”
COLUMN CELL
size timestamp=1295723467122, value=“XXL”
37
Column
Oriented
Storage
38
Hbase
Client
RegionServer
Zookeeper
SQLODBC
Client
Pivotal
HAWQPXF
Hbase
Client
Apache
Phoenix
Hbase
Client
Sequential HDFS Write & L2 Read
Adaptive Pre Fetch & L2 Reads
Sequential Writes
SQLJDBC
Client
HbaseAPI
Client (1) Put/Delete
Write-AheadLog(WAL)
Memstore
(3) Flush to
HDFS
(2.1) Write to
MemStore
(2.0) Write to WAL
(4) Get/Scan Read Request
Client RAM Pre-Fetch
HBase Architecture, Read & Write
Memstore =
Eventual
Consistency
HFile
39
HBase namespace layout
40
From “Hbase Definitive Guide”
http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2
Compression (HBase and others)
Q&A?..
http://bit.ly/isilonhbase
@arsenyspb
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

More Related Content

What's hot

Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 

What's hot (20)

Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 

Viewers also liked

Pdf 이교수의 멘붕하둡_pig
Pdf 이교수의 멘붕하둡_pigPdf 이교수의 멘붕하둡_pig
Pdf 이교수의 멘붕하둡_pig
Michelle Hong
 

Viewers also liked (6)

SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Pdf 이교수의 멘붕하둡_pig
Pdf 이교수의 멘붕하둡_pigPdf 이교수의 멘붕하둡_pig
Pdf 이교수의 멘붕하둡_pig
 
ISTQB REX BLACK book
ISTQB REX BLACK bookISTQB REX BLACK book
ISTQB REX BLACK book
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 

Similar to Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 

Similar to Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL (20)

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
NoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseNoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed Database
 
Knowledge share about scalable application architecture
Knowledge share about scalable application architectureKnowledge share about scalable application architecture
Knowledge share about scalable application architecture
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
No sql
No sqlNo sql
No sql
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
No sql
No sqlNo sql
No sql
 

Recently uploaded

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

  • 1. Cmprssd Intrduction To Hadoop, SQL-on-Hadoop, NoSQL @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design 2016-11-09
  • 2. Thank You For Inviting! My special kind regards to: Professor Meihui Zhang Associate Director Hou Liang Seah Industry Outreach Manager Robin Soo
  • 3. 🤔 What am I supposed to do?.. Please raise hand if you… …want to learn about modern data analytics ?.. …are OK if I use words like “Java” or “Command Line” or “Port”?.. …got enough kopi / teh / red bull for next 1 hour?.. …have hands-on experience with Hadoop, Spark, Hive?..
  • 5. 5 Hi, My Name Is Arseny, And I’m…
  • 6. Hadoop In A 🌰 Nutshell
  • 8. 8 2003 2004 2006 Hadoop is Google’s Tech in Open Source 2006
  • 9. 9 Hadoop Originates From Hyperscale Approach However, in 2016 big data & Hadoop don’t need a hyperscale datacenter
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Closer Look, i.e. Hortonworks Data Platform (HDP) YARN : Data Operating System DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS EncryptionData Workflow Sqoop Flume Kafka NFS WebHDFS Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper Scheduling Oozie Batch MapReduce Script Pig Search Solr SQL Hive NoSQL HBase Accumulo Phoenix Stream Storm In-memory Spark Others ISV Engines TezTez Tez Slider Slider HDFS Hadoop Distributed File System DATA MANAGEMENT Hortonworks Data Platform 2.3 Data Lifecycle & Governance Falcon Atlas We will “compress” all these topics during next 1 hour
  • 12. HDFS In A 🌰 Nutshell Hadoop Distributed File System
  • 13. 13© 2015 Pivotal Software, Inc. All rights reserved. Reading Data From HDFS Client Node Client JVM Distributed FileSystem HDFS Client 1: open FSData InputStream namenode JVM NameNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode 2: Request file block locations 3: read 6: close 4: read from block 5: read from block
  • 14. 14© 2015 Pivotal Software, Inc. All rights reserved. Writing Data to HDFS Client Node Client JVM Distributed FileSystem HDFS Client 1: create FSDataOutputStr eam namenode JVM NameNode datanode JVM DataNode datanode JVM DataNode datanode JVM DataNode 2: create 3a: write 6: close 4a: write packet 5c: ack packet 4b: write packet 4c: write packet 5b: ack packet 5a: ack packet 7: complete DataStreamer 3b: Request allocation (as new blocks required) 3c: Three data-node, data-block pairs returned Diagram shows 3x replication
  • 16. YARN In A 🌰 Nutshell Yet Another Resource Negotiator
  • 17. 17 Traditional SQL databases: structured Schema-on-Write Legacy SQL Is All Structured row keys color shape timestamp row row row ...... first red square HH:MM:SS second blue round HH:MM:SS 1 create schema on file or block storage 2 load data 3 query data select ROW KEY, COLOR from … where Can’t add data before the schema is created. To change schema, drop and re-loaded entire table. A drop of TB-size table with Foreign Keys could last days.
  • 18. 18 file.csv & other.txt Unstructured Schema-on-Read Query MapReduce In Color 1 load data straight from HDFS 2 query data - map - shuffle - reduce
  • 20. 20© 2015 Pivotal Software, Inc. All rights reserved. Starting Job – MapReduce v2.0 Client Node Client JVM Job MapReduce program Jobtracker Node 1: initiate job 2: request new application 3: copy job jars, config 4: submit job 9: retrieve job jars, data Node Manager Node JVM Node manager Child JVM YARN child Mapper or Reducer 10: run Shared File-System (e.g. HDFS) 6: determine input splits 7b: start container Node Manager Node JVM MRApp Master Node Manager 5b: launch 5c: initialize job 5a: start container 7a: allocate task resources 8: launch JVM ResourceManager
  • 22. Hive In A 🌰 Nutshell SQL interface to MapReduce Jobs
  • 23. 23 Relational DB  Relational DB and SQL conceived to – Remove repeated data, replace with tabular structure & relationships ▪ Provide efficient & robust structure for data storage  Exploit regular structure with declarative query language –Structured Query Language DRY – Don’t Repeat Yourself
  • 24. 24 What Hive Is…  A SQL-like processing capability based on Hadoop  Enables easy data summarisation, ad-hoc reporting and querying, and analysis of large volumes of data  Built on HQL, a SQL-like query language – Statements run as mapreduce jobs – Also allows mapreduce programmers to plugin custom mappers and reducers • Works with Plain text, Hbase, ORC, Parquet and others formats • Metadata is stored in MySQL
  • 25. 25 Hive Schemas  Hive is schema-on-read – Schema is only enforced when the data is read (at query time) – Allows greater flexibility: same data can be read using multiple schemas  Contrast with RDBMSes, which are schema-on-write – Schema is enforced when the data is loaded – Speeds up queries at the expense of load times
  • 27. 27 What Hive Is Not…  Hive, like Hadoop, is designed for batch processing of large datasets  Not a real-time system, not fully SQL-92 compliant – “Sibling” solutions like Tez, Impala and HAWQ offer more compliance  Latency and throughput are both high compared to a traditional RDBMS – Even when dealing with relatively small data (<100 MB)
  • 29. HBASE In A 🌰 Nutshell SQL interface to MapReduce Jobs
  • 30. 30 ACID is Business Requirement for RDBMs  Traditional DB-s have excellent support for ACID transactions –Atomic: All write operations succeed, or nothing is written –Consistent: Integrity rules guaranteed at commit –Isolation: It appears to the user as if only one process executes at a time. (Two concurrent transactions will not see on another’s transaction while “in flight”.) –Durable: The updates made to the database in a committed transaction will be visible to future transactions. (Effects of a process do not get lost if the system crashes.)
  • 31. 31 Scale RDBMS?..  RDBMS is bad fit for huge scale, online applications  How to do Sharding?..  Unlimited but Scaling up?..  Maybe give up on Joins for latency and do Master-Slave?..  Big Data describes problem, Not only SQLdefines the general approach to solution: – Emphasis on scale, distributed processing, use of commodity hardware
  • 32. 32 Business Needs for “Not Only SQL”  Not Only SQL DBs evolved from web-scale use-cases – Google, Amazon, Facebook, Twitter, Yahoo, … ▪ “Google Cache” = Entire page saved in to a cell of a BigTable database ▪ Columnar layout preferred ▪ filters to reduce the disk lookups for non-existent rows or columns increases the performance of a database query operation. – Requirement for massive scale, relational fits badly ▪ Queries relatively simple ▪ Direct interaction with online customers – Cost-effective, dynamic horizontal scaling required ▪ Many nodes based on inexpensive (commodity) hardware ▪ Must manage frequent node failures & addition of nodes at any time
  • 33. 🤔 But how to build such DB?..
  • 34. 34 Reminder: The CAP Theorem (2 not 3) Consistency Partition tolerance Availability“Once a writer has written, all readers will see that write” Single Version of Truth? “System is Available to serve 100% of requests and complete them successfully.” No SPOF?.. “A system can continue to operate in the presence of a network Partitions” Replicas?..
  • 35. 35 Eventually Consistent vs. ACID  An artificial acronym you may see is BASE –Basically Available ▪ System seems to work all the time –Soft State ▪ Not wholly consistent all the time, but… –Eventual Consistency ▪ After a period with no updates, a given dataset will be consistent  Resulting systems characterized as “eventually consistent” – Overbooking an airline or hotel and passing risk to customer
  • 36. 36 Non-relational distributed database • HBase is a database: has a schema, but it’s non-relational row keys column family “color” column family “shape” row row first “red”: #F00 “blue”: #00F “yellow”: #F0F “square”: second “round”: “size”: XXL 1.) Create column families 2.) Load data, multiples of rows form region files on HDFS 3.) Query data hbase>get “first”, “color”:”yellow” COLUMN CELL yellow timestamp=1295774833226, value=“#F0F” hbase>get “second”, “shape”:”size” COLUMN CELL size timestamp=1295723467122, value=“XXL”
  • 38. 38 Hbase Client RegionServer Zookeeper SQLODBC Client Pivotal HAWQPXF Hbase Client Apache Phoenix Hbase Client Sequential HDFS Write & L2 Read Adaptive Pre Fetch & L2 Reads Sequential Writes SQLJDBC Client HbaseAPI Client (1) Put/Delete Write-AheadLog(WAL) Memstore (3) Flush to HDFS (2.1) Write to MemStore (2.0) Write to WAL (4) Get/Scan Read Request Client RAM Pre-Fetch HBase Architecture, Read & Write Memstore = Eventual Consistency HFile
  • 40. 40 From “Hbase Definitive Guide” http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2 Compression (HBase and others)

Editor's Notes

  1. Expect students from a DB background to be comfortable here. Expect they will become uncomfortable when we get to CAP/BASE.
  2. Each data-block is read from one of the data-nodes that holds it (assuming it is replicated multiple times). The NameNode tries to assign the read to the least busy data-node. Note: the ‘client’ is whatever code that is reading the data from hdfs. It could be anything: a web app, Spring batch, Spring integration, a HAWQ query, anything. Typically it is running on one of the nodes in the Hadoop cluster rather than externally to the cluster. Good deep background on this can be found at http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/
  3. The # of writes to data nodes depends on the replication factor, of course. NameNode returns as multiple (data-node/data-block) pairs as per the replication factor. The first write is ‘on rack’, or at least as close to the client as possible. The second two writes are ‘off rack’ – on a different rack as the first write.
  4. We believe it is the resource manager that takes care of the copying of the job jars, config A confusing aspect of this slide is that there is a ‘Node Manager Node’ that spawns MRAppMaster AND a “Node Manager Node” that launches a YARN child. From Hadoop the Definitive Guide (p197 & 198), it seems what this is saying is that all data nodes will also have a NodeManager daemon process running. That process could be contacted by the ResourceManager to launch a MR job, which would create an MRAppMaster internally. During the management of the job, the MRAppMaster could contact another NodeManager elsewhere in the cluster to spawn a YARN child, which would either run a Mapper task or a Reducer task.
  5. Note that Pig is schema on read too. So is Map Reduce.
  6. Expect students from a DB background to be comfortable here. Expect they will become uncomfortable when we get to CAP/BASE.
  7. Hierarchical and Network DBs actually predate Relational. Most of these companies were small startups, did not start with the resources necessary to buy big iron.
  8. Hierarchical and Network DBs actually predate Relational. Most of these companies were small startups, did not start with the resources necessary to buy big iron.
  9. Maybe they don't need consistent data ever for some datasets! Examples, based on previous: - The bank has decided that it is ok to allow deposits and withdrawals during partition failure. When the system comes back up we will reconcile, see how much we lost, and book it as the cost of high availability. My online banking site shows pending deposits and withdrawals (soft state), even they can’t tell me at a precise moment in time what is in my account! - An airline or hotel may decide to overbook and pass the cost of inconsistent state on to the customer (ever happen to you?) For deep background, a student suggested “Building on Quicksand”: http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf
  10. Moby is integrated with CDH 5.1.2 and 5.1.3 and Ambari 1.5.1, 1.6