3. MOTIVATION
"There were 5 hexabytes of information created between the
dawn of civilization through 2003, but that much information
is now created every 2 days, and the pace is increasing.โ
Eric Schmidt, Google, August 2010
5EB 5EB 5EB
5
E
B
5
E
B
5
E
B
?
4. WHAT PEOPLE THOUGHT
ABOUT BIG DATA
โBig Data is a shorthand label that typically means applying
the tools of artificial intelligence, like machine learning, to
vast new troves of data beyond that captured in standard
databases. The new data sources include Web-browsing data
trails, social network communications, sensor data and
surveillance data.โ
"A process that has the potential to transform everythingโ
NY Times, August 2012
5. WHAT THEY
THOUGHT LATER
โThe โBigโ there is purely marketingโฆ This is about you
buying big expensive servers and whatnot.โ
โGood data is better than big dataโ
โBig data is bullshitโ, 2013
Quote from Harper Reed,
Tech Guru for Obama re-election
caimpaign in 2012
7. HOW MUCH IT WILL
CHANGE THE WORLD ?
We have a long history of successes in predictionsโฆ
โI predict the internet will [โฆ]
catastrophically collapse in 1996โ
Robert Metcalfe, Inventor of Ethernet, 1995
And the world will end in 2012โฆ
8. THERE IS NO WAY TO
PREDICT THE FUTURE
โSpam will be solved in 2 yearsโ
Bill Gates, 2004
โThere is no chance the iPhone will get any
significant market shareโ
Steve Ballmer, 2007
10. CAN YOU DO THE SAME
IN THE โBIG WORLDโ ?
The communication paths in the โbig worldโ:
What about this technology ?
steep learning curve
11. A REAL WORLD
EXAMPLE
Sarah, today Iโm gonna run some wonderful
Spark applications on my new Big Data cluster
on Amazon EC2!!
Oh, I though that Amazon was just selling shoes!
12. WELCOME BACK TO
1946
ENIAC
One of the first
electronic digital
computers (1946).
It was 180 m2 big
A Big Data cluster (today).
We need new languages and
abstractions to develop on top of it.
(more powerful than assembly !)
15. VARIETY
Need to process:
โข Digital Images
โข Video Recording
โข Free Text
data and extract useful
information.
How?
16. VVV: THAT IS BIG
DATA
Traditional systems are not
suitable for data characterized by
High:
โข Volume: TB, PB, HB, ZB โฆ
โข Velocity: GB/s, TB/s โฆ
โข Variety: unstructured or semi-
structured
Big Data Systems have been
created for these purposes.
17. WHATโS WRONG WITH
TRADITIONAL SYSTEMS?
Oracle databases can host tables with more than 200TB of
data (some of them are in Italy).
Suppose you want to run a query like:
select type, count(*)
from events
group by type
How much will you wait ?
Even if you reach 10GB/s of read
speed from disks (with multiple
SSD)โฆ
you will wait more than 5 hours !
(if the instance wonโt crash beforeโฆ)
18. BIG DATA SYSTEMS
ARE DISTRIBUTED
Big Data systems are a collection of software applications
installed on different machines.
Each application can be used as if it was installed in a single
machine.
โฆ
1 2 3 4 5 > 10.000
19. GETTING STARTED
Do not try to install all software by yourself! Youโll
become crazy!
Get a โPlatform including Hadoopโ in a virtual machine from:
(Many) applications in the VM runs in:
โpseudo-distributed modeโ For:
21. HISTORY: GOOGLE
FILE SYSTEM
In 2003, Google published a paper about a new distributed file
system called Google File System (GFS).
http://static.googleusercontent.com/media/research.google.com/it//archive/gfs-sosp2003.pdf
Their largest cluster:
โข Was composed of more than 1000 nodes
โข Could store more than 100 TB (it was 2003 !)
โข Can be accessed by hundreds of concurrent clients
Its main purpose was that of serving the Google search engine.
Butโฆ how?
22. HISTORY:
MAPREDUCE
In 2004, Google published another paper about a new batch
processing framework, called MapReduce.
http://static.googleusercontent.com/media/research.google.com/it//archive/mapreduce-osdi04.pdf
MapReduce:
โข Was a parallel processing framework
โข Was integrated perfectly with GFS
MapReduce was used for updating indexes in the Google
search engine.
23. HISTORY: HADOOP
In 2005, Doug Cutting (Yahoo!) created the basis of the Big
Data movement: Hadoop.
Originally, Hadoop was composed of:
1. HDFS: a Highly Distributed File System โinspired byโ
GFS
2. MapReduce: a parallel processing framework โinspired
byโ Google MapReduce
โinspired byโ = โa copy ofโ
โฆ but with an open source license (starting from 2009).
Hadoop was the
name of his sonโs toy
elephant.
24. HDFS: INTERNALS
A Master/Slave architecture:
โข Master: takes care of directories and file block locations
โข Slaves: store data blocks (128MB). Replication factor 3.
25. HDFS: LOGICAL VIEW
A HDFS cluster appears logically as a normal POSIX file
system (not fully compliant with POSIX):
โข Clients are distribution unaware (eg. Shell, Hue)
โข Allows creation of:
โข Files and directories
โข ACL (Users and groups)
โข Read/Write/AccessChild permissions
26. MAPREDUCE
Algorithm:
1. Map: data is taken from HDFS and transformed
2. Shuffling: data is splitted and reorganized among nodes
3. Reduce: data is summarized and written back to HDFS
Master
HDFS
NameNode
MapReduce
JobTracker
Slave 1
HDFS
DataNode
MapReduce
TaskTracker
Slave 2
HDFS
DataNode
MapReduce
TaskTracker
Slave 3
HDFS
DataNode
MapReduce
TaskTracker
Delegation and
aggregation
Shuffling (temp files)
map map map
Data locality
28. MAPREDUCE:
SOFTWARE
The โmapperโ:
public class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
29. MAPREDUCE:
SOFTWARE
The โreducerโ:
public class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
30. MAPREDUCE:
SOFTWARE
The โmainโ class:
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Configuration is usually pre-filled
from an additional xml file
(hadoop-site.xml)
31. MAPREDUCE
Considerations:
โข Can run in more than 10 thousand machines
โข Linear scalability (theoretical/commercial feature):
โข 100 nodes: 1PB in 2 hours ๏ 200 nodes: 1PB in 1 hour
โข 100 nodes: 1PB in 2 hours ๏ 200 nodes: 2PB in 2 hours
โข Programming model:
โข You can do more than word count (samples follow)
โข Complex data pipelines require more than 1 MapReduce
step
โข Difficult to write programs as MapReduce Jobs (a brand
new way of writing algorithms)
โข Difficult to maintain code (and to reverse engineer)
32. FIRST IMPROVEMENT:
PIG
MapReduce job are difficult to write.
Complex pipelines require multiple jobs.
In 2006, people at Yahoo research started working on a new
language to simplify creation of MapReduce jobs.
They created Pig.
The โPig Latinโ is a procedural language.
It is still used by Data Scientist at Yahoo!,
and worldwide.
33. PIG LATIN
The word count in Pig:
lines = LOAD '/tmp/input-file' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- remove whitespaces words
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups
GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/output-file';
From this kind of sources,
Pig creates one or more
MapReduce jobs
34. PIG
Considerations:
โข Procedural language, easier than MapReduce style
โข Can build directed acyclic graphs (DAG) of MapReduce
steps
โข One step has same scalability as MapReduce counterpart
โข Compatibility with many languages for writing user
defined functions (UDF): Java, Python, Ruby, โฆ
โข Thanks to UDF, you can treat also unstructured data
โข Itโs another language
โข Stackoverflow cannot help you in case of bugs !
35. SECOND
IMPROVEMENT: HIVE
In 2009, the Facebook Data Infrastructure Team created Hive:
โAn open-source data warehousing solution on top of hadoopโ
Hive brings SQL to Hadoop (queries translated to MapReduce):
โข You can define the structure of Hadoop files (tables, with
columns and data types) and save them in Hive Metastore
โข You can query tables with HiveQL, a dialect of SQL.
Limitations:
โข Joins only with equality conditions
โข Subqueries with limitations
โข Limitations depend on the versionโฆ Check the docs
36. HIVE: SAMPLE
QUERIES
An example of Hive query (you have seen it before):
select type, count(*)
from events
group by type
Another query:
select o.product, u.country
from order o
join user u
on o.user_id = u.id
Can be executed on many PB of
data (having an appropriate
number of machines)
How can you translate it in
MapReduce ?
How many MR steps ?
Order and User are folders with text files in
HDFS. Hive consider them as tables
38. MULTIPLE MR JOBS
Hive and Pig produce multiple MapReduce jobs and run them
in sequence.
What if we want to define a custom workflow of MR/Hive/Pig
jobs ?
Oozie:
โข Configure Jobs
โข Define Workflow
โข Schedule execution
39. FLUME
Files on HDFS are not always uploaded โby handโ in
command line.
Flume can bring files to HDFS.
Web App
(Apache front-end)
Big Data Infrastructure
42. CONFIGURING FLUME
A simple agent (simple-flume.conf):
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
A netcat source listens
for incoming telnet data
43. CONFIGURING FLUME
(CONT.)
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
A logger sink just outputs
data (useful for debugging)
44. TESTING FLUME
Run the example with the command:
flume-ng agent --conf conf --conf-file
example.conf --name a1
-Dflume.root.logger=INFO,console
Then, open another terminal and send telnet commands to
the listening agent..
45. INGESTING FROM
JDBC
Another tool useful to ingest data in HDFS from relational
databases is Sqoop2.
With Sqoop2, you can configure Jobs made of:
โข A JDBC Source (table or query)
โข A HDFS Sink (folder with type and compression options).
Other type of sink are also supported, eg. HBase.
Jobs can be configured and run with: sqoop2-shell (or hue).
Real-time
synch is not
supported
46. MACHINE LEARNING:
MAHOUT
Mahout is about โMachine learning on top of Hadoopโ.
Main packages:
โข Collaborative Filtering
โข Classification
โข Clustering
Many algorithms run on MapReduce.
Other algorithms run on other engines. Some of them can
run only locally (not parallelizable).
Sample usage: mahout kmeans โi โฆ -o โฆ -k โฆ
49. SOME PROBLEMS
WITH MAPREDUCE
โข Paradigm: MapReduce is a new โtemplateโ algorithm.
Difficult to translate existing algorithms in MapReduce.
โข Expressiveness: a single MapReduce job is often not
sufficient for enterprise data processing. You need to write
multiple jobs.
โข Interactivity: MapReduce is terribly slow for small amounts of
data (time for initialization). Hive queries cannot be
interactive.
โข Maintainability: writing a single MapReduce job can be
cumbersome. Writing a pipeline of MapReduce jobs produce
100% unmaintainable code. ?
Nobody still writes MapReduce jobs directly
50. AN โUNEXPECTEDโ
ISSUE WITH MR
With โrealโ Big Data (Volume), another issue is coming:
Performance:
When you have multiple PB of data, scalability is not linear
anymore.
Google has replaced MapReduce with โCloud Dataflowโ
since many years.
HDFS MAP SPILL REDUCESHUFFLE HDFS
Disk and network are the slowest components: imagine a pipeline of 20 MR jobsโฆ
partition
w r wr
network
transfer
51. MAPREDUCE 2
MapReduce v1 had too many components.
Changes:
โข Resource Management has been moved to YARN
โข MR API rewritten (changed package from
org.apache.hadoop.mapred to
org.apache.hadoopmapreduce)
52. YARN
Yet Another Resource Negotiator: started in 2008 at Yahoo!
Introduced in major data platforms in 2012.
Negotiate (allocate containers with):
โข RAM
โข DISK
โข CPU
โข NETWORK
54. HDFS LIMITATIONS
MapReduce problems have been solved with MR2.
What about HDFS ?
โข Can store large volumes of files
โข Supports any format, from text files to custom records
โข Supports โtransparentโ compression of data
โข Parallel retrieve and storage of batches
โข Does not provide:
โข Fast random read/write (HDFS is append only)
โข Data updates (rewrite the entire block: 128MB)
55. HBASE: THE HADOOP
DATABASE
Google solved the problem starting from 2004.
In 2006 they published a paper about โBig Tableโ.
http://static.googleusercontent.com/media/research.google.com/it//archive/bigtable-osdi06.pdf
The Hadoop community made their own version of Big Table
in 2010.
It has been called HBase. It provides with:
โข Fast read/write access to single records
โข Organization of data in tables, column families and
columns
โข Also: performance, replication, availability, consistencyโฆ
56. HBase
HBASE: DATA MODEL
Table Table
Column Family Col.
Family
Col. Family Col. Family
Col. Col. Col.
Cell
Cell
Cell
put โvalueโ to table1, cf1:col1 (row_key)
get * from table1 (row_key)
delete from table1, cf1:col1 (row_key)
scan โฆ
57. HBASE:
ARCHITECTURE
We will have a whole presentation on HBaseโฆ
Hadoop Master
HMaster NameNode
Hadoop Slave 1
Region
Server
DataNode
Hadoop Slave 2
Region
Server
DataNode
HDFS
58. ACCESS HBASE
Different ways to access HBase:
โข HBase Driver API:
โข CRUDL
โข MapReduce
โข Hadoop InputFormat and OutputFormat to read/write data
in batches
โข Hive/Impala
โข Do SQL on HBase: limitations in โpredicate pushdownโ
โข Apache Phoenix:
โข A project to โtranslateโ SQL queries into Driver API Calls
59. NOSQL
Some Data Platforms include different NoSQL databases.
Similar to HBase
Graphs
Documents
Key/Value
Not Only SQL ๏ NOw SQL
60. SOLR
NoSQL databases have some โfeaturesโ in common:
โข You need to model the database having the queries in
mind
โข You need to add redundancy (to do different queries)
โข Lack of a โgoodโ indexing system (secondary indexes
absent or limited)
A Solution:
โข Solr
Full Text Search
61. APACHE SPARK: THE
GAME CHANGER
โApache Spark is a fast and general engine for large-scale
data processingโ
Spark vs MapReduce:
โข Faster
โข Clearer
โข Shorter
โข Easier
โข More powerful
62. KEY DIFFERENCE
A MapReduce complex algorithm:
A Spark complex algorithm:
Map Reduce
HDFS
The developer writes
multiple applications,
each one with 1 map and
1 reduce step.
A scheduler (Oozie) is
programmed to execute
all applications in a
configurable order.
The developer writes 1
application using a
simple API.
The Spark Framework
executes the application.
Data is processed in
memory as much as
possible.
63. MIGRATION
Many applications originally developed on MapReduce are
gradually migrating to Spark (migration in progress).
Pig on Spark (Spork): just using โ-x sparkโ in shell
Hive on Spark: โset hive.execution.engine=sparkโ
Since 25/04/2014: No more MapReduce based algorithms
64. USAGE
RDDs can be used as normal Scala collections, there are only
small differences in the API.
val book = sc.textFile(โ/books/dante/inferno.txt")
val words = book.flatMap(f => f.split(" "))
val chart = words
.map(w => (w, 1))
.reduceByKey((n1, n2) => n1 + n2)
.top(4)(Ordering.by(t => t._2))
SCALA !
65. STREAMING
Spark has a โStreamingโ component.
Storm streaming
model is different
SAME SCALA
API
66. SPARK COMPONENTS
Spark libraries are developed on top of the Spark core framework for
large scale data processing:
โข Spark SQL: execute SQL queries on heterogeneous distributed
datasets
โข Spark Streaming: execute micro-batches on streaming data
โข Spark MLib: ready-to-use machine learning algorithms
โข Spark GraphX: algorithms and abstractions for working with graphs
Spark Core
Spark SQL
Spark
Streaming
Spark MLib
Spark
GraphX
People started discovering that no one had the potential to do the expected analysis
The Hype Cycle, Gartner
How much will you spend for a super fast SSD ?
Storage and computation can be parallelized
NameNode HA: Only one writer at time.
NFS shared partition to share updates. Secondary reads updates and refresh status.
Journal Nodes with Quorum Journal Manager (QJM). At least 3 nodes.
Also yarn-site.xml and mapred-site.xml in MR2.
Metadata queries are for determining data types.
With a different paradigm, some read/writes can be avoided. Resorting data (shuffle) at each step is not necessary.
Also, for โsmallโ data (TB), many writes can be replaced by in-memory storage.
News on HDFS2:
- Enabled automated failover with a hot standby and full stack resiliency for the NameNode master service
- Added enterprise standard NFS read/write access to HDFS
- Enabled point in time recovery with Snapshots in HDFS
- Wire Encryption for HDFS Data Transfer Protocol
Region servers can also work with remote datanodes.