Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
2. How It Works…
LIVE On-Line classes
Class recordings in Learning Management System (LMS)
Module wise Quizzes, Coding Assignments
24x7 on-demand technical support
Project work on large Datasets
Online certification exam
Lifetime access to the LMS
Complimentary Java Classes
Slide 2
www.edureka.in/hadoop
3. Course Topics
Module 1
Module 5
Module 2
Module 6
Understanding Big Data
Hadoop Architecture
Introduction to Hadoop 2.x
Data loading Techniques
Hadoop Project Environment
Module 3
Hadoop MapReduce framework
Programming in Map Reduce
Module 4
Advance MapReduce
YARN (MRv2) Architecture
Programming in YARN
Slide 3
Analytics using Pig
Understanding Pig Latin
Analytics using Hive
Understanding HIVE QL
Module 7
NoSQL Databases
Understanding HBASE
Zookeeper
Module 8
Real world Datasets and Analysis
Project Discussion
www.edureka.in/hadoop
4. Topics for Today
What is Big Data?
Limitations of the existing solutions
Solving the problem with Hadoop
Introduction to Hadoop
Hadoop Eco-System
Hadoop Core Components
HDFS Architecture
MapRedcue Job execution
Anatomy of a File Write and Read
Hadoop 2.0 (YARN or MRv2) Architecture
Slide 4
www.edureka.in/hadoop
5. What Is Big Data?
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing applications. The
challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.
NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.
Slide 5
www.edureka.in/hadoop
8. Annie’s Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 8
www.edureka.in/hadoop
9. Annie’s Question
Map the following to corresponding data type:
-
XML Files
-
Word Docs, PDF files, Text files
-
E-Mail body
-
Slide 9
Hello There!!
My name is Annie.
Data from Enterprise systems (ERP, CRM etc.)
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
www.edureka.in/hadoop
10. Annie’s Answer
XML Files -> Semi-structured data
Word Docs, PDF files, Text files -> Unstructured Data
E-Mail body -> Unstructured Data
Data from Enterprise systems (ERP, CRM etc.) -> Structured Data
Slide 10
www.edureka.in/hadoop
11. Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBM‟s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Slide 11
www.edureka.in/hadoop
12. Common Big Data Customer Scenarios
Web and e-tailing
Recommendation Engines
Ad Targeting
Search Quality
Abuse and Click Fraud Detection
Telecommunications
Customer Churn Prevention
Network Performance
Optimization
Calling Data Record (CDR)
Analysis
Analyzing Network to Predict
Failure
http://wiki.apache.org/hadoop/PoweredBy
Slide 12
www.edureka.in/hadoop
13. Common Big Data Customer Scenarios (Contd.)
Government
Fraud Detection And Cyber Security
Welfare schemes
Justice
Healthcare & Life Sciences
Health information exchange
Gene sequencing
Serialization
Healthcare service quality
improvements
Drug Safety
http://wiki.apache.org/hadoop/PoweredBy
Slide 13
www.edureka.in/hadoop
14. Common Big Data Customer Scenarios (Contd.)
Banks and Financial services
Modeling True Risk
Threat Analysis
Fraud Detection
Trade Surveillance
Credit Scoring And Analysis
Retail
Point of sales Transaction
Analysis
Customer Churn Analysis
Sentiment Analysis
http://wiki.apache.org/hadoop/PoweredBy
Slide 14
www.edureka.in/hadoop
15. Hidden Treasure
Case Study: Sears Holding Corporation
Insight into data can provide Business
Advantage.
Some key early indicators can mean Fortunes
to Business.
X
More Precise Analysis with more data.
*Sears was using traditional systems such as Oracle Exadata,
Teradata and SAS etc. to store and process the customer activity
and sales data.
Slide 15
www.edureka.in/hadoop
16. Limitations of Existing Data Analytics Architecture
BI Reports + Interactive Apps
A meagre
10% of the
~2PB Data is
available for
BI
RDBMS (Aggregated Data)
1. Can‟t explore original
high fidelity raw data
ETL Compute Grid
2. Moving data to compute
doesn‟t scale
Storage only Grid (original Raw Data)
Storage
Processing
90% of
the ~2PB
Archived
3. Premature data
death
Mostly Append
Collection
Instrumentation
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
Slide 16
www.edureka.in/hadoop
17. Solution: A Combined Storage Computer Layer
BI Reports + Interactive Apps
1. Data Exploration &
Advanced analytics
RDBMS (Aggregated Data)
No Data
Archiving
Entire ~2PB
Data is
available for
processing
Both
Storage
And
Processing
2. Scalable throughput for ETL &
aggregation
Hadoop : Storage + Compute Grid
3. Keep data alive
forever
Mostly Append
Collection
Instrumentation
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
Slide 17
www.edureka.in/hadoop
19. Hadoop – It’s about Scale And Structure
RDBMS
EDW
MPP
RDBMS
HADOOP
NoSQL
Structured
Data Types
Multi and Unstructured
Limited, No Data Processing
Processing
Processing coupled with Data
Standards & Structured
Governance
Loosely Structured
Required On write
Schema
Required On Read
Reads are Fast
Speed
Writes are Fast
Software License
Cost
Support Only
Known Entity
Resources
Growing, Complexities, Wide
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
Best Fit Use
Data Discovery
Processing Unstructured Data
Massive Storage/Processing
Slide 19
www.edureka.in/hadoop
20. Why DFS?
Read 1 TB Data
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
Slide 20
10 Machines
4 I/O Channels
Each Channel – 100 MB/s
www.edureka.in/hadoop
23. What Is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
Slide 23
www.edureka.in/hadoop
25. Annie’s Question
Hadoop is a framework that allows for the distributed processing of:
-
Slide 25
Small Data Sets
Large Data Sets
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
www.edureka.in/hadoop
26. Annie’s Answer
Large Data Sets. It is also capable to process small data-sets
however to experience the true power of Hadoop one needs to have
data in Tb‟s because this where RDBMS takes hours and fails
whereas Hadoop does the same in couple of minutes.
Slide 26
www.edureka.in/hadoop
27. Hadoop Eco-System
Apache Oozie (Workflow)
Hive
Pig Latin
DW System
Data Analysis
Mahout
Machine Learning
MapReduce Framework
HBase
HDFS (Hadoop Distributed File System)
Flume
Sqoop
Import Or Export
Slide 27
Unstructured or
Semi-Structured data
Structured Data
www.edureka.in/hadoop
28. Machine Learning with Mahout
Write intelligent applications using Apache Mahout
LinkedIn Recommendations
Hadoop and
MapReduce magic in
action
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
Slide 28
www.edureka.in/hadoop
29. Hadoop Core Components
Hadoop is a system for large scale data processing.
It has two main components:
HDFS – Hadoop Distributed File System (Storage)
Distributed across “nodes”
Natively redundant
NameNode tracks locations.
MapReduce (Processing)
Splits a task across processors
“near” the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
Slide 29
www.edureka.in/hadoop
30. Hadoop Core Components (Contd.)
MapReduce
Engine
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
HDFS
Cluster
Slide 30
Job Tracker
Admin Node
Name node
Data Node
Data Node
Data Node
Data Node
www.edureka.in/hadoop
32. Main Components Of HDFS
NameNode:
master of the system
maintains and manages the blocks which are present on the
DataNodes
DataNodes:
slaves which are deployed on each machine and provide the
actual storage
responsible for serving read and write requests for the clients
Slide 32
www.edureka.in/hadoop
33. NameNode Metadata
Meta-data in Memory
The entire metadata is in main memory
No demand paging of FS meta-data
Types of Metadata
List of files
List of Blocks for each file
List of DataNode for each block
File attributes, e.g. access time, replication factor
A Transaction Log
Records file creations, file deletions. etc
Slide 33
Name Node
(Stores metadata only)
METADATA:
/user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2
Name Node:
Keeps track of overall file directory
structure and the placement of Data
Block
www.edureka.in/hadoop
34. Secondary Name Node
metadata
NameNode
Secondary NameNode:
Single Point
Failure
Not a hot standby for the NameNode
You give me
metadata every
hour, I will make
it secure
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
Secondary
NameNode
metadata
Slide 34
www.edureka.in/hadoop
35. Annie’s Question
NameNode?
a)
is the “Single Point of Failure” in a cluster
b) runs on „Enterprise-class‟ hardware
c)
d) All of the above
Slide 35
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
stores meta-data
www.edureka.in/hadoop
36. Annie’s Answer
All of the above. NameNode Stores meta-data and runs on reliable
high quality hardware because it‟s a Single Point of failure in a
hadoop Cluster.
Slide 36
www.edureka.in/hadoop
37. Annie’s Question
When the NameNode fails, Secondary NameNode takes over
instantly and prevents Cluster Failure:
a)
TRUE
b) FALSE
Slide 37
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
www.edureka.in/hadoop
38. Annie’s Answer
False. Secondary NameNode is used for creating NameNode
Checkpoints. NameNode can Hello There!!
be manually recovered using „edits‟
My name is Annie.
and „FSImage‟ stored in Secondary NameNode.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 38
www.edureka.in/hadoop
39. JobTracker
1. Copy Input Files
DFS
Job.xml.
Job.jar.
3. Get Input Files‟ Info
Client
2. Submit Job
4. Create Splits
Input Files
5. Upload Job
Information
User
6. Submit Job
Slide 39
Job Tracker
www.edureka.in/hadoop
40. JobTracker (Contd.)
DFS
Input Spilts
Client
8. Read Job Files
Job.xml.
Job.jar.
Maps
6. Submit Job
Job Tracker
Reduces
9. Create
maps and
reduces
7. Initialize Job
Slide 40
As many maps
as splits
Job Queue
www.edureka.in/hadoop
42. Annie’s Question
Hadoop framework picks which of the following daemon
for scheduling a task ?
a) namenode
b) datanode
c) task tracker
d) job tracker
Slide 42
www.edureka.in/hadoop
43. Annie’s Answer
JobTracker takes care of all theHello There!!
job scheduling and
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
assign tasks to TaskTrackers.
Slide 43
www.edureka.in/hadoop
44. Anatomy of A File Write
HDFS
Client
2. Create
1. Create
Distributed
File System
3. Write
4. Write Packet
5. ack Packet
DataNode
DataNode
Slide 44
NameNode
7. Complete
4
Pipeline of
Data nodes
NameNode
4
DataNode
5
DataNode
DataNode
5
DataNode
www.edureka.in/hadoop
45. Anatomy of A File Read
HDFS
Client
2. Get Block locations
1. Create
3. Write
NameNode
Distributed
File System
NameNode
4. Read
5. Read
DataNode
DataNode
DataNode
Slide 45
DataNode
DataNode
DataNode
www.edureka.in/hadoop
47. Annie’s Question
In HDFS, blocks of a file are written in parallel, however
the replication of the blocks are done sequentially:
a)
TRUE
b) FALSE
Slide 47
www.edureka.in/hadoop
48. Annie’s Answer
True. A files is divided into Blocks, these blocks are
written in parallel but the block replication happen in
sequence.
Slide 48
www.edureka.in/hadoop
49. Annie’s Question
A file of 400MB is being copied to HDFS. The system
has finished copying 250MB. What happens if a client
tries to access that file:
a)
b)
c)
d)
Slide 49
can read up to block that's successfully written.
can read up to last bit successfully written.
Will throw an throw an exception.
Cannot see that file until its finished copying.
www.edureka.in/hadoop
50. Annie’s Answer
Client can read up to the successfully written data block,
Answer is (a)
Slide 50
www.edureka.in/hadoop
51. Hadoop 2.x (YARN or MRv2)
HDFS
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Secondary
Name Node
Active
NameNode
Data Node
Client
YARN
Shared
edit logs
Read edit logs and applies
to its own namespace
Resource
Manager
Standby
NameNode
Data Node
Data Manager
Node Node
Container
Node Manager
Container
Slide 51
App
Master
Node Manager
Container
App
Master
Data Node
Node Manager
Container
App
Master
Data Node
App
Master
www.edureka.in/hadoop
52. Further Reading
Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
Apache Hadoop HDFS Architecture
http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
Hadoop 2.0 and YARN
http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/
Slide 52
www.edureka.in/hadoop
53. Module-2 Pre-work
Setup the Hadoop development environment using the documents present in the LMS.
Hadoop Installation – Setup Cloudera CDH3 Demo VM
Hadoop Installation – Setup Cloudera CDH4 QuickStart VM
Execute Linux Basic Commands
Execute HDFS Hands On commands
Attempt the Module-1 Assignments present in the LMS.
Slide 53
www.edureka.in/hadoop
Web and e-tailing- eBay is using Hadoop technology and the Hbase database, which supports real-time analysis of Hadoop data, to build a new search engine for its auction site. eBay has more than 97 million active buyers and sellers and over 200 million items for sale in 50,000 categories. The site handles close to 2 billion page views, 250 million search queries and tens of billions of database calls daily. The company has 9 petabytes of data stored on Hadoop and Teradata clusters, and the amount is growing quickly.TelecommunicationsChina Mobile; Data Mining platform for Telecom Industry, 5-8 TB/day CDR , Network Signaling DataCurrent Solutions such as Oracle DB, SAS (Data Mining), Unix Servers and SAN aren’t sufficient to store and process such a vast amount of dataNeed faster data processing to Precision marketing, Network Optimization, Service Optimization and Log Processing
Government-AADHAR by Govt. Of India; 5 MB Data per resident, maps to about 10-15 PB of raw data. The Hadoop stack: HDFS (Hadoop distributed file system) is used to provide high data read/write throughput in the order of many terabytes per day. Distributed architecture enables scale out as needed. Hive is used for building the UIDAI data warehouse, HBase for indexed lookup of records across millions of rows, Zookeeper as a distributed coordination service for server instances, and Pig as an ETL (extract, transform and load) solution for loading data into Hive.Healthcare and Life Sciences- Life sciences research firm NextBio uses Hadoop and HBase to help big pharmaceutical companies conduct genomic research. The company embraced Hadoop in 2009 to make the sheer scale of genomic data-analysis more affordable. The company's core 100-node Hadoop cluster, which has processed as much as 100 terabytes of data, is used to compare data from drug studies to publically available genomics data. Given that there are tens of thousands of such studies and 3.2 billion base pairs behind each of the hundreds of genomes that NextBio studies, it's clearly a big-data challenge.
Banks and Financial services:3 of the top 5 Banks run Cloudera HadoopJPMorgan Chase uses Hadoop technology for a growing number of purposes, including fraud detection, IT risk management and self service. With over 150 petabytes of data stored online, 30,000 databases and 3.5 billion log-ins to user accounts, data is the lifeblood of JPMorgan Chase.Retail:Sears is an American multinational mid-range department store chain headquartered in Hoffman Estates, Illinois). It Moved to Hadoop from Hadoop from Teradata and SAS to avoid archiving and deleting its meaningful sales and other customer activity data. 300-Node Hadoop cluster helps Sears to keep its 100% data (~2PB) available to BI rather than a meager 10% as was the case with Non-Hadoop solutions. Walmart; migrated data from its existing Oracle, Neteeza, Oracle and Greenplum gear to its 250-Node Hadoop Cluster.
Why Oracle, HP, IBM and other Enterprise Technology giants are in Red on growth. Sears:Sears wanted to personalize marketing campaigns, coupons, and offers down to the individual customer, but the existing legacy systems were incapable of supporting that.Sears' process for analyzing marketing campaigns for loyalty club members used to take six weeks on mainframe, Teradata, and SAS servers. The new process running on Hadoop can be completed weekly. For certain online and mobile commerce scenarios, Sears can now perform daily analyses. What's more, targeting is more granular, in some cases down to the individual customer. Moving up the stack, Sears is consolidating its databases to MySQL, InfoBright, and Teradata--EMC Greenplum, Microsoft SQL Server, and Oracle (including four Exadata boxes) are on their way out.Sears routinely replaces legacy Unix systems with Linux rather than upgrade them, and it has retired most of its Sun and HP-UX servers. Microsoft server and development technologies are also on the way out.
- 2 PB of data--mostly structured and unstructured data such as customer transaction, point of sale, and supply chain.- Because of Archiving Need 90% of the ~2PB of Data is not available for BI
300-Node Hadoop cluster helps Sears to keep its 100% data (~2PB) available to BI rather than a meager 10% as was the case with Non-Hadoop solutions. Sears now keeps all of its data down to individual transactions (rather than aggregates) and years of history (rather than imposing quarterly windows on certain data, as it did previously). That's raw data, which Sears can refactor and combine as needed quickly and efficiently within Hadoop.To give a sense of how early Sears was to Hadoop development, Wal-Mart divulged early this year that it was scaling out an experimental 10-node Hadoop cluster for e-commerce analysis. Sears passed that size in 2010.Has its own Hadoop solutions subsidiary MetaScale, to provide hadoop services to other retail companies on the line of AWS.
Accessible: Hadoop runs on large clusters of commodity machines or cloud computing services such as Amazon EC2Robust: Since Hadoop can run on commodity cluster, its designed with the assumption of frequent hardware failure, it can gracefully handle such failure and computation don’t stop because of few failed devices / systemsScalable:Hadoop scales linearly to handle large data by adding more slave nodes to the clusterSimple : Its easy to write efficient parallel programming with Hadoop
We will cover other Hadoop Components in detail in future sessions of this course.
Data transferred from DataNode to MapTask process. DBlk is the file data block; CBlk is the file checksum block. File data are transferred to the client through Java niotransferTo (aka UNIX sendfilesyscall). Checksum data are first fetched to DataNode JVM buffer, and then pushed to the client (details are not shown). Both file data and checksum data are bundled in an HDFS packet (typically 64KB) in the format of: {packet header | checksum bytes | data bytes}.2. Data received from the socket are buffered in a BufferedInputStream, presumably for the purpose of reducing the number of syscalls to the kernel. This actually involves two buffer-copies: first, data are copied from kernel buffers into a temporary direct buffer in JDK code; second, data are copied from the temporary direct buffer to the byte[] buffer owned by the BufferedInputStream. The size of the byte[] in BufferedInputStream is controlled by configuration property "io.file.buffer.size", and is default to 4K. In our production environment, this parameter is customized to 128K.3. Through the BufferedInputStream, the checksum bytes are saved into an internal ByteBuffer (whose size is roughly (PacketSize / 512 * 4) or 512B), and file bytes (compressed data) are deposited into the byte[] buffer supplied by the decompression layer. Since the checksum calculation requires a full 512 byte chunk while a user's request may not be aligned with a chunk boundary, a 512B byte[] buffer is used to align the input before copying partial chunks into user-supplied byte[] buffer. Also note that data are copied to the buffer in 512-byte pieces (as required by FSInputChecker API). Finally, all checksum bytes are copied to a 4-byte array for FSInputChecker to perform checksum verification. Overall, this step involves an extra buffer-copy.4. The decompression layer uses a byte[] buffer to receive data from the DFSClient layer. The DecompressorStream copies the data from the byte[] buffer to a 64K direct buffer, calls the native library code to decompress the data and stores the uncompressed bytes in another 64K direct buffer. This step involves two buffer-copies.5.LineReader maintains an internal buffer to absorb data from the downstream. From the buffer, line separators are discovered and line bytes are copied to form Text objects. This step requires two buffer-copies.The client creates the file by calling create() on Distributed FileSystem (step 1). Distributed FileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it (step 2). The namenode performs various checks to make sure the file doesn’t already exist, and that the client has the right permissions to create the file.
The client opens the file it wishes to read by calling open() on the FileSystemobject,which for HDFS is an instance of DFS(step 1).DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the File (step 2). For each block, the namenode returns the addresses of the datanodes that have a copy of that block