SlideShare a Scribd company logo
1 of 64
Open for Business…
WHO AM I 
• Big Data / Analytics / BI & Cloud Solutions Specialist 
• http://www.linkedin.com/in/JulioPhilippe 
• Skills 
Big Data 
Analytics 
Architecture 
Business Intelligence 
IT Transformation 
Management 
Cloud Computing Datacenter 
Data Warehousing 
IT Solutions 
2 Big Data Analytics with Hadoop 
Optimization 
Business Development 
Hadoop 
Mentoring
BIG DATA MANAGEMENT INSIGHT 
« Data don’t spring relevant, 
they become though ! » 
3 Big Data Analytics with Hadoop
DATA-DRIVEN ON-LINE WEBSITES 
• To run the apps : messages, posts, blog entries, 
video clips, maps, web graph... 
• To give the data context : friends networks, 
social networks, collaborative filtering... 
• To keep the applications running : web logs, 
system logs, system metrics, database query 
logs... 
4 Big Data Analytics with Hadoop
BIG DATA – NOT ONLY DATA VOLUME 
• Improve analytics and statistics 
models 
• Extract business value by 
analyzing large volumes of multi-structured 
data from various 
sources such as databases, 
websites, blogs, social media, 
smart sensors... 
• Have efficient architectures, 
massively parallel, highly scalable 
and available to handle very large 
data volumes up to several 
petabytes 
Thematics 
• Web Technologies 
• Database Scale-out 
• Relational Data Analytics 
• Distributed Data Analytics 
• Distributed File Systems 
• Real Time Analytics 
5 Big Data Analytics with Hadoop
BIG DATA APPLICATIONS DOMAINS 
• Digital marketing optimization (e.g., web analytics, attribution, golden 
path analysis) 
• Data exploration and discovery (e.g., identifying new data-driven 
products, new markets) 
• Fraud detection and prevention (e.g., revenue protection, site 
integrity & uptime) 
• Social network and relationship analysis (e.g., influencer marketing, 
outsourcing, attrition prediction) 
• Machine-generated data analytics (e.g., remote device insight, remote 
sensing, location-based intelligence) 
• Data retention (e.g. long term retention of data, data archiving) 
6 Big Data Analytics with Hadoop
SOME BIG DATA USE CASES BY INDUSTRY 
Energy 
 Smart meter analytics 
 Distribution load forecasting & scheduling 
 Condition-based maintenance 
Telecommunications 
 Network performance 
 New products & services creation 
 Call Detail Records (CDRs) analysis 
 Customer relationship management 
7 Big Data Analytics with Hadoop 
Retail 
 Dynamic price optimization 
 Localized assortment 
 Supply-chain management 
 Customer relationship management 
Manufacturing 
 Supply chain management 
 Customer Care Call Centers 
 Preventive Maintenance and Repairs 
 Customer relationship management 
Banking 
 Fraud detection 
 Trade surveillance 
 Compliance and regulatory 
 Customer relationship management 
Insurance 
 Catastrophe modeling 
 Claims fraud 
 Reputation management 
 Customer relationship management 
Public 
 Fraud detection 
 Fighting criminality 
 Threats detection 
 Cyber security 
Media 
 Large-scale clickstream analytics 
 Abuse and click-fraud prevention 
 Social graph analysis and profile segmentation 
 Campaign management and loyalty programs 
Healthcare 
 Clinical trials data analysis 
 Patient care quality and program analysis 
 Supply chain management 
 Drug discovery and development analysis
TOP 10 BIG DATA SOURCES 
1. Social network profiles 
2. Social influencers 
3. Activity-generated data 
4. SaaS & Cloud Apps 
5. Public web information 
6. MapReduce results 
7. Data Warehouse appliances 
8. NoSQL databases 
9. Network and in-stream monitoring technologies 
10. Legacy documents 
8 Big Data Analytics with Hadoop
NEW DATA AND MANAGEMENT ECONOMICS 
Storage Trends 
New Data Structure 
Distributed File Systems, NoSQL Database, NewSQL…) 
Compute Trends 
New Analytics 
(Massively Parallel Processing, Algorithms…) 
General purpose 
data warehouse 
Proprietary and dedicated 
data warehouse 
OLTP is the 
data warehouse 
Objects storage 
Master/Slave 
Master/Master 
Distributed File Systems Federated/ 
9 Big Data Analytics with Hadoop 
Sharded 
Enterprise 
data warehouse 
Multi-structured 
Data 
Logical 
data warehouse 
Master Data Management, Data Quality, Data Integration
DISTRIBUTED FILE SYSTEMS 
• System that permanently store data 
• Divided into logical units (files, shards, 
chunks, blocks…) 
• A file path joins file and directory names 
into a relative or absolute address to 
identify a file 
• Support access to file and remote servers 
• Support concurrency 
• Support distribution 
• Support replication 
• NFS, GPFS, Hadoop DFS, GlusterFS, 
MogileFS, MooseFS…. 
10 Big Data Analytics with Hadoop 
Master Slave Slave Slave 
App
UNSTRUCTURED DATA AND OBJECT STORAGE 
• Metadata values are 
specific to each 
individual type 
• Enables automated 
management of content 
• Ensure integrity, 
retention and 
authenticity 
UID Metadata 101000101010100111010101100010110100…110010 
11 Big Data Analytics with Hadoop 
Video 
Image 
UID Metadata 111000101010100111010101100010110100…101010 
Audio 
UID Metadata 101100101010100111010101100010110100…110111 
Text 
UID Metadata 101011101010100111010101100010111110…110011 
Unstructured Data + Metadata = Object Storage
NOSQL DATABASES CATEGORIES 
Column 
BigTable (Google), HBase, 
Cassandra (DataStax), 
Hypertable… 
• NoSQL = Not only SQL 
• Popular name for a subset of structured storage software 
that is designed with the intention of delivering increased 
optimization for high-performance operations on large 
datasets 
• Basically, available, scalable, eventually consistent 
• Easy to use 
• Tolerant of scale by way of horizontal distribution 
Key-Value 
Redis, Riak (Basho), CouchBase, 
Voldemort (LinkedIn) 
MemcacheDB… 
12 Big Data Analytics with Hadoop 
Document 
MongoDB (10Gen), 
CouchDB, Terrastore, 
SimpleDB (AWS) … 
Graph 
Neo4j (Neo Technology), Jena, 
InfiniteGraph (Objectivity), 
FlockDB (Twitter)…
WHAT IS HADOOP ? 
“Flexible and available 
architecture for large scale 
computation and data 
processing on a network of 
commodity hardware” 
Open Source Software + Hardware Commodity 
= IT Costs Reduction 
13 Big Data Analytics with Hadoop
WHAT IS HADOOP USED FOR ? 
• Searching 
• Log processing 
• Recommendation systems 
• Analytics 
• Video and Image analysis 
• Data Retention 
14 Big Data Analytics with Hadoop
WHO USED HADOOP ? 
• Top level Apache Foundation project 
• Large, active user base, mailing lists, user 
groups 
• Very active development, strong development 
team 
• http://wiki.apache.org/hadoop/PoweredBy#L 
15 Big Data Analytics with Hadoop
MOVING COMPUTATION TO STORAGE 
General Purpose Storage Servers 
• Combine server with disks & networking for reducing latency 
• Specialized software enables general purpose systems designs to provide high 
performance data services 
Moving Data processing to Storage 
Legacy 
Application 
Data Processing 
Storage 
Emerging 
Application 
Data Processing 
Metadata Mgmt 
Storage 
Next Gen. 
Application 
Data Processing 
Metadata Mgmt 
16 Big Data Analytics with Hadoop 
Storage 
Metadata Mgmt 
Storage Array (SAN, NAS) Servers 
Network
BIG DATA ARCHITECTURE 
BI & DWH Architecture - Conventional 
• SQL based 
• High availability 
• Enterprise database 
• Right design for structured data 
• Current storage hardware (SAN, NAS, DAS) 
Analytics Architecture – Next Generation 
• Not only SQL based 
• High scalability, availability and flexibility 
• Compute and storage in the same box for 
reducing the network latency 
• Right design for semi-structured and 
unstructured data 
Network 
Switches 
17 Big Data Analytics with Hadoop 
Data 
Nodes 
Network 
Switches 
Edge 
Nodes 
App 
Servers 
Database 
Servers 
SAN 
Switch 
Storage Array
SHARE NOTHING ARCHITECTURE 
Share 
Disks 
IP Network 
FC 
IP Network 
Disk 
Server 
18 Big Data Analytics with Hadoop 
Share 
Nothing 
IP Network 
DB DB DB DB 
Disk Disk Disk Disk 
eg. HDFS 
Local 
Storage 
DB DB DB DB 
SAN Disks 
eg. Oracle RAC 
Share 
Everything 
DB 
eg. Unix FS
APACHE HADOOP 2.0 ECOSYSTEM 
Cluster Resource Management 
YARN 
Storage 
HDFS 
http://incubator.apache.org/projects/ 
19 Big Data Analytics with Hadoop 
Serialization (Avro, Shrift) 
Security (Knox, Sentry) 
Management (Oozie, Zookeeper, Chuckwa, Kafka) 
Management Ops (Ambari, Big Top, Whirr) 
Development (Crunch, MRUnit, HDT) 
Analytics 
IMPALA 
Batch 
Processing 
MAP 
REDUCE 
Stream 
STORM 
Search 
SOLR 
Machine 
Learning 
NoSQL 
HBASE 
Others 
ISV 
ENGINE 
HIVE 
PIG 
STINGER 
MAHOUT 
SPARK 
SPARK 
TEZ 
Real-Time 
Processing 
SPARK 
SPARK TEZ
HADOOP COMMON 
• Hadoop Common is a set of utilities that 
support the Hadoop subprojects. 
• Hadoop Common includes Filesystem, RPC, 
and Serialization libraries. 
20 Big Data Analytics with Hadoop
HDFS & MAPREDUCE 
• Hadoop Distributed File System 
- A scalable, Fault tolerant, High performance distributed file 
system 
- Asynchronous replication 
- Write-once and read-many (WORM) 
- Hadoop cluster with 3 DataNodes minimum 
- Data divided into 64MB (default) or 128MB blocks, each 
block replicated 3 times (default) 
- No RAID required for DataNode 
- Interfaces: Java, Thrift, C Library, FUSE, WebDAV, HTTP, FTP 
- NameNode holds filesystem metadata 
- Files are broken up and spread over the DataNodes 
• Hadoop Map Reduce 
- Software framework for distributed computation 
- Input | Map() | Copy/Sort | Reduce() | Output 
- JobTracker schedules and manages jobs 
- TaskTracker executes individual map() and reduce() tasks on 
each cluster node 
21 Big Data Analytics with Hadoop 
Master Node 
Worker Nodes
HDFS - READ FILE 
1. The client API calculates the 
blocks index based on the offset 
of the file pointer and make a 
request to the NameNode 
2. The NameNode replies which 
DataNodes has a copy of that 
block 
3. The client contacts the 
DataNodes directly without going 
through the NameNode 
4. The DataNodes read the blocks 
5. The DataNodes response to the 
client about the success 
NameNode 
22 Big Data Analytics with Hadoop 
DataNodes 
1. Calculate 
blocks index 
4.Read 4. Read 4. Read 
3. Contact 
DataNode 
3. 3. 
2. 
Client 
EdgeNode 
Read File 
End 
4. Read 
3. 5.
HDFS - WRITE FILE 
1. Client contacts the NameNode who 
designates one of the replica as the primary 
2. The response of the NameNode contains who 
is the primary and who are the secondary 
replicas 
3. The client pushes its changes to all 
DataNodes in any order, but this change is 
stored in a buffer of each DataNode 
4. The client sends a “commit” request to the 
primary, which determines an order to 
update and then push this order to all other 
secondaries 
5. After all secondaries complete the commit, 
the primary response to the client about the 
success 
6. All changes of blocks distribution and 
metadata changes be written to an operation 
log file at the NameNode 
5. Write 5. Write 5. Write 5. Write 
NameNode 
23 Big Data Analytics with Hadoop 
Primary DataNode 
DataNodes 
1. Select primary 
replica 
3. 
5. 
3. 
3. Push to 
DataNode 
4. Commit 
2. 
Client 
EdgeNode 
Write File 
End 
3. 
6. Write metadata
MAPREDUCE - EXEC FILE 
Le client program is copied on each node 
1. The JobTracker determines the number of splits 
from the input path, and select some TaskTrackers 
based on their network proximity to the data 
sources 
2. JobTracker sends the task requests to those 
selected TaskTrackers 
3. Each TaskTracker starts the map phase processing 
by extracting the input data from the splits 
4. When the map task completes, the TaskTracker 
notifies the JobTracker. When all the TaskTrackers 
are done, the JobTracker notifies the selected 
TaskTrackers for the reduce phase 
5. Each TaskTracker reads the region files remotely 
and invokes the reduce function, which collects the 
key/aggregated value into the output file (one per 
reducer node) 
6. After both phase completes, the JobTracker 
unblocks the client program 
24 Big Data Analytics with Hadoop 
TaskTrackers 
5. Reduce 5. Reduce 5. Reduce 5. Reduce 
3. Map 3. Map 3. Map 3. Map 
EdgeNode 
Client 
1. Select 
TaskTrackers 
from splits 
2. Send 
request to 
TackTracker 
2 
2 
2 
JobTracker 
Exec File 
4. 6. End 
End
MR VS. YARN ARCHITECTURE 
MR v1 YARN / MR v2 
Client Client 
25 Big Data Analytics with Hadoop 
ResourceManager 
NameNode 
NodeManager 
ApplicationMaster 
DataNode 
NodeManager 
ApplicationMaster 
DataNode 
NodeManager 
Container 
DataNode 
NodeManager 
Container 
DataNode 
NodeManager 
Container 
DataNode 
NodeManager 
Container 
DataNode 
Container 
Client Client 
JobTracker 
NameNode 
TaskTracker 
DataNode 
TaskTraker 
DataNode 
TaskTraker 
DataNode 
• YARN : Yet Another Resource Negotiator 
• MR : MapReduce
HBASE 
• It's not a relational database (No joins) 
• Sparse data – nulls are stored for free 
• Semi-structured or unstructured data 
• Data changes through time 
• Versioned data 
• Scalable – Goal of billions of rows x millions 
of columns 
Key Family 
26 Big Data Analytics with Hadoop 
Table - Example 
(Table, Row_Key, Family, Column, Timestamp) = Cell (Value) 
Region 
Row Timestamp Animal Repair 
Type Size Cost 
Enclosure1 
12 Zebra Medium 1000€ 
11 Lion Big 
Enclosure2 13 Monkey Small 1500€ 
Column 
Cell 
• Clone of Big Table (Google) 
• Implemented in Java (Clients : Java, C++, Ruby...) 
• Data is stored “Column‐oriented” 
• Distributed over many servers 
• Tolerant of machine failure 
• Layered over HDFS 
• Strong consistency
HBASE 
• Table 
- Regions for scalability, defined by 
row [start_key, end_key) 
- Store for efficiency, 1 per Family 
- 1..n StoreFiles 
(HFile format on HDFS) 
• Everything is byte 
• Rows are ordered sequentially by 
key 
• Special tables -ROOT- , .META. 
- Tell clients where to find user 
data 
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html 
27 Big Data Analytics with Hadoop
DATA ACCESS 
HIVE PIG HCatalog 
• Data Warehouse 
infrastructure that provides 
data summarization and ad 
hoc querying on top of 
Hadoop 
- MapReduce for execution 
- HDFS for storage 
• MetaStore 
- Table/Partitions properties 
- Thrift API : Current clients 
in PHP (Web Interface), 
Python interface to Hive, 
Java (Query Engine and 
CLI) 
- Metadata stored in any 
SQL backend 
• Hive Query Language 
- Basic SQL : Select, From, 
Join, Group By 
- Equi-Join, Multi-Table 
Insert, Multi-Group-By 
- Batch query 
• A high-level data-flow language 
and execution framework for 
parallel computation 
• Pig Latin 
- Data processing language 
- Compiler to translate to 
MapReduce 
• Simple to write MapReduce 
program 
• Abstracts you from specific detail 
• Focus on data processing 
• Data flow 
• Data manipulation 
28 Big Data Analytics with Hadoop 
• Table and storage 
management service for 
data created using Apache 
Hadoop 
• Providing a shared schema 
and data type mechanism 
• Providing a table 
abstraction so that users 
need not be concerned 
with where or how their 
data is stored. 
• Providing interoperability 
across data processing 
tools such as Pig, Map 
Reduce and Hive 
• HCatalog DDL (Data 
Definition Language) 
• HCatalog CLI (Common 
Language Interface)
DATA TRANSFER 
• Data import/export 
• Sqoop is a tool designed to help users of 
large data import existing relational 
databases into their Hadoop clusters 
• Automatic data import 
• Easy import data from many databases 
to Hadoop 
• Generates code for use in MapReduce 
applications 
RDBMS 
Hadoop Cluster 
• Apache Flume is a distributed, reliable, and 
available service for efficiently collecting, 
aggregating, and moving large amounts of 
log data 
• Simple and flexible architecture based on 
streaming data flows 
• Robust fault tolerant with tunable reliability 
mechanisms and many failover and recovery 
mechanisms 
• The system is centrally managed and allows 
for intelligent dynamic management 
29 Big Data Analytics with Hadoop 
Hadoop Cluster 
Log Files 
Streaming data flows 
Batching, 
compression, fltering, 
transformation 
SQOOP 
FLUME
MANAGEMENT 
OOZIE CHUKWA ZOOKEEPER 
• Oozie is a server based Bundle Engine 
that provides a higher-level Oozie 
abstraction that will batch a set of 
coordinator applications. The user will 
be able to 
start/stop/suspend/resume/rerun a set 
coordinator jobs in the bundle level 
resulting a better and easy operational 
control 
• Oozie is a server based Coordinator 
Engine specialized in running workflows 
based on time and data triggers 
• Oozie is a server based Workflow Engine 
specialized in running workflow jobs 
with actions that execute Hadoop 
Map/Reduce and Pig jobs 
• A data collection system for 
managing large distributed 
systems 
• Build on HDFS and MapReduce 
• Tools kit for displaying, 
monitoring and analyzing the log 
files 
30 Big Data Analytics with Hadoop 
• A high-performance 
coordination service for 
distributed applications 
• Zookeeper is a centralized 
service for maintaining 
configuration information, 
naming, providing distributed 
synchronization, and providing 
group services
MACHINE LEARNING 
• Apache Mahout is an Apache project to produce free implementations 
of distributed or otherwise scalable machine learning algorithms on 
the Hadoop platform 
• Mahout machine learning algorithms: 
– Recommendation mining, takes users’ behavior and find items said specified 
user might like 
– Clustering, takes e.g. text documents and groups them based on related 
document topics 
– Classification, learns from existing categorized documents what specific category 
documents look like and is able to assign unlabeled documents to the 
appropriate category 
– Frequent item set mining, takes a set of item groups (e.g. terms in a query 
session, shopping cart content) and identifies, which individual items typically 
appear together 
31 Big Data Analytics with Hadoop
SERIALIZATION 
• A data serialization system that provides dynamic integration 
with scripting languages 
• Avro Data 
- Expressive 
- Smaller and Faster 
- Dynamic 
- Schema store with data 
- APIs permit reading and creating 
- Include a file format and a textual encoding 
• Avro RPC 
- Leverage versioning support 
- For Hadoop service provide cross-language access 
32 Big Data Analytics with Hadoop
MANAGEMENT OPS 
WHIRR AMBARI 
• Apache Whirr is a set of 
libraries for running 
cloud services 
• A common service API 
• Provision, Install, 
Configure and Manage 
• Deploy clusters on 
demand for processing 
or for testing 
• Command line for 
deploying clusters 
• Ambari is a web-based set 
of tools for 
– Deploying 
– Administering 
– Monitoring 
Apache Hadoop clusters 
33 Big Data Analytics with Hadoop
OTHERS APACHE HADOOP PROJECTS 
Hadoop Project Description 
TEZ 
Tez is an effort to develop a generic application framework which can be used to process arbitrarily complex data-processing tasks and also a re-usable 
set of data-processing primitives which can be used by other projects. 
GORA Gora is an ORM framework for column stores such as Apache HBase and Apache Cassandra with a specific focus on Hadoop. 
DRILL Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel. 
LUCENE 
Lucene.NET is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and .NET platform 
utilizing Microsoft .NET Framework. 
BLUR Blur is a search platform capable of searching massive amounts of data in a cloud computing environment. 
GIRAPH Giraph is a large-scale, fault-tolerant, Bulk Synchronous Parallel (BSP)-based graph processing framework. 
HAMA 
Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations, 
e.g., matrix, graph and network algorithms. 
ACCUMULO Accumulo is a distributed key/value store that provides expressive, cell-level access labels. 
CRUNCH Crunch is a Java library for writing, testing, and running pipelines of MapReduce jobs on Apache Hadoop. 
MRUNIT MRUnit is a library to support unit testing of Hadoop MapReduce jobs. 
HADOOP DEVELOPMENT TOOLS Eclipse based tools for developing applications on the Hadoop platform. 
BIGTOP Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem. 
THRIFT Cross-language serialization and RPC framework. 
KNOX Knox Gateway is a system that provides a single point of secure access for Apache Hadoop clusters. 
KAFKA Kafka is a distributed publish-subscribe system for processing large amounts of streaming data. 
CASSANDRA Cassandra is columnar NoSQL store with scalability, availability and performance capabilities. 
FALCON 
A data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data 
discovery. 
SENTRY 
Sentry is a highly modular system for providing fine grained role based authorization to both data and metadata stored on an Apache Hadoop 
cluster. 
STORM 
Storm is a distributed, fault-tolerant, and high-performance real time computation system that provides strong guarantees on the processing of 
data. 
S4 
S4 (Simple Scalable Streaming System) is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows 
programmers to easily develop applications for processing continuous, unbounded streams of data. 
SPARK Apache Spark is an open source, parallel data processing both in-memory and on disk, combining batch, streaming, and interactive analytics 
34 Big Data Analytics with Hadoop 
https://incubator.apache.org/projects/
SOME HADOOP SERVICES PROVIDERS 
Cloudera 
Hortonworks 
35 Big Data Analytics with Hadoop 
MapR 
DataStax 
Microsoft 
VMware 
http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
CLOUDERA IMPALA 
• Unified storage : supports HDFS and HBase, Flexible 
file formats 
• Unified Metastore 
• Unified security 
• Unified client interface : ODBC, SQL syntax, Hue, 
Beeswax 
SQL App 
ODBC 
• Impala: real time SQL queries, native distributive 
query engine, optimized for low-latency 
• Answers as fast as you can ask 
• Everyone to ask questions for all data, Big Data 
Hive Metastore YARN NameNode State Store 
DataNode 
Query Planner 
Query Coordinator 
Query Exec Engine 
HBase 
HDFS 
DataNode 
Query Planner 
Query Coordinator 
Query Exec Engine 
HBase 
HDFS 
36 Big Data Analytics with Hadoop 
DataNode 
Query Planner 
Query Coordinator 
Query Exec Engine 
HBase 
HDFS 
storage and analytics together 
Source : http://cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
CLOUDERA HUE 
• HUE (aka. Hadoop 
User Experience) 
• Open source project 
started as Cloudera 
• HUE is a web UI for 
Hadoop 
• Platform for building 
custom applications 
with a nice UI library 
• User Admin: Account management for HUE users. 
• File Browser: Browse HDFS; change permissions and 
ownership; upload, download, view and edit files. 
• Job Designer: Create MapReduce jobs, which can be 
templates that prompt for parameters when they 
are submitted. 
• Job Browser: View jobs, tasks, counters, logs, etc. 
• Beeswax: Wizards to help create Hive tables, load 
data, run and manage Hive queries, and download 
results in Excel format. 
• Help: Documentation and help 
Source : http://blog.cloudera.com/blog/category/hue/ 
37 Big Data Analytics with Hadoop
HORTONWORKS STINGER INITIATIVE 
Interactive query for Apache Hive 
• The Stinger Initiative is 
a broad, community-based 
38 Big Data Analytics with Hadoop 
effort to drive 
the future of Apache 
Hive 
• Stinger delivers 
100x performance 
improvements at 
petabyte scale with 
familiar SQL semantics 
Source: http://hortonworks.com/labs/stinger/
MAPR HADOOP 
MapR Distribution for Apache Hadoop Advantages 
Source : http://www.mapr.com/products/why-mapr 
MapR Distribution Editions 
Source : http://www.mapr.com/products/mapr-editions 
39 Big Data Analytics with Hadoop
MICROSOFT POLYBASE 
• PolyBase is a technology on the data 
processing engine in SQL Server Parallel 
Data Warehouse (PDW) designed as 
the simplest way to combine non-relational 
data and traditional 
relational data in your analysis 
• PolyBase provides the easiest and 
broadest way to access Hadoop with 
the standard SQL query language 
without needing to learn MapReduce 
• PolyBase moves data in parallel to and 
from Hadoop and PDW allowing end 
users to perform their analysis without 
the help of IT 
40 Big Data Analytics with Hadoop 
SQL Query Results 
PolyBase 
Hadoop RDBMS 
• Microsoft SQL 
Server Parallel 
Data Warehouse 
• HDFS 
Source : http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx
VMWARE HADOOP VIRTUALIZATION EXTENSION 
• HADOOP VIRTUALIZATION EXTENSION (HVE) is designed to enhance the reliability and performance of virtualized Hadoop clusters with 
extended topology layer and refined locality related policies 
One Hadoop node per 
server 
Multiple Hadoop 
nodes per server 
HVE 
Task Scheduling 
Balancer 
Replica Choosing 
Replica Placement 
Replica Removal 
Network Topology 
Hadoop 
HDFS MapReduce 
Common 
Multiple compute 
nodes and one data 
node per server 
41 Big Data Analytics with Hadoop 
• Virtual Machine 
• Physical Server 
• Compute 
• Data 
Multiple compute 
nodes and data nodes 
per server 
Source: http://www.vmware.com/products/big-data-extensions 
• HVE consists of hooks and extensions to 
data-locality-related components of Hadoop, 
including: network topology, replica, 
balancer and task scheduling. It touches all 
sub projects of Hadoop: Common, HDFS and 
MapReduce
HADOOP AND CASSANDRA INTEGRATION 
Cassandra 
CFS Layer 
Job/Task Tracker 
Hive Metastore 
Tools 
Hive, Pig… 
• Brisk solution from DataStax combines the real time capabilities of Cassandra with the 
analytical power of Hadoop 
• The Hadoop Distributed File System (HDFS) is replace by the Apache Cassandra File System 
(CFS) - Data store in CFS 
• Blocks are compressed with Snappy 
• Hive Metastore in Cassandra – Automatically maps Cassandra Column Families to Hive tables 
42 Big Data Analytics with Hadoop
HIGH AVAILABILITY SOLUTIONS 
NameNode 
JobTracker 
DataNode 
TaskTracker 
DataNode 
TaskTracker 
Standby NN 
(standby) 
• Automatic blocks replication on 3 DataNodes – Rack awareness 
• NameNode and Standby NameNode 
• Quorum Journal Manager and Zookeeper 
• Disaster Recovery by replication 
• Hive and NameNode Metastores backup 
• HDFS Snapshots 
43 Big Data Analytics with Hadoop 
DataNode 
TaskTracker 
Name Node 
(active) 
Worker Node Worker Node Worker Node 
NameNode 
JobTracker
HA* WITH CONVENTIONAL SHARED STORAGE 
• NameNode and Standby NameNode 
• Shared storage array 
Network Network 
Standby 
Node 
Active 
Node 
SAS (6Gbps) 
NameNode Server 
• 2 CPU 6 core 
• 96GB RAM 
• 6 x HDD 600GB 15K 
(Raid10) 
• 2 x 1GbE Ports 
Share Storage Array 
with dual active/active controllers 
12 x 600GB HD 15K RPM (Raid10) 
SAS (6Gbps) 
Shared Storage Array 
• NameNode and Standby NameNode Servers: the Active and Standby nodes should 
have equivalent hardware 
• Shared Storage Array: shared directory which both Active node and Standby node 
servers can have read/write access to. The share storage Array supports NFS and is 
mounted on each node. Currently only a single shared edits directory is supported. The 
availability of the system is implemented by the redundancy of the shared edits 
directory with multiple network paths to the storage, and redundancy in the storage 
itself (disk, network, and power). It is recommended that the shared storage array be a 
high-quality dedicated storage. 
* High Availability 
44 Big Data Analytics with Hadoop 
Standby NameNode Server 
• 2 CPU 6 core 
• 96GB RAM 
• 6 x HDD 600GB 15K (Raid10) 
• 2 x 1GbE Ports
HA* WITH JOURNAL MANAGER AND ZOOKEEPER 
• NameNode and Standby NameNode 
• Automatic Failover with Zookeeper : Quorum, ZKFC (ZKFailoverController) 
• Quorum Journal Manager for reliable edit log storage 
Network Network 
Standby 
Node 
Active 
Node 
NameNode Server 
• 2 CPU 6 core 
• 96GB RAM 
• 6 x HDD 600GB 15K 
(Raid10) 
• 2 x 1GbE Ports 
• NameNode and Standby NameNode Servers: the Active and Standby nodes should have 
equivalent hardware 
• Quorum Journal Manager: In order for the Standby Node to keep its state synchronized with 
the Active Node, both nodes communicate with a group of separate daemons called 
"JournalNodes" 
• Zookeeper : a highly available service for maintaining small amounts of coordination data, 
notifying clients of changes in that data, and monitoring clients for failures. Zookeeper 
detects the Active node server failure and actives automatically the Standby node server. 
* High Availability 
45 Big Data Analytics with Hadoop 
Standby NameNode Server 
• 2 CPU 6 core 
• 96GB RAM 
• 6 x HDD 600GB 15K (Raid10) 
• 2 x 1GbE Ports 
HA Software 
• 3 x JournalNode daemons 
• 3 x Zookeeper daemon 
HA Software 
• 3 x JournalNode daemon 
• 3 x Zookeeper daemon
HA* WITH DISASTER RECOVERY 
• DistCp: tool for parallelized copying of large amounts of data 
• Large inter-cluster copy 
• Based on MapReduce 
Primary Site Disaster Recovery Site 
Hadoop Cluster #1 
NameSpace #1 
46 Big Data Analytics with Hadoop 
Hadoop Cluster #2 
NameSpace #2 
Parallelized Data Replication 
Map ( ) 
Racks Racks 
Moving an Elephant: Large Scale Hadoop Data Migration at Facebook 
http://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data- 
migration-at-facebook/10150246275318920 
* High Availability
SECURITY 
HDFS / KERBEROS / AD KNOX 
• Files permissions 
– Files permissions like 
Unix (owner, group, 
mode) 
• User identity 
– Simple 
– Super-user 
• Kerberos connectivity 
– Users authenticate to the 
edge of the cluster with 
Kerberos 
– Users and group access is 
maintained in cluster 
specific access control 
lists 
• Microsoft Active Directory 
connectivity 
• Knox is a system that 
provides a single point 
of authentication and 
access for Apache 
Hadoop services in a 
cluster 
• Knox simplifies Hadoop 
security for users who 
access the cluster data 
and execute jobs, and 
for operators who 
control access and 
manage the cluster 
• Knox runs as a server or 
cluster of servers that 
serve one or more 
Hadoop clusters 
47 Big Data Analytics with Hadoop 
GAZZANG 
• Advanced Key Management - 
Stores keys separate from 
the encrypted data 
• Transparent Data Encryption 
- Protects data at rest 
resulting in minimal 
performance impact 
• Process Based Access 
Controls - Restricts access to 
specific processes rather than 
by OS user 
• Encrypt and Decrypt 
Unstructured Data - Secures 
sensitive data that could be 
considered damaging if 
exposed outside the business 
• Automation Tools - Rapid 
distributed deployment from 
ten to thousands of data 
nodes
BACKUP 
• Replication blocks is not a form of backup 
• HDFS Snapshots 
• When a file is deleted by a user or an application, it is not immediately removed 
from HDFS 
– The file is moved in the /trash directory 
– The file can be restored quickly as long as it remains in /trash 
– A file remains in /trash for a configurable amount of time 
• When a file is corrupted restore from backup is necessary 
– Create incremental backup HDFS files 
– Use a time-stamp of the file 
– Use a staging area to store the backup files 
– Move the backup files to tape if necessary 
• Backing up with DistCp or copyToLocal commands 
48 Big Data Analytics with Hadoop
ARCHIVING 
• Hadoop Archives, or HAR files, are a file archiving 
facility that packs files into HDFS blocks more 
efficiently 
• Reduce the NameNode memory usage while still 
allowing transparent access to files. 
• An effective solution for the small files problem 
– http://developer.yahoo.com/blogs/hadoop/posts/2010/07/hadoop 
_archive_file_compaction/ 
– http://www.cloudera.com/blog/2009/02/the-small-files-problem/ 
• Archives are immutable 
• Rename, deletes and create return an error 
• Hadoop Archives is exposed as a file system 
MapReduce will be able to use all the logical input 
files in Hadoop Archives as input 
49 Big Data Analytics with Hadoop 
Master Index 
Index 
HAR File Layout 
Data
DATA COMPRESSION 
LZO 
• https://github.com/toddlipcon/hadoop-lzo 
• By enabling compression, the store file 
uses a compression algorithm on blocks 
as they are written (during flushes and 
compactions) and thus must be 
decompressed when reading 
• Compression reduces the number of 
bytes written/read to/from HDFS 
• Compression effectively improves the 
efficiency of network bandwidth and 
disk space 
• Compression reduces the size of data 
needed to be read when issuing a read 
SNAPPY 
• Hadoop Snappy is a project for Hadoop that 
provide access to the snappy compression. 
http://code.google.com/p/snappy/ 
• Hadoop-Snappy can be used as an add-on 
for recent (released) versions of Hadoop 
that do not provide Snappy Codec support 
yet 
• Hadoop-Snappy is being kept in synch with 
Hadoop Common 
• Snappy is a compression/decompression 
library. It does not aim for maximum 
compression, or compatibility with any 
other compression library 
50 Big Data Analytics with Hadoop
HDFS OVER NFS & CIFS PROTOCOLS 
• Make a HDFS filesystem available across 
networks as a exported share 
• Over NFS 
– Install FUSE server 
– Mount HDFS filesystem over FUSE 
– Export HDFS filesystem 
• Over CIFS 
– Install FUSE server 
– Install SAMBA server on FUSE server 
– Mount HDFS filesystem over SAMBA 
– Export HDFS filesystem 
FUSE is a framework which makes it possible to 
implement a filesystem in a userspace program. 
Features include: 
• Simple yet comprehensive API 
• Secure mounting by non-root user 
• Multi-threaded operation 
SAMBA is a suite of Linux applications that speak 
the SMB (Server Message Block) protocol. Many 
operating systems, including Windows use SMB to 
perform client-server networking. 
51 Big Data Analytics with Hadoop 
FUSE 
SAMBA
GANGLIA MONITORING 
• Ganglia by itself is a highly scalable 
cluster monitoring tool, and provides 
visual information on the state of 
individual machines in a cluster or 
summary information for a cluster or 
sets of clusters. Ganglia provides the 
ability to view different time windows 
• 2 daemons: GMON & GMETAD 
• GMOND collects or received metric data 
on each DataNode 
• 1 GMETAD/grid 
• Polls 1 GMOND per cluster for data 
• A node belongs to a cluster 
• A cluster belongs to a grid 
Poll 
52 Big Data Analytics with Hadoop 
Apache 
Web 
Frontend Web 
Grid 
Client 
Poll Poll 
Cluster 
GMETAD 
Failover 
Node 
Poll 
Failover Failover 
GMOND 
Node 
GMOND 
Node 
GMOND 
Node 
Cluster 1 
GMOND 
Node 
GMOND 
Node 
GMOND 
Node 
Cluster 2 
Cluster 3 
GMOND 
Node 
GMOND 
Node 
GMOND 
Node 
GMETAD 
Source : http://ganglia.sourceforge.net/
NAGIOS MONITORING 
• A system and network monitoring 
application 
• Nagios is an open source tool especially 
developed to monitor hosts and services 
and designed to inform the network 
incidents before end-users, clients do 
• Nagios watches hosts and services which 
we specify and alerts when things go bad 
and when things get recovered 
• Initially developed for servers and 
applications monitoring, it is widely used 
to monitor and networks availability 
53 Big Data Analytics with Hadoop 
Apache 
Hadoop 
CGI 
Nagios 
Plugins 
Database 
Hadoop infrastructure 
Common Graphical 
Interface 
Users 
Source : http://www.nagios.org/
CROWBAR SOFTWARE FOR HADOOP 
Solutions Componants 
A modular, open source framework that 
accelerates multi-node deployments, 
simplifies maintenance, and streamlines 
ongoing updates 
• Deploy Hadoop cluster in hours instead of 
days 
• Use or build barclamps to install and 
configure software modules 
Opscode Chef Server Capabilities 
Download the open source software: 
https://github.com/dellcloudedge/crowbar 
Active community 
http://lists.us.dell.com/mailman/listinfo/crowbar 
Resources on the Wiki: 
https://github.com/dellcloudedge/crowbar/wiki 
• Supports a cloud 
operations model to 
interact, modify, 
and build based on 
changing needs 
54 Big Data Analytics with Hadoop
LOGICAL DATA WAREHOUSE WITH HADOOP 
ADMINISTRATOR DATA SCIENTISTS 
ENGINEERS ANALYSTS BUSINESS USERS 
Development BI / Analytics 
NoSQL SQL 
Files Web Data RDBMS 
Data 
Transfer 
55 Big Data Analytics with Hadoop 
Activity 
Reporting 
MOBILE CLIENTS 
Mobile Apps 
Data Modeling 
Data 
Management 
Unstructured and structured Data Warehouse 
MPP, No SQL Engine, Distributed File Systems 
Share-Nothing Architecture, Algorithms 
Structured Data Warehouse 
MPP, In-Memory, Columns Database, SQL Engine, 
Share-Nothing Architecture 
or Share-Disk Architecture via SAN 
Data Integration 
Exploration, Visualization 
Data Quality 
Master Data 
Management 
Data sources 
Oracle Exadata, SAP HANA, IBM Netezza, Teradata, 
EMC Greenplum, Microsoft PDW, HP Vertica….
INFRASTRUCTURE RECOMMENDATIONS 
General 
• Determine the volume of 
data that needs to be 
stored and processed 
• Determine the rate at 
which the data is expected 
to grow 
• Determine when the 
cluster needs to grow, and 
whether there will be a 
need for additional 
processing and storage 
• For greater power 
efficiency and higher ROI 
over time, choose 
machines with more 
capacity. This helps to 
reduce the frequency of 
new machines being 
added 
• For high availability 
2 Admin servers in 
different racks are 
recommended 
• For high availability 
2 EdgeNodes mini in 
different racks are 
recommended 
Storage 
• The number of disks should 
be based on the amount of 
raw data required 
• Check on the rate of growth 
of data and try reducing the 
requirement of adding new 
machines every year. For 
example, depending on the 
net new data per year, it may 
be worthwhile using 12 hard 
drives per server than using 
6 hard drives per server, to 
accommodate larger amount 
of new data in the existing 
cluster 
DataNode 
• Each DataNode runs a 
TaskTracker daemons 
• 2 CPU 6-core is 
recommended for each 
DataNode for mostly cases 
• For increase the I/O 
performance use the SAS 
15K RPM disk, otherwise the 
SATA/SAS NL 7.2K RPM at 
better price is sufficient 
• Using RAID is not 
recommended 
• JBOD configuration is 
required. HDFS provides 
built-in redundancy by 
replicating blocks across 
multiple nodes. The x3 
replication factor is 
recommended 
• 48GB RAM per server is 
recommended for mostly 
cases 
• For tmp, log , etc, add 20% 
to usable disk space 
• The ratio between useable 
data and raw data is 3.6 
56 Big Data Analytics with Hadoop 
NameNode 
• The NameNode runs a 
JobTracker daemons 
• A copy of the NameNode 
metadata is stored on a 
separate machine 
• Losing the NameNode 
metadata would mean all 
data in HDFS lost. Use the 
Standby NameNode for high 
availability 
• The NameNode is not 
commodity hardware, and 
needs to have sufficient RAM 
and disk performance 
• The amount of RAM 
allocated to the NameNode 
limits the size of the cluster 
• Having plenty of extra 
NameNode memory space is 
highly recommended, so that 
the cluster can grow without 
having to add more memory 
to the NameNode, requiring 
a restart. 
• 96GB of RAM per server is 
recommended for mostly 
cases in the large cluster 
• Using RAID10 and 15K RPM 
disks is highly recommended 
• The NameNode and the 
secondary NameNode are 
the same server 
configuration 
Network 
• Use 10GbE switch per rack 
according the performance 
of the Hadoop cluster 
• Use low-latency 10GbE 
switch across multiple racks 
• For high availability 
2 network switches on top of 
each rack are recommended
INFRASTRUCTURE SIZING 
Feature Description & Formula Example 
Replication_blocks 
• Number of replication blocks 
• 3 recommended 
• Add 1 or 2 x Control Rack 24RU integrating NameNode, Secondary NameNode, EdgeNodes and 
AdminNodes depending on availability of the Hadoop cluster 
57 Big Data Analytics with Hadoop 
• 3 
Usable_data_volume • Data source volume (business data) • 400TB 
Temp_data_volume_ratio 
• tmp, log… 
• 20% of Usable_data_volume 
• 1,2 
Data_compression_ratio • Data compression ratio • 0,6 
Raw_data_volume 
• (Usable_data_volume x Replication_blocks x Temp_data_volume_ratio x 
Data_compression_ratio 
• 400 x 3 x 1,2 x 0,6 = 864TB 
Rack_units_per_rack • Number of rack units in one rack • 42RU 
Rack_units_switch • Number of rack units for a network switches in one rack • 2RU 
Rack_units_per_DataNode • Number of rack units for one DataNode • 2RU 
DataNode_volume • Raw data volume in one DataNode • 12 x Disks 2TB = 24TB 
DataNodes 
• Number of DataNodes 
• Raw_data_volume / DataNode_volume 
• 864/24 = 36 DataNodes 
Data Racks 
• Number of racks 
• (DataNodes x Rack_units_per_DataNode) / (Rack_units_per_rack – 
Rack_units_switch) 
• (36 x 2) / (42 – 2) = 
2 Data Racks
HADOOP ARCHITECTURE 
2 x EdgeNode 
• 2 CPU 6 core 
• 96 GB RAM 
• 6 x HDD 600GB 15K (Raid10) 
• 2 x 10GbE Ports 
58 Big Data Analytics with Hadoop 
3 to n x DataNode 
• 2 CPU 6 core 
• 48 GB RAM 
• 12 x HDD 3TB 7.5K 
• 2 x 10GbE Ports 
Network Switch 
2 x NameNode / Standby NameNode 
• 2 CPU 6 core 
• 96 GB RAM 
• 6 x HDD 600GB 15K (Raid10) 
• 2 x 10GbE Ports 
1 x AdminNode 
• 2 CPU 6 core 
• 48GB RAM 
• 6 x HDD 600GB 15K (Raid10) 
• 2 x 10GbE Ports 
Edge Nodes Control Nodes Worker Nodes 
Example
RACKS CONFIGURATION OVERVIEW 
Control Racks Data Racks 
Rack1 Rack2 Rack3 Rack4 Rack5 Rack6 
Switches Switches Switches Switches Switches Switches 
EdgeNodes 
NameNode 
DataNodes DataNodes DataNodes DataNodes 
EdgeNodes 
Standby 
NameNode 
Admin Node Admin Node 
Cloud 
59 Big Data Analytics with Hadoop
POC CONFIGURATION 
Example 
• Architecture example 
• The exact configuration and sizing is designed depending on the customer’s needs 
• AdminNode in on Standby NameNode server 
• Zookeeper processes are on NameNode and Standby NameNode servers 
1 x EdgeNode 
• 2 CPU 6 core 
• 32 GB RAM 
• 6 x HDD 600GB 15K Raid10 
• 2 x 10GbE Ports 
Network Switch 
60 Big Data Analytics with Hadoop 
3 x DataNode 
• 2 CPU 6 core 
• 48 GB RAM 
• 12 x HDD 1TB 7.2K 
• 2 x 10GbE Ports 
1 x NameNode 
1 x Standby NameNode 
• 2 CPU 6 core 
• 96 GB RAM 
• 6 x HDD 600GB 15K Raid10 
• 2 x 10GbE Ports 
Edge Nodes Control Nodes Worker Nodes
HADOOP BENCHMARKS 
• Designing appropriate hardware for a Hadoop cluster requires 
benchmarking or POC and careful planning to fully understand the 
workload. However, Hadoop clusters are commonly heterogeneous 
and it is recommended deploying initial hardware with balanced 
specifications when getting started 
• HiBench, a Hadoop benchmark suite constructed by Intel, is used 
intensively for Hadoop benchmarking, tuning & optimizations 
• A set of representative Hadoop programs including both micro-benchmarks 
and more "real world" applications such as: search, 
machine learning and Hive queries 
Source: Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Platform – Apache Hadoop - 
February 2012 
61 Big Data Analytics with Hadoop
HIBENCH 
Category Workload Description 
Microbenchmarks 
Sort 
This workload sorts its binary input data, which is generated using the Hadoop* 
RandomTextWriter example. 
WordCount 
This workload counts the occurrence of each words in the input data which is generated 
using Hadoop RamdomTextWrter 
TeraSort 
A standard benchmark for large-size data sorting that is generated by the TeraGen 
program 
DFSIO 
Computes the aggregated bandwidth by sampling the number of bytes read/written at 
fixed time intervals in each map task 
Web Search 
Nutch Indexing 
This workload tests the indexing subsystem in Nutch, a popular Apache* open-source 
search engine. The crawler subsystem in Nutch is used to crawl an in-house Wikipedia* 
mirror and generates 8.4 GB of compressed data (for about 2.4 million web pages) total as 
workload input 
Page Rank 
This workload is an open-source implementation of the page-rank algorithm, a link 
analysis algorithm used widely in Web search engines 
Machine Learning K-Means Clustering Typical application area of MapReduce for large-scale data mining and machine learning 
Bayesian Classification 
This workload tests the naive Bayesian (a well-known classification algorithm for 
knowledge discovery and data mining) trainer in Mahout*, which is an Apache open-source 
machine-learning library 
Analytical Query 
Hive Join 
This workload models complex analytic queries of structured (relational) tables by 
computing the sum of each group over a single read-only table 
Hive Aggregation 
This workload models complex analytic queries of structured (relational) tables by 
computing both the average and sum for each group by joining two different tables 
62 Big Data Analytics with Hadoop
HADOOP TERASORT WORKFLOW 
• Teragen is a utility included with Hadoop for use when 
creating data sets that will be used by Terasort. Teragen 
utilizes the parallel framework within Hadoop to quickly 
create large data sets that can be manipulated. The time 
to create a given data set is an important point when 
tracking performance of a Hadoop environment 
• Terasort benchmark tests HDFS and MapReduce 
functions in the Hadoop cluster. Terasort is a compute-intensive 
operation that utilizes the Teragen output as 
the Terasort input. Terasort will read the data created 
by Teragen into the system’s physical memory and then 
sort it and write it back out to the HDFS. Terasort will 
exercise all portions of the Hadoop environment during 
these operations 
• Teravalidate is used to ensure the data produced by 
Terasort is accurate. It will run across the Terasort 
output data and verify all data is properly sorted, with 
no errors produced, and let the user know the status of 
the results 
63 Big Data Analytics with Hadoop 
Start Terasort 
Create Data 
Start Sorts 
Map (n nodes, n tasks) Sorts Data 
Start Reduce 
Combine Sorts 
Results 
Complete Terasort 
Control (1 node) 
Map (n nodes, n tasks) 
Control (1 node) 
Control (1 node) 
Reduce (n nodes, n tasks) 
Control (1 node)
THANK YOU

More Related Content

What's hot

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 

What's hot (20)

Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 

Viewers also liked

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
La Banque de demain : Chapitre 4
La Banque de demain : Chapitre 4 La Banque de demain : Chapitre 4
La Banque de demain : Chapitre 4 OCTO Technology
 
Big data analytics and its impact on internet users
Big data analytics and its impact on internet usersBig data analytics and its impact on internet users
Big data analytics and its impact on internet usersStruggler Ever
 
Designing an IT Solution
Designing an IT SolutionDesigning an IT Solution
Designing an IT SolutionPhilippe Julio
 

Viewers also liked (20)

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is big data?
What is big data?What is big data?
What is big data?
 
La Banque de demain : Chapitre 4
La Banque de demain : Chapitre 4 La Banque de demain : Chapitre 4
La Banque de demain : Chapitre 4
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Big data analytics and its impact on internet users
Big data analytics and its impact on internet usersBig data analytics and its impact on internet users
Big data analytics and its impact on internet users
 
Designing an IT Solution
Designing an IT SolutionDesigning an IT Solution
Designing an IT Solution
 

Similar to Big Data Analytics with Hadoop

Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Jean-Pierre König
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsAbhishekKumarAgrahar2
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfSumanthReddy540432
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxAIMLSEMINARS
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 

Similar to Big Data Analytics with Hadoop (20)

Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Big Data
Big DataBig Data
Big Data
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdf
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 

Recently uploaded

Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 

Recently uploaded (20)

Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 

Big Data Analytics with Hadoop

  • 2. WHO AM I • Big Data / Analytics / BI & Cloud Solutions Specialist • http://www.linkedin.com/in/JulioPhilippe • Skills Big Data Analytics Architecture Business Intelligence IT Transformation Management Cloud Computing Datacenter Data Warehousing IT Solutions 2 Big Data Analytics with Hadoop Optimization Business Development Hadoop Mentoring
  • 3. BIG DATA MANAGEMENT INSIGHT « Data don’t spring relevant, they become though ! » 3 Big Data Analytics with Hadoop
  • 4. DATA-DRIVEN ON-LINE WEBSITES • To run the apps : messages, posts, blog entries, video clips, maps, web graph... • To give the data context : friends networks, social networks, collaborative filtering... • To keep the applications running : web logs, system logs, system metrics, database query logs... 4 Big Data Analytics with Hadoop
  • 5. BIG DATA – NOT ONLY DATA VOLUME • Improve analytics and statistics models • Extract business value by analyzing large volumes of multi-structured data from various sources such as databases, websites, blogs, social media, smart sensors... • Have efficient architectures, massively parallel, highly scalable and available to handle very large data volumes up to several petabytes Thematics • Web Technologies • Database Scale-out • Relational Data Analytics • Distributed Data Analytics • Distributed File Systems • Real Time Analytics 5 Big Data Analytics with Hadoop
  • 6. BIG DATA APPLICATIONS DOMAINS • Digital marketing optimization (e.g., web analytics, attribution, golden path analysis) • Data exploration and discovery (e.g., identifying new data-driven products, new markets) • Fraud detection and prevention (e.g., revenue protection, site integrity & uptime) • Social network and relationship analysis (e.g., influencer marketing, outsourcing, attrition prediction) • Machine-generated data analytics (e.g., remote device insight, remote sensing, location-based intelligence) • Data retention (e.g. long term retention of data, data archiving) 6 Big Data Analytics with Hadoop
  • 7. SOME BIG DATA USE CASES BY INDUSTRY Energy  Smart meter analytics  Distribution load forecasting & scheduling  Condition-based maintenance Telecommunications  Network performance  New products & services creation  Call Detail Records (CDRs) analysis  Customer relationship management 7 Big Data Analytics with Hadoop Retail  Dynamic price optimization  Localized assortment  Supply-chain management  Customer relationship management Manufacturing  Supply chain management  Customer Care Call Centers  Preventive Maintenance and Repairs  Customer relationship management Banking  Fraud detection  Trade surveillance  Compliance and regulatory  Customer relationship management Insurance  Catastrophe modeling  Claims fraud  Reputation management  Customer relationship management Public  Fraud detection  Fighting criminality  Threats detection  Cyber security Media  Large-scale clickstream analytics  Abuse and click-fraud prevention  Social graph analysis and profile segmentation  Campaign management and loyalty programs Healthcare  Clinical trials data analysis  Patient care quality and program analysis  Supply chain management  Drug discovery and development analysis
  • 8. TOP 10 BIG DATA SOURCES 1. Social network profiles 2. Social influencers 3. Activity-generated data 4. SaaS & Cloud Apps 5. Public web information 6. MapReduce results 7. Data Warehouse appliances 8. NoSQL databases 9. Network and in-stream monitoring technologies 10. Legacy documents 8 Big Data Analytics with Hadoop
  • 9. NEW DATA AND MANAGEMENT ECONOMICS Storage Trends New Data Structure Distributed File Systems, NoSQL Database, NewSQL…) Compute Trends New Analytics (Massively Parallel Processing, Algorithms…) General purpose data warehouse Proprietary and dedicated data warehouse OLTP is the data warehouse Objects storage Master/Slave Master/Master Distributed File Systems Federated/ 9 Big Data Analytics with Hadoop Sharded Enterprise data warehouse Multi-structured Data Logical data warehouse Master Data Management, Data Quality, Data Integration
  • 10. DISTRIBUTED FILE SYSTEMS • System that permanently store data • Divided into logical units (files, shards, chunks, blocks…) • A file path joins file and directory names into a relative or absolute address to identify a file • Support access to file and remote servers • Support concurrency • Support distribution • Support replication • NFS, GPFS, Hadoop DFS, GlusterFS, MogileFS, MooseFS…. 10 Big Data Analytics with Hadoop Master Slave Slave Slave App
  • 11. UNSTRUCTURED DATA AND OBJECT STORAGE • Metadata values are specific to each individual type • Enables automated management of content • Ensure integrity, retention and authenticity UID Metadata 101000101010100111010101100010110100…110010 11 Big Data Analytics with Hadoop Video Image UID Metadata 111000101010100111010101100010110100…101010 Audio UID Metadata 101100101010100111010101100010110100…110111 Text UID Metadata 101011101010100111010101100010111110…110011 Unstructured Data + Metadata = Object Storage
  • 12. NOSQL DATABASES CATEGORIES Column BigTable (Google), HBase, Cassandra (DataStax), Hypertable… • NoSQL = Not only SQL • Popular name for a subset of structured storage software that is designed with the intention of delivering increased optimization for high-performance operations on large datasets • Basically, available, scalable, eventually consistent • Easy to use • Tolerant of scale by way of horizontal distribution Key-Value Redis, Riak (Basho), CouchBase, Voldemort (LinkedIn) MemcacheDB… 12 Big Data Analytics with Hadoop Document MongoDB (10Gen), CouchDB, Terrastore, SimpleDB (AWS) … Graph Neo4j (Neo Technology), Jena, InfiniteGraph (Objectivity), FlockDB (Twitter)…
  • 13. WHAT IS HADOOP ? “Flexible and available architecture for large scale computation and data processing on a network of commodity hardware” Open Source Software + Hardware Commodity = IT Costs Reduction 13 Big Data Analytics with Hadoop
  • 14. WHAT IS HADOOP USED FOR ? • Searching • Log processing • Recommendation systems • Analytics • Video and Image analysis • Data Retention 14 Big Data Analytics with Hadoop
  • 15. WHO USED HADOOP ? • Top level Apache Foundation project • Large, active user base, mailing lists, user groups • Very active development, strong development team • http://wiki.apache.org/hadoop/PoweredBy#L 15 Big Data Analytics with Hadoop
  • 16. MOVING COMPUTATION TO STORAGE General Purpose Storage Servers • Combine server with disks & networking for reducing latency • Specialized software enables general purpose systems designs to provide high performance data services Moving Data processing to Storage Legacy Application Data Processing Storage Emerging Application Data Processing Metadata Mgmt Storage Next Gen. Application Data Processing Metadata Mgmt 16 Big Data Analytics with Hadoop Storage Metadata Mgmt Storage Array (SAN, NAS) Servers Network
  • 17. BIG DATA ARCHITECTURE BI & DWH Architecture - Conventional • SQL based • High availability • Enterprise database • Right design for structured data • Current storage hardware (SAN, NAS, DAS) Analytics Architecture – Next Generation • Not only SQL based • High scalability, availability and flexibility • Compute and storage in the same box for reducing the network latency • Right design for semi-structured and unstructured data Network Switches 17 Big Data Analytics with Hadoop Data Nodes Network Switches Edge Nodes App Servers Database Servers SAN Switch Storage Array
  • 18. SHARE NOTHING ARCHITECTURE Share Disks IP Network FC IP Network Disk Server 18 Big Data Analytics with Hadoop Share Nothing IP Network DB DB DB DB Disk Disk Disk Disk eg. HDFS Local Storage DB DB DB DB SAN Disks eg. Oracle RAC Share Everything DB eg. Unix FS
  • 19. APACHE HADOOP 2.0 ECOSYSTEM Cluster Resource Management YARN Storage HDFS http://incubator.apache.org/projects/ 19 Big Data Analytics with Hadoop Serialization (Avro, Shrift) Security (Knox, Sentry) Management (Oozie, Zookeeper, Chuckwa, Kafka) Management Ops (Ambari, Big Top, Whirr) Development (Crunch, MRUnit, HDT) Analytics IMPALA Batch Processing MAP REDUCE Stream STORM Search SOLR Machine Learning NoSQL HBASE Others ISV ENGINE HIVE PIG STINGER MAHOUT SPARK SPARK TEZ Real-Time Processing SPARK SPARK TEZ
  • 20. HADOOP COMMON • Hadoop Common is a set of utilities that support the Hadoop subprojects. • Hadoop Common includes Filesystem, RPC, and Serialization libraries. 20 Big Data Analytics with Hadoop
  • 21. HDFS & MAPREDUCE • Hadoop Distributed File System - A scalable, Fault tolerant, High performance distributed file system - Asynchronous replication - Write-once and read-many (WORM) - Hadoop cluster with 3 DataNodes minimum - Data divided into 64MB (default) or 128MB blocks, each block replicated 3 times (default) - No RAID required for DataNode - Interfaces: Java, Thrift, C Library, FUSE, WebDAV, HTTP, FTP - NameNode holds filesystem metadata - Files are broken up and spread over the DataNodes • Hadoop Map Reduce - Software framework for distributed computation - Input | Map() | Copy/Sort | Reduce() | Output - JobTracker schedules and manages jobs - TaskTracker executes individual map() and reduce() tasks on each cluster node 21 Big Data Analytics with Hadoop Master Node Worker Nodes
  • 22. HDFS - READ FILE 1. The client API calculates the blocks index based on the offset of the file pointer and make a request to the NameNode 2. The NameNode replies which DataNodes has a copy of that block 3. The client contacts the DataNodes directly without going through the NameNode 4. The DataNodes read the blocks 5. The DataNodes response to the client about the success NameNode 22 Big Data Analytics with Hadoop DataNodes 1. Calculate blocks index 4.Read 4. Read 4. Read 3. Contact DataNode 3. 3. 2. Client EdgeNode Read File End 4. Read 3. 5.
  • 23. HDFS - WRITE FILE 1. Client contacts the NameNode who designates one of the replica as the primary 2. The response of the NameNode contains who is the primary and who are the secondary replicas 3. The client pushes its changes to all DataNodes in any order, but this change is stored in a buffer of each DataNode 4. The client sends a “commit” request to the primary, which determines an order to update and then push this order to all other secondaries 5. After all secondaries complete the commit, the primary response to the client about the success 6. All changes of blocks distribution and metadata changes be written to an operation log file at the NameNode 5. Write 5. Write 5. Write 5. Write NameNode 23 Big Data Analytics with Hadoop Primary DataNode DataNodes 1. Select primary replica 3. 5. 3. 3. Push to DataNode 4. Commit 2. Client EdgeNode Write File End 3. 6. Write metadata
  • 24. MAPREDUCE - EXEC FILE Le client program is copied on each node 1. The JobTracker determines the number of splits from the input path, and select some TaskTrackers based on their network proximity to the data sources 2. JobTracker sends the task requests to those selected TaskTrackers 3. Each TaskTracker starts the map phase processing by extracting the input data from the splits 4. When the map task completes, the TaskTracker notifies the JobTracker. When all the TaskTrackers are done, the JobTracker notifies the selected TaskTrackers for the reduce phase 5. Each TaskTracker reads the region files remotely and invokes the reduce function, which collects the key/aggregated value into the output file (one per reducer node) 6. After both phase completes, the JobTracker unblocks the client program 24 Big Data Analytics with Hadoop TaskTrackers 5. Reduce 5. Reduce 5. Reduce 5. Reduce 3. Map 3. Map 3. Map 3. Map EdgeNode Client 1. Select TaskTrackers from splits 2. Send request to TackTracker 2 2 2 JobTracker Exec File 4. 6. End End
  • 25. MR VS. YARN ARCHITECTURE MR v1 YARN / MR v2 Client Client 25 Big Data Analytics with Hadoop ResourceManager NameNode NodeManager ApplicationMaster DataNode NodeManager ApplicationMaster DataNode NodeManager Container DataNode NodeManager Container DataNode NodeManager Container DataNode NodeManager Container DataNode Container Client Client JobTracker NameNode TaskTracker DataNode TaskTraker DataNode TaskTraker DataNode • YARN : Yet Another Resource Negotiator • MR : MapReduce
  • 26. HBASE • It's not a relational database (No joins) • Sparse data – nulls are stored for free • Semi-structured or unstructured data • Data changes through time • Versioned data • Scalable – Goal of billions of rows x millions of columns Key Family 26 Big Data Analytics with Hadoop Table - Example (Table, Row_Key, Family, Column, Timestamp) = Cell (Value) Region Row Timestamp Animal Repair Type Size Cost Enclosure1 12 Zebra Medium 1000€ 11 Lion Big Enclosure2 13 Monkey Small 1500€ Column Cell • Clone of Big Table (Google) • Implemented in Java (Clients : Java, C++, Ruby...) • Data is stored “Column‐oriented” • Distributed over many servers • Tolerant of machine failure • Layered over HDFS • Strong consistency
  • 27. HBASE • Table - Regions for scalability, defined by row [start_key, end_key) - Store for efficiency, 1 per Family - 1..n StoreFiles (HFile format on HDFS) • Everything is byte • Rows are ordered sequentially by key • Special tables -ROOT- , .META. - Tell clients where to find user data Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html 27 Big Data Analytics with Hadoop
  • 28. DATA ACCESS HIVE PIG HCatalog • Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop - MapReduce for execution - HDFS for storage • MetaStore - Table/Partitions properties - Thrift API : Current clients in PHP (Web Interface), Python interface to Hive, Java (Query Engine and CLI) - Metadata stored in any SQL backend • Hive Query Language - Basic SQL : Select, From, Join, Group By - Equi-Join, Multi-Table Insert, Multi-Group-By - Batch query • A high-level data-flow language and execution framework for parallel computation • Pig Latin - Data processing language - Compiler to translate to MapReduce • Simple to write MapReduce program • Abstracts you from specific detail • Focus on data processing • Data flow • Data manipulation 28 Big Data Analytics with Hadoop • Table and storage management service for data created using Apache Hadoop • Providing a shared schema and data type mechanism • Providing a table abstraction so that users need not be concerned with where or how their data is stored. • Providing interoperability across data processing tools such as Pig, Map Reduce and Hive • HCatalog DDL (Data Definition Language) • HCatalog CLI (Common Language Interface)
  • 29. DATA TRANSFER • Data import/export • Sqoop is a tool designed to help users of large data import existing relational databases into their Hadoop clusters • Automatic data import • Easy import data from many databases to Hadoop • Generates code for use in MapReduce applications RDBMS Hadoop Cluster • Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data • Simple and flexible architecture based on streaming data flows • Robust fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms • The system is centrally managed and allows for intelligent dynamic management 29 Big Data Analytics with Hadoop Hadoop Cluster Log Files Streaming data flows Batching, compression, fltering, transformation SQOOP FLUME
  • 30. MANAGEMENT OOZIE CHUKWA ZOOKEEPER • Oozie is a server based Bundle Engine that provides a higher-level Oozie abstraction that will batch a set of coordinator applications. The user will be able to start/stop/suspend/resume/rerun a set coordinator jobs in the bundle level resulting a better and easy operational control • Oozie is a server based Coordinator Engine specialized in running workflows based on time and data triggers • Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that execute Hadoop Map/Reduce and Pig jobs • A data collection system for managing large distributed systems • Build on HDFS and MapReduce • Tools kit for displaying, monitoring and analyzing the log files 30 Big Data Analytics with Hadoop • A high-performance coordination service for distributed applications • Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
  • 31. MACHINE LEARNING • Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform • Mahout machine learning algorithms: – Recommendation mining, takes users’ behavior and find items said specified user might like – Clustering, takes e.g. text documents and groups them based on related document topics – Classification, learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to the appropriate category – Frequent item set mining, takes a set of item groups (e.g. terms in a query session, shopping cart content) and identifies, which individual items typically appear together 31 Big Data Analytics with Hadoop
  • 32. SERIALIZATION • A data serialization system that provides dynamic integration with scripting languages • Avro Data - Expressive - Smaller and Faster - Dynamic - Schema store with data - APIs permit reading and creating - Include a file format and a textual encoding • Avro RPC - Leverage versioning support - For Hadoop service provide cross-language access 32 Big Data Analytics with Hadoop
  • 33. MANAGEMENT OPS WHIRR AMBARI • Apache Whirr is a set of libraries for running cloud services • A common service API • Provision, Install, Configure and Manage • Deploy clusters on demand for processing or for testing • Command line for deploying clusters • Ambari is a web-based set of tools for – Deploying – Administering – Monitoring Apache Hadoop clusters 33 Big Data Analytics with Hadoop
  • 34. OTHERS APACHE HADOOP PROJECTS Hadoop Project Description TEZ Tez is an effort to develop a generic application framework which can be used to process arbitrarily complex data-processing tasks and also a re-usable set of data-processing primitives which can be used by other projects. GORA Gora is an ORM framework for column stores such as Apache HBase and Apache Cassandra with a specific focus on Hadoop. DRILL Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel. LUCENE Lucene.NET is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and .NET platform utilizing Microsoft .NET Framework. BLUR Blur is a search platform capable of searching massive amounts of data in a cloud computing environment. GIRAPH Giraph is a large-scale, fault-tolerant, Bulk Synchronous Parallel (BSP)-based graph processing framework. HAMA Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations, e.g., matrix, graph and network algorithms. ACCUMULO Accumulo is a distributed key/value store that provides expressive, cell-level access labels. CRUNCH Crunch is a Java library for writing, testing, and running pipelines of MapReduce jobs on Apache Hadoop. MRUNIT MRUnit is a library to support unit testing of Hadoop MapReduce jobs. HADOOP DEVELOPMENT TOOLS Eclipse based tools for developing applications on the Hadoop platform. BIGTOP Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem. THRIFT Cross-language serialization and RPC framework. KNOX Knox Gateway is a system that provides a single point of secure access for Apache Hadoop clusters. KAFKA Kafka is a distributed publish-subscribe system for processing large amounts of streaming data. CASSANDRA Cassandra is columnar NoSQL store with scalability, availability and performance capabilities. FALCON A data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. SENTRY Sentry is a highly modular system for providing fine grained role based authorization to both data and metadata stored on an Apache Hadoop cluster. STORM Storm is a distributed, fault-tolerant, and high-performance real time computation system that provides strong guarantees on the processing of data. S4 S4 (Simple Scalable Streaming System) is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous, unbounded streams of data. SPARK Apache Spark is an open source, parallel data processing both in-memory and on disk, combining batch, streaming, and interactive analytics 34 Big Data Analytics with Hadoop https://incubator.apache.org/projects/
  • 35. SOME HADOOP SERVICES PROVIDERS Cloudera Hortonworks 35 Big Data Analytics with Hadoop MapR DataStax Microsoft VMware http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
  • 36. CLOUDERA IMPALA • Unified storage : supports HDFS and HBase, Flexible file formats • Unified Metastore • Unified security • Unified client interface : ODBC, SQL syntax, Hue, Beeswax SQL App ODBC • Impala: real time SQL queries, native distributive query engine, optimized for low-latency • Answers as fast as you can ask • Everyone to ask questions for all data, Big Data Hive Metastore YARN NameNode State Store DataNode Query Planner Query Coordinator Query Exec Engine HBase HDFS DataNode Query Planner Query Coordinator Query Exec Engine HBase HDFS 36 Big Data Analytics with Hadoop DataNode Query Planner Query Coordinator Query Exec Engine HBase HDFS storage and analytics together Source : http://cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
  • 37. CLOUDERA HUE • HUE (aka. Hadoop User Experience) • Open source project started as Cloudera • HUE is a web UI for Hadoop • Platform for building custom applications with a nice UI library • User Admin: Account management for HUE users. • File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. • Job Designer: Create MapReduce jobs, which can be templates that prompt for parameters when they are submitted. • Job Browser: View jobs, tasks, counters, logs, etc. • Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format. • Help: Documentation and help Source : http://blog.cloudera.com/blog/category/hue/ 37 Big Data Analytics with Hadoop
  • 38. HORTONWORKS STINGER INITIATIVE Interactive query for Apache Hive • The Stinger Initiative is a broad, community-based 38 Big Data Analytics with Hadoop effort to drive the future of Apache Hive • Stinger delivers 100x performance improvements at petabyte scale with familiar SQL semantics Source: http://hortonworks.com/labs/stinger/
  • 39. MAPR HADOOP MapR Distribution for Apache Hadoop Advantages Source : http://www.mapr.com/products/why-mapr MapR Distribution Editions Source : http://www.mapr.com/products/mapr-editions 39 Big Data Analytics with Hadoop
  • 40. MICROSOFT POLYBASE • PolyBase is a technology on the data processing engine in SQL Server Parallel Data Warehouse (PDW) designed as the simplest way to combine non-relational data and traditional relational data in your analysis • PolyBase provides the easiest and broadest way to access Hadoop with the standard SQL query language without needing to learn MapReduce • PolyBase moves data in parallel to and from Hadoop and PDW allowing end users to perform their analysis without the help of IT 40 Big Data Analytics with Hadoop SQL Query Results PolyBase Hadoop RDBMS • Microsoft SQL Server Parallel Data Warehouse • HDFS Source : http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx
  • 41. VMWARE HADOOP VIRTUALIZATION EXTENSION • HADOOP VIRTUALIZATION EXTENSION (HVE) is designed to enhance the reliability and performance of virtualized Hadoop clusters with extended topology layer and refined locality related policies One Hadoop node per server Multiple Hadoop nodes per server HVE Task Scheduling Balancer Replica Choosing Replica Placement Replica Removal Network Topology Hadoop HDFS MapReduce Common Multiple compute nodes and one data node per server 41 Big Data Analytics with Hadoop • Virtual Machine • Physical Server • Compute • Data Multiple compute nodes and data nodes per server Source: http://www.vmware.com/products/big-data-extensions • HVE consists of hooks and extensions to data-locality-related components of Hadoop, including: network topology, replica, balancer and task scheduling. It touches all sub projects of Hadoop: Common, HDFS and MapReduce
  • 42. HADOOP AND CASSANDRA INTEGRATION Cassandra CFS Layer Job/Task Tracker Hive Metastore Tools Hive, Pig… • Brisk solution from DataStax combines the real time capabilities of Cassandra with the analytical power of Hadoop • The Hadoop Distributed File System (HDFS) is replace by the Apache Cassandra File System (CFS) - Data store in CFS • Blocks are compressed with Snappy • Hive Metastore in Cassandra – Automatically maps Cassandra Column Families to Hive tables 42 Big Data Analytics with Hadoop
  • 43. HIGH AVAILABILITY SOLUTIONS NameNode JobTracker DataNode TaskTracker DataNode TaskTracker Standby NN (standby) • Automatic blocks replication on 3 DataNodes – Rack awareness • NameNode and Standby NameNode • Quorum Journal Manager and Zookeeper • Disaster Recovery by replication • Hive and NameNode Metastores backup • HDFS Snapshots 43 Big Data Analytics with Hadoop DataNode TaskTracker Name Node (active) Worker Node Worker Node Worker Node NameNode JobTracker
  • 44. HA* WITH CONVENTIONAL SHARED STORAGE • NameNode and Standby NameNode • Shared storage array Network Network Standby Node Active Node SAS (6Gbps) NameNode Server • 2 CPU 6 core • 96GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 1GbE Ports Share Storage Array with dual active/active controllers 12 x 600GB HD 15K RPM (Raid10) SAS (6Gbps) Shared Storage Array • NameNode and Standby NameNode Servers: the Active and Standby nodes should have equivalent hardware • Shared Storage Array: shared directory which both Active node and Standby node servers can have read/write access to. The share storage Array supports NFS and is mounted on each node. Currently only a single shared edits directory is supported. The availability of the system is implemented by the redundancy of the shared edits directory with multiple network paths to the storage, and redundancy in the storage itself (disk, network, and power). It is recommended that the shared storage array be a high-quality dedicated storage. * High Availability 44 Big Data Analytics with Hadoop Standby NameNode Server • 2 CPU 6 core • 96GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 1GbE Ports
  • 45. HA* WITH JOURNAL MANAGER AND ZOOKEEPER • NameNode and Standby NameNode • Automatic Failover with Zookeeper : Quorum, ZKFC (ZKFailoverController) • Quorum Journal Manager for reliable edit log storage Network Network Standby Node Active Node NameNode Server • 2 CPU 6 core • 96GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 1GbE Ports • NameNode and Standby NameNode Servers: the Active and Standby nodes should have equivalent hardware • Quorum Journal Manager: In order for the Standby Node to keep its state synchronized with the Active Node, both nodes communicate with a group of separate daemons called "JournalNodes" • Zookeeper : a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. Zookeeper detects the Active node server failure and actives automatically the Standby node server. * High Availability 45 Big Data Analytics with Hadoop Standby NameNode Server • 2 CPU 6 core • 96GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 1GbE Ports HA Software • 3 x JournalNode daemons • 3 x Zookeeper daemon HA Software • 3 x JournalNode daemon • 3 x Zookeeper daemon
  • 46. HA* WITH DISASTER RECOVERY • DistCp: tool for parallelized copying of large amounts of data • Large inter-cluster copy • Based on MapReduce Primary Site Disaster Recovery Site Hadoop Cluster #1 NameSpace #1 46 Big Data Analytics with Hadoop Hadoop Cluster #2 NameSpace #2 Parallelized Data Replication Map ( ) Racks Racks Moving an Elephant: Large Scale Hadoop Data Migration at Facebook http://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data- migration-at-facebook/10150246275318920 * High Availability
  • 47. SECURITY HDFS / KERBEROS / AD KNOX • Files permissions – Files permissions like Unix (owner, group, mode) • User identity – Simple – Super-user • Kerberos connectivity – Users authenticate to the edge of the cluster with Kerberos – Users and group access is maintained in cluster specific access control lists • Microsoft Active Directory connectivity • Knox is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster • Knox simplifies Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster • Knox runs as a server or cluster of servers that serve one or more Hadoop clusters 47 Big Data Analytics with Hadoop GAZZANG • Advanced Key Management - Stores keys separate from the encrypted data • Transparent Data Encryption - Protects data at rest resulting in minimal performance impact • Process Based Access Controls - Restricts access to specific processes rather than by OS user • Encrypt and Decrypt Unstructured Data - Secures sensitive data that could be considered damaging if exposed outside the business • Automation Tools - Rapid distributed deployment from ten to thousands of data nodes
  • 48. BACKUP • Replication blocks is not a form of backup • HDFS Snapshots • When a file is deleted by a user or an application, it is not immediately removed from HDFS – The file is moved in the /trash directory – The file can be restored quickly as long as it remains in /trash – A file remains in /trash for a configurable amount of time • When a file is corrupted restore from backup is necessary – Create incremental backup HDFS files – Use a time-stamp of the file – Use a staging area to store the backup files – Move the backup files to tape if necessary • Backing up with DistCp or copyToLocal commands 48 Big Data Analytics with Hadoop
  • 49. ARCHIVING • Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently • Reduce the NameNode memory usage while still allowing transparent access to files. • An effective solution for the small files problem – http://developer.yahoo.com/blogs/hadoop/posts/2010/07/hadoop _archive_file_compaction/ – http://www.cloudera.com/blog/2009/02/the-small-files-problem/ • Archives are immutable • Rename, deletes and create return an error • Hadoop Archives is exposed as a file system MapReduce will be able to use all the logical input files in Hadoop Archives as input 49 Big Data Analytics with Hadoop Master Index Index HAR File Layout Data
  • 50. DATA COMPRESSION LZO • https://github.com/toddlipcon/hadoop-lzo • By enabling compression, the store file uses a compression algorithm on blocks as they are written (during flushes and compactions) and thus must be decompressed when reading • Compression reduces the number of bytes written/read to/from HDFS • Compression effectively improves the efficiency of network bandwidth and disk space • Compression reduces the size of data needed to be read when issuing a read SNAPPY • Hadoop Snappy is a project for Hadoop that provide access to the snappy compression. http://code.google.com/p/snappy/ • Hadoop-Snappy can be used as an add-on for recent (released) versions of Hadoop that do not provide Snappy Codec support yet • Hadoop-Snappy is being kept in synch with Hadoop Common • Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library 50 Big Data Analytics with Hadoop
  • 51. HDFS OVER NFS & CIFS PROTOCOLS • Make a HDFS filesystem available across networks as a exported share • Over NFS – Install FUSE server – Mount HDFS filesystem over FUSE – Export HDFS filesystem • Over CIFS – Install FUSE server – Install SAMBA server on FUSE server – Mount HDFS filesystem over SAMBA – Export HDFS filesystem FUSE is a framework which makes it possible to implement a filesystem in a userspace program. Features include: • Simple yet comprehensive API • Secure mounting by non-root user • Multi-threaded operation SAMBA is a suite of Linux applications that speak the SMB (Server Message Block) protocol. Many operating systems, including Windows use SMB to perform client-server networking. 51 Big Data Analytics with Hadoop FUSE SAMBA
  • 52. GANGLIA MONITORING • Ganglia by itself is a highly scalable cluster monitoring tool, and provides visual information on the state of individual machines in a cluster or summary information for a cluster or sets of clusters. Ganglia provides the ability to view different time windows • 2 daemons: GMON & GMETAD • GMOND collects or received metric data on each DataNode • 1 GMETAD/grid • Polls 1 GMOND per cluster for data • A node belongs to a cluster • A cluster belongs to a grid Poll 52 Big Data Analytics with Hadoop Apache Web Frontend Web Grid Client Poll Poll Cluster GMETAD Failover Node Poll Failover Failover GMOND Node GMOND Node GMOND Node Cluster 1 GMOND Node GMOND Node GMOND Node Cluster 2 Cluster 3 GMOND Node GMOND Node GMOND Node GMETAD Source : http://ganglia.sourceforge.net/
  • 53. NAGIOS MONITORING • A system and network monitoring application • Nagios is an open source tool especially developed to monitor hosts and services and designed to inform the network incidents before end-users, clients do • Nagios watches hosts and services which we specify and alerts when things go bad and when things get recovered • Initially developed for servers and applications monitoring, it is widely used to monitor and networks availability 53 Big Data Analytics with Hadoop Apache Hadoop CGI Nagios Plugins Database Hadoop infrastructure Common Graphical Interface Users Source : http://www.nagios.org/
  • 54. CROWBAR SOFTWARE FOR HADOOP Solutions Componants A modular, open source framework that accelerates multi-node deployments, simplifies maintenance, and streamlines ongoing updates • Deploy Hadoop cluster in hours instead of days • Use or build barclamps to install and configure software modules Opscode Chef Server Capabilities Download the open source software: https://github.com/dellcloudedge/crowbar Active community http://lists.us.dell.com/mailman/listinfo/crowbar Resources on the Wiki: https://github.com/dellcloudedge/crowbar/wiki • Supports a cloud operations model to interact, modify, and build based on changing needs 54 Big Data Analytics with Hadoop
  • 55. LOGICAL DATA WAREHOUSE WITH HADOOP ADMINISTRATOR DATA SCIENTISTS ENGINEERS ANALYSTS BUSINESS USERS Development BI / Analytics NoSQL SQL Files Web Data RDBMS Data Transfer 55 Big Data Analytics with Hadoop Activity Reporting MOBILE CLIENTS Mobile Apps Data Modeling Data Management Unstructured and structured Data Warehouse MPP, No SQL Engine, Distributed File Systems Share-Nothing Architecture, Algorithms Structured Data Warehouse MPP, In-Memory, Columns Database, SQL Engine, Share-Nothing Architecture or Share-Disk Architecture via SAN Data Integration Exploration, Visualization Data Quality Master Data Management Data sources Oracle Exadata, SAP HANA, IBM Netezza, Teradata, EMC Greenplum, Microsoft PDW, HP Vertica….
  • 56. INFRASTRUCTURE RECOMMENDATIONS General • Determine the volume of data that needs to be stored and processed • Determine the rate at which the data is expected to grow • Determine when the cluster needs to grow, and whether there will be a need for additional processing and storage • For greater power efficiency and higher ROI over time, choose machines with more capacity. This helps to reduce the frequency of new machines being added • For high availability 2 Admin servers in different racks are recommended • For high availability 2 EdgeNodes mini in different racks are recommended Storage • The number of disks should be based on the amount of raw data required • Check on the rate of growth of data and try reducing the requirement of adding new machines every year. For example, depending on the net new data per year, it may be worthwhile using 12 hard drives per server than using 6 hard drives per server, to accommodate larger amount of new data in the existing cluster DataNode • Each DataNode runs a TaskTracker daemons • 2 CPU 6-core is recommended for each DataNode for mostly cases • For increase the I/O performance use the SAS 15K RPM disk, otherwise the SATA/SAS NL 7.2K RPM at better price is sufficient • Using RAID is not recommended • JBOD configuration is required. HDFS provides built-in redundancy by replicating blocks across multiple nodes. The x3 replication factor is recommended • 48GB RAM per server is recommended for mostly cases • For tmp, log , etc, add 20% to usable disk space • The ratio between useable data and raw data is 3.6 56 Big Data Analytics with Hadoop NameNode • The NameNode runs a JobTracker daemons • A copy of the NameNode metadata is stored on a separate machine • Losing the NameNode metadata would mean all data in HDFS lost. Use the Standby NameNode for high availability • The NameNode is not commodity hardware, and needs to have sufficient RAM and disk performance • The amount of RAM allocated to the NameNode limits the size of the cluster • Having plenty of extra NameNode memory space is highly recommended, so that the cluster can grow without having to add more memory to the NameNode, requiring a restart. • 96GB of RAM per server is recommended for mostly cases in the large cluster • Using RAID10 and 15K RPM disks is highly recommended • The NameNode and the secondary NameNode are the same server configuration Network • Use 10GbE switch per rack according the performance of the Hadoop cluster • Use low-latency 10GbE switch across multiple racks • For high availability 2 network switches on top of each rack are recommended
  • 57. INFRASTRUCTURE SIZING Feature Description & Formula Example Replication_blocks • Number of replication blocks • 3 recommended • Add 1 or 2 x Control Rack 24RU integrating NameNode, Secondary NameNode, EdgeNodes and AdminNodes depending on availability of the Hadoop cluster 57 Big Data Analytics with Hadoop • 3 Usable_data_volume • Data source volume (business data) • 400TB Temp_data_volume_ratio • tmp, log… • 20% of Usable_data_volume • 1,2 Data_compression_ratio • Data compression ratio • 0,6 Raw_data_volume • (Usable_data_volume x Replication_blocks x Temp_data_volume_ratio x Data_compression_ratio • 400 x 3 x 1,2 x 0,6 = 864TB Rack_units_per_rack • Number of rack units in one rack • 42RU Rack_units_switch • Number of rack units for a network switches in one rack • 2RU Rack_units_per_DataNode • Number of rack units for one DataNode • 2RU DataNode_volume • Raw data volume in one DataNode • 12 x Disks 2TB = 24TB DataNodes • Number of DataNodes • Raw_data_volume / DataNode_volume • 864/24 = 36 DataNodes Data Racks • Number of racks • (DataNodes x Rack_units_per_DataNode) / (Rack_units_per_rack – Rack_units_switch) • (36 x 2) / (42 – 2) = 2 Data Racks
  • 58. HADOOP ARCHITECTURE 2 x EdgeNode • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 10GbE Ports 58 Big Data Analytics with Hadoop 3 to n x DataNode • 2 CPU 6 core • 48 GB RAM • 12 x HDD 3TB 7.5K • 2 x 10GbE Ports Network Switch 2 x NameNode / Standby NameNode • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 10GbE Ports 1 x AdminNode • 2 CPU 6 core • 48GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 10GbE Ports Edge Nodes Control Nodes Worker Nodes Example
  • 59. RACKS CONFIGURATION OVERVIEW Control Racks Data Racks Rack1 Rack2 Rack3 Rack4 Rack5 Rack6 Switches Switches Switches Switches Switches Switches EdgeNodes NameNode DataNodes DataNodes DataNodes DataNodes EdgeNodes Standby NameNode Admin Node Admin Node Cloud 59 Big Data Analytics with Hadoop
  • 60. POC CONFIGURATION Example • Architecture example • The exact configuration and sizing is designed depending on the customer’s needs • AdminNode in on Standby NameNode server • Zookeeper processes are on NameNode and Standby NameNode servers 1 x EdgeNode • 2 CPU 6 core • 32 GB RAM • 6 x HDD 600GB 15K Raid10 • 2 x 10GbE Ports Network Switch 60 Big Data Analytics with Hadoop 3 x DataNode • 2 CPU 6 core • 48 GB RAM • 12 x HDD 1TB 7.2K • 2 x 10GbE Ports 1 x NameNode 1 x Standby NameNode • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K Raid10 • 2 x 10GbE Ports Edge Nodes Control Nodes Worker Nodes
  • 61. HADOOP BENCHMARKS • Designing appropriate hardware for a Hadoop cluster requires benchmarking or POC and careful planning to fully understand the workload. However, Hadoop clusters are commonly heterogeneous and it is recommended deploying initial hardware with balanced specifications when getting started • HiBench, a Hadoop benchmark suite constructed by Intel, is used intensively for Hadoop benchmarking, tuning & optimizations • A set of representative Hadoop programs including both micro-benchmarks and more "real world" applications such as: search, machine learning and Hive queries Source: Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Platform – Apache Hadoop - February 2012 61 Big Data Analytics with Hadoop
  • 62. HIBENCH Category Workload Description Microbenchmarks Sort This workload sorts its binary input data, which is generated using the Hadoop* RandomTextWriter example. WordCount This workload counts the occurrence of each words in the input data which is generated using Hadoop RamdomTextWrter TeraSort A standard benchmark for large-size data sorting that is generated by the TeraGen program DFSIO Computes the aggregated bandwidth by sampling the number of bytes read/written at fixed time intervals in each map task Web Search Nutch Indexing This workload tests the indexing subsystem in Nutch, a popular Apache* open-source search engine. The crawler subsystem in Nutch is used to crawl an in-house Wikipedia* mirror and generates 8.4 GB of compressed data (for about 2.4 million web pages) total as workload input Page Rank This workload is an open-source implementation of the page-rank algorithm, a link analysis algorithm used widely in Web search engines Machine Learning K-Means Clustering Typical application area of MapReduce for large-scale data mining and machine learning Bayesian Classification This workload tests the naive Bayesian (a well-known classification algorithm for knowledge discovery and data mining) trainer in Mahout*, which is an Apache open-source machine-learning library Analytical Query Hive Join This workload models complex analytic queries of structured (relational) tables by computing the sum of each group over a single read-only table Hive Aggregation This workload models complex analytic queries of structured (relational) tables by computing both the average and sum for each group by joining two different tables 62 Big Data Analytics with Hadoop
  • 63. HADOOP TERASORT WORKFLOW • Teragen is a utility included with Hadoop for use when creating data sets that will be used by Terasort. Teragen utilizes the parallel framework within Hadoop to quickly create large data sets that can be manipulated. The time to create a given data set is an important point when tracking performance of a Hadoop environment • Terasort benchmark tests HDFS and MapReduce functions in the Hadoop cluster. Terasort is a compute-intensive operation that utilizes the Teragen output as the Terasort input. Terasort will read the data created by Teragen into the system’s physical memory and then sort it and write it back out to the HDFS. Terasort will exercise all portions of the Hadoop environment during these operations • Teravalidate is used to ensure the data produced by Terasort is accurate. It will run across the Terasort output data and verify all data is properly sorted, with no errors produced, and let the user know the status of the results 63 Big Data Analytics with Hadoop Start Terasort Create Data Start Sorts Map (n nodes, n tasks) Sorts Data Start Reduce Combine Sorts Results Complete Terasort Control (1 node) Map (n nodes, n tasks) Control (1 node) Control (1 node) Reduce (n nodes, n tasks) Control (1 node)