SlideShare a Scribd company logo
1 of 49
BIG DATA &
HADOOP
FRAMEWORK

1
2

Who am I?
 Name:

Pham Phuong Tu
 Work as R&D developer at VC Corp
 Related Projects: muachung.vn,
enbac.com, rongbay.com, …
 Interest with system & software
architecture, big data, data statistic &
analytic
 Email: phamptu@gmail.com
3

Statistic System
4

Statistic System
Measuring Marketing Effective
5

User Activity Recorder
Mouse Click
6

User Activity Recorder
Mouse Click
7

User Activity Recorder
Mouse Scroll
8

Mining System Log
9

Table of content



Challenger
Big Data









Hadoop Framework










Overview
Data type
What – Who – Why
Extract, transform, load data
Data operation
Big data platform
Overview
History
User
Architecture
Map Reduce
Hadoop Environment

Q&A
Demo
10

CHALLENGER
11
12
13
14

BIG DATA
15
16

Analyze All Available Data
Data warehouse

Social Media

Website

Billing
ERP

CRM

Devices

Network Switches
17

Type of Data
 Plain

Text Data (Web)
 Semi-structured Data (XML, Json)
 Relational Data (Tables/Transaction/Legacy
Data)
 Graph Data


Social Network, Semantic Web (RDF), …

 Multi


Media Data

Image, Video, …

 Streaming


Data

You can only scan the data once
18

What is collecting all
this data?








Web Browsers
Web Sites (Search
Engine, Social Network,
E-commerce Platform…)
Applications
Computer, Smartphone,
Tablet, Games Boxes
Other System (Banking,
Phone, Medical, GPS)
Internet Service
Providers
19

Who is collecting your data?
 Government

Agencies

 Companies
 Service

Provider
 Big Stores
20

Why are they collecting
your data?


Search




Recommendation Systems




New York Times, Eyealike

Target Marketing




Facebook, AOL

Video and Image Analysis




Facebook, Yahoo, Google

Data Warehouse




Facebook, Amazon

Log analytic




Yahoo, Amazon, Zvents

Google Ads, Facebook Ads

Business strategy


Walmart
21

ETL
 Extract:

To convert the data into a single format
appropriate for transformation processing.
 Transform: Applies a series of rules or functions to
the extracted data from the source.
 Load: Loads the data into the end target, usually
the Data Warehouse.
22

Real-life ETL cycle
The typical real-life ETL cycle consists of the following
execution steps:
Cycle initiation
Build reference data
Extract (from sources)
Validate
Transform (clean, apply business rules, check for data
integrity, create aggregates or disaggregates)
Stage (load into staging tables, if used)
Audit reports (for example, on compliance with
business rules. Also, in case of failure, helps to
diagnose/repair)
Publish (to target tables)
Archive
Clean up
23

Operational data
with current implement
24

Operational data
with big data implement
25

Big Data Platform
Understand and navigate
federated big data sources

Federated Discovery and
Navigation

Manage & store huge
volume of any data

Hadoop File System
MapReduce

Structure and control data

Data Warehousing

Manage streaming data

Stream Computing

Analyze unstructured data

Text Analytics Engine

Integrate and govern all
data sources

Integration, Data Quality,
Security, Lifecycle Management,
MDM
26

HADOOP
FRAMEWORK
27

Single-node architecture
CPU
Machine Learning, Statistics
Memory
“Classical” Data Mining
Disk
28

Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between
any pair of nodes
in a rack

Switch

Switch

CPU
Mem
Disk

Switch

CPU

…

CPU

Mem

Mem

Disk

Disk

Each rack contains 16-64 nodes

CPU

…

Mem
Disk
29

Hadoop History
 Dec

2004 – Google GFS paper published
 July 2005 – Nutch uses MapReduce
 Feb 2006 – Becomes Lucene subproject
 Apr 2007 – Yahoo! on 1000-node cluster
 Jan 2008 – An Apache Top Level Project
 Jul 2008 – A 4000 node test cluster
30

Who uses Hadoop?
 Google,

Yahoo, Bing
 Amazon, Ebay, Alibaba
 Facebook, Twitter
 IBM, HP. Toshiba, Intel
 New York Times, BBC
 Line, Wechat
 VC Corp, FPT, VNG, VTC
31

Hadoop Components
 Distributed



file system (HDFS)

Single namespace for entire cluster
Replicates data 3x for fault-tolerance

 MapReduce



framework

Executes user jobs specified as “map” and
“reduce” functions
Manages work distribution & fault-tolerance
32
33

Goals of HDFS


Very Large Distributed File System




10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware


Files are replicated to handle hardware failure

Detect failures and recovers from them
Optimized for Batch Processing






Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS




Data locations exposed so that computations
can move to where data resides
34

HDFS Multi Cluster
35

Hadoop 1.0 vs 2.0
36

NameNode


Meta-data in Memory


The entire metadata is in main memory

No demand paging of meta-data
Types of Metadata






List of files

List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time, replication
factor
A Transaction Log








Records file creations, file deletions. etc
37

DataNode


A Block Server


Stores data in the local file system (e.g. ext3)

Stores meta-data of a block
 Serves data and meta-data to Clients
Block Report








Periodically sends a report of all existing blocks
to the NameNode

Facilitates Pipelining of Data


Forwards data to other specified DataNodes
38

Block Placement
 Current


Strategy

One replica on local node

 Second

replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly
placed
 Clients read from nearest replica
 Would like to make this policy pluggable
39

Data Correctness


Use Checksums to validate data




Use CRC32

File Creation


Client computes checksum per 512 byte

DataNode stores the checksum
File access






Client retrieves the data and checksum from
DataNode



If Validation fails, Client tries other replicas
40

NameNode Failure
A

single point of failure
 Transaction Log stored in multiple
directories


A directory on the local file system

A

directory on a remote file system
(NFS/CIFS)
 Need to develop a real HA solution
41

Data Pipelining
 Client

retrieves a list of DataNodes on
which to place replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to
the next DataNode in the Pipeline
 When all replicas are written, the Client
moves on to write the next block in file
42

Rebalancer
 Goal:

% disk full on DataNodes should be

similar





Usually run when new DataNodes are added
Cluster is online when Rebalancer is active
Rebalancer is throttled to avoid network
congestion
Command line tool
43

What is MapReduce?
 Simple

data-parallel programming model designed
for scalability and fault-tolerance

 Pioneered


by Google

Processes 20 petabytes of data per day

 Popularized


by open-source Hadoop project

Used at Yahoo!, Facebook, Amazon, …
44

Map Reduce Data Flow
45

Execution
46

Parallel Execution
47

Example: Word count process
48

Hadoop Environment
49

More Related Content

What's hot

facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M usersJongyoon Choi
 
Facebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenFacebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenHARMAN Services
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf8840VinayShelke
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 

What's hot (20)

BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
 
Facebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenFacebook Architecture - Breaking it Open
Facebook Architecture - Breaking it Open
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Viewers also liked

Lecture 16
Lecture 16Lecture 16
Lecture 16Shani729
 
Data warehouse solutions
Data warehouse solutionsData warehouse solutions
Data warehouse solutionsTu Pham
 
Big data in action
Big data in actionBig data in action
Big data in actionTu Pham
 
Recommendation system for ecommerce
Recommendation system for ecommerceRecommendation system for ecommerce
Recommendation system for ecommerceTu Pham
 
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUBTu Pham
 
Database, data storage, hosting with Firebase
Database, data storage, hosting with FirebaseDatabase, data storage, hosting with Firebase
Database, data storage, hosting with FirebaseTu Pham
 
Building Reactive Applications With Akka And Java
Building Reactive Applications With Akka And JavaBuilding Reactive Applications With Akka And Java
Building Reactive Applications With Akka And JavaTu Pham
 
Understanding Kubernetes
Understanding KubernetesUnderstanding Kubernetes
Understanding KubernetesTu Pham
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu Solution
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Big data initiative justification and prioritization framework
Big data initiative justification and prioritization frameworkBig data initiative justification and prioritization framework
Big data initiative justification and prioritization frameworkNeerajsabhnani
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAPaolo Platter
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadooppengshanzhang
 
Agile Lab_BigData_Meetup
Agile Lab_BigData_MeetupAgile Lab_BigData_Meetup
Agile Lab_BigData_MeetupPaolo Platter
 

Viewers also liked (20)

Lecture 16
Lecture 16Lecture 16
Lecture 16
 
Data warehouse solutions
Data warehouse solutionsData warehouse solutions
Data warehouse solutions
 
Big data in action
Big data in actionBig data in action
Big data in action
 
Recommendation system for ecommerce
Recommendation system for ecommerceRecommendation system for ecommerce
Recommendation system for ecommerce
 
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Database, data storage, hosting with Firebase
Database, data storage, hosting with FirebaseDatabase, data storage, hosting with Firebase
Database, data storage, hosting with Firebase
 
Building Reactive Applications With Akka And Java
Building Reactive Applications With Akka And JavaBuilding Reactive Applications With Akka And Java
Building Reactive Applications With Akka And Java
 
Understanding Kubernetes
Understanding KubernetesUnderstanding Kubernetes
Understanding Kubernetes
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
eCommerce Justification Presentation
eCommerce Justification Presentation eCommerce Justification Presentation
eCommerce Justification Presentation
 
Big data initiative justification and prioritization framework
Big data initiative justification and prioritization frameworkBig data initiative justification and prioritization framework
Big data initiative justification and prioritization framework
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKA
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoop
 
Agile Lab_BigData_Meetup
Agile Lab_BigData_MeetupAgile Lab_BigData_Meetup
Agile Lab_BigData_Meetup
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 

Similar to Big data & hadoop framework

A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)jwnoteboom
 
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008bosc_2008
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 

Similar to Big data & hadoop framework (20)

A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Bigdata
BigdataBigdata
Bigdata
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)
 
BigData
BigDataBigData
BigData
 
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 

More from Tu Pham

Go from idea to app with no coding using AppSheet.pptx
Go from idea to app with no coding using AppSheet.pptxGo from idea to app with no coding using AppSheet.pptx
Go from idea to app with no coding using AppSheet.pptxTu Pham
 
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
 Secure your app against DDOS, API Abuse, Hijacking, and Fraud Secure your app against DDOS, API Abuse, Hijacking, and Fraud
Secure your app against DDOS, API Abuse, Hijacking, and FraudTu Pham
 
Challenges In Implementing SRE
Challenges In Implementing SREChallenges In Implementing SRE
Challenges In Implementing SRETu Pham
 
IT Strategy
IT Strategy IT Strategy
IT Strategy Tu Pham
 
Set up Learn and Development program
Set up Learn and Development programSet up Learn and Development program
Set up Learn and Development programTu Pham
 
Cost Management For IT Project / Product
Cost Management For IT Project / ProductCost Management For IT Project / Product
Cost Management For IT Project / ProductTu Pham
 
Minimum Viable Product 101
Minimum Viable Product 101Minimum Viable Product 101
Minimum Viable Product 101Tu Pham
 
Understand your customers
Understand your customersUnderstand your customers
Understand your customersTu Pham
 
Let's build great products for mid-size companies
Let's build great products for mid-size companiesLet's build great products for mid-size companies
Let's build great products for mid-size companiesTu Pham
 
Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns Tu Pham
 
End To End Business Intelligence On Google Cloud
End To End Business Intelligence On Google CloudEnd To End Business Intelligence On Google Cloud
End To End Business Intelligence On Google CloudTu Pham
 
High Output Tech Management
High Output Tech Management High Output Tech Management
High Output Tech Management Tu Pham
 
Big Data Driven At Eway
Big Data Driven At Eway Big Data Driven At Eway
Big Data Driven At Eway Tu Pham
 
Security On The Cloud
Security On The CloudSecurity On The Cloud
Security On The CloudTu Pham
 
Eway Tech Talk #2 Coding Guidelines
Eway Tech Talk #2 Coding GuidelinesEway Tech Talk #2 Coding Guidelines
Eway Tech Talk #2 Coding GuidelinesTu Pham
 
End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud Tu Pham
 
Eway Tech Talk #0 Knowledge Sharing
Eway Tech Talk #0 Knowledge SharingEway Tech Talk #0 Knowledge Sharing
Eway Tech Talk #0 Knowledge SharingTu Pham
 
Php 5.6 vs Php 7 performance comparison
Php 5.6 vs Php 7 performance comparisonPhp 5.6 vs Php 7 performance comparison
Php 5.6 vs Php 7 performance comparisonTu Pham
 
System Security on Cloud
System Security on CloudSystem Security on Cloud
System Security on CloudTu Pham
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNOTu Pham
 

More from Tu Pham (20)

Go from idea to app with no coding using AppSheet.pptx
Go from idea to app with no coding using AppSheet.pptxGo from idea to app with no coding using AppSheet.pptx
Go from idea to app with no coding using AppSheet.pptx
 
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
 Secure your app against DDOS, API Abuse, Hijacking, and Fraud Secure your app against DDOS, API Abuse, Hijacking, and Fraud
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
 
Challenges In Implementing SRE
Challenges In Implementing SREChallenges In Implementing SRE
Challenges In Implementing SRE
 
IT Strategy
IT Strategy IT Strategy
IT Strategy
 
Set up Learn and Development program
Set up Learn and Development programSet up Learn and Development program
Set up Learn and Development program
 
Cost Management For IT Project / Product
Cost Management For IT Project / ProductCost Management For IT Project / Product
Cost Management For IT Project / Product
 
Minimum Viable Product 101
Minimum Viable Product 101Minimum Viable Product 101
Minimum Viable Product 101
 
Understand your customers
Understand your customersUnderstand your customers
Understand your customers
 
Let's build great products for mid-size companies
Let's build great products for mid-size companiesLet's build great products for mid-size companies
Let's build great products for mid-size companies
 
Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns
 
End To End Business Intelligence On Google Cloud
End To End Business Intelligence On Google CloudEnd To End Business Intelligence On Google Cloud
End To End Business Intelligence On Google Cloud
 
High Output Tech Management
High Output Tech Management High Output Tech Management
High Output Tech Management
 
Big Data Driven At Eway
Big Data Driven At Eway Big Data Driven At Eway
Big Data Driven At Eway
 
Security On The Cloud
Security On The CloudSecurity On The Cloud
Security On The Cloud
 
Eway Tech Talk #2 Coding Guidelines
Eway Tech Talk #2 Coding GuidelinesEway Tech Talk #2 Coding Guidelines
Eway Tech Talk #2 Coding Guidelines
 
End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud
 
Eway Tech Talk #0 Knowledge Sharing
Eway Tech Talk #0 Knowledge SharingEway Tech Talk #0 Knowledge Sharing
Eway Tech Talk #0 Knowledge Sharing
 
Php 5.6 vs Php 7 performance comparison
Php 5.6 vs Php 7 performance comparisonPhp 5.6 vs Php 7 performance comparison
Php 5.6 vs Php 7 performance comparison
 
System Security on Cloud
System Security on CloudSystem Security on Cloud
System Security on Cloud
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNO
 

Recently uploaded

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Big data & hadoop framework

  • 2. 2 Who am I?  Name: Pham Phuong Tu  Work as R&D developer at VC Corp  Related Projects: muachung.vn, enbac.com, rongbay.com, …  Interest with system & software architecture, big data, data statistic & analytic  Email: phamptu@gmail.com
  • 9. 9 Table of content   Challenger Big Data        Hadoop Framework         Overview Data type What – Who – Why Extract, transform, load data Data operation Big data platform Overview History User Architecture Map Reduce Hadoop Environment Q&A Demo
  • 11. 11
  • 12. 12
  • 13. 13
  • 15. 15
  • 16. 16 Analyze All Available Data Data warehouse Social Media Website Billing ERP CRM Devices Network Switches
  • 17. 17 Type of Data  Plain Text Data (Web)  Semi-structured Data (XML, Json)  Relational Data (Tables/Transaction/Legacy Data)  Graph Data  Social Network, Semantic Web (RDF), …  Multi  Media Data Image, Video, …  Streaming  Data You can only scan the data once
  • 18. 18 What is collecting all this data?       Web Browsers Web Sites (Search Engine, Social Network, E-commerce Platform…) Applications Computer, Smartphone, Tablet, Games Boxes Other System (Banking, Phone, Medical, GPS) Internet Service Providers
  • 19. 19 Who is collecting your data?  Government Agencies  Companies  Service Provider  Big Stores
  • 20. 20 Why are they collecting your data?  Search   Recommendation Systems   New York Times, Eyealike Target Marketing   Facebook, AOL Video and Image Analysis   Facebook, Yahoo, Google Data Warehouse   Facebook, Amazon Log analytic   Yahoo, Amazon, Zvents Google Ads, Facebook Ads Business strategy  Walmart
  • 21. 21 ETL  Extract: To convert the data into a single format appropriate for transformation processing.  Transform: Applies a series of rules or functions to the extracted data from the source.  Load: Loads the data into the end target, usually the Data Warehouse.
  • 22. 22 Real-life ETL cycle The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up
  • 25. 25 Big Data Platform Understand and navigate federated big data sources Federated Discovery and Navigation Manage & store huge volume of any data Hadoop File System MapReduce Structure and control data Data Warehousing Manage streaming data Stream Computing Analyze unstructured data Text Analytics Engine Integrate and govern all data sources Integration, Data Quality, Security, Lifecycle Management, MDM
  • 27. 27 Single-node architecture CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk
  • 28. 28 Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch CPU Mem Disk Switch CPU … CPU Mem Mem Disk Disk Each rack contains 16-64 nodes CPU … Mem Disk
  • 29. 29 Hadoop History  Dec 2004 – Google GFS paper published  July 2005 – Nutch uses MapReduce  Feb 2006 – Becomes Lucene subproject  Apr 2007 – Yahoo! on 1000-node cluster  Jan 2008 – An Apache Top Level Project  Jul 2008 – A 4000 node test cluster
  • 30. 30 Who uses Hadoop?  Google, Yahoo, Bing  Amazon, Ebay, Alibaba  Facebook, Twitter  IBM, HP. Toshiba, Intel  New York Times, BBC  Line, Wechat  VC Corp, FPT, VNG, VTC
  • 31. 31 Hadoop Components  Distributed   file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance  MapReduce   framework Executes user jobs specified as “map” and “reduce” functions Manages work distribution & fault-tolerance
  • 32. 32
  • 33. 33 Goals of HDFS  Very Large Distributed File System   10K nodes, 100 million files, 10 PB Assumes Commodity Hardware  Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing    Provides very high aggregate bandwidth User Space, runs on heterogeneous OS   Data locations exposed so that computations can move to where data resides
  • 36. 36 NameNode  Meta-data in Memory  The entire metadata is in main memory No demand paging of meta-data Types of Metadata    List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g creation time, replication factor A Transaction Log      Records file creations, file deletions. etc
  • 37. 37 DataNode  A Block Server  Stores data in the local file system (e.g. ext3) Stores meta-data of a block  Serves data and meta-data to Clients Block Report     Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data  Forwards data to other specified DataNodes
  • 38. 38 Block Placement  Current  Strategy One replica on local node  Second replica on a remote rack  Third replica on same remote rack  Additional replicas are randomly placed  Clients read from nearest replica  Would like to make this policy pluggable
  • 39. 39 Data Correctness  Use Checksums to validate data   Use CRC32 File Creation  Client computes checksum per 512 byte DataNode stores the checksum File access    Client retrieves the data and checksum from DataNode  If Validation fails, Client tries other replicas
  • 40. 40 NameNode Failure A single point of failure  Transaction Log stored in multiple directories  A directory on the local file system A directory on a remote file system (NFS/CIFS)  Need to develop a real HA solution
  • 41. 41 Data Pipelining  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file
  • 42. 42 Rebalancer  Goal: % disk full on DataNodes should be similar     Usually run when new DataNodes are added Cluster is online when Rebalancer is active Rebalancer is throttled to avoid network congestion Command line tool
  • 43. 43 What is MapReduce?  Simple data-parallel programming model designed for scalability and fault-tolerance  Pioneered  by Google Processes 20 petabytes of data per day  Popularized  by open-source Hadoop project Used at Yahoo!, Facebook, Amazon, …
  • 49. 49

Editor's Notes

  1. - Nói về Google Analytic
  2. Sự tăng trưởng ngày càng nhanh của dữ liệu (độ lớn, chủng loải) thông qua số liệu chi tiết Giá trị big data đem lại với các ngành (đặc biệt bán lẻ, tư vấn, vận chuyển, xây dựng, …) Khó khăn (57.6% tổ chức gặp khó khăn) The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;[14] as of 2012, every day 2.5 exabytes (2.5×1018) of data were created.[15] The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.[16]
  3. - Các khó khăn thường gặp trong big data (Độ phức tạp dữ liệu, Độ lớn dữ liệu, Hiệu năng, Kĩ năng nhân viên, Sự tăng trưởng của dữ liệu, Giá thành)
  4. - 3 vấn đề lớn với big data
  5. Tổng kết big data là gì Big data[1][2] is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3]search, sharing, transfer, analysis,[4] and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases,link legal citations, combat crime, and determine real-time roadway traffic conditions."
  6. Big data = Giao dịch + Tương tác + Theo dõi ERP (Enterprise Resource Planning) CRM (Customer Relationship Management) Đi sâu vào log, user click stream, afilliate networks, behavioral targeting
  7. Website Billing ERP Device Network Social Media
  8. - Ví du chi tiết từng loại dữ liệu
  9. - Các nguồn thu thập dữ liệu
  10. - Các đơn vị thu thập dữ liệu
  11. - Mục đích thu thập dữ liệu
  12. Extract (Each separate system may also use a different data organization and/or format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures) Transform – PRE PROCESS (Selecting only certain columns to load (or selecting nullcolumns not to load), Translating coded values, Encoding free-form values, Deriving a new calculated value, Sorting, Joining data from multiple sources, Aggregation, Generating surrogate-key values, Transposing or pivoting, Splitting a column into multiple columns, Disaggregation of repeating columns into a separate detail table, Lookup and validate the relevant data from tables or referential files for slowly changing dimensions, Applying any form of simple or complex data validation.) Load ( Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information; frequently, updating extracted data is done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the same data warehouse) may add new data in a historical form at regular intervals -- for example, hourly. To understand this, consider a data warehouse that is required to maintain sales records of the last year. This data warehouse overwrites any data older than a year with newer data. However, the entry of data for any one year window is made in a historical manner.)
  13. Reference data are data from outside the organization (often from standards organizations) which is, apart from occasional revisions, static. This non-dynamic data is sometimes also known as "standing data".[1] Examples would be currency codes,Countries (in this case covered by a global standard ISO 3166-1) etc. Reference data should be distinguished[2] from "Master Data" which is also relatively static data but originating from within the organization e.g. products, departments, even customers. A staging table is just a regular SQL server table. For example, if you have a process that imports some data from say .CSV files then you put this data in a staging table. You may then decide to apply some data cleaning or business rules to the data and move it to a different staging tables etc..
  14. Tệ nhất: + No log Phổ biến: + Chỉ phân tích được một phần nhỏ dữ liệu hiện tại + No data warehouse
  15. Lưu trữ lâu dài trong kho dữ liệu Phục vụ phân tích dữ liệu
  16. Vấn đề về scale up server, fault tolerance, performance
  17. Vấn đề quản lý lỗi (thiết bị, hệ thống), raid, network device, performance
  18. Sử dụng hadoop để làm gì Các hướng (core, distribution)
  19. MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.[1] A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy andfault tolerance. The model is inspired by the map and reduce functions commonly used in functional programming,[2] although their purpose in the MapReduce framework is not the same as in their original forms.[3] Furthermore, the key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology but has since been genericized.
  20. View from task perspective
  21. View from scheduled m/c perspective
  22. - Giới thiệu các sub project
  23. Kết luận tương lai big data Q&A