SlideShare a Scribd company logo
1 of 22
Overview
 Big Data
 3 Vs of Big Data
 Hadoop
 HDFS
 Map Reduce
 Big Data Market Size
 Big Data in India
oOrder Details for a store
oAll orders across 100s of stores
oA person’s stock portfolio
oAll stock transactions for Stock Exchange
Its data that is created very fast and is too big to
be processed on a single machine .These data
come from various sources in various formats.
What is BIG DATA ???
How 3 Vs define Big Data ???
 Volume: Large volumes of data
 Velocity: Quickly moving data
 Variety: Structured, Unstructured,
images, etc.
Volume
It is the size of the data which determines the value and potential of the
data under consideration. The name ‘Big Data’ itself contains a term
which is related to size and hence the characteristic.
Variety
Data today comes in all types of formats: Structured, data in traditional
databases. Unstructured text documents, email, stock ticker data and
financial transactions and semi-structured data too.
Velocity
Speed of generation of data or how fast the data is generated and processed to
meet the demands and the challenges which lie ahead in the path of growth and
development.
Why Big Data ?
 The real issue is not that you are acquiring large amounts of data. It's
what you do with the data that counts. The hopeful vision is that
organizations will be able to take data from any source, harness
relevant data and analyse it to find answers that enable
 1) cost reductions
 2) time reductions
 3) new product development and optimized offerings
 4) smarter business decision making
What is Hadoop?
 Hadoop is a distributed file system and data processing engine that is
designed to handle extremely high volumes of data in any structure.
 Hadoop has two components:
 The Hadoop distributed file system (HDFS), which supports data in structured
relational form, in unstructured form, and in any form in between
 The MapReduce programing paradigm for managing applications on multiple
distributed servers
 The focus is on supporting redundancy, distributed architectures, and
parallel processing
 Low cost: The open-source framework is free and uses commodity hardware to
store large quantities of data.
 Computing power: Its distributed computing model can quickly process very large
volumes of data.
 Scalability: You can easily grow your system simply by adding more nodes with
little administration.
 Storage flexibility: Unlike traditional relational databases, you don’t have to pre-
process data before storing it. You can store as much data as you want .
 Inherent data protection: Data and application processing are protected against
hardware failure.
11
The Hadoop Distributed File System (HDFS) is a distributed
file system designed to run on commodity hardware. It’s a
scalable file system that distributes and stores data across
all machines in a Hadoop cluster.
Hadoop Distributed File System
12
HDFS has a master/slave architecture
HDFS cluster consists of :
A single NameNode, a master server that manages the file system
namespace and regulates access to files by clients.
A number of DataNodes, which manage storage attached to the nodes
that they run on. Internally, a file is split into one or more blocks and
these blocks are stored in DataNodes.
HDFS Architecture
Files in HDFS
13
HDFS supports a traditional hierarchical file organization. A user or an application can
create directories and store files inside these directories. The NameNode maintains the file
system namespace. Any change to the file system namespace or its properties is recorded
by the NameNode.
The File System Namespace
Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file
HDFS Robustness
The primary objective of HDFS is to store data reliably
even in the presence of failures. The common types of
failures are DataNode failures and NameNode failures.
Data Disk Failure and Re-Replication
DataNodes may lose connectivity with the NameNode. The NameNode detects this condition, marks them as dead and
does not forward any new IO requests to them. The NameNode constantly tracks block failures and initiates re-replication
whenever necessary
Metadata Disk Failure
The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS
instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple
copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and
EditLogs to get updated synchronously.
Mappers and Reducers
Mappers
 These are just small programs that deal with a relatively small amount of data and work in parallel.
 Mapper maps input to a set of intermediate key/value pairs .
 Once mapping Done then a phase of mapreduce called shuffle and sort takes place on intermediate data.
Reducers
 Reducer reduces a set of intermediate values which share a key to a smaller set of values.
 It gets the key and the list of all values and then it writes the final result
MapReduce
MapReduce
MapReduce applications typically implement the Mapper and Reducer interfaces
to provide the map and reduce methods.
MapReduce divides workloads up into multiple tasks that can be executed in
parallel
Why MapReduce ?
o It won’t work.
o We may run out of memory.
o Data processing may take long time.
The initial approach is to process data serially i.e. from top to bottom.
MapReduce in Action
Worker
Worker
Worker
Worker
Worker
Master(2)
assign
map
(2)
assign
reduce
(3) read (4) local write
(5) remote read
Output
File 0
Output
File 1
(6) write
Split 0
Split 1
Split 2
Input files
Mapper: split, read, emit
intermediate Key-Value pairs
Reducer: repartition, emits
final output
User
Program
Map phase
Intermediate files
(on local disks)
Reduce phase Output files
Market Size
Source: Wikibon Taming Big Data
By 2015 4.5 million IT jobs in Big Data ; 2 million is in US itself
In India
 Gaining attraction
 Huge market opportunities for IT services (82.9% of revenues) and
analytics firms (17.1 % )
 Market size by end of 2015 - $1 billion
 India will require a minimum of 1 lakh data scientists in the next couple
of years in addition to data analysts and data managers to support the
Big Data space.
References
 https://hadoop.apache.org
 Cloudera (Introduction to HDFS & MapReduce)
 CBT Nuggets Apache Hadoop
 Hadoop- The Definitive Guide, 4th Edition
 en.wikipedia.org
 www.edureka.co/big-data-and-hadoop
 https://www.udacity.com/
Big Data & Hadoop

More Related Content

What's hot

Database management system
Database management systemDatabase management system
Database management systemnazmul hoque
 
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...Nandhitha B
 
assignment3
assignment3assignment3
assignment3Kirti J
 
Introduction to RDBMS
Introduction to RDBMSIntroduction to RDBMS
Introduction to RDBMSSarmad Ali
 
Distributed processing
Distributed processingDistributed processing
Distributed processingNeil Stein
 
Data Archiving and Processing
Data Archiving and ProcessingData Archiving and Processing
Data Archiving and ProcessingCRRC-Armenia
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Database System Concepts and Architecture
Database System Concepts and ArchitectureDatabase System Concepts and Architecture
Database System Concepts and Architecturesontumax
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Chapter 5: Database Systems, Data Centers, and Business Intelligence
Chapter 5: Database Systems, Data Centers, and Business IntelligenceChapter 5: Database Systems, Data Centers, and Business Intelligence
Chapter 5: Database Systems, Data Centers, and Business Intelligencephak_09
 
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSImplementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSIJEACS
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 

What's hot (16)

Database management system
Database management systemDatabase management system
Database management system
 
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
Introduction to yarn B.Nandhitha 2nd M.sc., computer science,Bon secours coll...
 
assignment3
assignment3assignment3
assignment3
 
Dbms slides
Dbms slidesDbms slides
Dbms slides
 
Introduction to RDBMS
Introduction to RDBMSIntroduction to RDBMS
Introduction to RDBMS
 
Distributed processing
Distributed processingDistributed processing
Distributed processing
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Data Archiving and Processing
Data Archiving and ProcessingData Archiving and Processing
Data Archiving and Processing
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Database System Concepts and Architecture
Database System Concepts and ArchitectureDatabase System Concepts and Architecture
Database System Concepts and Architecture
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Chapter 5: Database Systems, Data Centers, and Business Intelligence
Chapter 5: Database Systems, Data Centers, and Business IntelligenceChapter 5: Database Systems, Data Centers, and Business Intelligence
Chapter 5: Database Systems, Data Centers, and Business Intelligence
 
Ds intro
Ds introDs intro
Ds intro
 
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSImplementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Cppt
CpptCppt
Cppt
 

Similar to Big Data & Hadoop

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file systemJohn Veigas
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATAAishwarya Saseendran
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 

Similar to Big Data & Hadoop (20)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
hadoop
hadoophadoop
hadoop
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hadoop
HadoopHadoop
Hadoop
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
HADOOP
HADOOPHADOOP
HADOOP
 

Recently uploaded

Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 

Recently uploaded (20)

Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 

Big Data & Hadoop

  • 1.
  • 2. Overview  Big Data  3 Vs of Big Data  Hadoop  HDFS  Map Reduce  Big Data Market Size  Big Data in India
  • 3. oOrder Details for a store oAll orders across 100s of stores oA person’s stock portfolio oAll stock transactions for Stock Exchange Its data that is created very fast and is too big to be processed on a single machine .These data come from various sources in various formats. What is BIG DATA ???
  • 4. How 3 Vs define Big Data ???  Volume: Large volumes of data  Velocity: Quickly moving data  Variety: Structured, Unstructured, images, etc.
  • 5. Volume It is the size of the data which determines the value and potential of the data under consideration. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic.
  • 6. Variety Data today comes in all types of formats: Structured, data in traditional databases. Unstructured text documents, email, stock ticker data and financial transactions and semi-structured data too.
  • 7. Velocity Speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development.
  • 8. Why Big Data ?  The real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyse it to find answers that enable  1) cost reductions  2) time reductions  3) new product development and optimized offerings  4) smarter business decision making
  • 9. What is Hadoop?  Hadoop is a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure.  Hadoop has two components:  The Hadoop distributed file system (HDFS), which supports data in structured relational form, in unstructured form, and in any form in between  The MapReduce programing paradigm for managing applications on multiple distributed servers  The focus is on supporting redundancy, distributed architectures, and parallel processing
  • 10.  Low cost: The open-source framework is free and uses commodity hardware to store large quantities of data.  Computing power: Its distributed computing model can quickly process very large volumes of data.  Scalability: You can easily grow your system simply by adding more nodes with little administration.  Storage flexibility: Unlike traditional relational databases, you don’t have to pre- process data before storing it. You can store as much data as you want .  Inherent data protection: Data and application processing are protected against hardware failure.
  • 11. 11 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It’s a scalable file system that distributes and stores data across all machines in a Hadoop cluster. Hadoop Distributed File System
  • 12. 12 HDFS has a master/slave architecture HDFS cluster consists of : A single NameNode, a master server that manages the file system namespace and regulates access to files by clients. A number of DataNodes, which manage storage attached to the nodes that they run on. Internally, a file is split into one or more blocks and these blocks are stored in DataNodes. HDFS Architecture
  • 13. Files in HDFS 13 HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. The File System Namespace Data Replication HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file
  • 14. HDFS Robustness The primary objective of HDFS is to store data reliably even in the presence of failures. The common types of failures are DataNode failures and NameNode failures. Data Disk Failure and Re-Replication DataNodes may lose connectivity with the NameNode. The NameNode detects this condition, marks them as dead and does not forward any new IO requests to them. The NameNode constantly tracks block failures and initiates re-replication whenever necessary Metadata Disk Failure The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously.
  • 15. Mappers and Reducers Mappers  These are just small programs that deal with a relatively small amount of data and work in parallel.  Mapper maps input to a set of intermediate key/value pairs .  Once mapping Done then a phase of mapreduce called shuffle and sort takes place on intermediate data. Reducers  Reducer reduces a set of intermediate values which share a key to a smaller set of values.  It gets the key and the list of all values and then it writes the final result MapReduce
  • 16. MapReduce MapReduce applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods. MapReduce divides workloads up into multiple tasks that can be executed in parallel Why MapReduce ? o It won’t work. o We may run out of memory. o Data processing may take long time. The initial approach is to process data serially i.e. from top to bottom.
  • 17. MapReduce in Action Worker Worker Worker Worker Worker Master(2) assign map (2) assign reduce (3) read (4) local write (5) remote read Output File 0 Output File 1 (6) write Split 0 Split 1 Split 2 Input files Mapper: split, read, emit intermediate Key-Value pairs Reducer: repartition, emits final output User Program Map phase Intermediate files (on local disks) Reduce phase Output files
  • 18. Market Size Source: Wikibon Taming Big Data By 2015 4.5 million IT jobs in Big Data ; 2 million is in US itself
  • 19. In India  Gaining attraction  Huge market opportunities for IT services (82.9% of revenues) and analytics firms (17.1 % )  Market size by end of 2015 - $1 billion  India will require a minimum of 1 lakh data scientists in the next couple of years in addition to data analysts and data managers to support the Big Data space.
  • 20.
  • 21. References  https://hadoop.apache.org  Cloudera (Introduction to HDFS & MapReduce)  CBT Nuggets Apache Hadoop  Hadoop- The Definitive Guide, 4th Edition  en.wikipedia.org  www.edureka.co/big-data-and-hadoop  https://www.udacity.com/