Hadoop and big data

HADOOP
Presentation by:
Sharad Pandey

Things included:
• History of data
• What is Big Data?
• Distributed Computing Vs Parallelization
• Google’s solution
• Hadoop
• HDFS
• MapReduce
• A video to explain map reduce

HISTORY OF DATA!!!
• Due to the advent of new technologies, devices, and
communication means like social networking sites, the amount
of data produced by mankind is growing rapidly every year.
• The amount of data produced by us from the beginning of time
till 2003 was 5 billion gigabytes. If you pile up the data in the form
of disks it may fill an entire football field.
• The same amount was created in every two days in 2011, and in
every ten minutes in 2013. This rate is still growing enormously.
Though all this information produced is meaningful and can be
useful when processed, it is being neglected.

What is Big Data?
• Big data means really a big data, it is a collection of large
datasets that cannot be processed using traditional
computing techniques. Big data is not merely a data,
rather it has become a complete subject, which involves
various tools, techniques and frameworks.
• Black box data
• Social media data
• power grid data
• Search engine data

What Caused The Problem?
year Data storage
1990 1000(MB)
2010 1000(GB)
year Transfer rate of
data
1990 4.4 mbps
2010 100 mbps

So What Is The Problem?
• The transfer speed is around 100 MB/s
• A standard disk is 1 Terabyte
• Time to read entire disk= 10000 seconds or 3 Hours!
• Increase in processing time may not be as helpful because
• Network bandwidth is now more of a limiting factor
• Physical limits of processor chips have been reached

So What do We Do?
• The obvious solution is that we use multiple processors to solve
the same problem by fragmenting it into pieces.
• Imagine if we had 100 drives, each holding one hundredth of the
data. Working in parallel, we could read the data in under two
minutes.

Distributed Computing Vs
Parallelization
Parallelization Distributed Computing

Distributed Computing
The key issues involved in this Solution:
• Hardware failure
• Combine the data after analysis
• Network Associated Problems

Google’s Solution
Above diagram shows various commodity hardware which could be
single CPU machines or servers with higher capacity.

Mike cafarella and Doug Cutting

Hadoop
• Doug Cutting, Mike Cafarella started an Open Source Project
called HADOOP in 2005.
• Now Apache Hadoop is a registered trademark of the Apache
Software Foundation.
• Hadoop runs applications using the MapReduce algorithm.
• Hadoop = HDFS + MapReduce.

HDFS(Hadoop Distributed File System)
• Hadoop comes with a distributed file system called HDFS, which
stands for Hadoop Distributed Filesystem.
• HDFS, the Hadoop Distributed File System, is a distributed file
system designed to hold very large amounts of data (terabytes
or even petabytes), and provide high-throughput access to this
information.
• HDFS is highly fault tolerant and designed using low-cost
hardware.

MapReduce
• It is a programming model.
• Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines.
MapReduce
MAP
map function that
processes a key/value
pair to generate a set
of intermediate
key/value pairs
REDUCE
and a reduce function
that merges all
intermediate values
associated with the
same intermediate key.

References
• http://www.tutorialspoint.com/hadoop
• http://www.youtube.com
• http://www.Wikipedia.com
• https://hadoop.apache.org/

Hadoop and big data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop and big data

Similar to Hadoop and big data (20)

Recently uploaded

Recently uploaded (20)

Hadoop and big data