SlideShare a Scribd company logo
1 of 27
Presented by,
Gavara Sai Sri Lakshmi Alekhya,
MTech Graduate
Big Data
 Big Data is data whose scale, diversity and complexity require new
architecture, techniques, algorithms and analytics to manage it and
extract value and hidden knowledge from it.
 Simply Big data is similar to “small data” , but bigger in size.
 As having a data bigger it requires different approaches like
Techniques, tools, and architecture.
 This big data aims to solve new problems or old problems in a better
way.
 A Big data generates value from the storage and processing of very
large quantities of digital information that cannot be analyzed with
traditional computing techniques.
Generating Big data
Analysis -Big data Generation
 Walmart handles more than 1 million customer transactions every
hour.
 Facebook handles 40 billion photos from its user base.
 FB generates 10TB daily
 Twitter generates 7TB of data daily.
 Decoding the human genome originally took 10 years to process; but
now it can be achieved in one week.
4 V’s in Big Data
Big Data to Value
 Big data is not about the size of the data, but its
mainly about the value within the data.
Why Big Data Needed?
 Big Data Growth is needed.
 Increase of storage capacities.
 Increase of processing power.
 Availability of data(different data
types).
 Every day we create 2.5
quintillion bytes of data.
 IBM claims 90% of the data in the
world today has been created in
last two years alone.
Big Data Analytics
 Examining huge amounts of data.
 Accurate Information.
 Identification of hidden patterns, unknown
correlations.
 Competitive environment.
 Better Business Decisions like Strategic and
operational.
 Effective Marketing, Customer satisfaction, Increased
revenue.
Applications of Big Data
Risks of Big Data
 It will be so overwhelmed
 needs the right people and solve the
right problems.
 Costs escalate too fast
 is not necessary to capture 100%.
 Many sources of big data are privacy
 self regulation, legal regulation.
Challenges of Big Data
 Uncertainty of the Data Management Landscape
 The Big Data talent gap
 Getting data into Big data platform
 Synchronization across the data sources
 Getting useful information out of the Big data Platform
Big Data Analytics Technologies
 NoSQL: non-relational or atleast non-SQL database
solutions such as Hbase (also a part of the Hadoop
ecosystem), Cassandra, MongoDB, Riak, CouchDB and
many others.
 Hadoop : It is an ecosystem of software packages,
including MapReduce, HDFS and a whole host of
other software packages.
 Apache Hadoop is a frame work that allows for the distributed
processing of large data sets across clusters of commodity
computers using a simple programming model.
 It is an open source data management with scale-out storage and
distributed processing.
 Hadoop is a system for large scale data processing.
 It has two main components.
Hadoop = HDFS + MapReduce
HDFS – Hadoop Distributed File
System
 HDFS ( storage and file system): HDFS is a
scalable, fault tolerance reliable distributed file
system that provides high-throughput access to
data.
 NameNode:
 Master of the system
 Maintains and manages the blocks which are
present on the Datanodes.
 DataNodes:
 Slaves which are deployed on each machine
and provide the actual storage.
 Responsible for serving read and write
requests for the clients.
HDFS Architecture
Map Reduce
 A MapReduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner. The framework sorts the outputs of the maps, which are then
input to the reduce tasks. Typically both the input and the output of
the job are stored in a file-system.
 It has 2 phases.
 Mapper Phase:
 Process a key/value pair to generate intermediate key/value pairs
 Reducer Phase:
 Merge all intermediate values associated with the same key
MapReduce Architecture
HDFS and Map Reduce
Hadoop Eco-System
PIG:
Pig was initially developed at Yahoo Research around 2006 but moved into the
Apache Software Foundation in 2007 to allow individuals using Apache Hadoop to
focus a lot of on analyzing massive data sets and pay less time having to put in
writing mapper and reducer programs.
The Pig programming language is meant to handle any reasonably data—hence the
name!
 Pig consists of a two components, first is the language called as Pig Latin and
secondly an execution environment where Pig Latin programs are executed
HIVE:
Apache Hive is a data warehouse system for Apache Hadoop .
Hive is a technology which is developed by Facebook that turns Hadoop into
a data warehouse which complete with an extension of sql for querying.
Hive is used as HiveQL which is a declarative language.
In piglatin, dataflow is described but in Hive results must be described.
Hive by itself find out a dataflow to get those results.
Hive must have a schema that is more than one.
OOZIE:
Oozie is a java based web-application that runs in a java servlet that
uses the database to store definition of Workflow that is a collection of
actions. Hadoop jobs are managed by this.
HBASE:
Hbase is non-relational columnar distributed column oriented database
where as HDFS is file system.
 It is built and run on top of HDFS system.
 It is a management system that is open-source, versioned, and
distributed based on the Big Table of Google.
 It is written in Java. It is serving as the input and output for the Map
Reduce.
 For instance, read and write operations involve all rows but only a small
subset of all columns.
SQOOP:
Sqoop is a tool used to transfer the data from relational database environments like
oracle, mysql and postgre sql into hadoop environment.
It is a command-line interface platform is used for transferring data between
relational databases and Hadoop.
MAHOUT:
Mahout is a library for machine-learning and data mining which is divided
into four main groups: collective filtering, categorization, clustering, and
mining of parallel frequent patterns.
The Mahout library belongs to the subset that can be executed in a
distributed mode and executed by Map Reduce.
FLUME:
Flume is an open source programming which is made by cloud era to go about as
an organization for gathering and moving enormous measure of data around a
Hadoop bundle as data is conveyed or in no time.
Crucial use case of flume is together log records from all machines in cluster to
continue on them in a united store..
Conclusion
 Real time big data is not just a process for storing petabytes or
exabytes of data in a data warehouse, its about the ability to make
better decisions and take meaningful actions at the right time.
 Fast forward to the present and technologies like hadoop give you
the scale and flexibility to store data before you know how you are
going to process it
 Technologies such as MapReduce, Hive and Impala enables you to
run queries without changing the data structures underneath.
 It offers commercial opportunities of a comparable scale to
enterprise software in the late 1980’s.
Vendors using Big data(hadoop)
Future
 Our new research works in organizations use big data to
target
 customer centric outcomes,
 tap into internal data and
 build a better information ecosystem.
A Glimpse of Bigdata - Introduction

More Related Content

What's hot

Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop TechnologyRahul Sharma
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleSpringPeople
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1RojaT4
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoopAditi Yadav
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersEdureka!
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

What's hot (20)

Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeople
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Why Hadoop is Useful?
Why Hadoop is Useful?Why Hadoop is Useful?
Why Hadoop is Useful?
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 

Similar to A Glimpse of Bigdata - Introduction

Similar to A Glimpse of Bigdata - Introduction (20)

Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
paper
paperpaper
paper
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An OverviewHadoop and its role in Facebook: An Overview
Hadoop and its role in Facebook: An Overview
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Big Data
Big DataBig Data
Big Data
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Big data
Big dataBig data
Big data
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 

Recently uploaded

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 

Recently uploaded (20)

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 

A Glimpse of Bigdata - Introduction

  • 1. Presented by, Gavara Sai Sri Lakshmi Alekhya, MTech Graduate
  • 2.
  • 3. Big Data  Big Data is data whose scale, diversity and complexity require new architecture, techniques, algorithms and analytics to manage it and extract value and hidden knowledge from it.  Simply Big data is similar to “small data” , but bigger in size.  As having a data bigger it requires different approaches like Techniques, tools, and architecture.  This big data aims to solve new problems or old problems in a better way.  A Big data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques.
  • 5. Analysis -Big data Generation  Walmart handles more than 1 million customer transactions every hour.  Facebook handles 40 billion photos from its user base.  FB generates 10TB daily  Twitter generates 7TB of data daily.  Decoding the human genome originally took 10 years to process; but now it can be achieved in one week.
  • 6. 4 V’s in Big Data
  • 7. Big Data to Value  Big data is not about the size of the data, but its mainly about the value within the data.
  • 8. Why Big Data Needed?  Big Data Growth is needed.  Increase of storage capacities.  Increase of processing power.  Availability of data(different data types).  Every day we create 2.5 quintillion bytes of data.  IBM claims 90% of the data in the world today has been created in last two years alone.
  • 9. Big Data Analytics  Examining huge amounts of data.  Accurate Information.  Identification of hidden patterns, unknown correlations.  Competitive environment.  Better Business Decisions like Strategic and operational.  Effective Marketing, Customer satisfaction, Increased revenue.
  • 11. Risks of Big Data  It will be so overwhelmed  needs the right people and solve the right problems.  Costs escalate too fast  is not necessary to capture 100%.  Many sources of big data are privacy  self regulation, legal regulation.
  • 12. Challenges of Big Data  Uncertainty of the Data Management Landscape  The Big Data talent gap  Getting data into Big data platform  Synchronization across the data sources  Getting useful information out of the Big data Platform
  • 13. Big Data Analytics Technologies  NoSQL: non-relational or atleast non-SQL database solutions such as Hbase (also a part of the Hadoop ecosystem), Cassandra, MongoDB, Riak, CouchDB and many others.  Hadoop : It is an ecosystem of software packages, including MapReduce, HDFS and a whole host of other software packages.
  • 14.  Apache Hadoop is a frame work that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an open source data management with scale-out storage and distributed processing.  Hadoop is a system for large scale data processing.  It has two main components. Hadoop = HDFS + MapReduce
  • 15. HDFS – Hadoop Distributed File System  HDFS ( storage and file system): HDFS is a scalable, fault tolerance reliable distributed file system that provides high-throughput access to data.  NameNode:  Master of the system  Maintains and manages the blocks which are present on the Datanodes.  DataNodes:  Slaves which are deployed on each machine and provide the actual storage.  Responsible for serving read and write requests for the clients.
  • 17. Map Reduce  A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.  It has 2 phases.  Mapper Phase:  Process a key/value pair to generate intermediate key/value pairs  Reducer Phase:  Merge all intermediate values associated with the same key
  • 19. HDFS and Map Reduce
  • 21. PIG: Pig was initially developed at Yahoo Research around 2006 but moved into the Apache Software Foundation in 2007 to allow individuals using Apache Hadoop to focus a lot of on analyzing massive data sets and pay less time having to put in writing mapper and reducer programs. The Pig programming language is meant to handle any reasonably data—hence the name!  Pig consists of a two components, first is the language called as Pig Latin and secondly an execution environment where Pig Latin programs are executed HIVE: Apache Hive is a data warehouse system for Apache Hadoop . Hive is a technology which is developed by Facebook that turns Hadoop into a data warehouse which complete with an extension of sql for querying. Hive is used as HiveQL which is a declarative language. In piglatin, dataflow is described but in Hive results must be described. Hive by itself find out a dataflow to get those results. Hive must have a schema that is more than one.
  • 22. OOZIE: Oozie is a java based web-application that runs in a java servlet that uses the database to store definition of Workflow that is a collection of actions. Hadoop jobs are managed by this. HBASE: Hbase is non-relational columnar distributed column oriented database where as HDFS is file system.  It is built and run on top of HDFS system.  It is a management system that is open-source, versioned, and distributed based on the Big Table of Google.  It is written in Java. It is serving as the input and output for the Map Reduce.  For instance, read and write operations involve all rows but only a small subset of all columns.
  • 23. SQOOP: Sqoop is a tool used to transfer the data from relational database environments like oracle, mysql and postgre sql into hadoop environment. It is a command-line interface platform is used for transferring data between relational databases and Hadoop. MAHOUT: Mahout is a library for machine-learning and data mining which is divided into four main groups: collective filtering, categorization, clustering, and mining of parallel frequent patterns. The Mahout library belongs to the subset that can be executed in a distributed mode and executed by Map Reduce. FLUME: Flume is an open source programming which is made by cloud era to go about as an organization for gathering and moving enormous measure of data around a Hadoop bundle as data is conveyed or in no time. Crucial use case of flume is together log records from all machines in cluster to continue on them in a united store..
  • 24. Conclusion  Real time big data is not just a process for storing petabytes or exabytes of data in a data warehouse, its about the ability to make better decisions and take meaningful actions at the right time.  Fast forward to the present and technologies like hadoop give you the scale and flexibility to store data before you know how you are going to process it  Technologies such as MapReduce, Hive and Impala enables you to run queries without changing the data structures underneath.  It offers commercial opportunities of a comparable scale to enterprise software in the late 1980’s.
  • 25. Vendors using Big data(hadoop)
  • 26. Future  Our new research works in organizations use big data to target  customer centric outcomes,  tap into internal data and  build a better information ecosystem.