SlideShare a Scribd company logo
1 of 33
Data Access with Hadoop 
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
ameet@hortonworks.com 
@ameetp512 
Ameet Paranjape
Interactive and real-time data analysis in Hadoop! 
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
2013 
Digital universe 
2.3 Zettabytes 
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
85% of growth from new types of data 
with machine-generated data increasing 
15x 
1 Zettabyte (ZB) = 1 million Petabytes (PB); Sources: IDC and IDG Enterprise 
2020 
Digital universe 
40 Zettabytes 
Analysts consensus estimates 
enterprise data growth of 
year over year through 2020 
50x
Traditional systems under pressure 
DATA SYSTEM APPLICATIONS 
Business 
Analytics 
Custom 
Applications 
RDBMS EDW MPP 
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Packaged 
Applications 
• Silos of Data 
• Costly to Scale 
• Constrained Schemas 
Clickstream 
Geolocation 
Sentiment, Web Data 
Sensor. Machine Data 
Unstructured docs, emails 
Server logs 
SOURCES 
Existing Sources 
(CRM, ERP,…) 
New Data Types 
…and difficult to 
manage new data
Virtualization 
Slicing your servers into pieces so your can parcel out computing resources 
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 5
Hadoop 
Tying your servers together to make them act like one big computer 
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 6
Cost of storage is going down 
According to StatisticBrain, the average cost per gigabyte of storage was 
$437,500 in 1980, $11 in 2000, and just five cents in 2013. 
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 101 
The basics 
1. Hadoop ties your servers together, and makes them act like one big computer 
• So you can use inexpensive servers to do your big data processing 
2. Hadoop works well with structured, semi-structured, 
and unstructured information 
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop and the Modern Data Architecture (MDA) 
SOURCES 
EXISTING 
Systems 
Clickstream Web 
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
&Social 
Batch Interactive Real-Time 
HDFS 
(Hadoop Distributed File System) 
Geolocation Sensor 
& Machine 
Server 
Logs 
Unstructured 
DATA SYSTEM APPLICATIONS 
Business 
Analytics 
Custom 
Applications 
Packaged 
Applications 
RDBMS EDW MPP YARN: Data Operating System 
1 ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° N
It’s crowded out there! 
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Recommended Reading 
The Forrester Wave Report – Big Data Hadoop Solutions, Q1 2014 
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Comparison Tips 
1. Is the solution open or closed source? 
2. If code is open, who owns the IP? 
3. What’s available for free and what do you pay for? 
4. Is the solution substrate agnostic? 
5. OS support options? 
6. Partnerships 
7. What’s the pricing model? 
8. Local resources to help? 
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A Blueprint for Enterprise Hadoop 
Load data 
and manage 
according 
to policy 
PRESENTATION & APPLICATION 
ENTERPRISE MGMT & SECURITY 
DATA ACCESS SECURITY 
Access your data simultaneously in multiple ways 
(batch, interactive, real-time) Provide layered 
YARN Data Operating System 
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Deploy and 
effectively 
manage the 
platform 
Store and process all of your Corporate Data Assets 
approach to 
security through 
Authentication, 
Authorization, 
Accounting, and 
Data Protection 
DATA MANAGEMENT 
GOVERNANCE 
& INTEGRATION 
OPERATIONS 
Enable both existing and new application to 
provide value to the organization 
Empower existing operations and 
security tools to manage Hadoop 
Provide deployment choice across physical, virtual, cloud 
DEPLOYMENT OPTIONS
Apache Hadoop & A Hadoop “Distribution” 
Apache Hadoop Is a project 
 Governed by Apache Software Foundation (ASF) 
 Comprises YARN and HDFS 
Hadoop distribution is a package of projects (e.g. HDP) 
 Packages Apache Hadoop and related Apache projects 
 It extends Hadoop with: 
– Data access services to manipulate the data 
– Data governance and integration services 
– Security services 
– Operational services to manage the cluster 
 Tested for consistency across the entire package 
 Hardened for the enterprise 
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Page 14
YARN has transformed Hadoop 
BATCH, INTERACTIVE & REAL-TIME 
DATA ACCESS 
Batch Interactive Real-Time 
YARN: Data Operating System 
1 ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° ° 
DATA MANAGEMENT 
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
° 
° 
N 
HDFS 
(Hadoop Distributed File System)
Apache Projects for Data Access 
BATCH, INTERACTIVE & REAL-TIME 
DATA ACCESS 
Tez Tez 
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Apache Pig 
Apache Hive 
Apache HBase 
Apache Storm 
Apache Solr 
Apache Spark 
Traditional Tools 
In-Memory 
Spark 
YARN: Data Operating System 
DATA MANAGEMENT 
Script 
Pig 
Search 
Solr 
SQL 
Hive 
HCatalog 
NoSQL 
HBase 
Accumulo 
Stream 
Storm 
Others 
ISV 
Engines 
1 ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° ° 
° 
° 
N 
HDFS 
(Hadoop Distributed File System)
Apache Projects for Governance 
GOVERNANCE 
& INTEGRATION 
Data Workflow, 
Lifecycle & 
Governance 
Falcon 
Sqoop 
Flume 
NFS 
WebHDFS 
BATCH, INTERACTIVE & REAL-TIME 
DATA ACCESS 
Script 
Search 
Solr 
SQL 
Hive 
HCatalog 
NoSQL 
HBase 
Accumulo 
Stream 
Storm 
In-Memory 
Spark 
Tez Tez 
YARN: Data Operating System 
DATA MANAGEMENT 
Pig 
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Others 
ISV 
Engines 
1 ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° ° 
° 
° 
N 
HDFS 
(Hadoop Distributed File System) 
Apache Falcon 
Apache Sqoop 
Apache Flume 
Hadoop NFS & WebHDFS
Apache Projects for Security 
Tez Tez 
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Apache Knox 
Apache Argus 
Entire Stack 
(HDFS, Hive, YARN) 
Data Workflow, 
Lifecycle & 
Governance 
Falcon 
Sqoop 
Flume 
NFS 
WebHDFS 
In-Memory 
Spark 
YARN: Data Operating System 
DATA MANAGEMENT 
SECURITY 
BATCH, INTERACTIVE & REAL-TIME 
DATA ACCESS 
GOVERNANCE 
& INTEGRATION 
Authentication 
Authorization 
Accounting 
Data Protection 
Storage: HDFS 
Resources: YARN 
Access: Hive, … 
Pipeline: Falcon 
Cluster: Knox 
Script 
Pig 
Search 
Solr 
SQL 
Hive 
HCatalog 
NoSQL 
HBase 
Accumulo 
Stream 
Storm 
Others 
ISV 
Engines 
1 ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° ° 
° 
° 
N 
HDFS 
(Hadoop Distributed File System)
Apache Projects for Operations 
Tez Tez 
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Apache Ambari 
Apache Zookeeper 
Apache Oozie 
Provision, 
Manage & 
Monitor 
Ambari 
Zookeeper 
Scheduling 
Oozie 
Data Workflow, 
Lifecycle & 
Governance 
Falcon 
Sqoop 
Flume 
NFS 
WebHDFS 
In-Memory 
Spark 
YARN: Data Operating System 
DATA MANAGEMENT 
SECURITY 
BATCH, INTERACTIVE & REAL-TIME 
DATA ACCESS 
GOVERNANCE 
& INTEGRATION 
Authentication 
Authorization 
Accounting 
Data Protection 
Storage: HDFS 
Resources: YARN 
Access: Hive, … 
Pipeline: Falcon 
Cluster: Knox 
OPERATIONS 
Script 
Pig 
Search 
Solr 
SQL 
Hive 
HCatalog 
NoSQL 
HBase 
Accumulo 
Stream 
Storm 
Others 
ISV 
Engines 
1 ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° ° 
° 
° 
N 
HDFS 
(Hadoop Distributed File System)
Remember the MDA 
SOURCES 
EXISTING 
Systems 
Clickstream Web 
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
&Social 
Batch Interactive Real-Time 
HDFS 
(Hadoop Distributed File System) 
Geolocation Sensor 
& Machine 
Server 
Logs 
Unstructured 
DATA SYSTEM APPLICATIONS 
Business 
Analytics 
Custom 
Applications 
Packaged 
Applications 
RDBMS EDW MPP YARN: Data Operating System 
1 ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° N
What is Data Access? 
Data Access defines ALL the channels 
through which data can be accessed, 
analyzed, cleansed and consumed within 
Hadoop. Each channel can be categorized 
into THREE core patterns; Batch, Interactive 
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
and Real-time. 
Multiple engines provide 
optimized access to your mission 
critical data.
Access patterns enabled by YARN 
Batch 
Needs to happen but, no 
timeframe limitations 
Interactive 
Needs to happen at 
Human time 
Real-Time 
Needs to happen at 
Machine Execution time. 
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Batch Interactive Real-Time 
YARN: Data Operating System 
1 ° ° ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° ° 
° 
N ° 
HDFS 
(Hadoop Distributed File System)
HBase 
• Apache™ HBase is a non-relational (NoSQL) 
database that runs on top of the Hadoop® 
Distributed File System (HDFS). 
• It is columnar and provides fault-tolerant 
storage and quick access to large quantities 
of sparse data. 
• It also adds transactional capabilities to 
Hadoop, allowing users to conduct updates, 
inserts and deletes. 
• HBase was created for hosting very large 
tables with billions of rows and millions of 
columns. 
• 
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Developers use it to: 
• Provide low latency access to 
massive amounts of data (eg. 
Recommendation engine 
results) 
• Document store
Spark 
• Spark is a general-purpose engine for ad-hoc 
interactive analytics, iterative machine-learning, 
and other use cases well-suited to 
interactive, in-memory data processing of GB 
to TB sized datasets. 
• Spark loads data into memory so it can be 
queried repeatedly. It can create a “shadow” 
of data that can be used in the next iteration 
of a query 
• Spark provides simple APIs for data scientists 
and engineers familiar with Scala 
(programming language) to build applications 
• Spark is YARN-ready – another engine on 
YARN! 
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Developers use it to: 
• Data Science: machine 
Learning and iterative analytics
Stream Processing in Hadoop 
Batch Interactive Real-Time 
Sentiment Clickstream Machine/Sensor Server Logs Geo-location 
How do I deal with this 
continuous stream of data 
coming in from sensors…etc? 
Apache Storm 
Real-time event processing for sensor 
and business activity monitoring 
• Unlocks new business cases for Hadoop 
• Key component of a data lake architecture 
• Scale: Ingest millions of events per second. Fast 
query on petabytes of data 
• Integrated with Ambari to manage 
• Predictive Analytics 
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Prevent Optimize 
Finance 
- Securities Fraud 
- Compliance violations 
- Order routing 
- Pricing 
Telco 
- Security breaches 
- Network Outages 
- Bandwidth allocation 
- Customer service 
Retail 
- Offers 
- Pricing 
- Machine failures - Supply chain 
Manufacturing 
Transportation - 
Driver & fleet issues - Routes 
- Pricing 
Web 
- Application failures 
- Operational issues 
- Site content 
---- 
Monitor real-time 
data to… 
YARN: Data Operating System
Trucking company w/ large fleet of trucks in Midwest 
A truck generates millions of events for 
a given route; an event could be: 
• 'Normal' events: starting / stopping of the vehicle 
• ‘Violation’ events: speeding, excessive 
acceleration and breaking, unsafe tail distance 
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Route? 
Truck? 
Driver? 
Analysts query a 
broad history to 
understand if today’s 
violations are part of 
a larger problem with 
specific routes, 
trucks, or drivers 
Company uses an application that 
monitors truck locations and violations 
from the truck/driver in real-time
Solutions on Hadoop Require All! 
Truck Sensors 
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Inbound Messaging 
(Kafka) 
Stream Processing 
(Storm) 
Many Workloads: YARN 
Distributed Storage: HDFS 
Microsoft 
Excel 
Interactive Query 
(Hive on Tez) 
Alerts & Events 
(ActiveMQ) 
Real-time Serving 
(HBase) 
Real-Time 
User Interface
Query Executes Blazingly Fast with Hive 13 on Tez 
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Do Specific Routes Cause More Issues?
Do Specific Trucks Cause More Issues?
Do Specific Drivers in Trucks Cause More Issues?
Try it out... 
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
YARN Has 
Fundamentally 
Changed Hadoop 
YARN enables: 
• More Workloads 
From batch to interactive & real-time 
• More Data 
Multiple data sets of varying types 
and structures 
• More Value 
Hosting multiple business cases 
in a single Hadoop cluster 
Enterprise Hadoop Enables… 
• More Workloads 
From batch to interactive & real-time 
• More Data 
Multiple data sets of varying types 
and structures 
• More Value 
Hosting multiple business cases 
in a single Hadoop cluster

More Related Content

What's hot

Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionDataWorks Summit
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
 
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Bigger Data For Your Budget
Bigger Data For Your BudgetBigger Data For Your Budget
Bigger Data For Your BudgetHortonworks
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to HadoopHortonworks
 

What's hot (20)

Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Bigger Data For Your Budget
Bigger Data For Your BudgetBigger Data For Your Budget
Bigger Data For Your Budget
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 

Viewers also liked

La realimentación de artefactos tecnologicos
La realimentación de artefactos tecnologicosLa realimentación de artefactos tecnologicos
La realimentación de artefactos tecnologicosjosecolimba
 
The grand entry museum
The grand entry museumThe grand entry museum
The grand entry museumRebecca Hall
 
Туузан хорхой (Tapeworm)
Туузан хорхой (Tapeworm)Туузан хорхой (Tapeworm)
Туузан хорхой (Tapeworm)iGamer Gamer
 

Viewers also liked (6)

La realimentación de artefactos tecnologicos
La realimentación de artefactos tecnologicosLa realimentación de artefactos tecnologicos
La realimentación de artefactos tecnologicos
 
Ciencias naturales
Ciencias naturalesCiencias naturales
Ciencias naturales
 
Bet with Big Bad John
Bet with Big Bad John
Bet with Big Bad John
Bet with Big Bad John
 
The grand entry museum
The grand entry museumThe grand entry museum
The grand entry museum
 
Туузан хорхой (Tapeworm)
Туузан хорхой (Tapeworm)Туузан хорхой (Tapeworm)
Туузан хорхой (Tapeworm)
 
хелико
хеликохелико
хелико
 

Similar to Cloud Austin Meetup - Hadoop like a champion

Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Rommel Garcia
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in HadoopRommel Garcia
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - finalHortonworks
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Hortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Hortonworks
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Hortonworks
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGskumpf
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 

Similar to Cloud Austin Meetup - Hadoop like a champion (20)

Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 

Cloud Austin Meetup - Hadoop like a champion

  • 1. Data Access with Hadoop Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved ameet@hortonworks.com @ameetp512 Ameet Paranjape
  • 2. Interactive and real-time data analysis in Hadoop! Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 3. 2013 Digital universe 2.3 Zettabytes Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 85% of growth from new types of data with machine-generated data increasing 15x 1 Zettabyte (ZB) = 1 million Petabytes (PB); Sources: IDC and IDG Enterprise 2020 Digital universe 40 Zettabytes Analysts consensus estimates enterprise data growth of year over year through 2020 50x
  • 4. Traditional systems under pressure DATA SYSTEM APPLICATIONS Business Analytics Custom Applications RDBMS EDW MPP Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Packaged Applications • Silos of Data • Costly to Scale • Constrained Schemas Clickstream Geolocation Sentiment, Web Data Sensor. Machine Data Unstructured docs, emails Server logs SOURCES Existing Sources (CRM, ERP,…) New Data Types …and difficult to manage new data
  • 5. Virtualization Slicing your servers into pieces so your can parcel out computing resources Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 5
  • 6. Hadoop Tying your servers together to make them act like one big computer Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 6
  • 7. Cost of storage is going down According to StatisticBrain, the average cost per gigabyte of storage was $437,500 in 1980, $11 in 2000, and just five cents in 2013. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 8. Hadoop 101 The basics 1. Hadoop ties your servers together, and makes them act like one big computer • So you can use inexpensive servers to do your big data processing 2. Hadoop works well with structured, semi-structured, and unstructured information Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 9. Hadoop and the Modern Data Architecture (MDA) SOURCES EXISTING Systems Clickstream Web Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved &Social Batch Interactive Real-Time HDFS (Hadoop Distributed File System) Geolocation Sensor & Machine Server Logs Unstructured DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N
  • 10. It’s crowded out there! Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 11. Recommended Reading The Forrester Wave Report – Big Data Hadoop Solutions, Q1 2014 Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 12. Hadoop Comparison Tips 1. Is the solution open or closed source? 2. If code is open, who owns the IP? 3. What’s available for free and what do you pay for? 4. Is the solution substrate agnostic? 5. OS support options? 6. Partnerships 7. What’s the pricing model? 8. Local resources to help? Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 13. A Blueprint for Enterprise Hadoop Load data and manage according to policy PRESENTATION & APPLICATION ENTERPRISE MGMT & SECURITY DATA ACCESS SECURITY Access your data simultaneously in multiple ways (batch, interactive, real-time) Provide layered YARN Data Operating System Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Deploy and effectively manage the platform Store and process all of your Corporate Data Assets approach to security through Authentication, Authorization, Accounting, and Data Protection DATA MANAGEMENT GOVERNANCE & INTEGRATION OPERATIONS Enable both existing and new application to provide value to the organization Empower existing operations and security tools to manage Hadoop Provide deployment choice across physical, virtual, cloud DEPLOYMENT OPTIONS
  • 14. Apache Hadoop & A Hadoop “Distribution” Apache Hadoop Is a project  Governed by Apache Software Foundation (ASF)  Comprises YARN and HDFS Hadoop distribution is a package of projects (e.g. HDP)  Packages Apache Hadoop and related Apache projects  It extends Hadoop with: – Data access services to manipulate the data – Data governance and integration services – Security services – Operational services to manage the cluster  Tested for consistency across the entire package  Hardened for the enterprise Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Page 14
  • 15. YARN has transformed Hadoop BATCH, INTERACTIVE & REAL-TIME DATA ACCESS Batch Interactive Real-Time YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° DATA MANAGEMENT Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved ° ° N HDFS (Hadoop Distributed File System)
  • 16. Apache Projects for Data Access BATCH, INTERACTIVE & REAL-TIME DATA ACCESS Tez Tez Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Pig Apache Hive Apache HBase Apache Storm Apache Solr Apache Spark Traditional Tools In-Memory Spark YARN: Data Operating System DATA MANAGEMENT Script Pig Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System)
  • 17. Apache Projects for Governance GOVERNANCE & INTEGRATION Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS Script Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm In-Memory Spark Tez Tez YARN: Data Operating System DATA MANAGEMENT Pig Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Apache Falcon Apache Sqoop Apache Flume Hadoop NFS & WebHDFS
  • 18. Apache Projects for Security Tez Tez Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Knox Apache Argus Entire Stack (HDFS, Hive, YARN) Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS In-Memory Spark YARN: Data Operating System DATA MANAGEMENT SECURITY BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE & INTEGRATION Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Script Pig Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System)
  • 19. Apache Projects for Operations Tez Tez Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Ambari Apache Zookeeper Apache Oozie Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS In-Memory Spark YARN: Data Operating System DATA MANAGEMENT SECURITY BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE & INTEGRATION Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox OPERATIONS Script Pig Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System)
  • 20. Remember the MDA SOURCES EXISTING Systems Clickstream Web Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved &Social Batch Interactive Real-Time HDFS (Hadoop Distributed File System) Geolocation Sensor & Machine Server Logs Unstructured DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N
  • 21. What is Data Access? Data Access defines ALL the channels through which data can be accessed, analyzed, cleansed and consumed within Hadoop. Each channel can be categorized into THREE core patterns; Batch, Interactive Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved and Real-time. Multiple engines provide optimized access to your mission critical data.
  • 22. Access patterns enabled by YARN Batch Needs to happen but, no timeframe limitations Interactive Needs to happen at Human time Real-Time Needs to happen at Machine Execution time. Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Batch Interactive Real-Time YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N ° HDFS (Hadoop Distributed File System)
  • 23. HBase • Apache™ HBase is a non-relational (NoSQL) database that runs on top of the Hadoop® Distributed File System (HDFS). • It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. • It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. • HBase was created for hosting very large tables with billions of rows and millions of columns. • Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Developers use it to: • Provide low latency access to massive amounts of data (eg. Recommendation engine results) • Document store
  • 24. Spark • Spark is a general-purpose engine for ad-hoc interactive analytics, iterative machine-learning, and other use cases well-suited to interactive, in-memory data processing of GB to TB sized datasets. • Spark loads data into memory so it can be queried repeatedly. It can create a “shadow” of data that can be used in the next iteration of a query • Spark provides simple APIs for data scientists and engineers familiar with Scala (programming language) to build applications • Spark is YARN-ready – another engine on YARN! Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Developers use it to: • Data Science: machine Learning and iterative analytics
  • 25. Stream Processing in Hadoop Batch Interactive Real-Time Sentiment Clickstream Machine/Sensor Server Logs Geo-location How do I deal with this continuous stream of data coming in from sensors…etc? Apache Storm Real-time event processing for sensor and business activity monitoring • Unlocks new business cases for Hadoop • Key component of a data lake architecture • Scale: Ingest millions of events per second. Fast query on petabytes of data • Integrated with Ambari to manage • Predictive Analytics Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Prevent Optimize Finance - Securities Fraud - Compliance violations - Order routing - Pricing Telco - Security breaches - Network Outages - Bandwidth allocation - Customer service Retail - Offers - Pricing - Machine failures - Supply chain Manufacturing Transportation - Driver & fleet issues - Routes - Pricing Web - Application failures - Operational issues - Site content ---- Monitor real-time data to… YARN: Data Operating System
  • 26. Trucking company w/ large fleet of trucks in Midwest A truck generates millions of events for a given route; an event could be: • 'Normal' events: starting / stopping of the vehicle • ‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Route? Truck? Driver? Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers Company uses an application that monitors truck locations and violations from the truck/driver in real-time
  • 27. Solutions on Hadoop Require All! Truck Sensors Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Inbound Messaging (Kafka) Stream Processing (Storm) Many Workloads: YARN Distributed Storage: HDFS Microsoft Excel Interactive Query (Hive on Tez) Alerts & Events (ActiveMQ) Real-time Serving (HBase) Real-Time User Interface
  • 28. Query Executes Blazingly Fast with Hive 13 on Tez Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 29. Do Specific Routes Cause More Issues?
  • 30. Do Specific Trucks Cause More Issues?
  • 31. Do Specific Drivers in Trucks Cause More Issues?
  • 32. Try it out... Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 33. Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved YARN Has Fundamentally Changed Hadoop YARN enables: • More Workloads From batch to interactive & real-time • More Data Multiple data sets of varying types and structures • More Value Hosting multiple business cases in a single Hadoop cluster Enterprise Hadoop Enables… • More Workloads From batch to interactive & real-time • More Data Multiple data sets of varying types and structures • More Value Hosting multiple business cases in a single Hadoop cluster

Editor's Notes

  1. So, where does Hadoop fit in the data center? This picture here is a very simple depiction of the typical data architecture in any organization.   - There are sources of data: ERP, CRM, other digital sources - That data is then stored in a data system: a data warehouse, MPP system, etc. - Then an application of some kind accesses that data system: a packaged application such as Excel or Tableau, a custom application written by a developer, or even another business application   This has been the foundation of the data center for years. We have had some challenges with this architecture all along, however, we are seeing increased pressure to modify and improve this basic blueprint because A) this approach created silos of data and it was difficult to share the data or get a holistic view of it B) these systems are costly to scale C) and they are also coupled to a very static schema. Changes to a data model are difficult if not impossible. This limits flexibility and insight.   Finally, the emergence of NEW types of data as we digitize the world around us such as clickstream, machine sensor, etc, are growing at exponential rates. We are all becoming data driven organizations.   In fact that sheer volume of data is to grow 20X between 2013 and 2020 – and which puts tremendous pressure on this architecture. The old architecture is neither technologically nor commercially practical.
  2. When you distill down all the “new” types of big data that are being managed by Haodop, they generally fall into the six categories .. as represented as columns on the left side of this slide… - sentiment & web, - clickstream, - machine & sensor, - geographic data, - server logs and - and general unstructured content, the stuff we find in docs and pdfs throughout our organization.   Within various vertical, best-practice architectures have emerged to surface the value from Hadoop and HDP. Some representative appear here: Advertisers target ads to their best customer segment and also analyze point-of-sale data to determine the effectiveness of campaigns. Banks detect fraud and money laundering while also improving customer service. Hospitals respond to patients in real time and then analyze historical data to reduce readmission rates. Manufacturers control quality on the production line and then diagnose product defects in the aggregate. Oil companies predict and repair equipment proactively and also analyze equipment durability under varied circumstances. Telecoms allocate bandwidth in real time, and later discover unforeseen patterns after analyzing billions of historical call records. Retailers make sure the shelves are stocked today and also plan their product mix for next year. **** WHAT IS YOUR USE CASE???
  3. YARN enables the modern data architecture as it turns hadoop into a truly multi-purpose data platform with batch, interactive and real time workloads all running in a single cluster..   It enables users to: - Create a central cluster into which data can be stored and then accessed it using a range of processing engines: batch, interactive, real-time. - It is akin to the journey with virtualization: from a single virtual server to a pool of virtual infrastructure.   It is the architectural center of Hadoop - it provides the data operating system around which the core enterprise capabilities of security, governance and operations can be integrated - It is the integration point into which all data processing engines integrate – from the open source community but also from the commercial vendor ecosystem
  4. Hadoop has evolved over the years to not only provide linear scale compute and storage, but it also needed explicit functions to make it a complete data platform. These new projects spun up around Hadoop to meet some of the complex requirements of the modern enterprise A good way to look at the evolution of Hadoop is through this picture. - When Hadoop began it was simply a data management layer (HDFS) and a single data access engine (MapReduce). Over the past several years the range of components in the Hadoop ecosystem has exploded: - Data Access - The emergence of multiple access engines spanning SQL, NoSQL, Scripting, Streaming and more. YARN ensures that they all can be part of Hadoop seamlessly. - Security - To address the key requirements of authorization, access, audit/accounting and data protection - Operations - Tools to manage the platform - Governance and integration - Tools to load and manage data according to policy   These are all the core requirements of any data platform and over time the Hadoop community has expanded to include all of these capabilities. The reason that there are 5 categories?   Because each addresses the requirements of each different persona that engages with a data platform. Developers (Data Access) Administrators (Security, Operations) Governance (Data Architects)
  5. YARN enables the modern data architecture as it turns hadoop into a truly multi-purpose data platform with batch, interactive and real time workloads all running in a single cluster..   It enables users to: - Create a central cluster into which data can be stored and then accessed it using a range of processing engines: batch, interactive, real-time. - It is akin to the journey with virtualization: from a single virtual server to a pool of virtual infrastructure.   It is the architectural center of Hadoop - it provides the data operating system around which the core enterprise capabilities of security, governance and operations can be integrated - It is the integration point into which all data processing engines integrate – from the open source community but also from the commercial vendor ecosystem
  6. Twitter Johnson Controls Cablevision….Plans
  7. Elastic Search Flume Sink does exist