SlideShare a Scribd company logo
1 of 31
Building a Hadoop Data
Warehouse
Hadoop 101 for enterprise
data warehouse professionals
Ralph Kimball
APRIL 2014
Building a Hadoop Data Warehouse
© Ralph Kimball, Cloudera, 2014
April 2014
The Data Warehouse Mission
 Identify all possible enterprise data assets
 Select those assets that have actionable
content and can be accessed
 Bring the data assets into a logically
centralized “enterprise data warehouse”
 Expose those data assets most effectively for
decision making
Enormous RDBMS Legacy
 Legacy RDBMSs have been spectacularly
successful, and we will continue to use them.
 Too successful… If all you have is a hammer,
everything looks like a nail.
 RDBMS dilemma: a new ocean of new data
types that are being monetized for strategic
advantage
 Unstructured, semi-structured and machine data
 Evolving schemas, just-in-time schemas
 Links, images, genomes, geo-positions, log data
…
Houston: we have a problem
 Traditional RDBMSs cannot handle
 The new data types
 Extended analytic processing
 Terabytes/hour loading with immediate query
access
 We want to use SQL and SQL-like languages,
but we don’t want the RDBMS storage
constraints…
 The disruptive solution: Hadoop
The Data Warehouse Stack in
Hadoop
 Hadoop is an open source distributed storage and
processing framework
 To understand how data warehousing is different in
Hadoop, start with this powerful architecture difference:
The Data Warehouse Stack in
Hadoop
 Hadoop is an open source distributed storage and
processing framework
 To understand how data warehousing is different in
Hadoop, start with this powerful architecture difference:
Hadoop for Exploratory DW/BI
• Query engines can access HDFS files before ETL
• BI tools are the ultimate glue integrating EDW
HDFS Files:
Sources: Trans-
actions
Free
text
Images
Machines/
Sensors
Links/
Networks
Metadata (system table): HCatalo
g
Query Engines:
BI
Tools:
Tableau
Industry standad HW;
Fault tolerant; Replicated;
Write once(!); Agnostic
content; Scalable to
“infinity”
Others…
Bus
Obj
Cognos QlikVie
w
Others…
All clients can use this to read
files
These are query
engines, not
databases!
Purpose built for
EXTREME I/O
speeds;
Use ETL tool or
Sqoop
EDW
Overflow
Hive
SQL
Impala
SQL
Data Load to Query in One
Step
 Copy into HDFS with ETL tool, Sqoop, or
Flume
into standard HDFS files (write once)
registering metadata with HCatalog
 Declare query schema in Hive or Impala
(no data copying or reloading)
 Immediately launch familiar SQL queries:
“Exploratory BI”
Typical Large Hadoop Cluster
 100 nodes (5 racks)
 Each node
 Dual hex core CPU running at 3 GHz
 64-378 GB of RAM
 24-36 TB disk storage (6-10 TB effective storage
with default redundancy of 3X)
 Overall cluster (!)
 6.4-37.8 TB of RAM (wow, think about this…)
 Up to a PB of effective storage
 Approximate fully loaded cost per TB: $1000 +/-
Committing to High
Performance HDFS files with
Embedded Schemas
10
HDFS Raw Files:
Sources: Trans-
actions
Free
text
Images
Machines/
Sensors
Links/
Networks
Metadata (system table): HCatalo
g
Query Engines: Hive
SQL
Impala
SQL
BI
Tools:
Tableau
Commodity HW;
Fault tolerant; Replicated;
Append Only(!); Agnostic
content; Scalable to
“infinity”
Bus
Obj
Cognos QlikVie
w
Others…
All clients can use this to read
files
Parquet Columnar
FILES:
Read optimized schema
defined
column store
Purpose built for
EXTREME I/O
speeds;
Use ETL tool or
Sqoop
EDW
Overflow
Others…
These are query
engines, not
databases!
High Performance Data
Warehouse Thread in Hadoop
 Copy data from raw HDFS file into
Parquet columnar file
 Parquet is not a database: it’s a file accessible to
multiple query and analysis apps
 Parquet data can be updated and the schema modified
 Query Parquet data with Hive or Impala
 At least 10x performance gain over simple raw file
 Hive launches MapReduce jobs: relation scan
 Ideal for ETL and transfer to conventional EDW
 Impala launches in-memory individual queries
 Ideal for interactive query in Hadoop destination DW
 Impala at least 10x additional performance gain over Hive
Use Hadoop as Platform for
Direct Analysis or ETL to
Text/Number DB
 Huge array of special analysis apps for
 Unstructured text
 Hyper structured text/numbers (machine data)
 Positional data from GPS
 Images
 Audio, video
 Consume results with increasing SQL support
from these individual apps
 Or, write text/number data into Hadoop
from unstructured source or external EDW
relational DBMS
The Larger Picture: Why Use
Hadoop as Part of Your EDW?
 Strategic:
 Open floodgates to new kinds of data
 New kinds of analysis impossible in RDBMS
 “Schema on read” for exploratory BI
 Attack same data from multiple perspectives
 Choose SQL and non-SQL approaches at query time
 Keep hyper granular data in “active archive” forever
 No compromise data analysis
 Compliance
 Simultaneous incompatible analysis modes on same data files
 Enterprise data hub: one location for all data resources
 Tactical:
 Dramatically lowered operational costs
 Linear scaling across response time, concurrency, and
data size well beyond petabytes
 Highly reliable write-once, redundantly stored data
 Meet ETL SLAs
It’s Not That Difficult
 Important existing tools already work in Hadoop
 ETL tool suites: familiar data flows and user interfaces
 BI query tools: identical user interfaces, integration
 Standard job schedulers, sort packages (e.g.
SyncSort)
 Skills you need anyway:
 Java, Python or Ruby, C, SQL, Sqoop data transfer
 Linux admin
 but, MapReduce programming no longer needed
 Investigate and add incrementally:
 Analytic tools: MADLib extensions to RDBMS, SAS, R
 Specialty data tools
 E.g., Splunk (machine data)
Integration is Crucial
 Integration is MORE than bringing separate data
sources onto a common platform.
 Suppose you have two customer facing data
sources in your DW producing the following
results.
Is this integration?
Doing Integration the Right Way
 Teaspoon sip of EDW 101 for Hadoop
Professionals!
 Build a conformed dimension library
 Plan to download dimensions from EDW
 Attach conformed dimensions to every possible
source
 Join dimensions at query time to fact tables
in SQL-capable files
 Embed dimension content as columns in NoSQL
structures, and also HBase.
Integrating Big Data
 Remember: Data warehouse integration is drilling
across:
 Establish conformed attributes
(e.g., Customer Category) in each database
 Fetch separate answer sets from different
platforms grouped on the same conformed
attributes
 Sort-merge the answer sets at the BI layer
Out of the Box Possibility:
Billions of Rows, Millions of
Columns
 Tough problem for all current relational platforms:
huge Name-Value data sources (e.g. customer
observations)
 Think about Hbase (!)
 Intended for “impossibly wide schemas”
 Fully general binary data content
 Fire hose SCD1 and SCD2 updates of individual
records
 Continuously growing row and columns
 Only simple SQL direct access possible now: no
joins…

Summing Up:
The Data Warehouse
Renaissance
 Hadoop DW becomes equal partner with Enterprise
DW
 Hadoop will be the strategic environment of choice
for new data types and new analysis modes
 Hadoop:
 Extreme data type diversity
 Huge library of specialty analysis tools with SQL
extensions
 Starting point for exploratory BI and ETL-to-EDW
processing
 Destination point for serious BI
 Permanent active archive of hyper granular data
 BI tools implement Hadoop-to-EDW integration

The Kimball Group Resource
 www.kimballgroup.com
 Best selling data warehouse books
NEW BOOK! The Classic “Toolkit” 3rd
Ed.
 In depth data warehouse classes
taught by primary authors
 Dimensional modeling (Ralph/Margy)
 ETL architecture (Ralph/Bob)
 Dimensional design reviews and consulting
by Kimball Group principals
 White Papers
on Integration, Data Quality, and Big Data
Analytics
“A data warehouse DBMS is
now expected to coordinate
data virtualization strategies,
and distributed file and/or
processing approaches, to
address changes in data
management and access
requirements.”
– 2014 Gartner Magic Quadrant
for Data Warehouse DBMS
Data Warehousing,
Meet Hadoop
What inhibits “Big Data” initiatives?
• No compelling business need
• Not enough staff to support
• Lack of “data science” expertise
• Missing enterprise-grade features
• Complexity of DIY open source
What inhibits “Big Data” initiatives?
• No compelling business need
• Not enough staff to support
• Lack of “data science” expertise
• Missing enterprise-grade features
• Complexity of DIY open source
From Apache Hadoop to an enterprise data hub
BATCH
PROCESSING
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
FILESYSTEM
MAPREDUCE
HDFS
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
From Apache Hadoop to an enterprise data hub
BATCH
PROCESSING
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
SYSTEM
MANAGEMENT
FILESYSTEM
MAPREDUCE
HDFS
CLOUDERAMANAGER
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
✔✔
From Apache Hadoop to an enterprise data hub
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
✔✔
✔✔
BATCH
PROCESSING
ANALYTIC
SQL
SEARCH
ENGINE
MACHINE
LEARNING
STREAM
PROCESSING
3RD
PARTY
APPS
WORKLOAD MANAGEMENT
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
SYSTEM
MANAGEMENT
FILESYSTEM ONLINE NOSQL
MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING
YARN
HDFS HBASE
CLOUDERAMANAGER
From Apache Hadoop to an enterprise data hub
BATCH
PROCESSING
ANALYTIC
SQL
SEARCH
ENGINE
MACHINE
LEARNING
STREAM
PROCESSING
3RD
PARTY
APPS
WORKLOAD MANAGEMENT
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT, SECURE
DATA
MANAGEMENT
SYSTEM
MANAGEMENT
FILESYSTEM ONLINE NOSQL
MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING
YARN
HDFS HBASE
CLOUDERANAVIGATORCLOUDERAMANAGER
SENTRY
Open Source
Scalable
Flexible
Cost-Effective
✔
Managed
✖
Open
Architecture ✖
Secure and
Governed ✖
✔✔
✔✔
✔✔
Partners
Proactive &
Predictive Support
Professional
Services
Training
Cloudera: Your Trusted Advisor for Big Data
28
Advance from Strategy to ROI with Best Practices and Peak Performance
What inhibits “Big Data” initiatives?
• No compelling business need
• Not enough staff to support
• Lack of “data science” expertise
• Missing enterprise-grade features
• Complexity of DIY open source
BusinessBusinessITIT
Disrupt the Industry, Not Your Business
Data
Science
Agile
Exploration
ETL
Acceleration
Cheap
Storage
EDW
Optimization
Customer
360
Your Journey to Gaining Value from All Your Data
Operational EfficiencyOperational Efficiency
(Faster, Bigger, Cheaper)(Faster, Bigger, Cheaper)
Transformative ApplicationsTransformative Applications
(New Business Value)(New Business Value)
Thank you for attending!
• Submit questions in the Q&A
panel
• For a comprehensive set of
Data Warehouse resources -
books, in depth classes,
overall design consulting
http://www.kimballgroup.com
• Follow:
• @cloudera
• @mattbrandwein
Register now for our next Webinar
with Dr. Ralph Kimball:
Best Practices for the Hadoop Data
Warehouse
EDW 101 for Hadoop Professionals
Online Webinar | May 29, 2014
10AM PT / 1PM ET
http://tinyurl.com/kimballwebinar

More Related Content

More from Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Cloudera, Inc.
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionCloudera, Inc.
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Cloudera, Inc.
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloudera, Inc.
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceCloudera, Inc.
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enoughCloudera, Inc.
 

More from Cloudera, Inc. (20)

Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR compliance
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enough
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Building a Hadoop Data Warehouse: Hadoop 101 for Enterprise Data Warehouse Professionals

  • 1. Building a Hadoop Data Warehouse Hadoop 101 for enterprise data warehouse professionals Ralph Kimball APRIL 2014 Building a Hadoop Data Warehouse © Ralph Kimball, Cloudera, 2014 April 2014
  • 2. The Data Warehouse Mission  Identify all possible enterprise data assets  Select those assets that have actionable content and can be accessed  Bring the data assets into a logically centralized “enterprise data warehouse”  Expose those data assets most effectively for decision making
  • 3. Enormous RDBMS Legacy  Legacy RDBMSs have been spectacularly successful, and we will continue to use them.  Too successful… If all you have is a hammer, everything looks like a nail.  RDBMS dilemma: a new ocean of new data types that are being monetized for strategic advantage  Unstructured, semi-structured and machine data  Evolving schemas, just-in-time schemas  Links, images, genomes, geo-positions, log data …
  • 4. Houston: we have a problem  Traditional RDBMSs cannot handle  The new data types  Extended analytic processing  Terabytes/hour loading with immediate query access  We want to use SQL and SQL-like languages, but we don’t want the RDBMS storage constraints…  The disruptive solution: Hadoop
  • 5. The Data Warehouse Stack in Hadoop  Hadoop is an open source distributed storage and processing framework  To understand how data warehousing is different in Hadoop, start with this powerful architecture difference:
  • 6. The Data Warehouse Stack in Hadoop  Hadoop is an open source distributed storage and processing framework  To understand how data warehousing is different in Hadoop, start with this powerful architecture difference:
  • 7. Hadoop for Exploratory DW/BI • Query engines can access HDFS files before ETL • BI tools are the ultimate glue integrating EDW HDFS Files: Sources: Trans- actions Free text Images Machines/ Sensors Links/ Networks Metadata (system table): HCatalo g Query Engines: BI Tools: Tableau Industry standad HW; Fault tolerant; Replicated; Write once(!); Agnostic content; Scalable to “infinity” Others… Bus Obj Cognos QlikVie w Others… All clients can use this to read files These are query engines, not databases! Purpose built for EXTREME I/O speeds; Use ETL tool or Sqoop EDW Overflow Hive SQL Impala SQL
  • 8. Data Load to Query in One Step  Copy into HDFS with ETL tool, Sqoop, or Flume into standard HDFS files (write once) registering metadata with HCatalog  Declare query schema in Hive or Impala (no data copying or reloading)  Immediately launch familiar SQL queries: “Exploratory BI”
  • 9. Typical Large Hadoop Cluster  100 nodes (5 racks)  Each node  Dual hex core CPU running at 3 GHz  64-378 GB of RAM  24-36 TB disk storage (6-10 TB effective storage with default redundancy of 3X)  Overall cluster (!)  6.4-37.8 TB of RAM (wow, think about this…)  Up to a PB of effective storage  Approximate fully loaded cost per TB: $1000 +/-
  • 10. Committing to High Performance HDFS files with Embedded Schemas 10 HDFS Raw Files: Sources: Trans- actions Free text Images Machines/ Sensors Links/ Networks Metadata (system table): HCatalo g Query Engines: Hive SQL Impala SQL BI Tools: Tableau Commodity HW; Fault tolerant; Replicated; Append Only(!); Agnostic content; Scalable to “infinity” Bus Obj Cognos QlikVie w Others… All clients can use this to read files Parquet Columnar FILES: Read optimized schema defined column store Purpose built for EXTREME I/O speeds; Use ETL tool or Sqoop EDW Overflow Others… These are query engines, not databases!
  • 11. High Performance Data Warehouse Thread in Hadoop  Copy data from raw HDFS file into Parquet columnar file  Parquet is not a database: it’s a file accessible to multiple query and analysis apps  Parquet data can be updated and the schema modified  Query Parquet data with Hive or Impala  At least 10x performance gain over simple raw file  Hive launches MapReduce jobs: relation scan  Ideal for ETL and transfer to conventional EDW  Impala launches in-memory individual queries  Ideal for interactive query in Hadoop destination DW  Impala at least 10x additional performance gain over Hive
  • 12. Use Hadoop as Platform for Direct Analysis or ETL to Text/Number DB  Huge array of special analysis apps for  Unstructured text  Hyper structured text/numbers (machine data)  Positional data from GPS  Images  Audio, video  Consume results with increasing SQL support from these individual apps  Or, write text/number data into Hadoop from unstructured source or external EDW relational DBMS
  • 13. The Larger Picture: Why Use Hadoop as Part of Your EDW?  Strategic:  Open floodgates to new kinds of data  New kinds of analysis impossible in RDBMS  “Schema on read” for exploratory BI  Attack same data from multiple perspectives  Choose SQL and non-SQL approaches at query time  Keep hyper granular data in “active archive” forever  No compromise data analysis  Compliance  Simultaneous incompatible analysis modes on same data files  Enterprise data hub: one location for all data resources  Tactical:  Dramatically lowered operational costs  Linear scaling across response time, concurrency, and data size well beyond petabytes  Highly reliable write-once, redundantly stored data  Meet ETL SLAs
  • 14. It’s Not That Difficult  Important existing tools already work in Hadoop  ETL tool suites: familiar data flows and user interfaces  BI query tools: identical user interfaces, integration  Standard job schedulers, sort packages (e.g. SyncSort)  Skills you need anyway:  Java, Python or Ruby, C, SQL, Sqoop data transfer  Linux admin  but, MapReduce programming no longer needed  Investigate and add incrementally:  Analytic tools: MADLib extensions to RDBMS, SAS, R  Specialty data tools  E.g., Splunk (machine data)
  • 15. Integration is Crucial  Integration is MORE than bringing separate data sources onto a common platform.  Suppose you have two customer facing data sources in your DW producing the following results. Is this integration?
  • 16. Doing Integration the Right Way  Teaspoon sip of EDW 101 for Hadoop Professionals!  Build a conformed dimension library  Plan to download dimensions from EDW  Attach conformed dimensions to every possible source  Join dimensions at query time to fact tables in SQL-capable files  Embed dimension content as columns in NoSQL structures, and also HBase.
  • 17. Integrating Big Data  Remember: Data warehouse integration is drilling across:  Establish conformed attributes (e.g., Customer Category) in each database  Fetch separate answer sets from different platforms grouped on the same conformed attributes  Sort-merge the answer sets at the BI layer
  • 18. Out of the Box Possibility: Billions of Rows, Millions of Columns  Tough problem for all current relational platforms: huge Name-Value data sources (e.g. customer observations)  Think about Hbase (!)  Intended for “impossibly wide schemas”  Fully general binary data content  Fire hose SCD1 and SCD2 updates of individual records  Continuously growing row and columns  Only simple SQL direct access possible now: no joins… 
  • 19. Summing Up: The Data Warehouse Renaissance  Hadoop DW becomes equal partner with Enterprise DW  Hadoop will be the strategic environment of choice for new data types and new analysis modes  Hadoop:  Extreme data type diversity  Huge library of specialty analysis tools with SQL extensions  Starting point for exploratory BI and ETL-to-EDW processing  Destination point for serious BI  Permanent active archive of hyper granular data  BI tools implement Hadoop-to-EDW integration 
  • 20. The Kimball Group Resource  www.kimballgroup.com  Best selling data warehouse books NEW BOOK! The Classic “Toolkit” 3rd Ed.  In depth data warehouse classes taught by primary authors  Dimensional modeling (Ralph/Margy)  ETL architecture (Ralph/Bob)  Dimensional design reviews and consulting by Kimball Group principals  White Papers on Integration, Data Quality, and Big Data Analytics
  • 21. “A data warehouse DBMS is now expected to coordinate data virtualization strategies, and distributed file and/or processing approaches, to address changes in data management and access requirements.” – 2014 Gartner Magic Quadrant for Data Warehouse DBMS Data Warehousing, Meet Hadoop
  • 22. What inhibits “Big Data” initiatives? • No compelling business need • Not enough staff to support • Lack of “data science” expertise • Missing enterprise-grade features • Complexity of DIY open source
  • 23. What inhibits “Big Data” initiatives? • No compelling business need • Not enough staff to support • Lack of “data science” expertise • Missing enterprise-grade features • Complexity of DIY open source
  • 24. From Apache Hadoop to an enterprise data hub BATCH PROCESSING STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE FILESYSTEM MAPREDUCE HDFS Open Source Scalable Flexible Cost-Effective ✔ Managed ✖ Open Architecture ✖ Secure and Governed ✖
  • 25. From Apache Hadoop to an enterprise data hub BATCH PROCESSING STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE SYSTEM MANAGEMENT FILESYSTEM MAPREDUCE HDFS CLOUDERAMANAGER Open Source Scalable Flexible Cost-Effective ✔ Managed ✖ Open Architecture ✖ Secure and Governed ✖ ✔✔
  • 26. From Apache Hadoop to an enterprise data hub Open Source Scalable Flexible Cost-Effective ✔ Managed ✖ Open Architecture ✖ Secure and Governed ✖ ✔✔ ✔✔ BATCH PROCESSING ANALYTIC SQL SEARCH ENGINE MACHINE LEARNING STREAM PROCESSING 3RD PARTY APPS WORKLOAD MANAGEMENT STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE SYSTEM MANAGEMENT FILESYSTEM ONLINE NOSQL MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING YARN HDFS HBASE CLOUDERAMANAGER
  • 27. From Apache Hadoop to an enterprise data hub BATCH PROCESSING ANALYTIC SQL SEARCH ENGINE MACHINE LEARNING STREAM PROCESSING 3RD PARTY APPS WORKLOAD MANAGEMENT STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT, SECURE DATA MANAGEMENT SYSTEM MANAGEMENT FILESYSTEM ONLINE NOSQL MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING YARN HDFS HBASE CLOUDERANAVIGATORCLOUDERAMANAGER SENTRY Open Source Scalable Flexible Cost-Effective ✔ Managed ✖ Open Architecture ✖ Secure and Governed ✖ ✔✔ ✔✔ ✔✔
  • 28. Partners Proactive & Predictive Support Professional Services Training Cloudera: Your Trusted Advisor for Big Data 28 Advance from Strategy to ROI with Best Practices and Peak Performance
  • 29. What inhibits “Big Data” initiatives? • No compelling business need • Not enough staff to support • Lack of “data science” expertise • Missing enterprise-grade features • Complexity of DIY open source
  • 30. BusinessBusinessITIT Disrupt the Industry, Not Your Business Data Science Agile Exploration ETL Acceleration Cheap Storage EDW Optimization Customer 360 Your Journey to Gaining Value from All Your Data Operational EfficiencyOperational Efficiency (Faster, Bigger, Cheaper)(Faster, Bigger, Cheaper) Transformative ApplicationsTransformative Applications (New Business Value)(New Business Value)
  • 31. Thank you for attending! • Submit questions in the Q&A panel • For a comprehensive set of Data Warehouse resources - books, in depth classes, overall design consulting http://www.kimballgroup.com • Follow: • @cloudera • @mattbrandwein Register now for our next Webinar with Dr. Ralph Kimball: Best Practices for the Hadoop Data Warehouse EDW 101 for Hadoop Professionals Online Webinar | May 29, 2014 10AM PT / 1PM ET http://tinyurl.com/kimballwebinar