SlideShare a Scribd company logo
1 of 49
Designing a Modern
Data Warehouse + Data Lake
Strategies & architecture options for implementing a
modern data warehousing environment
Agenda
• What is a traditional Data warehouse
• ETL Vs ELT
• Evolving to a Modern Data Warehouse
• Current DW challenges
• Evolution of Modern Data warehouse
• Data Lakes Objectives, Challenges, & Implementation
Options
• Some Real use cases
Evolving to a
Modern Data Warehouse
DW+BI Systems Used to Be Fairly Straightforward
Third Party
Data
Reporting Tool
of Choice
OLAP
Semantic
Layer
Batch ETL Enterprise Data
Warehouse
Operational
Data Store
Data
Marts
Operational
Reporting
Historical
Analytical
Reporting
Organizational
Data (Sales,
Inventory, etc)
Master
Data
The traditional data warehouse
• The traditional data warehouse was designed specifically
to be a central repository for all data in a company.
Disparate data from transactional systems, ERP, CRM, and
LOB applications are cleansed—that is, extracted,
transformed, and loaded (ETL)—into the warehouse
within an overall relational schema. The predictable data
structure and quality optimized processing and
operational reporting.
Current DW Challenges
• The following challenges with today’s data warehouses can impede usage and prevent users from maximizing their analytic
capabilities:
• Timeliness. Introducing new content to the enterprise data warehouse can be a time-consuming and cumbersome process.
getting the data quickly themselves. Users also may waste valuable time and resources to pull the data from operational systems,
store and manage it themselves, and then analyse it.
• Flexibility. Users not only lack on-demand access to any data they may need at any time, but also the ability to use the tools of
their choice to analyse the data and derive critical insights.
• Quality. Users may view the current data warehouse with suspicion. If where the data originated and how it has been acted on
are unclear, users may not trust the data.
• Findability. With many current data warehousing solutions, users do not have a function to rapidly and easily search for and find
the data they need when they need it. Inability to find data also limits the users’ ability to leverage and build on existing data
analyses.
This new solution needs to support multiple reporting tools in a self-serve capacity, to allow rapid ingestion of new datasets without
extensive modelling, and to scale large datasets while delivering performance. It should support advanced analytics, like machine
learning and text analytics, and allow users to cleanse and process the data iteratively and to track lineage of data for compliance.
Users should be able to easily search and explore structured, unstructured, internal, and external data from multiple sources in one
secure place.
Modern Data Warehousing
Third Party
Data
Enterprise Data
Warehouse
Reporting Tool
of Choice
Organizational
Data
Devices &
Sensors
Social Media
Demographics
Data
Near-Real-Time Monitoring
Data Lake
Data
Marts
OLAP
Semantic
Layer
Operational
Data Store Hadoop Machine
Learning
Streaming Data
Batch
ETL
Data Science
Advanced Analytics
Historical
Analytical
Reporting
Operational
Reporting
Mobile
Self-Service
Reports & Models
Master
Data
Federated
Queries
In-Memory
Model
Alerts
Microsoft Azure:
1) Azure Data warehouse
2) Azure Data Lake
3) Azure Machine
Learning
4) Power BI
AWS:
1) Amazon Redshift
2) Amazon EMR
3) Amazon Athena
4) Amazon QuickSight
GCP:
1) Google Big Query
2) Google Cloud Datalab
3) Google Cloud DataPrep
4) Google Cloud Dataflow
5) Cloud Data Studio
Modern DW
• New trends are causing it to break in four different ways: data
growth, fast query expectations from users, non-
relational/unstructured data, and cloud-born data.
• Enter the modern data warehouse, which is able to handle and
excel with these new trends. It handles all types of data (Hadoop),
provides a way to easily interface with all these types of data and
can handle “big data” and provide fast queries.
Evolve to a modern data warehouse
• The modern data warehouse lives up to the promise to get insights
from a variety of techniques like business intelligence and
advanced analytics from all types of data that are growing
explosively, and processing in real-time, with a more robust ability
to deliver the right data at the right time.
• A modern data warehouse delivers a comprehensive logical data
and analytics platform with a complete suite of fully supported,
solutions and technologies that can meet the needs of even the
most sophisticated and demanding modern enterprise—on-
premises, in the cloud, or within any hybrid scenario
What Makes a Data Warehouse “Modern”
Variety of data
sources;
multistructured
Coexists with
Data lake
Coexists with
Hadoop
Larger data
volumes; MPP
Multi-platform
architecture
Data
virtualization +
integration
Support all
user types &
levels
Flexible
deployment
Deployment
decoupled
from dev
Governance
model & MDM
Promotion of
self-service
solutions
Near real-time
data; Lambda
arch
Advanced
analytics
Agile
delivery
Cloud
integration;
hybrid env
Automation &
APIs
Data catalog;
search ability
Scalable
architecture
Analytics
sandbox w/
promotability
Bimodal
environment
Growing an Existing DW Environment
Hadoop
Internal to the data warehouse:
✓ Data modeling strategies
✓ Partitioning
✓ Clustered columnstore index
✓ In-memory structures
✓ MPP (massively parallel processing)
Augment the data warehouse:
✓ Complementary data storage &
analytical solutions
✓ Cloud & hybrid solutions
✓ Data virtualization / virtual DW
Data Lake NoSQLIn-Memory
Model
-- Grow around your existing data warehouse --
Data
Warehouse
Multi-Structured Data
Devices,
Sensors
Social
Media
Web
Logs
Images,
Audio, Video
Spatial,
GPS
Hadoop
Data Lake
NoSQL
Big Data & Analytics
Infrastructure
Data
Warehouse
Reporting tool
of choice
Organizational
Data
Objectives:
1 Storage for multi
structured data
(json, xml, csv…)
with a ‘polygot
persistence’
strategy
-
Integrate portions
of the data into
data warehouse
Federated query
access1
22
3
3
Lambda Architecture
Devices &
Sensors
Data Lake
Speed
LayerEvent Hub Stream Analytics
Streaming
Dashboard
Organizational
Data
Batch
Layer
Reporting tool
of choice
Scheduled ETL Reports,
Dashboards
Advanced
Analytics
Curated Data
Data
Warehouse
Data
Mart(s)
Serving Layer
Machine
Learning
Speed Layer:
Low latency data
Batch Layer:
Data processing to
support complex
analysis
Serving Layer:
Responds to queries
Objectives:
• Support large
volume of high-
velocity data
• Near real-time
analysis +
persisted history
Master Data
Raw Data
Larger Scale Data Warehouse: MPP
Massively Parallel Processing (MPP) operates on
high volumes of data across distributed nodes
Shared-nothing architecture: each node has its
own disk, memory, CPU
Decoupled storage and compute
Scale up compute nodes to increase parallelism
Integrates with relational & non-relational data
Control Node
Compute
Node
Compute
Node
Compute
Node
Data Storage
ETL Process
ETL is normally a continuous, ongoing process with a
well-defined workflow. ETL first extracts data from
homogeneous or heterogeneous data sources. Then,
data is cleansed, enriched, transformed, and stored
either back in the lake or in a data warehouse.
ELT (Extract, Load, Transform) is a variant of ETL
wherein the extracted data is first loaded into the
target system. Transformations are performed
after the data is loaded into the data warehouse.
ELT typically works well when the target system is
powerful enough to handle transformations.
Analytical databases like Amazon Redshift and
Google BigQuery are often used in ELT pipelines
because they are highly efficient in performing
Summarizing 10 Pros & Cons of ETL
and ELT
• 1. Time - Load
ETL: Uses staging area and system, extra time to load data
ELT: All in one system, load only once
• 2. Time - Transformation
ETL: Need to wait, especially for big data sizes - as data grows, transformation time increases
ELT: All in one system, speed is not dependant on data size
• 3. Time - Maintenance
ETL: High maintenance - choice of data to load and transform and must do it again if deleted
or want to enhance the main data repository
ELT: Low maintenance - all data is always available
• 4. Implementation complexity
ETL: At early stage, requires less space and result is clean
ELT: Requires in-depth knowledge of tools and expert design of the main large repository
• 5. Analysis & processing style
ETL: Based on multiple scripts to create the views - deleting view means deleting data
ELT: Creating adhoc views - low cost for building and maintaining
ETL Vs ELT (Continued)
• 6. Data limitation or restriction in supply
ETL: By presuming and choosing data a priori
ELT: By HW (none) and data retention policy
• 7. Data warehouse support
ETL: Prevalent legacy model used for on-premises and relational, structured data
ELT: Tailored to using in scalable cloud infrastructure to support structured, unstructured such
big data sources
• 8. Data lake support
ETL: Not part of approach
ELT: Enables use of lake with unstructured data supported
• 9. Usability
ETL: Fixed tables, Fixed timeline, Used mainly by IT
ELT: Ad Hoc, Agility, Flexibility, Usable by everyone from developer to citizen integrator
• 10. Cost-effective
ETL: Not cost-effective for small and medium businesses
ELT: Scalable and available to all business sizes using online SaaS solutions
Data Lake
Objectives, Challenges &
Implementation Options
Data Lake
Data Lake
Repository
Web
Log
Social
Sensor
Images
A repository for analyzing large
quantities of disparate sources of data in
its native format
One architectural platform to house all
types of data:
o Machine-generated data (ex: IoT, logs)
o Human-generated data (ex: tweets, e-mail)
o Traditional operational data (ex: sales, inventory)
Objectives of a Data Lake
✓ Reduce up-front effort by ingesting
data in any format without requiring a
schema initially
✓ Make acquiring new data easy, so it can
be available for data science & analysis
quickly
✓ Store large volume of multi-structured
data in its native format
Data Lake
Repository
Web
Log
Social
Sensor
Images
Objectives of a Data Lake
✓ Defer work to ‘schematize’ after value &
requirements are known
✓ Achieve agility faster than a traditional
data warehouse can
✓ Speed up decision-making ability
✓ Storage for additional types of data
which were historically difficult to obtain
Data Lake
Repository
Web
Log
Social
Sensor
Images
Data Lake as a Staging Area for DW
Corporate
Data
CRM
Data
Warehouse Reporting tool
of choice
Social Media
Data
Devices &
Sensors
Raw Data:
Staging Area
Data Lake Store
Strategy:
• Reduce storage
needs in data
warehouse
• Practical use for
data stored in the
data lake
1 Utilize the data
lake as a landing
area for DW
staging area,
instead of the
relational
database
1
Data Lake Implementation
A data lake is a conceptual idea.
It can be implemented with one or more
technologies.
HDFS (Hadoop Distributed File Storage)
is a very common option for data lake
storage. However, Hadoop is not a
requirement for a data lake. A data lake
may also span > 1 Hadoop cluster.
NoSQL databases are also very common.
Data Lake
NoSQL
HDFS
Multi-Technology Data Lake
Corporate
Data
CRM
Data
Warehouse
Reporting tool
of choice
Social Media
Data
Devices &
Sensors
Raw Data
Data Lake Store
Strategy:
Use a ‘conceptual’
data lake strategy
with multiple
technologies, each
best suited to the
data (Polygot
Persistence)
DocumentDB
Collection
Standardized
(Curated) Data
Data Lake Store
1
1
Ingest
hierarchical,
nested JSON data
into DocumentDB
2
2 Standardize data
into tabular text
file
“Conceptual” Data Lake
Coexistence of Data Lake & Data Warehouse
Data Lake Values:
✓ Agility
✓ Flexibility
✓ Rapid Delivery
✓ Exploration
DW Values:
✓ Governance
✓ Reliability
✓ Standardization
✓ Security
Data Lake Enterprise
Data Warehouse
More effort
Less effort
Data retrieval
Data ingestion
Less effort
More effort
Challenges of a Data Lake: Technology
Complex, multi-
layered architecture
✓ Polygot persistence strategy
✓ Architectures & toolsets are emerging and maturing
Unknown storage &
scalability
✓ Can we realistically store “everything?”
✓ Cloud deployments are attractive when scale is undetermined
Working with ✓ Missing or incomplete
un-curated data
✓ Inconsistent dates and data types
✓ Data type mismatches
data
✓ Flex-field data which can vary per record
✓ Different granularities
✓ Incremental data loads
etc…
Challenges of a Data Lake: Technology
Performance
✓ Trade-offs between latency, scalability, & query performance
✓ Monitoring & auditing
Data retrieval
✓ Easy access for data consumers
✓ Organization of data to facilitate data retrieval
✓ Business metadata is *critical* for making sense of the data
Change
management
✓ File structure changes (inconsistent ‘schema-on-read’)
✓ Integration with master data
✓ Inherent risks associated with a highly flexible repository
✓ Maintaining & updating over time (meeting future needs)
Challenges of a Data Lake: Process
Finding right balance ✓ Finding optimal level of chaos which invites experimentation
of agility
(deferred work vs.
up-front work)
✓ Risk & complexity are still there – just shifted
✓ When schema-on-read is appropriate (Temporarily? Permanently?)
✓ How the data lake will coexist with the data warehouse
✓ How to operationalize self-service/sandbox work from analysts
Data quality
✓ How to reconcile or confirm accuracy of results
✓ Data validation between systems
✓ Challenging to implement data governance
✓ Difficult to enforce standards
✓ Securing and obfuscating confidential data
✓ Meeting compliance and regulatory requirements
Governance &
security
Parameters Data Lake Data Warehouse
Storage In the data lake, all data is kept irrespective of the source and its
structure. Data is kept in its raw form. It is only transformed when it
is ready to be used.
A data warehouse will consist of data that is extracted from
transactional systems or data which consists of quantitative metrics
with their attributes. The data is cleaned and transformed
History Big data technologies used in data lakes is relatively new. Data warehouse concept, unlike big data, had been used for decades.
Data Capturing Captures all kinds of data and structures, semi-structured and
unstructured in their original form from source systems.
Captures structured information and organizes them in schemas as
defined for data warehouse purposes
Data Timeline Data lakes can retain all data. This includes not only the data that is in
use but also data that it might use in the future. Also, data is kept for
all time, to go back in time and do an analysis.
In the data warehouse development process, significant time is spent
on analyzing various data sources.
Users Data lake is ideal for the users who indulge in deep analysis. Such
users include data scientists who need advanced analytical tools with
capabilities such as predictive modeling and statistical analysis.
The data warehouse is ideal for operational users because of being well
structured, easy to use and understand.
Storage Costs Data storing in big data technologies are relatively inexpensive then
storing data in a data warehouse.
Storing data in Data warehouse is costlier and time-consuming.
Task Data lakes can contain all data and data types; it empowers users to
access data prior the process of transformed, cleansed and
structured.
Data warehouses can provide insights into pre-defined questions for
pre-defined data types.
Processing time Data lakes empower users to access data before it has been
transformed, cleansed and structured. Thus, it allows users to get to
their result more quickly compares to the traditional data warehouse.
Data warehouses offer insights into pre-defined questions for pre-
defined data types. So, any changes to the data warehouse needed
more time.
Position of Schema Typically, the schema is defined after data is stored. This offers high
agility and ease of data capture but requires work at the end of the
process
Typically schema is defined before data is stored. Requires work at the
start of the process, but offers performance, security, and integration.
Data processing Data Lakes use of the ELT (Extract Load Transform) process. Data warehouse uses a traditional ETL (Extract Transform Load) process.
Complain Data is kept in its raw form. It is only transformed when it is ready to
be used.
The chief complaint against data warehouses is the inability, or the
problem faced when trying to make change in in them.
Key Benefits They integrate different types of data to come up with entirely new
questions as these users not likely to use data warehouses because
they may need to go beyond its capabilities.
Most users in an organization are operational. These type of users only
care about reports and key performance metrics.
What Makes A Data Warehouse
“Modern”
What Makes a Data Warehouse “Modern”
Variety of subject areas & data sources for analysis
with capability to handle large volumes of data
Expansion beyond a single relational DW/data mart
structure to include Hadoop, Data Lake, or NoSQL
Logical design across multi-platform architecture
balancing scalability & performance
Data virtualization in addition to data integration
What Makes a Data Warehouse “Modern”
Support for all types & levels of users
Flexible deployment (including mobile) which is
decoupled from tool used for development
Governance model to support trust and security,
and master data management
Support for promoting self-service solutions to
the corporate environment
What Makes a Data Warehouse “Modern”
Ability to facilitate near real-time analysis on
high velocity data (Lambda architecture)
Support for advanced analytics
Agile delivery approach with fast delivery cycle
Hybrid integration with cloud services
APIs for downstream access to data
Challenges With Modern
Data Warehousing
Challenges with Modern Data Warehousing
Agility
Reducing time to
value Minimizing chaos
Evolving &
maturing
technology
Balancing ‘schema
on write’ with
‘schema on read’
How strict to be
with dimensional
design?
Challenges with Modern Data Warehousing
Complexity
Hybrid
scenarios
Multi-
platform
infrastructure
Ever-
increasing
data volumes
File type &
format
diversity
Real-time
reporting
needs
Effort & cost
of data
integration
Broad skillsets
needed
Challenges with Modern Data Warehousing
Balance with Self-Service Initiatives
Self-service solutions
which challenge
centralized DW
Managing ‘production’
delivery from IT and
user-created solutions
Handling ownership
changes (promotion) of
valuable solutions
How Delta Airlines uses Big Data for
Customer Sentiment Analysis-
• Airline companies are using sentiment analysis to analyse flyer tracking
experience. Large airlines like Delta, monitors tweets to find out how
their customers feel about delays, upgrades , in-flight entertainment ,
and more. For example, when a customer tweets negatively about his
lost baggage with the airline prior to boarding his connecting flight. The
airline identifies such negative tweets and forwards to their support
team. The support team sends a representative to the passengers
destination presenting him a free first class upgrade ticket on his return
along with the information about the tracked baggage promising to
deliver it as soon as he or she steps out of the plane. The customer
tweets like a happy camper rest of his trip helping the airlines build
positive brand recognition.
Streaming analytics helps retail stores track the indoor location of their customers, discover the most
attractive product in the shop, send an 'offer of the day' when customers within a certain radius, provide
offers related to their specific products of interest, suggest other products relative to what they are
looking at right now, offer coupons based on customers' previous purchases, and obtain more advanced
statistics for the store manager to refine and redefine business strategies (Figure 1).
Doing Away with Dangerous Police Chases
(IOT)
• Law enforcement Agencies in Austin, Texas, Los
Angeles, Arizona, Wisconsin have been trying out
a system that eliminates the need for police to
engage in dangerous high speed chases of
suspects. An air compressor launcher on the front
of the patrol car fires a sticky GPS locator with a
transmitted. Police can then remotely track the
vehicle versus chasing it.
Final Thoughts
Traditional data warehousing still is important, but needs to co-exist with
other platforms. Build around your existing DW infrastructure.
Plan your data lake with data retrieval in mind.
Expect to balance ETL with some (perhaps limited) data virtualization
techniques in a multi-platform environment.
Be fully aware of data lake & data virtualization challenges in order to
craft your own achievable & realistic best practices.
Plan to work in an agile fashion, leveraging self-service & sandbox
solutions which can be promoted to the corporate solution.
Recommended Resources
Fundamentals
More
Advanced
Recommended Whitepaper:
How to Build An Enterprise Data Lake:
Important Considerations Before Jumping
In by Mark Madsen, Third Nature Inc.
Wrap-Up &
Questions
Appendix :
New sources and types of data
• The traditional data warehouse was built on a strategy of well
structured, sanitized and trusted repository. Yet, today more than
85 percent of data volume comes from a variety of new data
types proliferating from mobile and social channels, scanners,
sensors, RFID tags, devices, feeds, and other sources outside the
business. These data types do not easily fit the business schema
model and may not be cost effective to ETL into the relational
data warehouse. Yet, these new types of data have the potential
to enhance business operations. For example, a shipping
company might use fuel and weight sensors with GPS, traffic, and
weather feeds to optimize shipping routes or fleet usage.
Big data is driving us to a tipping point
Large data
volumes
Multiple
data types
Real-time
data creation
User
expectations
MobilityHardware
and storage
economics
Clickstream analysis
Clickstream data is the trail of digital breadcrumbs left by users as they click their way through
a website, and it’s loaded with valuable customer information for businesses.
Clickstream analysis is the process of collecting, analysing, and reporting aggregate data about
which pages visitors visit, and in what order. This reveals usage patterns, which in turn gives a
heightened understanding of customer behaviour. This use of the analysis creates a user
profile that aids in understanding the types of people that visit a company’s website.
Challenges of a Data Lake
Technology
People Process

More Related Content

What's hot

Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
A 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeA 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeSnowflake Computing
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudDenodo
 

What's hot (20)

Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
A 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeA 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with Snowflake
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
data warehouse vs data lake
data warehouse vs data lakedata warehouse vs data lake
data warehouse vs data lake
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
 

Similar to Designing modern dw and data lake

Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Managing The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageManaging The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageDell World
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricNathan Bijnens
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfDatacademy.ai
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingsumit621
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World DistilledRTTS
 
Modern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleModern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleVasu S
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?RTTS
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdferamfatima43
 

Similar to Designing modern dw and data lake (20)

Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
ETL Technologies.pptx
ETL Technologies.pptxETL Technologies.pptx
ETL Technologies.pptx
 
Managing The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageManaging The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing Storage
 
Big Data
Big DataBig Data
Big Data
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdf
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
Modern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | QuboleModern Integrated Data Environment - Whitepaper | Qubole
Modern Integrated Data Environment - Whitepaper | Qubole
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
 
Data wirehouse
Data wirehouseData wirehouse
Data wirehouse
 

More from punedevscom

Cloud Security Webinar
Cloud Security WebinarCloud Security Webinar
Cloud Security Webinarpunedevscom
 
Understanding .Net Standards, .Net Core & .Net Framework
Understanding .Net Standards, .Net Core & .Net FrameworkUnderstanding .Net Standards, .Net Core & .Net Framework
Understanding .Net Standards, .Net Core & .Net Frameworkpunedevscom
 
Text Mining - Text data Visualization
Text Mining - Text data VisualizationText Mining - Text data Visualization
Text Mining - Text data Visualizationpunedevscom
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
Enabling DevOps for enterprise
Enabling DevOps for enterpriseEnabling DevOps for enterprise
Enabling DevOps for enterprisepunedevscom
 
Machine Learning: Real life business application
Machine Learning: Real life business applicationMachine Learning: Real life business application
Machine Learning: Real life business applicationpunedevscom
 
Technical Documentation Within SDLC
Technical Documentation Within SDLC Technical Documentation Within SDLC
Technical Documentation Within SDLC punedevscom
 
Data Preparation and Dimension Reduction
Data Preparation and Dimension Reduction Data Preparation and Dimension Reduction
Data Preparation and Dimension Reduction punedevscom
 

More from punedevscom (10)

Cloud Security Webinar
Cloud Security WebinarCloud Security Webinar
Cloud Security Webinar
 
Understanding .Net Standards, .Net Core & .Net Framework
Understanding .Net Standards, .Net Core & .Net FrameworkUnderstanding .Net Standards, .Net Core & .Net Framework
Understanding .Net Standards, .Net Core & .Net Framework
 
Text Mining - Text data Visualization
Text Mining - Text data VisualizationText Mining - Text data Visualization
Text Mining - Text data Visualization
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
IoT and AI
 IoT and AI IoT and AI
IoT and AI
 
Enabling DevOps for enterprise
Enabling DevOps for enterpriseEnabling DevOps for enterprise
Enabling DevOps for enterprise
 
Remote working
Remote workingRemote working
Remote working
 
Machine Learning: Real life business application
Machine Learning: Real life business applicationMachine Learning: Real life business application
Machine Learning: Real life business application
 
Technical Documentation Within SDLC
Technical Documentation Within SDLC Technical Documentation Within SDLC
Technical Documentation Within SDLC
 
Data Preparation and Dimension Reduction
Data Preparation and Dimension Reduction Data Preparation and Dimension Reduction
Data Preparation and Dimension Reduction
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Designing modern dw and data lake

  • 1. Designing a Modern Data Warehouse + Data Lake Strategies & architecture options for implementing a modern data warehousing environment
  • 2. Agenda • What is a traditional Data warehouse • ETL Vs ELT • Evolving to a Modern Data Warehouse • Current DW challenges • Evolution of Modern Data warehouse • Data Lakes Objectives, Challenges, & Implementation Options • Some Real use cases
  • 3. Evolving to a Modern Data Warehouse
  • 4. DW+BI Systems Used to Be Fairly Straightforward Third Party Data Reporting Tool of Choice OLAP Semantic Layer Batch ETL Enterprise Data Warehouse Operational Data Store Data Marts Operational Reporting Historical Analytical Reporting Organizational Data (Sales, Inventory, etc) Master Data
  • 5. The traditional data warehouse • The traditional data warehouse was designed specifically to be a central repository for all data in a company. Disparate data from transactional systems, ERP, CRM, and LOB applications are cleansed—that is, extracted, transformed, and loaded (ETL)—into the warehouse within an overall relational schema. The predictable data structure and quality optimized processing and operational reporting.
  • 6. Current DW Challenges • The following challenges with today’s data warehouses can impede usage and prevent users from maximizing their analytic capabilities: • Timeliness. Introducing new content to the enterprise data warehouse can be a time-consuming and cumbersome process. getting the data quickly themselves. Users also may waste valuable time and resources to pull the data from operational systems, store and manage it themselves, and then analyse it. • Flexibility. Users not only lack on-demand access to any data they may need at any time, but also the ability to use the tools of their choice to analyse the data and derive critical insights. • Quality. Users may view the current data warehouse with suspicion. If where the data originated and how it has been acted on are unclear, users may not trust the data. • Findability. With many current data warehousing solutions, users do not have a function to rapidly and easily search for and find the data they need when they need it. Inability to find data also limits the users’ ability to leverage and build on existing data analyses. This new solution needs to support multiple reporting tools in a self-serve capacity, to allow rapid ingestion of new datasets without extensive modelling, and to scale large datasets while delivering performance. It should support advanced analytics, like machine learning and text analytics, and allow users to cleanse and process the data iteratively and to track lineage of data for compliance. Users should be able to easily search and explore structured, unstructured, internal, and external data from multiple sources in one secure place.
  • 7. Modern Data Warehousing Third Party Data Enterprise Data Warehouse Reporting Tool of Choice Organizational Data Devices & Sensors Social Media Demographics Data Near-Real-Time Monitoring Data Lake Data Marts OLAP Semantic Layer Operational Data Store Hadoop Machine Learning Streaming Data Batch ETL Data Science Advanced Analytics Historical Analytical Reporting Operational Reporting Mobile Self-Service Reports & Models Master Data Federated Queries In-Memory Model Alerts
  • 8. Microsoft Azure: 1) Azure Data warehouse 2) Azure Data Lake 3) Azure Machine Learning 4) Power BI AWS: 1) Amazon Redshift 2) Amazon EMR 3) Amazon Athena 4) Amazon QuickSight GCP: 1) Google Big Query 2) Google Cloud Datalab 3) Google Cloud DataPrep 4) Google Cloud Dataflow 5) Cloud Data Studio
  • 9. Modern DW • New trends are causing it to break in four different ways: data growth, fast query expectations from users, non- relational/unstructured data, and cloud-born data. • Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data and can handle “big data” and provide fast queries.
  • 10. Evolve to a modern data warehouse • The modern data warehouse lives up to the promise to get insights from a variety of techniques like business intelligence and advanced analytics from all types of data that are growing explosively, and processing in real-time, with a more robust ability to deliver the right data at the right time. • A modern data warehouse delivers a comprehensive logical data and analytics platform with a complete suite of fully supported, solutions and technologies that can meet the needs of even the most sophisticated and demanding modern enterprise—on- premises, in the cloud, or within any hybrid scenario
  • 11. What Makes a Data Warehouse “Modern” Variety of data sources; multistructured Coexists with Data lake Coexists with Hadoop Larger data volumes; MPP Multi-platform architecture Data virtualization + integration Support all user types & levels Flexible deployment Deployment decoupled from dev Governance model & MDM Promotion of self-service solutions Near real-time data; Lambda arch Advanced analytics Agile delivery Cloud integration; hybrid env Automation & APIs Data catalog; search ability Scalable architecture Analytics sandbox w/ promotability Bimodal environment
  • 12. Growing an Existing DW Environment Hadoop Internal to the data warehouse: ✓ Data modeling strategies ✓ Partitioning ✓ Clustered columnstore index ✓ In-memory structures ✓ MPP (massively parallel processing) Augment the data warehouse: ✓ Complementary data storage & analytical solutions ✓ Cloud & hybrid solutions ✓ Data virtualization / virtual DW Data Lake NoSQLIn-Memory Model -- Grow around your existing data warehouse -- Data Warehouse
  • 13. Multi-Structured Data Devices, Sensors Social Media Web Logs Images, Audio, Video Spatial, GPS Hadoop Data Lake NoSQL Big Data & Analytics Infrastructure Data Warehouse Reporting tool of choice Organizational Data Objectives: 1 Storage for multi structured data (json, xml, csv…) with a ‘polygot persistence’ strategy - Integrate portions of the data into data warehouse Federated query access1 22 3 3
  • 14. Lambda Architecture Devices & Sensors Data Lake Speed LayerEvent Hub Stream Analytics Streaming Dashboard Organizational Data Batch Layer Reporting tool of choice Scheduled ETL Reports, Dashboards Advanced Analytics Curated Data Data Warehouse Data Mart(s) Serving Layer Machine Learning Speed Layer: Low latency data Batch Layer: Data processing to support complex analysis Serving Layer: Responds to queries Objectives: • Support large volume of high- velocity data • Near real-time analysis + persisted history Master Data Raw Data
  • 15. Larger Scale Data Warehouse: MPP Massively Parallel Processing (MPP) operates on high volumes of data across distributed nodes Shared-nothing architecture: each node has its own disk, memory, CPU Decoupled storage and compute Scale up compute nodes to increase parallelism Integrates with relational & non-relational data Control Node Compute Node Compute Node Compute Node Data Storage
  • 16. ETL Process ETL is normally a continuous, ongoing process with a well-defined workflow. ETL first extracts data from homogeneous or heterogeneous data sources. Then, data is cleansed, enriched, transformed, and stored either back in the lake or in a data warehouse. ELT (Extract, Load, Transform) is a variant of ETL wherein the extracted data is first loaded into the target system. Transformations are performed after the data is loaded into the data warehouse. ELT typically works well when the target system is powerful enough to handle transformations. Analytical databases like Amazon Redshift and Google BigQuery are often used in ELT pipelines because they are highly efficient in performing
  • 17. Summarizing 10 Pros & Cons of ETL and ELT • 1. Time - Load ETL: Uses staging area and system, extra time to load data ELT: All in one system, load only once • 2. Time - Transformation ETL: Need to wait, especially for big data sizes - as data grows, transformation time increases ELT: All in one system, speed is not dependant on data size • 3. Time - Maintenance ETL: High maintenance - choice of data to load and transform and must do it again if deleted or want to enhance the main data repository ELT: Low maintenance - all data is always available • 4. Implementation complexity ETL: At early stage, requires less space and result is clean ELT: Requires in-depth knowledge of tools and expert design of the main large repository • 5. Analysis & processing style ETL: Based on multiple scripts to create the views - deleting view means deleting data ELT: Creating adhoc views - low cost for building and maintaining
  • 18. ETL Vs ELT (Continued) • 6. Data limitation or restriction in supply ETL: By presuming and choosing data a priori ELT: By HW (none) and data retention policy • 7. Data warehouse support ETL: Prevalent legacy model used for on-premises and relational, structured data ELT: Tailored to using in scalable cloud infrastructure to support structured, unstructured such big data sources • 8. Data lake support ETL: Not part of approach ELT: Enables use of lake with unstructured data supported • 9. Usability ETL: Fixed tables, Fixed timeline, Used mainly by IT ELT: Ad Hoc, Agility, Flexibility, Usable by everyone from developer to citizen integrator • 10. Cost-effective ETL: Not cost-effective for small and medium businesses ELT: Scalable and available to all business sizes using online SaaS solutions
  • 19. Data Lake Objectives, Challenges & Implementation Options
  • 20. Data Lake Data Lake Repository Web Log Social Sensor Images A repository for analyzing large quantities of disparate sources of data in its native format One architectural platform to house all types of data: o Machine-generated data (ex: IoT, logs) o Human-generated data (ex: tweets, e-mail) o Traditional operational data (ex: sales, inventory)
  • 21. Objectives of a Data Lake ✓ Reduce up-front effort by ingesting data in any format without requiring a schema initially ✓ Make acquiring new data easy, so it can be available for data science & analysis quickly ✓ Store large volume of multi-structured data in its native format Data Lake Repository Web Log Social Sensor Images
  • 22. Objectives of a Data Lake ✓ Defer work to ‘schematize’ after value & requirements are known ✓ Achieve agility faster than a traditional data warehouse can ✓ Speed up decision-making ability ✓ Storage for additional types of data which were historically difficult to obtain Data Lake Repository Web Log Social Sensor Images
  • 23. Data Lake as a Staging Area for DW Corporate Data CRM Data Warehouse Reporting tool of choice Social Media Data Devices & Sensors Raw Data: Staging Area Data Lake Store Strategy: • Reduce storage needs in data warehouse • Practical use for data stored in the data lake 1 Utilize the data lake as a landing area for DW staging area, instead of the relational database 1
  • 24. Data Lake Implementation A data lake is a conceptual idea. It can be implemented with one or more technologies. HDFS (Hadoop Distributed File Storage) is a very common option for data lake storage. However, Hadoop is not a requirement for a data lake. A data lake may also span > 1 Hadoop cluster. NoSQL databases are also very common. Data Lake NoSQL HDFS
  • 25. Multi-Technology Data Lake Corporate Data CRM Data Warehouse Reporting tool of choice Social Media Data Devices & Sensors Raw Data Data Lake Store Strategy: Use a ‘conceptual’ data lake strategy with multiple technologies, each best suited to the data (Polygot Persistence) DocumentDB Collection Standardized (Curated) Data Data Lake Store 1 1 Ingest hierarchical, nested JSON data into DocumentDB 2 2 Standardize data into tabular text file “Conceptual” Data Lake
  • 26. Coexistence of Data Lake & Data Warehouse Data Lake Values: ✓ Agility ✓ Flexibility ✓ Rapid Delivery ✓ Exploration DW Values: ✓ Governance ✓ Reliability ✓ Standardization ✓ Security Data Lake Enterprise Data Warehouse More effort Less effort Data retrieval Data ingestion Less effort More effort
  • 27. Challenges of a Data Lake: Technology Complex, multi- layered architecture ✓ Polygot persistence strategy ✓ Architectures & toolsets are emerging and maturing Unknown storage & scalability ✓ Can we realistically store “everything?” ✓ Cloud deployments are attractive when scale is undetermined Working with ✓ Missing or incomplete un-curated data ✓ Inconsistent dates and data types ✓ Data type mismatches data ✓ Flex-field data which can vary per record ✓ Different granularities ✓ Incremental data loads etc…
  • 28. Challenges of a Data Lake: Technology Performance ✓ Trade-offs between latency, scalability, & query performance ✓ Monitoring & auditing Data retrieval ✓ Easy access for data consumers ✓ Organization of data to facilitate data retrieval ✓ Business metadata is *critical* for making sense of the data Change management ✓ File structure changes (inconsistent ‘schema-on-read’) ✓ Integration with master data ✓ Inherent risks associated with a highly flexible repository ✓ Maintaining & updating over time (meeting future needs)
  • 29. Challenges of a Data Lake: Process Finding right balance ✓ Finding optimal level of chaos which invites experimentation of agility (deferred work vs. up-front work) ✓ Risk & complexity are still there – just shifted ✓ When schema-on-read is appropriate (Temporarily? Permanently?) ✓ How the data lake will coexist with the data warehouse ✓ How to operationalize self-service/sandbox work from analysts Data quality ✓ How to reconcile or confirm accuracy of results ✓ Data validation between systems ✓ Challenging to implement data governance ✓ Difficult to enforce standards ✓ Securing and obfuscating confidential data ✓ Meeting compliance and regulatory requirements Governance & security
  • 30. Parameters Data Lake Data Warehouse Storage In the data lake, all data is kept irrespective of the source and its structure. Data is kept in its raw form. It is only transformed when it is ready to be used. A data warehouse will consist of data that is extracted from transactional systems or data which consists of quantitative metrics with their attributes. The data is cleaned and transformed History Big data technologies used in data lakes is relatively new. Data warehouse concept, unlike big data, had been used for decades. Data Capturing Captures all kinds of data and structures, semi-structured and unstructured in their original form from source systems. Captures structured information and organizes them in schemas as defined for data warehouse purposes Data Timeline Data lakes can retain all data. This includes not only the data that is in use but also data that it might use in the future. Also, data is kept for all time, to go back in time and do an analysis. In the data warehouse development process, significant time is spent on analyzing various data sources. Users Data lake is ideal for the users who indulge in deep analysis. Such users include data scientists who need advanced analytical tools with capabilities such as predictive modeling and statistical analysis. The data warehouse is ideal for operational users because of being well structured, easy to use and understand. Storage Costs Data storing in big data technologies are relatively inexpensive then storing data in a data warehouse. Storing data in Data warehouse is costlier and time-consuming. Task Data lakes can contain all data and data types; it empowers users to access data prior the process of transformed, cleansed and structured. Data warehouses can provide insights into pre-defined questions for pre-defined data types. Processing time Data lakes empower users to access data before it has been transformed, cleansed and structured. Thus, it allows users to get to their result more quickly compares to the traditional data warehouse. Data warehouses offer insights into pre-defined questions for pre- defined data types. So, any changes to the data warehouse needed more time. Position of Schema Typically, the schema is defined after data is stored. This offers high agility and ease of data capture but requires work at the end of the process Typically schema is defined before data is stored. Requires work at the start of the process, but offers performance, security, and integration. Data processing Data Lakes use of the ELT (Extract Load Transform) process. Data warehouse uses a traditional ETL (Extract Transform Load) process. Complain Data is kept in its raw form. It is only transformed when it is ready to be used. The chief complaint against data warehouses is the inability, or the problem faced when trying to make change in in them. Key Benefits They integrate different types of data to come up with entirely new questions as these users not likely to use data warehouses because they may need to go beyond its capabilities. Most users in an organization are operational. These type of users only care about reports and key performance metrics.
  • 31. What Makes A Data Warehouse “Modern”
  • 32. What Makes a Data Warehouse “Modern” Variety of subject areas & data sources for analysis with capability to handle large volumes of data Expansion beyond a single relational DW/data mart structure to include Hadoop, Data Lake, or NoSQL Logical design across multi-platform architecture balancing scalability & performance Data virtualization in addition to data integration
  • 33. What Makes a Data Warehouse “Modern” Support for all types & levels of users Flexible deployment (including mobile) which is decoupled from tool used for development Governance model to support trust and security, and master data management Support for promoting self-service solutions to the corporate environment
  • 34. What Makes a Data Warehouse “Modern” Ability to facilitate near real-time analysis on high velocity data (Lambda architecture) Support for advanced analytics Agile delivery approach with fast delivery cycle Hybrid integration with cloud services APIs for downstream access to data
  • 36. Challenges with Modern Data Warehousing Agility Reducing time to value Minimizing chaos Evolving & maturing technology Balancing ‘schema on write’ with ‘schema on read’ How strict to be with dimensional design?
  • 37. Challenges with Modern Data Warehousing Complexity Hybrid scenarios Multi- platform infrastructure Ever- increasing data volumes File type & format diversity Real-time reporting needs Effort & cost of data integration Broad skillsets needed
  • 38. Challenges with Modern Data Warehousing Balance with Self-Service Initiatives Self-service solutions which challenge centralized DW Managing ‘production’ delivery from IT and user-created solutions Handling ownership changes (promotion) of valuable solutions
  • 39. How Delta Airlines uses Big Data for Customer Sentiment Analysis- • Airline companies are using sentiment analysis to analyse flyer tracking experience. Large airlines like Delta, monitors tweets to find out how their customers feel about delays, upgrades , in-flight entertainment , and more. For example, when a customer tweets negatively about his lost baggage with the airline prior to boarding his connecting flight. The airline identifies such negative tweets and forwards to their support team. The support team sends a representative to the passengers destination presenting him a free first class upgrade ticket on his return along with the information about the tracked baggage promising to deliver it as soon as he or she steps out of the plane. The customer tweets like a happy camper rest of his trip helping the airlines build positive brand recognition.
  • 40. Streaming analytics helps retail stores track the indoor location of their customers, discover the most attractive product in the shop, send an 'offer of the day' when customers within a certain radius, provide offers related to their specific products of interest, suggest other products relative to what they are looking at right now, offer coupons based on customers' previous purchases, and obtain more advanced statistics for the store manager to refine and redefine business strategies (Figure 1).
  • 41. Doing Away with Dangerous Police Chases (IOT) • Law enforcement Agencies in Austin, Texas, Los Angeles, Arizona, Wisconsin have been trying out a system that eliminates the need for police to engage in dangerous high speed chases of suspects. An air compressor launcher on the front of the patrol car fires a sticky GPS locator with a transmitted. Police can then remotely track the vehicle versus chasing it.
  • 42. Final Thoughts Traditional data warehousing still is important, but needs to co-exist with other platforms. Build around your existing DW infrastructure. Plan your data lake with data retrieval in mind. Expect to balance ETL with some (perhaps limited) data virtualization techniques in a multi-platform environment. Be fully aware of data lake & data virtualization challenges in order to craft your own achievable & realistic best practices. Plan to work in an agile fashion, leveraging self-service & sandbox solutions which can be promoted to the corporate solution.
  • 43. Recommended Resources Fundamentals More Advanced Recommended Whitepaper: How to Build An Enterprise Data Lake: Important Considerations Before Jumping In by Mark Madsen, Third Nature Inc.
  • 46. New sources and types of data • The traditional data warehouse was built on a strategy of well structured, sanitized and trusted repository. Yet, today more than 85 percent of data volume comes from a variety of new data types proliferating from mobile and social channels, scanners, sensors, RFID tags, devices, feeds, and other sources outside the business. These data types do not easily fit the business schema model and may not be cost effective to ETL into the relational data warehouse. Yet, these new types of data have the potential to enhance business operations. For example, a shipping company might use fuel and weight sensors with GPS, traffic, and weather feeds to optimize shipping routes or fleet usage.
  • 47. Big data is driving us to a tipping point Large data volumes Multiple data types Real-time data creation User expectations MobilityHardware and storage economics
  • 48. Clickstream analysis Clickstream data is the trail of digital breadcrumbs left by users as they click their way through a website, and it’s loaded with valuable customer information for businesses. Clickstream analysis is the process of collecting, analysing, and reporting aggregate data about which pages visitors visit, and in what order. This reveals usage patterns, which in turn gives a heightened understanding of customer behaviour. This use of the analysis creates a user profile that aids in understanding the types of people that visit a company’s website.
  • 49. Challenges of a Data Lake Technology People Process