Take control of your SAP testing with UiPath Test Suite
Designing modern dw and data lake
1. Designing a Modern
Data Warehouse + Data Lake
Strategies & architecture options for implementing a
modern data warehousing environment
2. Agenda
• What is a traditional Data warehouse
• ETL Vs ELT
• Evolving to a Modern Data Warehouse
• Current DW challenges
• Evolution of Modern Data warehouse
• Data Lakes Objectives, Challenges, & Implementation
Options
• Some Real use cases
4. DW+BI Systems Used to Be Fairly Straightforward
Third Party
Data
Reporting Tool
of Choice
OLAP
Semantic
Layer
Batch ETL Enterprise Data
Warehouse
Operational
Data Store
Data
Marts
Operational
Reporting
Historical
Analytical
Reporting
Organizational
Data (Sales,
Inventory, etc)
Master
Data
5. The traditional data warehouse
• The traditional data warehouse was designed specifically
to be a central repository for all data in a company.
Disparate data from transactional systems, ERP, CRM, and
LOB applications are cleansed—that is, extracted,
transformed, and loaded (ETL)—into the warehouse
within an overall relational schema. The predictable data
structure and quality optimized processing and
operational reporting.
6. Current DW Challenges
• The following challenges with today’s data warehouses can impede usage and prevent users from maximizing their analytic
capabilities:
• Timeliness. Introducing new content to the enterprise data warehouse can be a time-consuming and cumbersome process.
getting the data quickly themselves. Users also may waste valuable time and resources to pull the data from operational systems,
store and manage it themselves, and then analyse it.
• Flexibility. Users not only lack on-demand access to any data they may need at any time, but also the ability to use the tools of
their choice to analyse the data and derive critical insights.
• Quality. Users may view the current data warehouse with suspicion. If where the data originated and how it has been acted on
are unclear, users may not trust the data.
• Findability. With many current data warehousing solutions, users do not have a function to rapidly and easily search for and find
the data they need when they need it. Inability to find data also limits the users’ ability to leverage and build on existing data
analyses.
This new solution needs to support multiple reporting tools in a self-serve capacity, to allow rapid ingestion of new datasets without
extensive modelling, and to scale large datasets while delivering performance. It should support advanced analytics, like machine
learning and text analytics, and allow users to cleanse and process the data iteratively and to track lineage of data for compliance.
Users should be able to easily search and explore structured, unstructured, internal, and external data from multiple sources in one
secure place.
7. Modern Data Warehousing
Third Party
Data
Enterprise Data
Warehouse
Reporting Tool
of Choice
Organizational
Data
Devices &
Sensors
Social Media
Demographics
Data
Near-Real-Time Monitoring
Data Lake
Data
Marts
OLAP
Semantic
Layer
Operational
Data Store Hadoop Machine
Learning
Streaming Data
Batch
ETL
Data Science
Advanced Analytics
Historical
Analytical
Reporting
Operational
Reporting
Mobile
Self-Service
Reports & Models
Master
Data
Federated
Queries
In-Memory
Model
Alerts
8. Microsoft Azure:
1) Azure Data warehouse
2) Azure Data Lake
3) Azure Machine
Learning
4) Power BI
AWS:
1) Amazon Redshift
2) Amazon EMR
3) Amazon Athena
4) Amazon QuickSight
GCP:
1) Google Big Query
2) Google Cloud Datalab
3) Google Cloud DataPrep
4) Google Cloud Dataflow
5) Cloud Data Studio
9. Modern DW
• New trends are causing it to break in four different ways: data
growth, fast query expectations from users, non-
relational/unstructured data, and cloud-born data.
• Enter the modern data warehouse, which is able to handle and
excel with these new trends. It handles all types of data (Hadoop),
provides a way to easily interface with all these types of data and
can handle “big data” and provide fast queries.
10. Evolve to a modern data warehouse
• The modern data warehouse lives up to the promise to get insights
from a variety of techniques like business intelligence and
advanced analytics from all types of data that are growing
explosively, and processing in real-time, with a more robust ability
to deliver the right data at the right time.
• A modern data warehouse delivers a comprehensive logical data
and analytics platform with a complete suite of fully supported,
solutions and technologies that can meet the needs of even the
most sophisticated and demanding modern enterprise—on-
premises, in the cloud, or within any hybrid scenario
11. What Makes a Data Warehouse “Modern”
Variety of data
sources;
multistructured
Coexists with
Data lake
Coexists with
Hadoop
Larger data
volumes; MPP
Multi-platform
architecture
Data
virtualization +
integration
Support all
user types &
levels
Flexible
deployment
Deployment
decoupled
from dev
Governance
model & MDM
Promotion of
self-service
solutions
Near real-time
data; Lambda
arch
Advanced
analytics
Agile
delivery
Cloud
integration;
hybrid env
Automation &
APIs
Data catalog;
search ability
Scalable
architecture
Analytics
sandbox w/
promotability
Bimodal
environment
12. Growing an Existing DW Environment
Hadoop
Internal to the data warehouse:
✓ Data modeling strategies
✓ Partitioning
✓ Clustered columnstore index
✓ In-memory structures
✓ MPP (massively parallel processing)
Augment the data warehouse:
✓ Complementary data storage &
analytical solutions
✓ Cloud & hybrid solutions
✓ Data virtualization / virtual DW
Data Lake NoSQLIn-Memory
Model
-- Grow around your existing data warehouse --
Data
Warehouse
14. Lambda Architecture
Devices &
Sensors
Data Lake
Speed
LayerEvent Hub Stream Analytics
Streaming
Dashboard
Organizational
Data
Batch
Layer
Reporting tool
of choice
Scheduled ETL Reports,
Dashboards
Advanced
Analytics
Curated Data
Data
Warehouse
Data
Mart(s)
Serving Layer
Machine
Learning
Speed Layer:
Low latency data
Batch Layer:
Data processing to
support complex
analysis
Serving Layer:
Responds to queries
Objectives:
• Support large
volume of high-
velocity data
• Near real-time
analysis +
persisted history
Master Data
Raw Data
15. Larger Scale Data Warehouse: MPP
Massively Parallel Processing (MPP) operates on
high volumes of data across distributed nodes
Shared-nothing architecture: each node has its
own disk, memory, CPU
Decoupled storage and compute
Scale up compute nodes to increase parallelism
Integrates with relational & non-relational data
Control Node
Compute
Node
Compute
Node
Compute
Node
Data Storage
16. ETL Process
ETL is normally a continuous, ongoing process with a
well-defined workflow. ETL first extracts data from
homogeneous or heterogeneous data sources. Then,
data is cleansed, enriched, transformed, and stored
either back in the lake or in a data warehouse.
ELT (Extract, Load, Transform) is a variant of ETL
wherein the extracted data is first loaded into the
target system. Transformations are performed
after the data is loaded into the data warehouse.
ELT typically works well when the target system is
powerful enough to handle transformations.
Analytical databases like Amazon Redshift and
Google BigQuery are often used in ELT pipelines
because they are highly efficient in performing
17. Summarizing 10 Pros & Cons of ETL
and ELT
• 1. Time - Load
ETL: Uses staging area and system, extra time to load data
ELT: All in one system, load only once
• 2. Time - Transformation
ETL: Need to wait, especially for big data sizes - as data grows, transformation time increases
ELT: All in one system, speed is not dependant on data size
• 3. Time - Maintenance
ETL: High maintenance - choice of data to load and transform and must do it again if deleted
or want to enhance the main data repository
ELT: Low maintenance - all data is always available
• 4. Implementation complexity
ETL: At early stage, requires less space and result is clean
ELT: Requires in-depth knowledge of tools and expert design of the main large repository
• 5. Analysis & processing style
ETL: Based on multiple scripts to create the views - deleting view means deleting data
ELT: Creating adhoc views - low cost for building and maintaining
18. ETL Vs ELT (Continued)
• 6. Data limitation or restriction in supply
ETL: By presuming and choosing data a priori
ELT: By HW (none) and data retention policy
• 7. Data warehouse support
ETL: Prevalent legacy model used for on-premises and relational, structured data
ELT: Tailored to using in scalable cloud infrastructure to support structured, unstructured such
big data sources
• 8. Data lake support
ETL: Not part of approach
ELT: Enables use of lake with unstructured data supported
• 9. Usability
ETL: Fixed tables, Fixed timeline, Used mainly by IT
ELT: Ad Hoc, Agility, Flexibility, Usable by everyone from developer to citizen integrator
• 10. Cost-effective
ETL: Not cost-effective for small and medium businesses
ELT: Scalable and available to all business sizes using online SaaS solutions
20. Data Lake
Data Lake
Repository
Web
Log
Social
Sensor
Images
A repository for analyzing large
quantities of disparate sources of data in
its native format
One architectural platform to house all
types of data:
o Machine-generated data (ex: IoT, logs)
o Human-generated data (ex: tweets, e-mail)
o Traditional operational data (ex: sales, inventory)
21. Objectives of a Data Lake
✓ Reduce up-front effort by ingesting
data in any format without requiring a
schema initially
✓ Make acquiring new data easy, so it can
be available for data science & analysis
quickly
✓ Store large volume of multi-structured
data in its native format
Data Lake
Repository
Web
Log
Social
Sensor
Images
22. Objectives of a Data Lake
✓ Defer work to ‘schematize’ after value &
requirements are known
✓ Achieve agility faster than a traditional
data warehouse can
✓ Speed up decision-making ability
✓ Storage for additional types of data
which were historically difficult to obtain
Data Lake
Repository
Web
Log
Social
Sensor
Images
23. Data Lake as a Staging Area for DW
Corporate
Data
CRM
Data
Warehouse Reporting tool
of choice
Social Media
Data
Devices &
Sensors
Raw Data:
Staging Area
Data Lake Store
Strategy:
• Reduce storage
needs in data
warehouse
• Practical use for
data stored in the
data lake
1 Utilize the data
lake as a landing
area for DW
staging area,
instead of the
relational
database
1
24. Data Lake Implementation
A data lake is a conceptual idea.
It can be implemented with one or more
technologies.
HDFS (Hadoop Distributed File Storage)
is a very common option for data lake
storage. However, Hadoop is not a
requirement for a data lake. A data lake
may also span > 1 Hadoop cluster.
NoSQL databases are also very common.
Data Lake
NoSQL
HDFS
25. Multi-Technology Data Lake
Corporate
Data
CRM
Data
Warehouse
Reporting tool
of choice
Social Media
Data
Devices &
Sensors
Raw Data
Data Lake Store
Strategy:
Use a ‘conceptual’
data lake strategy
with multiple
technologies, each
best suited to the
data (Polygot
Persistence)
DocumentDB
Collection
Standardized
(Curated) Data
Data Lake Store
1
1
Ingest
hierarchical,
nested JSON data
into DocumentDB
2
2 Standardize data
into tabular text
file
“Conceptual” Data Lake
26. Coexistence of Data Lake & Data Warehouse
Data Lake Values:
✓ Agility
✓ Flexibility
✓ Rapid Delivery
✓ Exploration
DW Values:
✓ Governance
✓ Reliability
✓ Standardization
✓ Security
Data Lake Enterprise
Data Warehouse
More effort
Less effort
Data retrieval
Data ingestion
Less effort
More effort
27. Challenges of a Data Lake: Technology
Complex, multi-
layered architecture
✓ Polygot persistence strategy
✓ Architectures & toolsets are emerging and maturing
Unknown storage &
scalability
✓ Can we realistically store “everything?”
✓ Cloud deployments are attractive when scale is undetermined
Working with ✓ Missing or incomplete
un-curated data
✓ Inconsistent dates and data types
✓ Data type mismatches
data
✓ Flex-field data which can vary per record
✓ Different granularities
✓ Incremental data loads
etc…
28. Challenges of a Data Lake: Technology
Performance
✓ Trade-offs between latency, scalability, & query performance
✓ Monitoring & auditing
Data retrieval
✓ Easy access for data consumers
✓ Organization of data to facilitate data retrieval
✓ Business metadata is *critical* for making sense of the data
Change
management
✓ File structure changes (inconsistent ‘schema-on-read’)
✓ Integration with master data
✓ Inherent risks associated with a highly flexible repository
✓ Maintaining & updating over time (meeting future needs)
29. Challenges of a Data Lake: Process
Finding right balance ✓ Finding optimal level of chaos which invites experimentation
of agility
(deferred work vs.
up-front work)
✓ Risk & complexity are still there – just shifted
✓ When schema-on-read is appropriate (Temporarily? Permanently?)
✓ How the data lake will coexist with the data warehouse
✓ How to operationalize self-service/sandbox work from analysts
Data quality
✓ How to reconcile or confirm accuracy of results
✓ Data validation between systems
✓ Challenging to implement data governance
✓ Difficult to enforce standards
✓ Securing and obfuscating confidential data
✓ Meeting compliance and regulatory requirements
Governance &
security
30. Parameters Data Lake Data Warehouse
Storage In the data lake, all data is kept irrespective of the source and its
structure. Data is kept in its raw form. It is only transformed when it
is ready to be used.
A data warehouse will consist of data that is extracted from
transactional systems or data which consists of quantitative metrics
with their attributes. The data is cleaned and transformed
History Big data technologies used in data lakes is relatively new. Data warehouse concept, unlike big data, had been used for decades.
Data Capturing Captures all kinds of data and structures, semi-structured and
unstructured in their original form from source systems.
Captures structured information and organizes them in schemas as
defined for data warehouse purposes
Data Timeline Data lakes can retain all data. This includes not only the data that is in
use but also data that it might use in the future. Also, data is kept for
all time, to go back in time and do an analysis.
In the data warehouse development process, significant time is spent
on analyzing various data sources.
Users Data lake is ideal for the users who indulge in deep analysis. Such
users include data scientists who need advanced analytical tools with
capabilities such as predictive modeling and statistical analysis.
The data warehouse is ideal for operational users because of being well
structured, easy to use and understand.
Storage Costs Data storing in big data technologies are relatively inexpensive then
storing data in a data warehouse.
Storing data in Data warehouse is costlier and time-consuming.
Task Data lakes can contain all data and data types; it empowers users to
access data prior the process of transformed, cleansed and
structured.
Data warehouses can provide insights into pre-defined questions for
pre-defined data types.
Processing time Data lakes empower users to access data before it has been
transformed, cleansed and structured. Thus, it allows users to get to
their result more quickly compares to the traditional data warehouse.
Data warehouses offer insights into pre-defined questions for pre-
defined data types. So, any changes to the data warehouse needed
more time.
Position of Schema Typically, the schema is defined after data is stored. This offers high
agility and ease of data capture but requires work at the end of the
process
Typically schema is defined before data is stored. Requires work at the
start of the process, but offers performance, security, and integration.
Data processing Data Lakes use of the ELT (Extract Load Transform) process. Data warehouse uses a traditional ETL (Extract Transform Load) process.
Complain Data is kept in its raw form. It is only transformed when it is ready to
be used.
The chief complaint against data warehouses is the inability, or the
problem faced when trying to make change in in them.
Key Benefits They integrate different types of data to come up with entirely new
questions as these users not likely to use data warehouses because
they may need to go beyond its capabilities.
Most users in an organization are operational. These type of users only
care about reports and key performance metrics.
32. What Makes a Data Warehouse “Modern”
Variety of subject areas & data sources for analysis
with capability to handle large volumes of data
Expansion beyond a single relational DW/data mart
structure to include Hadoop, Data Lake, or NoSQL
Logical design across multi-platform architecture
balancing scalability & performance
Data virtualization in addition to data integration
33. What Makes a Data Warehouse “Modern”
Support for all types & levels of users
Flexible deployment (including mobile) which is
decoupled from tool used for development
Governance model to support trust and security,
and master data management
Support for promoting self-service solutions to
the corporate environment
34. What Makes a Data Warehouse “Modern”
Ability to facilitate near real-time analysis on
high velocity data (Lambda architecture)
Support for advanced analytics
Agile delivery approach with fast delivery cycle
Hybrid integration with cloud services
APIs for downstream access to data
36. Challenges with Modern Data Warehousing
Agility
Reducing time to
value Minimizing chaos
Evolving &
maturing
technology
Balancing ‘schema
on write’ with
‘schema on read’
How strict to be
with dimensional
design?
37. Challenges with Modern Data Warehousing
Complexity
Hybrid
scenarios
Multi-
platform
infrastructure
Ever-
increasing
data volumes
File type &
format
diversity
Real-time
reporting
needs
Effort & cost
of data
integration
Broad skillsets
needed
38. Challenges with Modern Data Warehousing
Balance with Self-Service Initiatives
Self-service solutions
which challenge
centralized DW
Managing ‘production’
delivery from IT and
user-created solutions
Handling ownership
changes (promotion) of
valuable solutions
39. How Delta Airlines uses Big Data for
Customer Sentiment Analysis-
• Airline companies are using sentiment analysis to analyse flyer tracking
experience. Large airlines like Delta, monitors tweets to find out how
their customers feel about delays, upgrades , in-flight entertainment ,
and more. For example, when a customer tweets negatively about his
lost baggage with the airline prior to boarding his connecting flight. The
airline identifies such negative tweets and forwards to their support
team. The support team sends a representative to the passengers
destination presenting him a free first class upgrade ticket on his return
along with the information about the tracked baggage promising to
deliver it as soon as he or she steps out of the plane. The customer
tweets like a happy camper rest of his trip helping the airlines build
positive brand recognition.
40. Streaming analytics helps retail stores track the indoor location of their customers, discover the most
attractive product in the shop, send an 'offer of the day' when customers within a certain radius, provide
offers related to their specific products of interest, suggest other products relative to what they are
looking at right now, offer coupons based on customers' previous purchases, and obtain more advanced
statistics for the store manager to refine and redefine business strategies (Figure 1).
41. Doing Away with Dangerous Police Chases
(IOT)
• Law enforcement Agencies in Austin, Texas, Los
Angeles, Arizona, Wisconsin have been trying out
a system that eliminates the need for police to
engage in dangerous high speed chases of
suspects. An air compressor launcher on the front
of the patrol car fires a sticky GPS locator with a
transmitted. Police can then remotely track the
vehicle versus chasing it.
42. Final Thoughts
Traditional data warehousing still is important, but needs to co-exist with
other platforms. Build around your existing DW infrastructure.
Plan your data lake with data retrieval in mind.
Expect to balance ETL with some (perhaps limited) data virtualization
techniques in a multi-platform environment.
Be fully aware of data lake & data virtualization challenges in order to
craft your own achievable & realistic best practices.
Plan to work in an agile fashion, leveraging self-service & sandbox
solutions which can be promoted to the corporate solution.
46. New sources and types of data
• The traditional data warehouse was built on a strategy of well
structured, sanitized and trusted repository. Yet, today more than
85 percent of data volume comes from a variety of new data
types proliferating from mobile and social channels, scanners,
sensors, RFID tags, devices, feeds, and other sources outside the
business. These data types do not easily fit the business schema
model and may not be cost effective to ETL into the relational
data warehouse. Yet, these new types of data have the potential
to enhance business operations. For example, a shipping
company might use fuel and weight sensors with GPS, traffic, and
weather feeds to optimize shipping routes or fleet usage.
47. Big data is driving us to a tipping point
Large data
volumes
Multiple
data types
Real-time
data creation
User
expectations
MobilityHardware
and storage
economics
48. Clickstream analysis
Clickstream data is the trail of digital breadcrumbs left by users as they click their way through
a website, and it’s loaded with valuable customer information for businesses.
Clickstream analysis is the process of collecting, analysing, and reporting aggregate data about
which pages visitors visit, and in what order. This reveals usage patterns, which in turn gives a
heightened understanding of customer behaviour. This use of the analysis creates a user
profile that aids in understanding the types of people that visit a company’s website.