Logical Data Warehouse and Data Lakes can play a role in many different type of projects and, in this presentation, we will look at some of the most common patterns and use cases. Learn about analytical and big data patterns as well as performance considerations. Example implementations will be discussed for each pattern.
- Architectural patterns for logical data warehouse and data lakes.
- Performance considerations.
- Customer use cases and demo.
This presentation is part of the Denodo Educational Seminar, and you can watch the video here goo.gl/vycYmZ.
3. 3
HEADQUARTERS
Palo Alto, CA.
DENODO OFFICES, CUSTOMERS, PARTNERS
Global presence throughout North America,
EMEA, APAC, and Latin America.
CUSTOMERS
250+ customers, including many
F500 and G2000 companies across every
major industry have gained significant
business agility and ROI.
LEADERSHIP
Longest continuous focus on data
virtualization and data services.
Product leadership.
Solutions expertise.
3
THE LEADER IN DATA VIRTUALIZATION
Denodo provides agile, high performance data
integration and data abstraction across the broadest
range of enterprise, cloud, big data and unstructured
data sources, and real-time data services at half the
cost of traditional approaches.
4. Speakers
Paul Moxon
Senior Director of Product
Management, Denodo
Pablo Álvarez
Principal Technical Account
Manager, Denodo
Rubén Fernández
Technical Account Manager,
Denodo
6. Agenda1.The Logical Data Warehouse
2.Different Types, Different Needs
3.Performance in a LDW
4.Customer Success Stories
5.Q&A
7. What is a Logical Data Warehouse?
A logical data warehouse is a data system that follows
the ideas of traditional EDW (star or snowflake schemas)
and includes, in addition to one (or more) core DWs,
data from external sources.
The main motivations are improved decision making
and/or cost reduction
8. Logical Data Warehouse
Description:
“The Logical Data Warehouse (LDW) is a new data management architecture for
analytics combining the strengths of traditional repository warehouses with
alternative data management and access strategy. The LDW will form a new
best practice by the end of 2015.”
“The LDW is an evolution and augmentation of DW practices, not a replacement”
“A repository-only style DW contains a single ontology/taxonomy, whereas in the
LDW a semantic layer can contain many combination of use cases, many
business definitions of the same information”
“The LDW permits an IT organization to make a large number of datasets
available for analysis via query tools and applications.”
8
Gartner Definition
Gartner Hype Cycle for Enterprise Information Management, 2012
9. Logical Data Warehouse
Description:
“The Logical Data Warehouse (LDW) is a new data management architecture for
analytics combining the strengths of traditional repository warehouses with
alternative data management and access strategy. The LDW will form a new
best practice by the end of 2015.”
“The LDW is an evolution and augmentation of DW practices, not a replacement”
“A repository-only style DW contains a single ontology/taxonomy, whereas in the
LDW a semantic layer can contain many combination of use cases, many
business definitions of the same information”
“The LDW permits an IT organization to make a large number of datasets
available for analysis via query tools and applications.”
9
Gartner Definition
Gartner Hype Cycle for Enterprise Information Management, 2012
10. Logical Data Warehouse
Description:
A semantic layer on top of the data warehouse that keeps the business data
definition.
Allows the integration of multiple data sources including enterprise systems,
the data warehouse, additional processing nodes (analytical appliances, Big
Data, …), Web, Cloud and unstructured data.
Publishes data to multiple applications and reporting tools.
10
11. 11
Three Integration/Semantic Layer Alternatives
Gartner’s View of Data Integration
Application/BI Tool as Data
Integration/Semantic Layer
EDW as Data
Integration/Semantic Layer
Data Virtualization as Data
Integration/Semantic Layer
Application/BI Tool Data Virtualization
EDW
EDW
ODS ODS EDW ODS
12. 12
Application/BI Tool as the Data Integration Layer
Application/BI Tool as Data
Integration/Semantic Layer
Application/BI Tool
EDW ODS
• Integration is delegated to end user tools
and applications
• e.g. BI Tools with ‘data blending’
• Results in duplication of effort – integration
defined many times in different tools
• Impact of change in data schema?
• End user tools are not intended to be
integration middleware
• Not their primary purpose or expertise
13. 13
EDW as the Data Integration Layer
EDW as Data
Integration/Semantic Layer
EDW
ODS
• Access to ‘other’ data (query federation) via
EDW
• Teradata QueryGrid, IBM FluidQuery, SAP
Smart Data Access, etc.
• Often coupled with traditional ETL replication
of data into EDW
• EDW ‘center of data universe’
• Provides data integration and semantic layer
• Appears attractive to organizations heavily
invested in EDW
• More than one EDW? EDW costs?
14. 14
Data Virtualization as the Data Integration Layer
Data Virtualization as Data
Integration/Semantic Layer
Data Virtualization
EDW ODS
• Move data integration and semantic layer to
independent Data Virtualization platform
• Purpose built for supporting data access
across multiple heterogeneous data sources
• Separate layer provides semantic models for
underlying data
• Physical to logical mapping
• Enforces common and consistent security
and governance policies
• Gartner’s recommended approach
17. 17
The State and Future of Data Integration. Gartner, 25 may 2016
Physical data movement architectures that aren’t designed to
support the dynamic nature of business change, volatile
requirements and massive data volume are increasingly being
replaced by data virtualization.
Evolving approaches (such as the use of LDW architectures) include
implementations beyond repository-centric techniques
18. What about the Logical Data Lake?
A Data Lake will not have a star or snowflake schema, but rather a more
heterogeneous collection of views with raw data from heterogeneous
sources
The virtual layer will act as a common umbrella under which these
different sources are presented to the end user as a single system
However, from the virtualization perspective, a Virtual Data Lake shares
many technical aspects with a LDW and most of these contents also
apply to a Logical Data Lake
20. 20
Common Patterns for a Logical Data Warehouse
1. The Virtual Data Mart
2. DW + MDM
Data Warehouse extended with master data
3. DW + Cloud
Data Warehouse extended with cloud data
4. DW + DW
Integration of multiple Data Warehouse
5. DW historical offloading
DW horizontal partitioning with historical data in cheaper storage
6. Slim DW extension
DW vertical partitioning with rarely used data in cheaper storage
21. 21
Virtual Data Marts
Business friendly models defined on top of one or multiple systems,
often “flavored” for a particular division
Motivation
Hide complexity of star schemas for business users
Simplify model for a particular vertical
Reuse semantic models and security across multiple reporting engines
Typical queries
Simple projections, filters and aggregations on top of curated “fat tables”
that merge data from facts and many dimensions
Simplified semantic models for business users
22. 22
Virtual Data Marts
Time Dimension Fact table
(sales)
Product
Retailer
Dimension
Sales
EDW Others
Product
Prod. Details
23. 23
DW + MDM
Slim dimensions with extended information maintained in an external
MDM system
Motivation
Keep a single copy of golden records in the MDM that can be reused across
systems and managed in a single place
Typical queries
Join a large fact table (DW) with several MDM dimensions, aggregations on
top
Example
Revenue by customer, projecting the address from the MDM
25. 25
DW + Cloud dimensional data
Fresh data from cloud systems (e.g. SFDC) is mixed with the EDW, usually
on the dimensions. DW is sometimes also in the cloud.
Motivation
Take advantage of “fresh” data coming straight from SaaS systems
Avoid local replication of cloud systems
Typical queries
Dimensions are joined with cloud data to filter based on some external attribute
not available (or not current) in the EDW
Example
Report on current revenue on accounts where the potential for an expansion is
higher than 80%
26. 26
DW + Cloud dimensional data
Time Dimension Fact table
(sales) Product Dimension
Customer
Dimension
CRM
SFDC
Customer
EDW
27. 27
Multiple DW integration
Motivation
Merges and acquisitions
Different DWs by department
Transition to new EDW Deployments (migration to Spark, Redshift, etc.)
Typical queries
Joins across fact tables in different DW with aggregations before or after the JOIN
Example
Get customers with a purchases higher than 100 USD that do not have a fidelity
card (purchases and fidelity card data in different DW)
Use of multiple DW as if it was only one
28. 28
Multiple DW integration
Time
Dimensi
on
Sales fact
Product
Dimension
Region
Finance EDW
City
Marketing EDW
Customer Fidelity factsProduct
Dimension
*Real Examples: Nationwide POC, IBM tests
Store
29. 29
DW Historical Partitioning
Only the most current data (e.g. last year) is in the EDW. Historical data is
offloaded to a Hadoop cluster
Motivations
Reduce storage cost
Transparently use the two datasets as if they were all together
Typical queries
Facts are defined as a partitioned UNION based on date
Queries join the “virtual fact” with dimensions and aggregate on top
Example
Queries on current date only need to go to the DW, but longer timespans need to merge
with Hadoop
Horizontal partitioning
31. 31
Slim DW extension
Minimal DW, with more complete raw data in a Hadoop cluster
Motivation
Reduce cost
Transparently use the two datasets as if they were all together
Typical queries
Tables are defined virtually as 1-to-1 joins between the two systems
Queries join the facts with dimensions and aggregate on top
Example
Common queries only need to go to the DW, but some queries need attributes or
measures from Hadoop
Vertical partitioning
34. 34
It is a common assumption that a virtualized solution will
be much slower than a persisted approach via ETL:
1. There is a large amount of data moved through the
network for each query
2. Network transfer is slow
But is this really true?
35. 35
Debunking the myths of virtual performance
1. Complex queries can be solved transferring moderate data volumes when
the right techniques are applied
Operational queries
Predicate delegation produces small result sets
Logical Data Warehouse and Big Data
Denodo uses characteristics of underlying star schemas to apply
query rewriting rules that maximize delegation to specialized sources
(especially heavy GROUP BY) and minimize data movement
2. Current networks are almost as fast as reading from disk
10GB and 100GB Ethernet are a commodity
36. 36
Denodo has done extensive testing using queries from the standard benchmarking test
TPC-DS* and the following scenario
Compares the performance of a federated approach in Denodo with an MPP system where
all the data has been replicated via ETL
Customer Dim.
2 M rows
Sales Facts
290 M rows
Items Dim.
400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of
decision support solutions including, but not limited to, Big Data systems.
vs.
Sales Facts
290 M rows
Items Dim.
400 K rows
Customer Dim.
2 M rows
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
37. 37
Performance Comparison
Query Description
Returned
Rows
Time Netezza
Time Denodo
(Federated Oracle,
Netezza & SQL Server)
Optimization Technique
(automatically selected)
Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and
year between 2000 and 2004
5,51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where
sale price less than current
list price
17,05 K 3.5 sec. 5.2 sec On the fly data movement
Logical Data Warehouse vs. Physical Data Warehouse
38. 38
Performance and optimizations in Denodo
Focused on 3 core concepts
Dynamic Multi-Source Query Execution Plans
Leverages processing power & architecture of data sources
Dynamic to support ad hoc queries
Uses statistics for cost-based query plans
Selective Materialization
Intelligent Caching of only the most relevant and often used
information
Optimized Resource Management
Smart allocation of resources to handle high concurrency
Throttling to control and mitigate source impact
Resource plans based on rules
39. 39
Performance and optimizations in Denodo
Comparing optimizations in DV vs ETL
Although Data Virtualization is a data integration platform,
architecturally speaking it is more similar to a RDBMs
Uses relational logic
Metadata is equivalent to that of a database
Enables ad hoc querying
Key difference between ETL engines and DV:
ETL engines are optimized for static bulk movements
Fixed data flows
Data virtualization is optimized for queries
Dynamic execution plan per query
Therefore, the performance architecture presented here
resembles that of a RDBMS
41. 41
Step by Step
Metadata
Query Tree
• Maps query entities (tables, fields) to actual metadata
• Retrieves execution capabilities and restrictions for views involved
in the query
Static
Optimizer
• Query delegation
• SQL rewriting rules (removal of redundant filters, tree pruning, join
reordering, transformation push-up, star-schema rewritings, etc.)
• Data movement query plans
Cost Based
Optimizer
• Picks optimal JOIN methods and orders based on data distribution
statistics, indexes, transfer rates, etc.
Physical
Execution Plan
• Creates the calls to the underlying systems in their corresponding
protocols and dialects (SQL, MDX, WS calls, etc.)
How Dynamic Query Optimizer Works
42. How Dynamic Query Optimizer Works
42
Example: Total sales by retailer and product during the last month for the brand ACME
Time Dimension Fact table
(sales) Product Dimension
Retailer
Dimension
EDW MDM
SELECT retailer.name,
product.name,
SUM(sales.amount)
FROM
sales JOIN retailer ON
sales.retailer_fk = retailer.id
JOIN product ON sales.product_fk =
product.id
JOIN time ON sales.time_fk = time.id
WHERE time.date < ADDMONTH(NOW(),-1)
AND product.brand = ‘ACME’
GROUP BY product.name, retailer.name
43. How Dynamic Query Optimizer Works
43
Example: Non-optimized
1,000,000,0
00 rows
JOIN
JOIN
JOIN
GROUP BY
product.name,
retailer.name
100 rows 10 rows 30 rows
10,000,000
rows
SELECT
sales.retailer_fk,
sales.product_fk,
sales.time_fk,
sales.amount
FROM sales
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT
product.name,
product.id
FROM product
WHERE
produc.brand =
‘ACME’
SELECT time.date,
time.id
FROM time
WHERE time.date <
add_months(CURRENT_
TIMESTAMP, -1)
44. How Dynamic Query Optimizer Works
44
Step 1: Applies JOIN reordering to maximize delegation
100,000,000
rows
JOIN
JOIN
100 rows 10 rows
10,000,000
rows
GROUP BY
product.name,
retailer.name
SELECT sales.retailer_fk,
sales.product_fk,
sales.amount
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP, -1)
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
45. How Dynamic Query Optimizer Works
45
Step 2
10,000 rows
JOIN
JOIN
100 rows 10 rows
1,000 rows
GROUP BY
product.name,
retailer.name
Since the JOIN is on foreign keys
(1-to-many), and the GROUP BY is
on attributes from the dimensions,
it applies the partial aggregation
push down optimization
SELECT sales.retailer_fk,
sales.product_fk,
SUM(sales.amount)
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP, -1)
GROUP BY sales.retailer_fk,
sales.product_fk
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
46. How Dynamic Query Optimizer Works
46
Step 3
Selects the right JOIN
strategy based on costs for
data volume estimations
1,000 rows
NESTED
JOIN
HASH
JOIN
100 rows10 rows
1,000 rows
GROUP BY
product.name,
retailer.name
SELECT sales.retailer_fk,
sales.product_fk,
SUM(sales.amount)
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP, -1)
GROUP BY sales.retailer_fk,
sales.product_fk
WHERE product.id IN (1,2,…)
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
47. How Dynamic Query Optimizer Works
1. Automatic JOIN reordering
Groups branches that go to the same source to maximize query delegation and reduce processing in the DV
layer
End users don’t need to worry about the optimal “pairing” of the tables
2. The Partial Aggregation push-down optimization is key in those scenarios. Based on
PK-FK restrictions, pushes the aggregation (for the PKs) to the DW
Leverages the processing power of the DW, optimized for these aggregations
Reduces significantly the data transferred through the network (from 1 b to 10 k)
3. The Cost-based Optimizer picks the right JOIN strategies based on estimations on data
volumes, existence of indexes, transfer rates, etc.
Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata) than for regular
databases to take into consideration the different way those systems operate (distributed data, parallel
processing, different aggregation techniques, etc.)
47
Summary
48. How Dynamic Query Optimizer Works
Automatic data movement
Creation of temp tables in one of the systems to enable complete delegation
Only considered as an option if the target source has the “data movement” option
enabled
Use of native bulk load APIs for better performance
Execution Alternatives
If a view exist in more than one system, Denodo can decide in execution time which one
to use
The goal is to maximize query delegation depending on the other tables involved in the
query
48
Other relevant optimization techniques for LDW and Big Data
49. How Dynamic Query Optimizer Works
Optimizations for Virtual Partitioning
Eliminates unnecessary queries and processing based on a pre-execution analysis of the
views and the queries
Pruning of unnecessary JOIN branches
Relevant for horizontal partitioning and “fat” semantic models when queries do not
need attributes for all the tables
Pruning of unnecessary UNION branches
Enables detection of unnecessary UNION branches in vertical partitioning scenarios
Push down of JOIN under UNION views
Enables the delegation of JOINs with dimensions
Automatic Data movement for partition scenarios
Enables the delegation of JOINs with dimensions
49
Other relevant optimization techniques for LDW and Big Data
51. 51
Caching
Sometimes, real time access & federation not a good fit:
Sources are slow (ex. text files, cloud apps. like Salesforce.com)
A lot of data processing needed (ex. complex combinations, transformations,
matching, cleansing, etc.)
Limited access or have to mitigate impact on the sources
For these scenarios, Denodo can replicate just the relevant data in
the cache
Real time vs. caching
52. 52
Caching
Denodo’s cache system is based on an external relational database
Traditional (Oracle, SQLServer, DB2, MySQL, etc.)
MPP (Teradata, Netezza, Vertica, Redshift, etc.)
In-memory storage (Oracle TimesTen, SAP HANA)
Works at view level.
Allows hybrid access (real-time / cached) of an execution tree
Cache Control (population / maintenance)
Manually – user initiated at any time
Time based - using the TTL or the Denodo Scheduler
Event based - e.g. using JMS messages triggered in the DB
Overview
54. 54
Further Reading
Data Virtualization Blog (http://www.datavirtualizationblog.com)
Check the following articles written by our CTO Alberto Pan in our blog:
• Myths in data virtualization performance
• Performance of Data Virtualization in Logical Data Warehouse scenarios
• Physical vs Logical Data Warehouse: the numbers
• Cost Based Optimization in Data Virtualization
Denodo Cookbook
• Data Warehouse Offloading
56. Autodesk Overview
• Founded 1982 (NASDAQ: ASDK)
• Annual revenues (FY 2015) $2.5B
Over 8,800 employees
• 3D modeling and animation software
Flagship product is AutoCAD
• Market sectors:
Architecture, Engineering, and Construction
Manufacturing
Media and Entertainment
Recently started 3D Printing offerings
56
57. Business Drivers for Change
• Software consumption model is changing
Perpetual licenses to subscriptions
User want more flexibility in how they use software
• Autodesk needed to transition to subscription pricing
2016 – some products will be subscription only
• Lifetime revenue higher with subscriptions
Over 3-5 years, subscriptions = more revenues
• Changing a licensing model is disruptive
57
58. Technology Challenges
• Current ‘traditional’ BI/EDW architecture not
designed for data streams from online apps
Weblogs, Clickstreams, Cloud/Desktop apps, etc.
• Existing infrastructure can’t simply ‘go away’
Regulatory reporting (e.g. SEC)
Existing ‘perpetual’ customers
• ‘Subscription’ infrastructure work in parallel
Extend and enhance existing systems
With single access point to all data
• Solution – ‘Logical Data Warehouse’
58
63. 63
Problem Solution Results
Case Study Autodesk Successfully Changes Their
Revenue Model and Transforms Business
Autodesk was changing their business
revenue model from a conventional
perpetual license model to
subscription-based license model.
Inability to deliver high quality data in
a timely manner to business
stakeholders.
Evolution from traditional operational
data warehouse to contemporary
logical data warehouse deemed
necessary for faster speed.
General purpose platform to deliver
data through logical data warehouse.
Denodo Abstraction Layer helps live
invoicing with SAP.
Data virtualization enabled a culture
of “see before you build”.
Successfully transitioned to
subscription-based licensing.
For the first time, Autodesk can do
single point security enforcement and
have uniform data environment for
access.
Autodesk, Inc. is an American multinational software corporation that makes software for the
architecture, engineering, construction, manufacturing, media, and entertainment industries.