Logical Data Warehouse: How to Build a Virtualized Data Services Layer

1
Ajay Shriwastava
Sachin Ghai
ImpetusTechnologies Inc.
Logical Data Warehouse:
Building a virtual data services layer
Hadoop Summit – San Jose – 11 June 2015

2
AGENDA
Emergence of
Logical Data
Warehouse
Virtualization Offload

3
EVOLUTION OFTHE DATAWAREHOUSE

4
If this slide looks inverted to you, it actually is.
Data warehouse as we knew so far has inverted the concepts
with emergence of BIG DATA.

5
Pre-
determin
ed input
schema
Extensive
data
governance
ANSI SQL
compliance
IT teams
ownership
Concurren
t users
Pre-
canned BI
Low cost
storage/
archive
Non
SQL
access
Explorator
y analysis
Machine
Learning/
Graph
Data
discovery
Self Service
BI/analytics
Enterprise Data Warehouse (EDW) Big Data Warehouse (BDW)

6
CO-EXISTENCE OF EDW AND BDW – A REALITY
Organizations
still on initial
phase of Big
Data journey.
Existing ETL
jobs feeding
EDW.
BI and
downstream
applications
usingANSI SQL
for querying
data in EDW.
Business use
cases for big
data are
emerging.

7
EMERGENCE OF LOGICAL DATAWAREHOUSE
Logical Data
Warehouse
In response to emerging forces like Big Data, the data warehousing practices evolution led
to emergence of “Logical Data Warehouse”.
Key
components
include:
Repository management
Data virtualization
Distributed process
Auditing statistics and performance evaluation services
SLA management
Taxonomy/Ontology resolution
Metadata management
First proposed in May 2009 and published in August 2011 research by Gartner.

8
NEW PARADIGMS
DATA LAKE
DISTRIBUTED
PARALLEL
PROCESS
VIRTUALIZATION OFFLOAD
Repositories
continue to be no
longer Enterprise
Data Warehouse
or Data marts –
emergence of
HDFS as the “data
lake” along with
NoSQL data
stores.
Distributed
process now
have become
synonymous
with MapReduce
on files or DB.
With Spark,
more distributed
operators are
becoming
common place.
Virtualization
gaining favor as
an access
mechanism
where transient
consolidation is
required for a
use case.
Offload to newer
repositories and
process engines
requiring more
accurate science
and process now.

9
VIRTUALIZATION – AND ITS MANY DELINEATIONS
• simplified, unified, and integrated viewDataVirtualization
• is a subset of data virtualization
• enhanced with query optimization strategies for specific source
Data Federation
• involves actual data movement and ‘write’ to a repository rather
than just ‘read’ for a transient use caseData Blending

10
BRIDGINGTHE CHANNELWITHVIRTUALIZATION
VIRTUALIZATION
Relational
Oracle,
DB2…
NoSQL
Cassandra,
MongoDB
…
File Systems
HDFS,
GPFS…
MPP
Teradata,
Netezza…
Hadoop based
Warehouse
Hive,Tajo…
Users
Enterprise
Web, Mobile
Applications
Enterprise,
ESB…
BI
Reporting,
Visualization…
Data Science
Machine Learning,
Graph…
Data
Management
MDM,
Discovery…
TARGET
SYSTEMS
SOURCESYSTEMS

11
WEALTH MANAGEMENT – USE CASE
Big data can transform the client and account centric wealth management to personalized goal based
wealth management.
Unfortunately the information is spread across many different line of business using separate data
sources and platforms

12
GOAL BASEDWEALTH MANAGEMENT
• From account or client centric view to household and relationship view.
• A comprehensive approach to understand long term financial goals of client.
• Facilitate financial security during life changing events – marriage, college, job changes, retirement,
inter generational wealth transfer.

13
BUILDING RELATIONSHIPS
HouseholdView Business NetworkView HierarchyView.
Collect the data and process on a
common platform.
- Different LOB’s
- IVR Logs/Web Logs
Discover relationships.
Unified view over existing data in
client and account systems.
Integrate with social data.
ImplementGovernance.
Implement corporate hierarchy over
client data.
Jerry
Mayfield
Paul
Robinson
Andrew
Madura
Linda Mays
Jack Kline
Root
Node
Friends
Golf Buddy
College
Alumni
Neighbors
Father
Son
Daughter-1 Daughter-2
Self Spouse

14
PERSONALIZED SERVICES CROSS/UP SELL
HouseholdView Business NetworkView HierarchyView.
• 401 K / IRA
• 529
• College loans
• Student Credit Cards.
• Gift Cards
• Estate Planning
• Alimonies
• Company loans
• Asset standing
• Business prospects
• Corporate Accounts.
• Corporate discounts.
• Group based services.
• Prospects identification.
• Goal based services.
Assets
Distribution
0
10
20
30
40
50
Goals Actuals
Liabilities
0%
20%
40%
60%
80%
100%

15
TRADITIONAL DATA INTEGRATION
• Data is embedded in silos
• Time consuming and resource intensive ETL processes
• Create data duplication
• Governance is inhibitor and not enabler
• Inability to handle theV’s of big data

16
Governance–Security,Audit,Lineage
MonitoringandClusterManagement
HDFS/Hive/PIG
EDW –Teradata
/Netezza etc
ProducerA
Stream Analytix
SQL
Offload
Solution
Kyvos Engine
EDWMigration
In memory data Layer (spark)
Centralized
Schema
Big Data
Governance
Ankush
Jumbune
Analytics/VirtualizationStreamIngestion
BatchIngestion
Batch Data Ingestion
Sqoop /Talend etc.
Kundera
DataVirtualization Layer
Distributed Messaging Layer (Kafka)
Producer B
DataAccess
Spark Streaming
REST
API
Hadoop Cluster /YARN
Query
API (sql)
Custom layer for
universal connectivity
Search
API
ES, Solr, NLP
Propriety
connectors /
ODBC Drivers
BITools – Micro Strategy,Tableau, Kyvos…
Impetus Offerings (Details : Appendix 1)
Recommended Platform
Platform Requirements
JDBC
ML Lib
OLTP/RD
BMS
R Algorithms.
Data Quality.
Mahout
Storm
LOGICAL DATA WAREHOUSE REFERENCE
ARCHITECTURE

17
UNIFIEDVIEW - ADVANTAGES
• Fast real time data integration without creating expensive copies of data.
• Significant saving of time and resources required for ETL
• Facility to create a final composite schema.
• Information management capability.
• Meet stringent service level agreements.

18
OFFLOAD
Offload cold data and exploratory analysis workloads to commodity hardware
driven Hadoop cluster
– Save cost, resources

19
OFFLOAD CONCEPT
Run Hive
queries on
Hadoop
SQL
ScriptsEDW
Static SQL
Procedures/
PL-SQL
Proprietary
script
(RoadMap)
BDW
Hive
Queries
Hive Queries
SQL
Script
Parser
JAVA code with Hive
Queries
Tables
An Enterprise Data Warehouse (EDW) to Big Data Warehouse (BDW) offload will
essentially involve tables and code migration.

20
KEY CHALLENGES IN OFFLOAD
Varied input sources
Validating complete
schema and data offload
ANSI SQL
incompatibility
User Defined Functions
unavailability in target
system
Lack of unified view and
UI
Missing Data Quality
checks

21
HOWWE BUILTTHE OFFLOAD SOLUTION
Step 1: Identification
Step 2: Schema and
Data Migration
Step 3: Logic
Migration
Step 4: Data Quality
Enhancement
Step 5:Transformed
code Execution
ReducedTime!
Reduced Risk!
Automation!

22
SAMPLE QUERY – AUTOTRANSFORMED
Teradata Query:
INSERT INTO month_wise_ship_agg
select d_month_seq
,substr(w_warehouse_name,1,20) , TRIM(TRAILING '_' FROM sm_type)
,cc_name ,sum(case when (cs_ship_date_sk - cs_sold_date_sk <= 30 ) then 1 else 0 end) as "30 days"
,sum(case when (cs_ship_date_sk - cs_sold_date_sk > 30) and (cs_ship_date_sk - cs_sold_date_sk <= 60)
then 1 else 0 end ) as "31-60 days"
from catalog_sales ,warehouse ,ship_mode ,call_center C,date_dim
where cs_ship_date_sk = d_date_sk
and EXTRACT (YEAR FROM d_date) IN (2000,2001)
and DAYNUMBER_OF_MONTH(d_date) > 1
and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE
and cs_warehouse_sk = w_warehouse_sk
and cs_ship_mode_sk = sm_ship_mode_sk
and cs_call_center_sk = cc_call_center_sk
and (C.cust_id , C.address) LIKE ANY ( SELECT C1.cust_id, C1.address FROM Customer C1 )
group by substr(w_warehouse_name,1,20) ,sm_type ,cc_name ,d_month_seq ;
Hive Query:
INSERT INTO TABLE month_wise_ship_agg
SELECT d_month_seq,SUBSTR( w_warehouse_name , 1 , 20 ) AS auto_c01,
sm_type, UDF_TRIM('TRAILING ','_' , sm_type) ,
cc_name,SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk <= 30) THEN 1 ELSE 0 END ) AS 30_days,
SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk > 30) AND ( cs_ship_date_sk - cs_sold_date_sk <=
60) THEN 1 ELSE 0 END ) AS 31_60_days
FROM catalog_sales, warehouse, ship_mode, call_center C, date_dim
WHERE cs_ship_date_sk = d_date_sk
AND EXTRACT ('YEAR', d_date) IN (2000,2001)
AND DAYNUMBER_OF_MONTH(d_date) > 1
and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE()
AND cs_warehouse_sk = w_warehouse_sk
AND cs_ship_mode_sk = sm_ship_mode_sk
AND cs_call_center_sk = cc_call_center_sk
AND EXISTS ( SELECT * Customer C1 where C.cust_id LIKE C1.cust_id AND C.age LIKE C1.address )
GROUP BY SUBSTR( w_warehouse_name , 1 , 20 ) ,sm_type,cc_name,d_month_seq;

23
OFFLOAD - KEY ADVANTAGES
Optimize MPP and
Relational database
resources for
workloads
Re-use
millions of
lines of
code and
$
Avoid the learning
curve and re-code,
re-test cycles
Seamless
integration
of
downstream
/ upstream
apps and
reports

24
SUMMARY
Establish
your data
strategy
Identify key
component
: Hadoop,
MPP, Spark,
NoSQL etc.
Segregate
your
workloads
Offload to
low cost
Hadoop
where
required
Leverage
Virtualization
for key use
case
Establish
Data
Quality,
SLA,
Semantics
and
MetaData
as key
supporting
pillars
To summarize, following steps are recommended for creating a logical data warehouse

25
SIGN-OFF QUOTE
In the end, there can be only 2 types of data
warehouses: Logical data warehouse and illogical
data warehouse…
“
”

26
Thank you.
Questions??
ajay.shriwastava@impetus.co.in
sachin.ghai@impetus.co.in

27
APPENDIX 1
Product URL
StreamAnalytix http://streamanalytix.com/
Kyvos http://www.kyvosinsights.com/
Kundera http://bigdata.impetus.com/open_source_kundera
SQL Offload Solution
http://www.impetus.com/sites/impetus.com/impetus/br
ochures/ETL_Offloading_Datasheet.pdf
Ankush http://bigdata.impetus.com/ankush
Jumbune http://www.jumbune.org/

Logical Data Warehouse: How to Build a Virtualized Data Services Layer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Logical Data Warehouse: How to Build a Virtualized Data Services Layer

Similar to Logical Data Warehouse: How to Build a Virtualized Data Services Layer (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Logical Data Warehouse: How to Build a Virtualized Data Services Layer

Editor's Notes