SlideShare a Scribd company logo
1 of 27
1
Ajay Shriwastava
Sachin Ghai
ImpetusTechnologies Inc.
Logical Data Warehouse:
Building a virtual data services layer
Hadoop Summit – San Jose – 11 June 2015
2
AGENDA
Emergence of
Logical Data
Warehouse
Virtualization Offload
3
EVOLUTION OFTHE DATAWAREHOUSE
4
EVOLUTION OFTHE DATAWAREHOUSE
If this slide looks inverted to you, it actually is.
Data warehouse as we knew so far has inverted the concepts
with emergence of BIG DATA.
5
EVOLUTION OFTHE DATAWAREHOUSE
Pre-
determin
ed input
schema
Extensive
data
governance
ANSI SQL
compliance
IT teams
ownership
Concurren
t users
Pre-
canned BI
Low cost
storage/
archive
Non
SQL
access
Explorator
y analysis
Machine
Learning/
Graph
Data
discovery
Self Service
BI/analytics
Enterprise Data Warehouse (EDW) Big Data Warehouse (BDW)
6
CO-EXISTENCE OF EDW AND BDW – A REALITY
Organizations
still on initial
phase of Big
Data journey.
Existing ETL
jobs feeding
EDW.
BI and
downstream
applications
usingANSI SQL
for querying
data in EDW.
Business use
cases for big
data are
emerging.
7
EMERGENCE OF LOGICAL DATAWAREHOUSE
Logical Data
Warehouse
In response to emerging forces like Big Data, the data warehousing practices evolution led
to emergence of “Logical Data Warehouse”.
Key
components
include:
Repository management
Data virtualization
Distributed process
Auditing statistics and performance evaluation services
SLA management
Taxonomy/Ontology resolution
Metadata management
First proposed in May 2009 and published in August 2011 research by Gartner.
8
NEW PARADIGMS
DATA LAKE
DISTRIBUTED
PARALLEL
PROCESS
VIRTUALIZATION OFFLOAD
Repositories
continue to be no
longer Enterprise
Data Warehouse
or Data marts –
emergence of
HDFS as the “data
lake” along with
NoSQL data
stores.
Distributed
process now
have become
synonymous
with MapReduce
on files or DB.
With Spark,
more distributed
operators are
becoming
common place.
Virtualization
gaining favor as
an access
mechanism
where transient
consolidation is
required for a
use case.
Offload to newer
repositories and
process engines
requiring more
accurate science
and process now.
9
VIRTUALIZATION – AND ITS MANY DELINEATIONS
• simplified, unified, and integrated viewDataVirtualization
• is a subset of data virtualization
• enhanced with query optimization strategies for specific source
Data Federation
• involves actual data movement and ‘write’ to a repository rather
than just ‘read’ for a transient use caseData Blending
10
BRIDGINGTHE CHANNELWITHVIRTUALIZATION
VIRTUALIZATION
Relational
Oracle,
DB2…
NoSQL
Cassandra,
MongoDB
…
File Systems
HDFS,
GPFS…
MPP
Teradata,
Netezza…
Hadoop based
Warehouse
Hive,Tajo…
Users
Enterprise
Web, Mobile
Applications
Enterprise,
ESB…
BI
Reporting,
Visualization…
Data Science
Machine Learning,
Graph…
Data
Management
MDM,
Discovery…
TARGET
SYSTEMS
SOURCESYSTEMS
11
WEALTH MANAGEMENT – USE CASE
Big data can transform the client and account centric wealth management to personalized goal based
wealth management.
Unfortunately the information is spread across many different line of business using separate data
sources and platforms
12
GOAL BASEDWEALTH MANAGEMENT
• From account or client centric view to household and relationship view.
• A comprehensive approach to understand long term financial goals of client.
• Facilitate financial security during life changing events – marriage, college, job changes, retirement,
inter generational wealth transfer.
13
BUILDING RELATIONSHIPS
HouseholdView Business NetworkView HierarchyView.
Collect the data and process on a
common platform.
- Different LOB’s
- IVR Logs/Web Logs
Discover relationships.
Unified view over existing data in
client and account systems.
Integrate with social data.
ImplementGovernance.
Implement corporate hierarchy over
client data.
Jerry
Mayfield
Paul
Robinson
Andrew
Madura
Linda Mays
Jack Kline
Root
Node
Friends
Golf Buddy
College
Alumni
Neighbors
Father
Son
Daughter-1 Daughter-2
Self Spouse
14
PERSONALIZED SERVICES CROSS/UP SELL
HouseholdView Business NetworkView HierarchyView.
• 401 K / IRA
• 529
• College loans
• Student Credit Cards.
• Gift Cards
• Estate Planning
• Alimonies
• Company loans
• Asset standing
• Business prospects
• Corporate Accounts.
• Corporate discounts.
• Group based services.
• Prospects identification.
• Goal based services.
Assets
Distribution
0
10
20
30
40
50
Goals Actuals
Liabilities
0%
20%
40%
60%
80%
100%
15
TRADITIONAL DATA INTEGRATION
• Data is embedded in silos
• Time consuming and resource intensive ETL processes
• Create data duplication
• Governance is inhibitor and not enabler
• Inability to handle theV’s of big data
16
Governance–Security,Audit,Lineage
MonitoringandClusterManagement
HDFS/Hive/PIG
EDW –Teradata
/Netezza etc
ProducerA
Stream Analytix
SQL
Offload
Solution
Kyvos Engine
EDWMigration
In memory data Layer (spark)
Centralized
Schema
Big Data
Governance
Ankush
Jumbune
Analytics/VirtualizationStreamIngestion
BatchIngestion
Batch Data Ingestion
Sqoop /Talend etc.
Kundera
DataVirtualization Layer
Distributed Messaging Layer (Kafka)
Producer B
DataAccess
Spark Streaming
REST
API
Hadoop Cluster /YARN
Query
API (sql)
Custom layer for
universal connectivity
Search
API
ES, Solr, NLP
Propriety
connectors /
ODBC Drivers
BITools – Micro Strategy,Tableau, Kyvos…
Impetus Offerings (Details : Appendix 1)
Recommended Platform
Platform Requirements
JDBC
ML Lib
OLTP/RD
BMS
R Algorithms.
Data Quality.
Mahout
Storm
LOGICAL DATA WAREHOUSE REFERENCE
ARCHITECTURE
17
UNIFIEDVIEW - ADVANTAGES
• Fast real time data integration without creating expensive copies of data.
• Significant saving of time and resources required for ETL
• Facility to create a final composite schema.
• Information management capability.
• Meet stringent service level agreements.
18
OFFLOAD
Offload cold data and exploratory analysis workloads to commodity hardware
driven Hadoop cluster
– Save cost, resources
19
OFFLOAD CONCEPT
Run Hive
queries on
Hadoop
SQL
ScriptsEDW
Static SQL
Procedures/
PL-SQL
Proprietary
script
(RoadMap)
BDW
Hive
Queries
Hive Queries
SQL
Script
Parser
JAVA code with Hive
Queries
Tables
An Enterprise Data Warehouse (EDW) to Big Data Warehouse (BDW) offload will
essentially involve tables and code migration.
20
KEY CHALLENGES IN OFFLOAD
Varied input sources
Validating complete
schema and data offload
ANSI SQL
incompatibility
User Defined Functions
unavailability in target
system
Lack of unified view and
UI
Missing Data Quality
checks
21
HOWWE BUILTTHE OFFLOAD SOLUTION
Step 1: Identification
Step 2: Schema and
Data Migration
Step 3: Logic
Migration
Step 4: Data Quality
Enhancement
Step 5:Transformed
code Execution
ReducedTime!
Reduced Risk!
Automation!
22
SAMPLE QUERY – AUTOTRANSFORMED
Teradata Query:
INSERT INTO month_wise_ship_agg
select d_month_seq
,substr(w_warehouse_name,1,20) , TRIM(TRAILING '_' FROM sm_type)
,cc_name ,sum(case when (cs_ship_date_sk - cs_sold_date_sk <= 30 ) then 1 else 0 end) as "30 days"
,sum(case when (cs_ship_date_sk - cs_sold_date_sk > 30) and (cs_ship_date_sk - cs_sold_date_sk <= 60)
then 1 else 0 end ) as "31-60 days"
from catalog_sales ,warehouse ,ship_mode ,call_center C,date_dim
where cs_ship_date_sk = d_date_sk
and EXTRACT (YEAR FROM d_date) IN (2000,2001)
and DAYNUMBER_OF_MONTH(d_date) > 1
and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE
and cs_warehouse_sk = w_warehouse_sk
and cs_ship_mode_sk = sm_ship_mode_sk
and cs_call_center_sk = cc_call_center_sk
and (C.cust_id , C.address) LIKE ANY ( SELECT C1.cust_id, C1.address FROM Customer C1 )
group by substr(w_warehouse_name,1,20) ,sm_type ,cc_name ,d_month_seq ;
Hive Query:
INSERT INTO TABLE month_wise_ship_agg
SELECT d_month_seq,SUBSTR( w_warehouse_name , 1 , 20 ) AS auto_c01,
sm_type, UDF_TRIM('TRAILING ','_' , sm_type) ,
cc_name,SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk <= 30) THEN 1 ELSE 0 END ) AS 30_days,
SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk > 30) AND ( cs_ship_date_sk - cs_sold_date_sk <=
60) THEN 1 ELSE 0 END ) AS 31_60_days
FROM catalog_sales, warehouse, ship_mode, call_center C, date_dim
WHERE cs_ship_date_sk = d_date_sk
AND EXTRACT ('YEAR', d_date) IN (2000,2001)
AND DAYNUMBER_OF_MONTH(d_date) > 1
and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE()
AND cs_warehouse_sk = w_warehouse_sk
AND cs_ship_mode_sk = sm_ship_mode_sk
AND cs_call_center_sk = cc_call_center_sk
AND EXISTS ( SELECT * Customer C1 where C.cust_id LIKE C1.cust_id AND C.age LIKE C1.address )
GROUP BY SUBSTR( w_warehouse_name , 1 , 20 ) ,sm_type,cc_name,d_month_seq;
23
OFFLOAD - KEY ADVANTAGES
Optimize MPP and
Relational database
resources for
workloads
Re-use
millions of
lines of
code and
$
Avoid the learning
curve and re-code,
re-test cycles
Seamless
integration
of
downstream
/ upstream
apps and
reports
24
SUMMARY
Establish
your data
strategy
Identify key
component
: Hadoop,
MPP, Spark,
NoSQL etc.
Segregate
your
workloads
Offload to
low cost
Hadoop
where
required
Leverage
Virtualization
for key use
case
Establish
Data
Quality,
SLA,
Semantics
and
MetaData
as key
supporting
pillars
To summarize, following steps are recommended for creating a logical data warehouse
25
SIGN-OFF QUOTE
In the end, there can be only 2 types of data
warehouses: Logical data warehouse and illogical
data warehouse…
“
”
26
Thank you.
Questions??
ajay.shriwastava@impetus.co.in
sachin.ghai@impetus.co.in
27
APPENDIX 1
Product URL
StreamAnalytix http://streamanalytix.com/
Kyvos http://www.kyvosinsights.com/
Kundera http://bigdata.impetus.com/open_source_kundera
SQL Offload Solution
http://www.impetus.com/sites/impetus.com/impetus/br
ochures/ETL_Offloading_Datasheet.pdf
Ankush http://bigdata.impetus.com/ankush
Jumbune http://www.jumbune.org/

More Related Content

What's hot

What's hot (20)

Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Deploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via SliderDeploying Docker applications on YARN via Slider
Deploying Docker applications on YARN via Slider
 
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
 
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksPowering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
 

Similar to Logical Data Warehouse: How to Build a Virtualized Data Services Layer

Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
Ajay Shriwastava
 

Similar to Logical Data Warehouse: How to Build a Virtualized Data Services Layer (20)

Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorial
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
datavault2.pptx
datavault2.pptxdatavault2.pptx
datavault2.pptx
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data Warehousing
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Big Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesBig Data: It’s all about the Use Cases
Big Data: It’s all about the Use Cases
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
 
Snowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern AnalyticsSnowflake’s Cloud Data Platform and Modern Analytics
Snowflake’s Cloud Data Platform and Modern Analytics
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Virtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & BénéficesVirtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & Bénéfices
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseBig Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
 
Technology Overview
Technology OverviewTechnology Overview
Technology Overview
 
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
Slides: Proven Strategies for Hybrid Cloud Computing with Mainframes — From A...
 
Accelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesAccelerate and modernize your data pipelines
Accelerate and modernize your data pipelines
 
Refactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics ProductsRefactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics Products
 
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on AzureGlobal Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Logical Data Warehouse: How to Build a Virtualized Data Services Layer

  • 1. 1 Ajay Shriwastava Sachin Ghai ImpetusTechnologies Inc. Logical Data Warehouse: Building a virtual data services layer Hadoop Summit – San Jose – 11 June 2015
  • 4. 4 EVOLUTION OFTHE DATAWAREHOUSE If this slide looks inverted to you, it actually is. Data warehouse as we knew so far has inverted the concepts with emergence of BIG DATA.
  • 5. 5 EVOLUTION OFTHE DATAWAREHOUSE Pre- determin ed input schema Extensive data governance ANSI SQL compliance IT teams ownership Concurren t users Pre- canned BI Low cost storage/ archive Non SQL access Explorator y analysis Machine Learning/ Graph Data discovery Self Service BI/analytics Enterprise Data Warehouse (EDW) Big Data Warehouse (BDW)
  • 6. 6 CO-EXISTENCE OF EDW AND BDW – A REALITY Organizations still on initial phase of Big Data journey. Existing ETL jobs feeding EDW. BI and downstream applications usingANSI SQL for querying data in EDW. Business use cases for big data are emerging.
  • 7. 7 EMERGENCE OF LOGICAL DATAWAREHOUSE Logical Data Warehouse In response to emerging forces like Big Data, the data warehousing practices evolution led to emergence of “Logical Data Warehouse”. Key components include: Repository management Data virtualization Distributed process Auditing statistics and performance evaluation services SLA management Taxonomy/Ontology resolution Metadata management First proposed in May 2009 and published in August 2011 research by Gartner.
  • 8. 8 NEW PARADIGMS DATA LAKE DISTRIBUTED PARALLEL PROCESS VIRTUALIZATION OFFLOAD Repositories continue to be no longer Enterprise Data Warehouse or Data marts – emergence of HDFS as the “data lake” along with NoSQL data stores. Distributed process now have become synonymous with MapReduce on files or DB. With Spark, more distributed operators are becoming common place. Virtualization gaining favor as an access mechanism where transient consolidation is required for a use case. Offload to newer repositories and process engines requiring more accurate science and process now.
  • 9. 9 VIRTUALIZATION – AND ITS MANY DELINEATIONS • simplified, unified, and integrated viewDataVirtualization • is a subset of data virtualization • enhanced with query optimization strategies for specific source Data Federation • involves actual data movement and ‘write’ to a repository rather than just ‘read’ for a transient use caseData Blending
  • 10. 10 BRIDGINGTHE CHANNELWITHVIRTUALIZATION VIRTUALIZATION Relational Oracle, DB2… NoSQL Cassandra, MongoDB … File Systems HDFS, GPFS… MPP Teradata, Netezza… Hadoop based Warehouse Hive,Tajo… Users Enterprise Web, Mobile Applications Enterprise, ESB… BI Reporting, Visualization… Data Science Machine Learning, Graph… Data Management MDM, Discovery… TARGET SYSTEMS SOURCESYSTEMS
  • 11. 11 WEALTH MANAGEMENT – USE CASE Big data can transform the client and account centric wealth management to personalized goal based wealth management. Unfortunately the information is spread across many different line of business using separate data sources and platforms
  • 12. 12 GOAL BASEDWEALTH MANAGEMENT • From account or client centric view to household and relationship view. • A comprehensive approach to understand long term financial goals of client. • Facilitate financial security during life changing events – marriage, college, job changes, retirement, inter generational wealth transfer.
  • 13. 13 BUILDING RELATIONSHIPS HouseholdView Business NetworkView HierarchyView. Collect the data and process on a common platform. - Different LOB’s - IVR Logs/Web Logs Discover relationships. Unified view over existing data in client and account systems. Integrate with social data. ImplementGovernance. Implement corporate hierarchy over client data. Jerry Mayfield Paul Robinson Andrew Madura Linda Mays Jack Kline Root Node Friends Golf Buddy College Alumni Neighbors Father Son Daughter-1 Daughter-2 Self Spouse
  • 14. 14 PERSONALIZED SERVICES CROSS/UP SELL HouseholdView Business NetworkView HierarchyView. • 401 K / IRA • 529 • College loans • Student Credit Cards. • Gift Cards • Estate Planning • Alimonies • Company loans • Asset standing • Business prospects • Corporate Accounts. • Corporate discounts. • Group based services. • Prospects identification. • Goal based services. Assets Distribution 0 10 20 30 40 50 Goals Actuals Liabilities 0% 20% 40% 60% 80% 100%
  • 15. 15 TRADITIONAL DATA INTEGRATION • Data is embedded in silos • Time consuming and resource intensive ETL processes • Create data duplication • Governance is inhibitor and not enabler • Inability to handle theV’s of big data
  • 16. 16 Governance–Security,Audit,Lineage MonitoringandClusterManagement HDFS/Hive/PIG EDW –Teradata /Netezza etc ProducerA Stream Analytix SQL Offload Solution Kyvos Engine EDWMigration In memory data Layer (spark) Centralized Schema Big Data Governance Ankush Jumbune Analytics/VirtualizationStreamIngestion BatchIngestion Batch Data Ingestion Sqoop /Talend etc. Kundera DataVirtualization Layer Distributed Messaging Layer (Kafka) Producer B DataAccess Spark Streaming REST API Hadoop Cluster /YARN Query API (sql) Custom layer for universal connectivity Search API ES, Solr, NLP Propriety connectors / ODBC Drivers BITools – Micro Strategy,Tableau, Kyvos… Impetus Offerings (Details : Appendix 1) Recommended Platform Platform Requirements JDBC ML Lib OLTP/RD BMS R Algorithms. Data Quality. Mahout Storm LOGICAL DATA WAREHOUSE REFERENCE ARCHITECTURE
  • 17. 17 UNIFIEDVIEW - ADVANTAGES • Fast real time data integration without creating expensive copies of data. • Significant saving of time and resources required for ETL • Facility to create a final composite schema. • Information management capability. • Meet stringent service level agreements.
  • 18. 18 OFFLOAD Offload cold data and exploratory analysis workloads to commodity hardware driven Hadoop cluster – Save cost, resources
  • 19. 19 OFFLOAD CONCEPT Run Hive queries on Hadoop SQL ScriptsEDW Static SQL Procedures/ PL-SQL Proprietary script (RoadMap) BDW Hive Queries Hive Queries SQL Script Parser JAVA code with Hive Queries Tables An Enterprise Data Warehouse (EDW) to Big Data Warehouse (BDW) offload will essentially involve tables and code migration.
  • 20. 20 KEY CHALLENGES IN OFFLOAD Varied input sources Validating complete schema and data offload ANSI SQL incompatibility User Defined Functions unavailability in target system Lack of unified view and UI Missing Data Quality checks
  • 21. 21 HOWWE BUILTTHE OFFLOAD SOLUTION Step 1: Identification Step 2: Schema and Data Migration Step 3: Logic Migration Step 4: Data Quality Enhancement Step 5:Transformed code Execution ReducedTime! Reduced Risk! Automation!
  • 22. 22 SAMPLE QUERY – AUTOTRANSFORMED Teradata Query: INSERT INTO month_wise_ship_agg select d_month_seq ,substr(w_warehouse_name,1,20) , TRIM(TRAILING '_' FROM sm_type) ,cc_name ,sum(case when (cs_ship_date_sk - cs_sold_date_sk <= 30 ) then 1 else 0 end) as "30 days" ,sum(case when (cs_ship_date_sk - cs_sold_date_sk > 30) and (cs_ship_date_sk - cs_sold_date_sk <= 60) then 1 else 0 end ) as "31-60 days" from catalog_sales ,warehouse ,ship_mode ,call_center C,date_dim where cs_ship_date_sk = d_date_sk and EXTRACT (YEAR FROM d_date) IN (2000,2001) and DAYNUMBER_OF_MONTH(d_date) > 1 and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE and cs_warehouse_sk = w_warehouse_sk and cs_ship_mode_sk = sm_ship_mode_sk and cs_call_center_sk = cc_call_center_sk and (C.cust_id , C.address) LIKE ANY ( SELECT C1.cust_id, C1.address FROM Customer C1 ) group by substr(w_warehouse_name,1,20) ,sm_type ,cc_name ,d_month_seq ; Hive Query: INSERT INTO TABLE month_wise_ship_agg SELECT d_month_seq,SUBSTR( w_warehouse_name , 1 , 20 ) AS auto_c01, sm_type, UDF_TRIM('TRAILING ','_' , sm_type) , cc_name,SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk <= 30) THEN 1 ELSE 0 END ) AS 30_days, SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk > 30) AND ( cs_ship_date_sk - cs_sold_date_sk <= 60) THEN 1 ELSE 0 END ) AS 31_60_days FROM catalog_sales, warehouse, ship_mode, call_center C, date_dim WHERE cs_ship_date_sk = d_date_sk AND EXTRACT ('YEAR', d_date) IN (2000,2001) AND DAYNUMBER_OF_MONTH(d_date) > 1 and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE() AND cs_warehouse_sk = w_warehouse_sk AND cs_ship_mode_sk = sm_ship_mode_sk AND cs_call_center_sk = cc_call_center_sk AND EXISTS ( SELECT * Customer C1 where C.cust_id LIKE C1.cust_id AND C.age LIKE C1.address ) GROUP BY SUBSTR( w_warehouse_name , 1 , 20 ) ,sm_type,cc_name,d_month_seq;
  • 23. 23 OFFLOAD - KEY ADVANTAGES Optimize MPP and Relational database resources for workloads Re-use millions of lines of code and $ Avoid the learning curve and re-code, re-test cycles Seamless integration of downstream / upstream apps and reports
  • 24. 24 SUMMARY Establish your data strategy Identify key component : Hadoop, MPP, Spark, NoSQL etc. Segregate your workloads Offload to low cost Hadoop where required Leverage Virtualization for key use case Establish Data Quality, SLA, Semantics and MetaData as key supporting pillars To summarize, following steps are recommended for creating a logical data warehouse
  • 25. 25 SIGN-OFF QUOTE In the end, there can be only 2 types of data warehouses: Logical data warehouse and illogical data warehouse… “ ”
  • 27. 27 APPENDIX 1 Product URL StreamAnalytix http://streamanalytix.com/ Kyvos http://www.kyvosinsights.com/ Kundera http://bigdata.impetus.com/open_source_kundera SQL Offload Solution http://www.impetus.com/sites/impetus.com/impetus/br ochures/ETL_Offloading_Datasheet.pdf Ankush http://bigdata.impetus.com/ankush Jumbune http://www.jumbune.org/

Editor's Notes

  1. Add list of implemented functions