SlideShare a Scribd company logo
1 of 121
Data
Warehousing
A Look Back, Moving Forward
Dale Sanders
June 2005
2
Introduction & Warnings
 Why am I here?
 Teach
 Stimulate some thought
 Share some of my experiences and lessons
 Learn
 From you, please…
 Ask questions, challenge opinions, share your knowledge
 I’ll do my best to live up to my end of the bargain
 Warnings
 The pictures in this presentation
 May or may not have any relevance
whatsoever to the topic or slide
 Mostly intended to break up the monotony
3
Expectation Management
 DW Strengths (according to others)
 I know what not to do as much as I know what to do
 Seen and made all the big mistakes
 Vision, strategy, system architecture, data management &
DW modeling, complex cultural issues, “leapfrog” problem
solving
 What not to expect: DW weaknesses
 My programming skills suck
 Haven’t written a decent line of code in four years!
 Some might say it’s been 24 years… 
 Knowledge of leading products is very rusty
 Though I’m beefing up on Microsoft and Cognos
Within these expectations, make no mistake about it… I know data warehousing 
4
Today’s Discussions
 I am a good “Idea Guy”
 But, ideas are worthless without someone to implement and
enhance them
 Steve Barlow, Dan Lidgard, Jon Despain, Chuck Lyon, Laure
Shull, Kris Mitchell, Peter Hess, Ron Gault, Rob Carpenter, my
wife, and many others
 My greatest strength and blessing
 The ability to recognize, listen to, and hold
onto good people
 Knock on wood
 My achievements in personal and professional life
 More a function of those around me than a reflection on me
5
DW Best Practices:
The Most Important Metrics
 Employee satisfaction
 Without it, long-term customer satisfaction is impossible
 Customer satisfaction
 That’s the nature of the Information Services career field
 Some people in our profession still don’t get it
 We are here to serve
 The Organizational Laugh Metric
 How many times do you hear laughter in the day-to-day
operations of your team?
 It is the single most important vital sign to organizational health
and business success
6
My Background
 Three, eight-year chapters
 Captain, Information Systems Engineer, US Air Force
 Nuclear warfare battle management
 Force status data integration
 Intelligence and attack warning data “fusion”
 Consultant in several industries
 TRW
 CIA Data Center
 TRW Credit Reporting Data Base
 National Security Agency (NSA)
 Intel: New Mexico Data Repository (NMDR)
 Air Force
 Integrated Minuteman Data Base (IMDB)
 Peacekeeper Information Retrieval System (PIRS)
 Many others…
 Healthcare
 Intermountain Health Care Enterprise Data Warehouse
 Consultant to other healthcare organizations’ data warehouses
 Now at Northwestern University Medical System
7
Overview
 Data warehousing history
 According to Sanders
 Why and how did this become a sub-specialty in information
systems?
 What have we learned so far?
 My take on “Best Practices”
 Key lessons-learned
 My thoughts on the most popular authors in the field
 What they contribute, where they detract
8
Data Warehousing History
“Newspaper Rock”
100 B.C.
American Retail
2005 A.D.
Lots of stuff happened
9
What Happened in the Cloud?
 Stage 1: Laziness
 Operators grew tired of hanging tapes
 In response to requests for historical financial data
 They stored data on-line, in “unauthorized” mainframe databases
 Stage 2: End of the mainframe bully
 Computing moved out from finance to the rest of the business
 Unix and relational databases
 Distributed computing created islands of information
 Stage 2.1: The government gets involved
 Consolidating IRS and military databases to save money on mainframes
 “Hey, look what I can do with this data…”
 Stage 3: Demming comes along
 Push towards constant business “reengineering”
 Cultural emphasis on “continuous quality improvement” and “business
innovation” drives the need for data
 Stage 4: Data warehousing has it’s own language
 Ralph Kimball publishes “The Data Warehouse Toolkit”
10
The Real Truth
 Data warehousing is a symptom of a
problem
 Technological inability to deploy single-platform
information systems that:
 Capture data once and reuse it throughout an
enterprise
 Support high-transaction rates (single record
CREATE, SELECT, UPDATE, DELETE) and analytic queries
on the same computing platform, with the same
data, at the same time
 Someday, maybe we will address the root cause
 Until then, it’s a good way to make a living
11
The “Ideal Library” Practice
 Stores all of the books and other reference material you need to
conduct your research
 The Enterprise data warehouse
 A single place to visit
 One database environment
 Contents are kept current and refreshed
 Timely, well choreographed data loads
 Staffed with friendly, knowledgeable people that can help you
find your way around
 Your Data Warehouse team
 Organized for easy navigation and use
 Metadata
 Data models
 “User friendly” naming conventions
12
Cultural Detractors
The two biggies…
 The business supported by the data
warehouse must be motivated by a desire
for constant improvement and fact-based
decision making
 The data warehouse team falls victim to the
“Politics of Data”
 Through naivety
 Through misguided motives, themselves
13
Business Culture
 Does your CEO…
 Talk about constant improvement, constantly?
 Drive corporate goals that are SMART?
 Specific, Measurable, Attainable, Realistic,
Tangible
 Crave data to make better informed decisions?
 Become visibly, buoyantly excited at a demo for
a data cube?
 If so, the success of your data warehouse
is right around the corner… sort of…
I love data!
14
Political Best Practices
 You will be called a “data thief”
 Get used to it
 Encourage life cycle ownership of the OLTP
data, even in the EDW
 You will be called “dangerous”
 “You don’t understand our data!”
 OLTP owners know their data better than
you do– acknowledge it and leverage it
 You will be blamed for poor data quality in the OLTP systems
 This is a natural reaction
 Data warehouses raise the visibility of poor data quality
 Use the EDW as a tool for raising overall data quality
 You will be called a “job robber”
 EDW is perceived as a replacement for OLTP systems
 Educate people: The EDW depends on OLTP systems for its existence
 Stick to your values and pure motives
 The politics will fade away
15
Data Quality
 Pitfall
 Taking accountability for data quality on the source system
 Spending gobs of time and money “cleansing” data before it’s loaded into
the DW
 It’s a never ending, never win battle
 You will always be one step behind data quality
 You will always be in the cross-hairs of
blame
 Best Practice
 Push accountability where it belongs– to the
source system
 Use the data warehouse as a tool to reveal
data quality, either good or bad
 Be prepared to weather the initial storm of
blame
16
Measuring Data Quality
 Data Quality = Completeness x Validity
 Can it be measured objectively?
 Measuring “Completeness”
 Number of null values in a column
 Measuring “Validity”
 Cardinality is a simple way to measure validity
 “We only have four standard regions in the business, but we
have 18 distinct values in the region column.”
17
Business Validity
 How can you measure it? You can’t…
 “I collect this data from our customers, but I have to guess
sometimes because I don’t speak Spanish.”
 “This data is valid for trend analysis decisions before
9/11/2001, but should not be used after that date, due to
changes in security procedures.”
 “You can’t use insurance billing and reimbursement data to
make clinical, patient care decisions.”
 “This customer purchased four copies of ‘Zamfir, Master of
the Pan Flute’, therefore he loves everything about Zamfir.”
 What Amazon didn’t know: I bought them for my mom and her
sewing circle.
Where do you capture subjective data quality? Metadata….
18
The Importance of Metadata
 Maybe the most over-hyped, underserved area of
data warehousing common sense
 Vendors want to charge you big $$$$$ for their tools
 Consultants would like you to think that it’s the Holy Grail in
disguise and only they can help you find it
 Authors who have never been in an operational environment
would have you chasing your tail in pursuit of an esoteric,
mythological Metadata Nirvana
 Don’t listen to the confusing messages! You know the
answer… just listen to your common sense…
19
Metadata: Keep It Simple!
 Ultimately, what are the most valuable business
motives behind metadata?
 Make data more “understandable” to those who are
not familiar with it
 Data quality issues
 Data timeliness and temporal issues
 Context in which is was collected
 Translating physical names to natural language
 Make data more “findable” to those who don’t know
where it is
 Organize it
 Take a lesson from library science and the card
catalog
20
Table Elements
Required Elements
 Long Name (or English name)
 Description
Semi-optional Elements
 Source
 Example
 Data Steward
21
Column Elements
Required Elements
 Long Name
 Description
Optional Elements
 Value Range
 Data Quality
 Associated Lookup
22
The Data Model
TABLE_ENT
TABLE_ENT_ID: NUMBER
TABLE_ENT_DESC: VARCHAR2(4000)
TABLE_ENT_SRC: VARCHAR2(50)
TABLE_ENT_NAME: VARCHAR2(50)
TABLE_TYPE: VARCHAR2(10)
CREATE_DT: DATE
LAST_LOAD_DT: DATE
SCHEMA_ID: NUMBER
DATA_MART
DATA_MART_ID: NUMBER
DATA_MART_NAME: VARCHAR2(50)
DATA_MART_DESC: VARCHAR2(4000)
DATA_STEWARD: VARCHAR2(50)
LAST_LOAD_DT: DATE
UPDATE_FREQ: VARCHAR2(50)
DATA_BEG_DT: DATE
DATA_END_DT: DATE
DATA_MART_TABLE_ENT
DATA_MART_ID: NUMBER
TABLE_ENT_ID: NUMBER
FOLDER
FOLDER_ID: NUMBER
PARENT_FOLDER_ID: NUMBER
FOLDER_NM: VARCHAR2(50)
FOLDER_DSC: VARCHAR2(4000)
CREATE_USER_ID: VARCHAR2(20)
CREATE_DT: DATE
REPORT
RPT_ID: NUMBER
FOLDER_ID: NUMBER
RPT_NM: VARCHAR2(250)
RPT_LOC_TXT: VARCHAR2(1000)
PURPOSE_TXT: VARCHAR2(4000)
RUN_FREQ_TXT: VARCHAR2(1000)
AUDIENCE_TXT: VARCHAR2(500)
EDW_RPT_FLG: NUMBER
DATA_SOURCE_TXT: VARCHAR2(4000)
SELECT_CRITERIA_TXT: VARCHAR2(4000)
STAT_METHODS_TXT: VARCHAR2(4000)
RPT_TOOL_TXT: VARCHAR2(250)
CODE_TXT: CLOB
FORMULA_TXT: CLOB
COMMENTARY_TXT: VARCHAR2(4000)
AUTHOR_NM: VARCHAR2(500)
AUTHOR_TITLE_TXT: VARCHAR2(500)
AUTHOR_DEPT_TXT: VARCHAR2(500)
AUTHOR_LOC_TXT: VARCHAR2(500)
AUTHOR_PHONE_TXT: VARCHAR2(500)
AUTHOR_EMAIL_TXT: VARCHAR2(500)
BUSINESS_OWNER_TXT: VARCHAR2(500)
METADATA_UPDATE_DT: DATE
VALIDATION_DT: DATE
CREATE_USER_ID: VARCHAR2(20)
CREATE_DT: DATE
REPORT_TABLE_ENT_ASSOC
RPT_ID: NUMBER
TABLE_ENT_ID: NUMBER
ATTRIBUTE
ATTRIBUTE_ID: NUMBER
TABLE_ENT_ID: NUMBER
ATTRIBUTE_DESC: VARCHAR2(4000)
ATTRIBUTE_NAME: VARCHAR2(50)
ATTRIBUTE_DATATYPE: VARCHAR2(50)
SAMPLE_VALUE: VARCHAR2(100)
INDEX_FLG: NUMBER
PRIMARY_KEY_FLG: NUMBER
TABLE_POSITION_NO: NUMBER
SCHEMA
SCHEMA_ID: NUMBER
SCHEMA_DESC: VARCHAR2(50)
23
Example Metadata Entry
LKUP.POSTAL_CD_MASTER Table
Long Name:
Postal Code Master - IHC
Description:
Contains Postal (Zip) codes for the IHC referral region and
IHC specific descriptions. These descriptions allow for
specific IHC groupings used in various analyses.
Data Steward:
Jim Allred, ext. 3518
24
Metadata on the Web
25
Some Info Is Free
It can be collected from the database.
For example:
 Primary and Foreign Keys
 Indexed Columns
 Table Creation Dates
26
Most Valuable Info is Subjective
The human element
 Most metadata is not automatically
collected by tools because it does NOT
exist in that form
 Interviews with data stewards are the key
 It can take months (and months and
months) of effort to collect initial
metadata.
27
Holding Feet to the Fire
 Made data architects
responsible for metadata
in their subject areas
 Metadata completion
reports in every staff
meeting for a year
 Standing rule: No new
data marts go live
without metadata
28
Is it all worth it?
Data analysts think so.
“I couldn’t do my job without it.”
It will push the ROI of a home-hum DW into the
stratosphere
It does for DW’ing what the Yellow Pages did for
the business ROI of the telephone
29
It Gets Used
At Intermountain Health Care
 210 web hits on average each week day
 (23,000 employees, $2B revenue)
Avg Hits by Day of Week
(April 2004 - Sep 2004)
189
217 212
240
188
0
50
100
150
200
250
300
MON TUE WED THU FRI
30
“What’s New”
31
Report Quality
 A function of…
 Data quality
 How well does the report reflect the intent behind the question being
asked?
 “This report doesn’t make sense. I’m trying to find out how many
widgets we can produce next year, based on the last four years’
production.”
 “That’s not what you asked for.”
 SQL and other programming accuracy
 Statistical validity– population size of the data
 Timeliness of the data relative to the decision
 Event Correlation
 Best Practice:
 An accompanying “meta-report” for every report that involves
significant, high risk decisions
32
Meta Report
A document, associated with a
published report, which defines the
report.
33
Repository
A central place for storing and sharing
information about business reports
34
IHC Analyst Use of Meta Reports
37%
89%
21%
95%
0%
20%
40%
60%
80%
100%
Data Collected Aug-04 N=32
Read Others Search Duplication Search SQL Audience Request
35
Meta Report
 Core Elements
 Author Information
 Report Name
 Report Purpose
 Data Source(s)
 Report Methods
 Recommended Elements
 Business Owner
 Run Frequency
 Intended Audience
 Statistical Tests
 Software Used
 Source Code
 Formulas
 Relevant Issues &
Commentary
36
• Title
• Location
• Author
• Owner
37
• Purpose
• Frequency
• Audience
• Data Source(s)
38
• Selection Criteria
• Statistics
• Software
• Source Code
• Formulas
39
What’s
It Look
Like?
40
41
Utilization and Creation Rate
Error
42
Think: Mission Control
 Customized ETL Library
 Schedule of operations
 Alerting tool
 Storage strategies / backups
 Development philosophy and environment
 Performance—monitoring and tuning
Operations Best Practices
43
 EDW
 Oracle v 9.2.0.3 on AIX 5.2
 Storage: IBM SAN (shark), >3T
 ETL tools
 Ascential’s Data Stage
 Kornshell (unix), SQL scripts, PL/SQL
scripting
 OLAP: MS’ Analysis Services
 BI: Business Objects (Crystal Enterprise)
 With a Cube presentation layer
 Dashboard: Visual Mining’s Net Charts
 EDW Team: ~16 FTEs, plus SAs and DBAs
IHC Architecture
44
Customized
ETL
Library
45
 One of our ETL programmers noticed he kept
doing the same things over and over for all of
his ETL jobs. Rather than copying and
pasting this repetitive code, he created a
library. Now we all use the ETL Library.
 We named the library EDW_UTIL (EDW
Utilities)
History
46
Implementation
 Executes via Oracle stored procedures
 Supported by associated tables to hold data
when necessary
 Error table
 QA table
 Index table
47
Benefits
 Provides standardization
 Eliminates code rewrites
 Can hide complexities
 Such as the appropriate way to
analyze and gather statistics
on tables
 Very accessible to all of
our ETL tools
 Simply an Oracle stored
procedure call
48
Index Management
 Past process included:
 Dropping the table’s indexes with a script
 Loading the table
 Creating the indexes with a script
 The past process resulted in messy
scripts to manage and
coordinate
49
Index Management
 New process includes:
 Capturing a table’s existing indexes metadata
 Dropping the table’s indexes with a single procedure call
 Loading the table
 Recreating the indexes with a single
procedure call
 There are no more messy scripts to
manage and coordinate
 No “lost” indexes were neglected
when adding to create index script
50
Index Management
 Samples
 IMPORT_SCHEMA_INDEX_DATA
 IMPORT_TABLE_INDEX_DATA
 DROP_TABLE_INDEXES
 CREATE_TABLE_INDEXES
51
Background Loading of Tables
 We often load data into tables which are not
accessible to end users. A simple rename puts them
into production.
 Helps transfer the identical attributes from the live to
the background table
 Samples
 COPY_TABLE_METADATA
 TRANSFER_TABLE_PRIVS
 DROP_TABLE_INDEXES
 CREATE_TABLE_INDEXES
(Create on background table, identical to production table)
52
Load Times, Errors, QA
 We had no idea who was loading what and when
 Each staff member logged in their own way and for their
own interest
 ETL error capturing and QA was difficult
 We can now capture errors and QA information in a
somewhat standardized fashion
53
Load Times, Errors, QA
Samples
 BEGIN_JOB_TIME
 (ex: CASEMIX)
 BEGIN_LOAD_TIME
 (ex: CASEMIX INDEX)
 END_LOAD_TIME
 END_JOB_TIME
 COMPLETE_LOAD_TIME
(Begin and end together)
 LOAD_TIME_ERROR
(Alert on these errors)
 LOAD_TIME_METRICS
QA (row counts)
54
Miscellaneous Procedures
 Hide the “gory” details from the majority
of the EDW team
 Such as Oracle’s table analyze command
 Gives us consistent application of
system wide parameters such as:
 A new box with a different number of CPUs
(parallel slaves)
or
 A new version of Oracle
 We populate some metadata too, such
as last load date
55
DW Schedule of Operations
 Some loads are adhoc, not scheduled
 Users query in an adhoc fashion
 We have a minimal service/application tier
implemented (loss of control)
 Use of a variety of ETL tools
 Use of a variety of user categories
 DBA, SA, ETL user, end users
 Use of a variety of servers
 Production EDW, Stage EDW, ETL servers, OLAP servers,
Presentation layer servers
56
General Approach
 Focus on load jobs against production EDW
 Still working on all the reporting aspects (a sample on the
next slide)
 Pull this information out of the “load times” data
captured by these ETL library calls
 BEGIN_JOB_TIME
 BEGIN_LOAD_TIME
 END_LOAD_TIME
 END_JOB_TIME
 COMPLETE_LOAD_TIME
57
Sample Report
58
DW Alerting Tool
 DW alerting
 Aggregate data alerts, such as, your
average length of stay just crossed a
certain threshold
 A simple tool was created which
sends a text email, based on
existence of data returned from a
query
 Primarily embraced by DW team
members for internal DW
operations, not that the original
intent is abandoned
59
Features
 Web based
 Open to all EDW users
 Run daily, weekly, every two weeks, monthly,
quarterly (wakes every 5 minutes)
 This is a passive polling
 Ability to enter query in SQL
 Alert (email) on 3 situations
 Query returns data
 Query returns no data
 Always
60
User Interface
61
Examples
 ~100 alerts in use
 Live performance check
 Every 4 hours—look for inactive sessions holding active
slaves
 Daily—look for any active sessions older than 72 hours
 ETL monitoring; alert only if problem
 Alert on errors logged via the ETL_UTIL library (manage by
exception)
 Alert on existence of “bad” records captured during ETL
62
Storage and Backup
 Inherited state of affairs
 Running like any OLTP database
 High end expensive SANs (storage area
networks)
 FULL nightly online backups
 Out of space? Just buy more
63
Nightmare in the Making
 Exponential growth
 More data sources
 More summary tables
 More indexes
 No data has yet been
purged
 Relaxed attitude
 Disk is cheap
 Reality: Disk management is expensive
64
Looming Crisis
 Backups often run 16 hours or more
 Performance degradation witnessed by users
 Good backups obtained less than 50% of the time
 Literally running out of space
 Gross underestimating
 Some reckless overuse
 Financial $$$$ cost
 The system administrators (SAs) quadruple the price of
disk purchase from the previous budget year. Ouch!
 SAs roll in the price of tape drives, etc.
65
Major Changes in Operations
 Transfer some disk ownership AND backup
responsibilities to DW team, away from
SAs and DBAs
 EDW team more aware of upcoming space
demands
 EDW team more in tune with which data sets
are easily recreated from the source (don’t
need a backup)
 Stop performing full daily backups
 Move towards less expensive disk
option
 IBM offers a few levels of SANs
66
Tracking and Predicting
Storage Use
67
Changes to Backup Strategy
 Perform full backup once monthly during
downtime
 Perform no data backup on DEV/STAGE
environments
 Do backup DDL (all code)
daily in all environments
 Implement daily
“incremental” backup
68
Daily Incremental Backups
 Easier said than done
 We’ve resorted to a table level backup (in Oracle,
that’s an EXPORT)
 The EDW team owns which tables are exported
 EDW team populates a table, the “export table list” with
each table’s export frequency
 Populated via an application in development
 The DBA’s run an export based on the “export table
list”
69
Use Cheaper Disk
 General practice: You can take greater risks with DW reliability
and availability vs. OLTP systems
 Use it to your advantage
 Our SAN vendor (IBM) offers a few levels of SANs. Next level
down is a big step down in price, small step down in features.
 Feature loss:
 Read cache (referring to disk cache, not box memory).
 We rarely read the same thing twice anyway
 No “phone home” to IBM (auto paging)
 Mean time to failure is higher, but still acceptable
70
Performance Monitoring & Tuning
 Err on the side of freedom and empowerment
 How much harm can really be done?
 We’d rather not constrain our customers
 “Pounding queries” do find
their way to production
 Opportunity to educate users
 Opportunity for us to tune
underlying structures
71
The Focus Areas
 Indexing
 Well-defined criteria for when and how to apply indexes
 Is this a lost art?
 Big use of BITMAPS
 Composite index trick (acts like a table)
 Partitioning for performance, rather than data management
 Exploiting Oracle’s Direct Path INSERT feature
 Avoiding UPDATE and DELETE commands
 Copy with MINUS instead
 Implementing Oracle's Parallel Query
 Turn off referential integrity in the DW.. no brainer
 That’s the job of the source system
72
DW Monitoring:
Empowering End Users
 Motive
 Too many calls from end users about their
queries
 “Please kill it.”
 “Is it still running or is my PC locked up?”
 “Why is the DW so slow?”
 Give them the insight and tools
 Give them the ability to kill their own
queries
 Still in the works
73
The Insight
74
Tracking Long-Running Queries
 We use Pinecone (from Ambeo) to monitor
the duration of all queries and the SQL
 Each week, we look at the top few
 Typical outcome?
 We’ll add indexes
 We’ll denormalize
 We'll contact the user
and assist them with
writing a better query
75
The DW Sandbox
 More empowerment for customers
 Motive
 Lots of little MS
 Access databases
(with valuable data) spread
all over the place
 Needed to be joined
with DW data
 Costly to maintain
 PC hogs
 Solution
 Provide customers with their own “sandbox” on the DW,
with DBA-like priv’s
76
Features
 Web based tool for creating tables and
loading MS Access data to the DW
 Simple, easy to use interface
 Privileges
 Users have full rights to the
tables they create
 Can grant rights to others
 Big, big victory for customer
service and data “maturity”
 10% of DW customers use the
Sandbox
 About 600 tables in use now
 About 2G of data
77
Design-Build Best Practices
 Build vertically, design horizontally
 Start by building data marts that address analytic
needs in one area of the business with a fairly
limited data set
 But, design with the horizontal needs of the
company in mind, so that you will eventually “tie”
all of these vertical data marts together with a
common semantic layer
78
Creating Value In Both Axes
Build
Design
79
For Example…
CancerRegistry
Mammography
Radiology
Pathology
Laboratory
ContinuingCare
AndFollow-Up
QualityofLife
Survey
Radiation
Therapy
HealthPlans
Claims
Ambulatory
Casemix
AcuteCare
Casemix
An Integrated Reporting Model of Cancer Patient’s Data
Oncology Data Integration Strategy
Top down reporting
requirements and
data model
Disparate Sources “connected” semantically to the data bus
80
The Logic Layer in Data Warehouses
Source
System
ETL Process Data
Warehouse
Reports
Data Layer Logic Layer
Presentation
Layer
Analytic Systems
Transaction Systems
HereNot Here
81
Evidence of Business Process Alignment
1. Map out your high level business process
 Don’t fall prey to analysis paralysis with endless business
process modeling diagrams!
1. Identify and associate the transaction systems that support
those processes
2. Identify the common, overlapping semantics/data attributes
and their utilization rates
3. Build your data marts within an enterprise framework
that is aligned with the processes
you are trying to understand
82
For example…
DiagnosisHealth Need
Patient
Perception
Procedure
Results &
Outcomes
Episode of Care
AP/AR
Claims
ProcessingHealthcare business process
HELP Lab HPI
MC400
SurveyAS400IDX HDMCIS/CDRHNA
Supported by non-integrated data in Transaction Systems…
Rx
Integrated in the Data
Warehouse
Data
Warehouse
83
Event Correlation
 A leading edge Best Practice
 The third dimension to rows and columns
 Overlays the data that underlies a report or
graph
 “In 2004, we experienced a drop in revenue
as a result of the earthquake that destroyed
our plant in the Philippines.”
 “In January of 2005, we saw a spike in the
North America market for snow shovel sales
that coincided with an increase in sales for
pain relievers. This correlates to the record
snowfall in that region and should not be
considered a trend. Barring major product
innovation, we consider the market for snow
shovels in this area as saturated. Sales will be
slow for the next several years.”
84
Standardizing Semantics
 Sweet irony are the many synonyms for “standard
semantics”
 Data dictionary
 Vocabulary
 Dimensions
 Data elements
 Data attributes
 The bottom line issue: Standardizing the terms you
use to describe key facts about your business
85
Standardizing “Names of Things”
 You better do it within the first two months of your
data warehouse project
 If you are beyond that point, you better stop and do it now,
lest you pay a bigger price later
 Don’t…
 Push the standard on the source systems, unless it’s easy
to accomplish
 This was one of the common pitfalls of early data
warehousing project failures
 Try to standardize everything under
the sun!
 Focus on the high value facts
86
Where Are The “High Value” Semantics?
In the high-overlap, high-utilization areas…
Source
System X
Source
System Y
Source
System Z
Highest value
area for
standardizing
semantics
87
Another Perspective
Semantic Utilization
SemanticOverlap
88
The Standard Semantic “Layer”
Data
Warehouse
Source
Systems Extract, Transform, Load
Semantic Standards
89
Data Modeling
 Star schemas are great and simple, but they aren’t
the end-all, be-all of analytic data modeling
 Best practices: Do what makes sense– don’t be a schema
bigot
 I’ve seen great analytic value from 3NF models
 Maintain data familiarity for your customers
 When meeting vertical needs
 Don’t make massive changes to the way the model looks
and feels, nor the naming conventions– you will alienate
existing users of the data
 Use views to achieve “new” or standards-compliant
perspectives on data
 When meeting horizontal needs
90
For Example…
Source perspective
DW perspective
Similar names
& organization
Vertical data customer
Horizontal data customer
“Standardized” view
91
The Case For Timely Updates
%RequestsforData
utilization
Data Age
0
100
Today 1 year 2 years
Generally, to minimize Total Cost of Ownership (TCO), your update frequency
should be no greater than the decision making cycle associated with the data.
But… everyone wants more timely data.
92
Best Practice: Measure Yourself
 Employee satisfaction
 Customer satisfaction
 Average number of
queries/month
 Number of queries above a
threshold (30 minutes?)
 Average query response time
 Total number of records
 Total number of query-able
tables
 Total number of query-able
columns
 Number of “users”
 Average rows delivered per
month
 Storage utilization
 CPU utilization
 Downtime per month by data
mart
The Data Warehouse Dashboard
93
Other Best Practices
 The Data Warehouse Information
Systems Team reports to the CIO
 Most data analysts can and probably
should report to the business units
 Change management/service level
agreements with the source systems
 No changes in the sources systems
unless they are coordinated with the data
warehouse team
94
More Best Practices
 Skills of the Data Warehouse IS Team
 Experienced chief architect/project manager
 Procedural/script programmers
 SQL/declarative programmers
 Data warehouse storage management architects
 Data warehouse hardware architects and system
administrators
 Data architects/modelers
 DBAs
95
More Best Practices
 Evidence of project collaboration
 A cross section of members and expertise from the data
warehouse IS team
 Statisticians and data analysts who understand the
business domain
 A customer that understands the process(es) being
measured and can influence change
 A data steward– usually someone from the front lines who
knows how the data is collected
Project = complex reports or a data mart
96
More Best Practices
 When at all possible, always extract as close
to the source as possible
Primary
Source
Copy A
Copy B
Data
Warehouse
Best Practice Path
97
The Most Popular Authors
 I appreciate…
 The interest they stir
 The vocabulary– semantics– of this new specialty that they helped
create
 The downside…
 The buzzwords that are more buzz than substance
 “Corporate Information Factories”
 Endless, meaningless debate
 “That’s not an Operational Data Store!”
 “Do you follow Kimball or Inmon?”
 Follow your own common sense
 Most of these authors have not had to build a data warehouse from
scratch and live with their decisions through a complete lifecycle
98
ETL Operations
 Besides the cultural risks and challenges, the riskiest part of a
data warehouse…
 Good book
 Westerman, WalMart Data Warehousing
 The Extract, Transform, and Load
processes
 Worthy of it’s own “Best Practices” discussion
 Suffice to say, mitigate risks in this area carefully and deliberately
 The major design errors don’t show up until late in the lifecycle,
when the cost of repair is great
99
Two Essential ETL Functions
 Initial loads
 How far back do we go in history?
 Maintenance loads
 Differential loads or total refresh?
 How often?
 You will run and tune these processes several times
before you go into production
 How many records are we dealing with?
 How long will this take to run?
 What’s the impact on the source system performance?
100
Maintenance Loads
 Total refresh vs. Incremental loads
 Total refresh: Truncate and reload everything from the
source system
 Incremental: Load only the new and updated records
 For small data sets, a total refresh strategy is the
easiest to implement
 How do you define “small”? You will know it when don’t
see it.
 Sometimes the fastest strategy when you are trying to
show quick results
 Grab and go…
101
Incremental Loads
 How do we get a snapshot of the data that
has changed since the last load?
 Many source systems will have an existing log file
of some kind
 Take advantage of these when you can,
otherwise incremental loads can be complicated
102
File Transfer Formats
Design your extract so that it uses…
 Fixed, predetermined length for all records and fields
 Avoid variable length if at all possible
 A unique character that separates each field in a record, such as ~
 A standard format for header records across all source systems
 Such as the first three records in each file
 Include name of source system,
file, and record count and
number of fields in the
record
 This will be handy for
monitoring jobs and
collecting load metadata
103
Benefits of Standard
File Transfer Format
 Compatible with standard database and operating system utilities
 Dynamically create initial and maintenance load scripts
 Read the table definitions (DDL) then merge that with the
standard transfer file format
 Dynamically generate load monitoring data
 Read the header row, insert that into a
“Load Status” table with status “Running”,
# of records, start time
 At EOF, change status to “Complete” and
capture end of load time
 I wish I would have thought about this topic
more, and earlier in my career
104
Westerman Makes A Good Point
 My experience: ETL is the least tasteful and
productive use of a veteran EDW Team member, so
I like Westerman’s insight on this topic
 If you design for
instantaneous updates
from the beginning, it
translates to less ETL
maintenance and labor
time for the EDW staff,
later
105
Messaging Applied to ETL
 Basic concepts
 Use a load message queue for records that need to be updated, coming
from the source systems
 When the EDW analytical processing workload is low (off-peak), pick the
next message off the load queue and load the data
 Run this in parallel so that you can process several load messages at
the same time while you have a window of opportunity
 Sometimes called “throttling”
 Speed up and slow down based upon traffic conditions
 Motive behind the concept
 Continuous updates in a mixed
workload environment
 Mixed: Analytical processing at the
same time as transaction oriented,
constant updates, deletes, inserts
106
ETL Message Queue
Process
Source
Systems
•Updates
•Inserts
•Deletes
ETL Message Queue
ETL
Manager
Database
Workload and
Performance
Metrics
EDW Production Tables
107
Four Data Maintenance
Processes
 Initial load
 Loading into an empty table
 Append load
 Update process
 Delete process
 As much as practical, use your database utilities for these
processes
 Study and know your database utilities for data warehousing;
they are getting better all the time
 I see some bad strategies in this area-- companies spending
time building their own utilities…aye cucumber!
108
A Few Planning Thoughts
 Understand the percentage of records
that will be updated, deleted, or inserted
 You’ll probably develop a different
process for 90% inserts vs. 90%
updates
 Logging
 In general, turn logging off during the processes, if logging was
on at all
 Field vs. Record level updates
 Some folks, in the interest of purity, will build complex update
processes for passing only field (attribute) level changes
 No brainer: Pass the whole record
109
Initial Load
 Every table will, at some time, require an
initial load
 For some tables, it will be the best choice for data
maintenance
 Total data refresh
 Best for “small” tables
 Simple process to implement
 Simply delete (or truncate) and reload with fresh
data
110
A Better Initial Load Process
 Background load
 Safer– protects against corrupt files
 Higher availability to customers
 Three or four steps… maybe 6?
1. Create a temporary table
2. Load the temporary table
3. Run quality checks
4. Rename the temporary table to the production table name
5. Delete the old table
6. Regrant rights, if necessary
 Westerman: “You want to use as many initial load processes as
possible.”
 I agree!
111
Append Load
 For larger tables that accumulate historical
data
 There are no updates, just appends
 A hard fact that will not change
 Example
 Sales that are closed
 Lab results
112
Append Load Options
 Load a single part of a table
 Load a partition and ‘attach’ it to the table
 Create a new, empty partition
 Load the new records
 Attach the partition to the table
 Look to use the “LOAD APPEND” command
in your database
113
Another Append Option
1. Create a temp table identical to the one you
are loading
2. Load the new records into the empty temp
table
3. Issue INSERT/SELECT
INSERT INTO Big_Table (SELECT * FROM
Temp_Big_Table)
1. Delete the temp table
IF # RECORDS IN TEMP IS MUCH < # OF RECORDS IN BIG
THEN GOOD TECHNIQUE
ELSE NOT GOOD
114
Update Process
 The most difficult and risky to build
 Use this process only if the tables are too large for a
complete refresh, “Initial Load” process
 Updates affect data that changes over time
 Like Purchase Orders, hospital transactions, etc.
 Medical records, if you treat the data maintenance at the
macroscopic level
115
Update Process Options
 Simple process
1. Separate the affected records into an update file, insert file, or
delete file
 Do this on the source system, if possible
1. Transfer the files to the data warehouse staging area
2. Create and run two processes
– A delete process for deleting the records in the production table
that need updated or deleted
– An insert process for inserting the entirely new “updated” record
into the production table, as well as the true inserts
Simple, but typically not very fast
116
Simple
Process
Updated records
Deleted records
New records
Source System
Updates
Deletes
Inserts
EDW Staging Area
Delete Process
Insert Process
EDW
Production Table1
2
4
5
6
1. Delete Process identifies records for
deletion from the Production Table
based upon contents of the
Updates file.
2. Delete Process identifies records for
deletion from the Production Table
based upon contents of the
Deletes file.
3. Delete process deletes records
from Production Table.
4. Insert Process identifies records for
insert to the Production Table
based upon contents of the
Updates file.
5. Insert Process identifies records for
insert to the Production Table
based upon contents of the Inserts
file.
6. Insert Process inserts records into
the Production Table.
3
117
When You Are Unsure
 Sometimes, source system log and audit files make
it difficult to know if a record was updated or
inserted (i.e. created)
 Try this…
1. Load the records into a temp table that is
identical to the production table to be updated
2. Delete corresponding records from the
production table
DELETE FROM prod_table WHERE key_field
IN (SELECT temp_key_field FROM temp_table)
3. Insert all the records from the temp table into
the production table
Most databases now support this with an UPSERT
118
Massive Deletes
 Just as with Updates and Inserts, the number of Deletes you
have to manage is inversely proportional to the frequency of your
ETL processes
 Infrequent ETL Massive data operations
 Partitions work well for this, again
 E.g., keeping a 5 year window of data
 Insert most recent year with a partition
 Delete the last year’s partition
 Blazing fast!
1 2 3 4 5
Delete partition
Insert partition
119
“Raw” Data Standards for ETL
 Makes the process of communicating with your source system
partners much easier
 Data type (e.g., format for date time stamps)
 File formats (ASCII vs. EBCDIC)
 Header records
 Control characters
 Rule of thumb
 Never transfer data at the binary level unless you are transferring
between binary compatible computer systems
 Use only text-displayable characters
 Less rework time vs. Less storage space and faster transfer speed
 Storage and CPU time are cheap compared to labor
120
Last Thought…Indexing Strategies
 Define these early, practice them
religiously, use them extensively
 This is “Database Design 101”
 Don’t fall prey to this most common
performance problem!
121
My Thanks
 For being invited…
 For your time and attention
 For the many folks who have worked for and with
me over the years that made me look better as a
result
 Please contact me if you have any questions
 dsanders@nmff.org
 PH: 312-695-8618

More Related Content

What's hot

Part 2 - 20 Years in Healthcare Analytics & Data Warehousing: What did we lea...
Part 2 - 20 Years in Healthcare Analytics & Data Warehousing: What did we lea...Part 2 - 20 Years in Healthcare Analytics & Data Warehousing: What did we lea...
Part 2 - 20 Years in Healthcare Analytics & Data Warehousing: What did we lea...Health Catalyst
 
Healthcare Analytics Careers: New Roles for the Brave, New World of Value-bas...
Healthcare Analytics Careers: New Roles for the Brave, New World of Value-bas...Healthcare Analytics Careers: New Roles for the Brave, New World of Value-bas...
Healthcare Analytics Careers: New Roles for the Brave, New World of Value-bas...Health Catalyst
 
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Health Catalyst
 
Choosing an Analytics Solution in Healthcare
Choosing an Analytics Solution in HealthcareChoosing an Analytics Solution in Healthcare
Choosing an Analytics Solution in HealthcareDale Sanders
 
Managing National Health: An Overview of Metrics & Options
Managing National Health: An Overview of Metrics & OptionsManaging National Health: An Overview of Metrics & Options
Managing National Health: An Overview of Metrics & OptionsDale Sanders
 
eBook - Data Analytics in Healthcare
eBook - Data Analytics in HealthcareeBook - Data Analytics in Healthcare
eBook - Data Analytics in HealthcareNextGen Healthcare
 
Realizing the Promise of Precision Medicine
Realizing the Promise of Precision MedicineRealizing the Promise of Precision Medicine
Realizing the Promise of Precision MedicineHealth Catalyst
 
BIG Data & Hadoop Applications in Healthcare
BIG Data & Hadoop Applications in HealthcareBIG Data & Hadoop Applications in Healthcare
BIG Data & Hadoop Applications in HealthcareSkillspeed
 
Big data and the Healthcare Sector
Big data and the Healthcare Sector Big data and the Healthcare Sector
Big data and the Healthcare Sector Chris Groves
 
Microsoft: A Waking Giant In Healthcare Analytics and Big Data
Microsoft: A Waking Giant In Healthcare Analytics and Big DataMicrosoft: A Waking Giant In Healthcare Analytics and Big Data
Microsoft: A Waking Giant In Healthcare Analytics and Big DataHealth Catalyst
 
How to Choose the Best Healthcare Analytics Software Solution in a Crowded Ma...
How to Choose the Best Healthcare Analytics Software Solution in a Crowded Ma...How to Choose the Best Healthcare Analytics Software Solution in a Crowded Ma...
How to Choose the Best Healthcare Analytics Software Solution in a Crowded Ma...Health Catalyst
 
Getting Started With a Healthcare Predictive Analytics Program
Getting Started With a Healthcare Predictive Analytics ProgramGetting Started With a Healthcare Predictive Analytics Program
Getting Started With a Healthcare Predictive Analytics ProgramJ. Bryan Bennett, MBA, CPA, LSSGB
 
Data Analytics in Healthcare
Data Analytics in HealthcareData Analytics in Healthcare
Data Analytics in HealthcareMark Gall
 
Deploying Predictive Analytics in Healthcare
Deploying Predictive Analytics in HealthcareDeploying Predictive Analytics in Healthcare
Deploying Predictive Analytics in HealthcareHealth Catalyst
 
Mergers, acquisitions, and partnerships dramatically reducing it consolidati...
Mergers, acquisitions, and partnerships  dramatically reducing it consolidati...Mergers, acquisitions, and partnerships  dramatically reducing it consolidati...
Mergers, acquisitions, and partnerships dramatically reducing it consolidati...Health Catalyst
 
Develop Your Analysts and They'll Pay for Themselves
Develop Your Analysts and They'll Pay for ThemselvesDevelop Your Analysts and They'll Pay for Themselves
Develop Your Analysts and They'll Pay for ThemselvesHealth Catalyst
 
The Role of Data Lakes in Healthcare
The Role of Data Lakes in HealthcareThe Role of Data Lakes in Healthcare
The Role of Data Lakes in HealthcarePerficient, Inc.
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
 
Big Data Analytics for Healthcare Decision Support- Operational and Clinical
Big Data Analytics for Healthcare Decision Support- Operational and ClinicalBig Data Analytics for Healthcare Decision Support- Operational and Clinical
Big Data Analytics for Healthcare Decision Support- Operational and ClinicalAdrish Sannyasi
 

What's hot (20)

Part 2 - 20 Years in Healthcare Analytics & Data Warehousing: What did we lea...
Part 2 - 20 Years in Healthcare Analytics & Data Warehousing: What did we lea...Part 2 - 20 Years in Healthcare Analytics & Data Warehousing: What did we lea...
Part 2 - 20 Years in Healthcare Analytics & Data Warehousing: What did we lea...
 
Healthcare Analytics Careers: New Roles for the Brave, New World of Value-bas...
Healthcare Analytics Careers: New Roles for the Brave, New World of Value-bas...Healthcare Analytics Careers: New Roles for the Brave, New World of Value-bas...
Healthcare Analytics Careers: New Roles for the Brave, New World of Value-bas...
 
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
 
Choosing an Analytics Solution in Healthcare
Choosing an Analytics Solution in HealthcareChoosing an Analytics Solution in Healthcare
Choosing an Analytics Solution in Healthcare
 
Managing National Health: An Overview of Metrics & Options
Managing National Health: An Overview of Metrics & OptionsManaging National Health: An Overview of Metrics & Options
Managing National Health: An Overview of Metrics & Options
 
eBook - Data Analytics in Healthcare
eBook - Data Analytics in HealthcareeBook - Data Analytics in Healthcare
eBook - Data Analytics in Healthcare
 
Realizing the Promise of Precision Medicine
Realizing the Promise of Precision MedicineRealizing the Promise of Precision Medicine
Realizing the Promise of Precision Medicine
 
BIG Data & Hadoop Applications in Healthcare
BIG Data & Hadoop Applications in HealthcareBIG Data & Hadoop Applications in Healthcare
BIG Data & Hadoop Applications in Healthcare
 
Big data and the Healthcare Sector
Big data and the Healthcare Sector Big data and the Healthcare Sector
Big data and the Healthcare Sector
 
Mobility Management in Healthcare: MDM, BYOD, mHealth
Mobility Management in Healthcare: MDM, BYOD, mHealthMobility Management in Healthcare: MDM, BYOD, mHealth
Mobility Management in Healthcare: MDM, BYOD, mHealth
 
Microsoft: A Waking Giant In Healthcare Analytics and Big Data
Microsoft: A Waking Giant In Healthcare Analytics and Big DataMicrosoft: A Waking Giant In Healthcare Analytics and Big Data
Microsoft: A Waking Giant In Healthcare Analytics and Big Data
 
How to Choose the Best Healthcare Analytics Software Solution in a Crowded Ma...
How to Choose the Best Healthcare Analytics Software Solution in a Crowded Ma...How to Choose the Best Healthcare Analytics Software Solution in a Crowded Ma...
How to Choose the Best Healthcare Analytics Software Solution in a Crowded Ma...
 
Getting Started With a Healthcare Predictive Analytics Program
Getting Started With a Healthcare Predictive Analytics ProgramGetting Started With a Healthcare Predictive Analytics Program
Getting Started With a Healthcare Predictive Analytics Program
 
Data Analytics in Healthcare
Data Analytics in HealthcareData Analytics in Healthcare
Data Analytics in Healthcare
 
Deploying Predictive Analytics in Healthcare
Deploying Predictive Analytics in HealthcareDeploying Predictive Analytics in Healthcare
Deploying Predictive Analytics in Healthcare
 
Mergers, acquisitions, and partnerships dramatically reducing it consolidati...
Mergers, acquisitions, and partnerships  dramatically reducing it consolidati...Mergers, acquisitions, and partnerships  dramatically reducing it consolidati...
Mergers, acquisitions, and partnerships dramatically reducing it consolidati...
 
Develop Your Analysts and They'll Pay for Themselves
Develop Your Analysts and They'll Pay for ThemselvesDevelop Your Analysts and They'll Pay for Themselves
Develop Your Analysts and They'll Pay for Themselves
 
The Role of Data Lakes in Healthcare
The Role of Data Lakes in HealthcareThe Role of Data Lakes in Healthcare
The Role of Data Lakes in Healthcare
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
 
Big Data Analytics for Healthcare Decision Support- Operational and Clinical
Big Data Analytics for Healthcare Decision Support- Operational and ClinicalBig Data Analytics for Healthcare Decision Support- Operational and Clinical
Big Data Analytics for Healthcare Decision Support- Operational and Clinical
 

Viewers also liked

The 12 Criteria of Population Health Management
The 12 Criteria of Population Health ManagementThe 12 Criteria of Population Health Management
The 12 Criteria of Population Health ManagementDale Sanders
 
Population Health Management
Population Health ManagementPopulation Health Management
Population Health ManagementDale Sanders
 
OECD Health Indicators at a Glance
OECD Health Indicators at a GlanceOECD Health Indicators at a Glance
OECD Health Indicators at a GlanceDale Sanders
 
Reco4J @ London Meetup (June 26th)
Reco4J @ London Meetup (June 26th)Reco4J @ London Meetup (June 26th)
Reco4J @ London Meetup (June 26th)Alessandro Negro
 
Reco4J @ Munich Meetup (April 18th)
Reco4J @ Munich Meetup (April 18th)Reco4J @ Munich Meetup (April 18th)
Reco4J @ Munich Meetup (April 18th)Alessandro Negro
 
Reco4 @ Paris Meetup (May 20th)
Reco4 @ Paris Meetup (May 20th)Reco4 @ Paris Meetup (May 20th)
Reco4 @ Paris Meetup (May 20th)Alessandro Negro
 
Neo4j Introduction (for Techies)
Neo4j Introduction (for Techies)Neo4j Introduction (for Techies)
Neo4j Introduction (for Techies)Patrick Baumgartner
 
Precise Patient Registries for Clinical Research and Population Management
Precise Patient Registries for Clinical Research and Population ManagementPrecise Patient Registries for Clinical Research and Population Management
Precise Patient Registries for Clinical Research and Population ManagementDale Sanders
 
Data-Ed Webinar: Data Warehouse Strategies
Data-Ed Webinar: Data Warehouse StrategiesData-Ed Webinar: Data Warehouse Strategies
Data-Ed Webinar: Data Warehouse StrategiesDATAVERSITY
 
HIMSS National Data Warehousing Webinar
HIMSS National Data Warehousing WebinarHIMSS National Data Warehousing Webinar
HIMSS National Data Warehousing WebinarDale Sanders
 
An Overview of Disease Registries
An Overview of Disease RegistriesAn Overview of Disease Registries
An Overview of Disease RegistriesDale Sanders
 
Data Driven Clinical Quality and Decision Support
Data Driven Clinical Quality and Decision SupportData Driven Clinical Quality and Decision Support
Data Driven Clinical Quality and Decision SupportDale Sanders
 
Strategic Options for Analytics in Healthcare
Strategic Options for Analytics in HealthcareStrategic Options for Analytics in Healthcare
Strategic Options for Analytics in HealthcareDale Sanders
 
The Case for Healthcare Data Literacy: It's Not About Big Data
The Case for Healthcare Data Literacy: It's Not About Big DataThe Case for Healthcare Data Literacy: It's Not About Big Data
The Case for Healthcare Data Literacy: It's Not About Big DataHealth Catalyst
 
Late Binding in Data Warehouses
Late Binding in Data WarehousesLate Binding in Data Warehouses
Late Binding in Data WarehousesDale Sanders
 
Healthcare Billing and Reimbursement: Starting from Scratch
Healthcare Billing and Reimbursement: Starting from ScratchHealthcare Billing and Reimbursement: Starting from Scratch
Healthcare Billing and Reimbursement: Starting from ScratchDale Sanders
 
Healthcare 2.0: The Age of Analytics
Healthcare 2.0: The Age of AnalyticsHealthcare 2.0: The Age of Analytics
Healthcare 2.0: The Age of AnalyticsDale Sanders
 
Healthcare Analytics Market Categorization
Healthcare Analytics Market CategorizationHealthcare Analytics Market Categorization
Healthcare Analytics Market CategorizationDale Sanders
 
Late-Binding Data Warehouse - An Update on the Fastest Growing Trend in Healt...
Late-Binding Data Warehouse - An Update on the Fastest Growing Trend in Healt...Late-Binding Data Warehouse - An Update on the Fastest Growing Trend in Healt...
Late-Binding Data Warehouse - An Update on the Fastest Growing Trend in Healt...Health Catalyst
 

Viewers also liked (20)

The 12 Criteria of Population Health Management
The 12 Criteria of Population Health ManagementThe 12 Criteria of Population Health Management
The 12 Criteria of Population Health Management
 
Population Health Management
Population Health ManagementPopulation Health Management
Population Health Management
 
OECD Health Indicators at a Glance
OECD Health Indicators at a GlanceOECD Health Indicators at a Glance
OECD Health Indicators at a Glance
 
Reco4J @ London Meetup (June 26th)
Reco4J @ London Meetup (June 26th)Reco4J @ London Meetup (June 26th)
Reco4J @ London Meetup (June 26th)
 
Reco4J @ Munich Meetup (April 18th)
Reco4J @ Munich Meetup (April 18th)Reco4J @ Munich Meetup (April 18th)
Reco4J @ Munich Meetup (April 18th)
 
Reco4 @ Paris Meetup (May 20th)
Reco4 @ Paris Meetup (May 20th)Reco4 @ Paris Meetup (May 20th)
Reco4 @ Paris Meetup (May 20th)
 
Neo4j Introduction (for Techies)
Neo4j Introduction (for Techies)Neo4j Introduction (for Techies)
Neo4j Introduction (for Techies)
 
Precise Patient Registries for Clinical Research and Population Management
Precise Patient Registries for Clinical Research and Population ManagementPrecise Patient Registries for Clinical Research and Population Management
Precise Patient Registries for Clinical Research and Population Management
 
Data-Ed Webinar: Data Warehouse Strategies
Data-Ed Webinar: Data Warehouse StrategiesData-Ed Webinar: Data Warehouse Strategies
Data-Ed Webinar: Data Warehouse Strategies
 
HIMSS National Data Warehousing Webinar
HIMSS National Data Warehousing WebinarHIMSS National Data Warehousing Webinar
HIMSS National Data Warehousing Webinar
 
Amia now! session two
Amia now! session twoAmia now! session two
Amia now! session two
 
An Overview of Disease Registries
An Overview of Disease RegistriesAn Overview of Disease Registries
An Overview of Disease Registries
 
Data Driven Clinical Quality and Decision Support
Data Driven Clinical Quality and Decision SupportData Driven Clinical Quality and Decision Support
Data Driven Clinical Quality and Decision Support
 
Strategic Options for Analytics in Healthcare
Strategic Options for Analytics in HealthcareStrategic Options for Analytics in Healthcare
Strategic Options for Analytics in Healthcare
 
The Case for Healthcare Data Literacy: It's Not About Big Data
The Case for Healthcare Data Literacy: It's Not About Big DataThe Case for Healthcare Data Literacy: It's Not About Big Data
The Case for Healthcare Data Literacy: It's Not About Big Data
 
Late Binding in Data Warehouses
Late Binding in Data WarehousesLate Binding in Data Warehouses
Late Binding in Data Warehouses
 
Healthcare Billing and Reimbursement: Starting from Scratch
Healthcare Billing and Reimbursement: Starting from ScratchHealthcare Billing and Reimbursement: Starting from Scratch
Healthcare Billing and Reimbursement: Starting from Scratch
 
Healthcare 2.0: The Age of Analytics
Healthcare 2.0: The Age of AnalyticsHealthcare 2.0: The Age of Analytics
Healthcare 2.0: The Age of Analytics
 
Healthcare Analytics Market Categorization
Healthcare Analytics Market CategorizationHealthcare Analytics Market Categorization
Healthcare Analytics Market Categorization
 
Late-Binding Data Warehouse - An Update on the Fastest Growing Trend in Healt...
Late-Binding Data Warehouse - An Update on the Fastest Growing Trend in Healt...Late-Binding Data Warehouse - An Update on the Fastest Growing Trend in Healt...
Late-Binding Data Warehouse - An Update on the Fastest Growing Trend in Healt...
 

Similar to Data Warehousing History and Best Practices

What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceAnnie Flippo
 
How to get the most of your Data & Analytcs
How to get the most of your Data & AnalytcsHow to get the most of your Data & Analytcs
How to get the most of your Data & AnalytcsCorsair's Publishing
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data scienceVipul Kalamkar
 
BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?Christopher Bradley
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data scienceThinkful
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCTJ Stalcup
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
10 Steps for Taking Control of Your Organization's Digital Debris
10 Steps for Taking Control of Your Organization's Digital Debris 10 Steps for Taking Control of Your Organization's Digital Debris
10 Steps for Taking Control of Your Organization's Digital Debris Perficient, Inc.
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)Zenodia Charpy
 
From Asset to Impact - Presentation to ICS Data Protection Conference 2011
From Asset to Impact - Presentation to ICS Data Protection Conference 2011From Asset to Impact - Presentation to ICS Data Protection Conference 2011
From Asset to Impact - Presentation to ICS Data Protection Conference 2011Castlebridge Associates
 
10 Steps to Develop a Data Literate Workforce
10 Steps to Develop a Data Literate Workforce10 Steps to Develop a Data Literate Workforce
10 Steps to Develop a Data Literate WorkforceSense Corp
 
Why CxOs care about Data Governance; the roadblock to digital mastery
Why CxOs care about Data Governance; the roadblock to digital masteryWhy CxOs care about Data Governance; the roadblock to digital mastery
Why CxOs care about Data Governance; the roadblock to digital masteryCoert Du Plessis (杜康)
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Data sci sd-11.6.17
Data sci sd-11.6.17Data sci sd-11.6.17
Data sci sd-11.6.17Thinkful
 
Does big data = big insights?
Does big data = big insights?Does big data = big insights?
Does big data = big insights?Colin Strong
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackPrecisely
 
Beginners_s_Guide_Data_Analytics_1661051664.pdf
Beginners_s_Guide_Data_Analytics_1661051664.pdfBeginners_s_Guide_Data_Analytics_1661051664.pdf
Beginners_s_Guide_Data_Analytics_1661051664.pdfKashifJ1
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sdThinkful
 

Similar to Data Warehousing History and Best Practices (20)

Data Management for Dummies
Data Management for DummiesData Management for Dummies
Data Management for Dummies
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
How to get the most of your Data & Analytcs
How to get the most of your Data & AnalytcsHow to get the most of your Data & Analytcs
How to get the most of your Data & Analytcs
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
10 Steps for Taking Control of Your Organization's Digital Debris
10 Steps for Taking Control of Your Organization's Digital Debris 10 Steps for Taking Control of Your Organization's Digital Debris
10 Steps for Taking Control of Your Organization's Digital Debris
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
From Asset to Impact - Presentation to ICS Data Protection Conference 2011
From Asset to Impact - Presentation to ICS Data Protection Conference 2011From Asset to Impact - Presentation to ICS Data Protection Conference 2011
From Asset to Impact - Presentation to ICS Data Protection Conference 2011
 
10 Steps to Develop a Data Literate Workforce
10 Steps to Develop a Data Literate Workforce10 Steps to Develop a Data Literate Workforce
10 Steps to Develop a Data Literate Workforce
 
Why CxOs care about Data Governance; the roadblock to digital mastery
Why CxOs care about Data Governance; the roadblock to digital masteryWhy CxOs care about Data Governance; the roadblock to digital mastery
Why CxOs care about Data Governance; the roadblock to digital mastery
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data sci sd-11.6.17
Data sci sd-11.6.17Data sci sd-11.6.17
Data sci sd-11.6.17
 
Does big data = big insights?
Does big data = big insights?Does big data = big insights?
Does big data = big insights?
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on TrackYour AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
 
Beginners_s_Guide_Data_Analytics_1661051664.pdf
Beginners_s_Guide_Data_Analytics_1661051664.pdfBeginners_s_Guide_Data_Analytics_1661051664.pdf
Beginners_s_Guide_Data_Analytics_1661051664.pdf
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sd
 

Recently uploaded

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 

Recently uploaded (20)

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 

Data Warehousing History and Best Practices

  • 1. Data Warehousing A Look Back, Moving Forward Dale Sanders June 2005
  • 2. 2 Introduction & Warnings  Why am I here?  Teach  Stimulate some thought  Share some of my experiences and lessons  Learn  From you, please…  Ask questions, challenge opinions, share your knowledge  I’ll do my best to live up to my end of the bargain  Warnings  The pictures in this presentation  May or may not have any relevance whatsoever to the topic or slide  Mostly intended to break up the monotony
  • 3. 3 Expectation Management  DW Strengths (according to others)  I know what not to do as much as I know what to do  Seen and made all the big mistakes  Vision, strategy, system architecture, data management & DW modeling, complex cultural issues, “leapfrog” problem solving  What not to expect: DW weaknesses  My programming skills suck  Haven’t written a decent line of code in four years!  Some might say it’s been 24 years…   Knowledge of leading products is very rusty  Though I’m beefing up on Microsoft and Cognos Within these expectations, make no mistake about it… I know data warehousing 
  • 4. 4 Today’s Discussions  I am a good “Idea Guy”  But, ideas are worthless without someone to implement and enhance them  Steve Barlow, Dan Lidgard, Jon Despain, Chuck Lyon, Laure Shull, Kris Mitchell, Peter Hess, Ron Gault, Rob Carpenter, my wife, and many others  My greatest strength and blessing  The ability to recognize, listen to, and hold onto good people  Knock on wood  My achievements in personal and professional life  More a function of those around me than a reflection on me
  • 5. 5 DW Best Practices: The Most Important Metrics  Employee satisfaction  Without it, long-term customer satisfaction is impossible  Customer satisfaction  That’s the nature of the Information Services career field  Some people in our profession still don’t get it  We are here to serve  The Organizational Laugh Metric  How many times do you hear laughter in the day-to-day operations of your team?  It is the single most important vital sign to organizational health and business success
  • 6. 6 My Background  Three, eight-year chapters  Captain, Information Systems Engineer, US Air Force  Nuclear warfare battle management  Force status data integration  Intelligence and attack warning data “fusion”  Consultant in several industries  TRW  CIA Data Center  TRW Credit Reporting Data Base  National Security Agency (NSA)  Intel: New Mexico Data Repository (NMDR)  Air Force  Integrated Minuteman Data Base (IMDB)  Peacekeeper Information Retrieval System (PIRS)  Many others…  Healthcare  Intermountain Health Care Enterprise Data Warehouse  Consultant to other healthcare organizations’ data warehouses  Now at Northwestern University Medical System
  • 7. 7 Overview  Data warehousing history  According to Sanders  Why and how did this become a sub-specialty in information systems?  What have we learned so far?  My take on “Best Practices”  Key lessons-learned  My thoughts on the most popular authors in the field  What they contribute, where they detract
  • 8. 8 Data Warehousing History “Newspaper Rock” 100 B.C. American Retail 2005 A.D. Lots of stuff happened
  • 9. 9 What Happened in the Cloud?  Stage 1: Laziness  Operators grew tired of hanging tapes  In response to requests for historical financial data  They stored data on-line, in “unauthorized” mainframe databases  Stage 2: End of the mainframe bully  Computing moved out from finance to the rest of the business  Unix and relational databases  Distributed computing created islands of information  Stage 2.1: The government gets involved  Consolidating IRS and military databases to save money on mainframes  “Hey, look what I can do with this data…”  Stage 3: Demming comes along  Push towards constant business “reengineering”  Cultural emphasis on “continuous quality improvement” and “business innovation” drives the need for data  Stage 4: Data warehousing has it’s own language  Ralph Kimball publishes “The Data Warehouse Toolkit”
  • 10. 10 The Real Truth  Data warehousing is a symptom of a problem  Technological inability to deploy single-platform information systems that:  Capture data once and reuse it throughout an enterprise  Support high-transaction rates (single record CREATE, SELECT, UPDATE, DELETE) and analytic queries on the same computing platform, with the same data, at the same time  Someday, maybe we will address the root cause  Until then, it’s a good way to make a living
  • 11. 11 The “Ideal Library” Practice  Stores all of the books and other reference material you need to conduct your research  The Enterprise data warehouse  A single place to visit  One database environment  Contents are kept current and refreshed  Timely, well choreographed data loads  Staffed with friendly, knowledgeable people that can help you find your way around  Your Data Warehouse team  Organized for easy navigation and use  Metadata  Data models  “User friendly” naming conventions
  • 12. 12 Cultural Detractors The two biggies…  The business supported by the data warehouse must be motivated by a desire for constant improvement and fact-based decision making  The data warehouse team falls victim to the “Politics of Data”  Through naivety  Through misguided motives, themselves
  • 13. 13 Business Culture  Does your CEO…  Talk about constant improvement, constantly?  Drive corporate goals that are SMART?  Specific, Measurable, Attainable, Realistic, Tangible  Crave data to make better informed decisions?  Become visibly, buoyantly excited at a demo for a data cube?  If so, the success of your data warehouse is right around the corner… sort of… I love data!
  • 14. 14 Political Best Practices  You will be called a “data thief”  Get used to it  Encourage life cycle ownership of the OLTP data, even in the EDW  You will be called “dangerous”  “You don’t understand our data!”  OLTP owners know their data better than you do– acknowledge it and leverage it  You will be blamed for poor data quality in the OLTP systems  This is a natural reaction  Data warehouses raise the visibility of poor data quality  Use the EDW as a tool for raising overall data quality  You will be called a “job robber”  EDW is perceived as a replacement for OLTP systems  Educate people: The EDW depends on OLTP systems for its existence  Stick to your values and pure motives  The politics will fade away
  • 15. 15 Data Quality  Pitfall  Taking accountability for data quality on the source system  Spending gobs of time and money “cleansing” data before it’s loaded into the DW  It’s a never ending, never win battle  You will always be one step behind data quality  You will always be in the cross-hairs of blame  Best Practice  Push accountability where it belongs– to the source system  Use the data warehouse as a tool to reveal data quality, either good or bad  Be prepared to weather the initial storm of blame
  • 16. 16 Measuring Data Quality  Data Quality = Completeness x Validity  Can it be measured objectively?  Measuring “Completeness”  Number of null values in a column  Measuring “Validity”  Cardinality is a simple way to measure validity  “We only have four standard regions in the business, but we have 18 distinct values in the region column.”
  • 17. 17 Business Validity  How can you measure it? You can’t…  “I collect this data from our customers, but I have to guess sometimes because I don’t speak Spanish.”  “This data is valid for trend analysis decisions before 9/11/2001, but should not be used after that date, due to changes in security procedures.”  “You can’t use insurance billing and reimbursement data to make clinical, patient care decisions.”  “This customer purchased four copies of ‘Zamfir, Master of the Pan Flute’, therefore he loves everything about Zamfir.”  What Amazon didn’t know: I bought them for my mom and her sewing circle. Where do you capture subjective data quality? Metadata….
  • 18. 18 The Importance of Metadata  Maybe the most over-hyped, underserved area of data warehousing common sense  Vendors want to charge you big $$$$$ for their tools  Consultants would like you to think that it’s the Holy Grail in disguise and only they can help you find it  Authors who have never been in an operational environment would have you chasing your tail in pursuit of an esoteric, mythological Metadata Nirvana  Don’t listen to the confusing messages! You know the answer… just listen to your common sense…
  • 19. 19 Metadata: Keep It Simple!  Ultimately, what are the most valuable business motives behind metadata?  Make data more “understandable” to those who are not familiar with it  Data quality issues  Data timeliness and temporal issues  Context in which is was collected  Translating physical names to natural language  Make data more “findable” to those who don’t know where it is  Organize it  Take a lesson from library science and the card catalog
  • 20. 20 Table Elements Required Elements  Long Name (or English name)  Description Semi-optional Elements  Source  Example  Data Steward
  • 21. 21 Column Elements Required Elements  Long Name  Description Optional Elements  Value Range  Data Quality  Associated Lookup
  • 22. 22 The Data Model TABLE_ENT TABLE_ENT_ID: NUMBER TABLE_ENT_DESC: VARCHAR2(4000) TABLE_ENT_SRC: VARCHAR2(50) TABLE_ENT_NAME: VARCHAR2(50) TABLE_TYPE: VARCHAR2(10) CREATE_DT: DATE LAST_LOAD_DT: DATE SCHEMA_ID: NUMBER DATA_MART DATA_MART_ID: NUMBER DATA_MART_NAME: VARCHAR2(50) DATA_MART_DESC: VARCHAR2(4000) DATA_STEWARD: VARCHAR2(50) LAST_LOAD_DT: DATE UPDATE_FREQ: VARCHAR2(50) DATA_BEG_DT: DATE DATA_END_DT: DATE DATA_MART_TABLE_ENT DATA_MART_ID: NUMBER TABLE_ENT_ID: NUMBER FOLDER FOLDER_ID: NUMBER PARENT_FOLDER_ID: NUMBER FOLDER_NM: VARCHAR2(50) FOLDER_DSC: VARCHAR2(4000) CREATE_USER_ID: VARCHAR2(20) CREATE_DT: DATE REPORT RPT_ID: NUMBER FOLDER_ID: NUMBER RPT_NM: VARCHAR2(250) RPT_LOC_TXT: VARCHAR2(1000) PURPOSE_TXT: VARCHAR2(4000) RUN_FREQ_TXT: VARCHAR2(1000) AUDIENCE_TXT: VARCHAR2(500) EDW_RPT_FLG: NUMBER DATA_SOURCE_TXT: VARCHAR2(4000) SELECT_CRITERIA_TXT: VARCHAR2(4000) STAT_METHODS_TXT: VARCHAR2(4000) RPT_TOOL_TXT: VARCHAR2(250) CODE_TXT: CLOB FORMULA_TXT: CLOB COMMENTARY_TXT: VARCHAR2(4000) AUTHOR_NM: VARCHAR2(500) AUTHOR_TITLE_TXT: VARCHAR2(500) AUTHOR_DEPT_TXT: VARCHAR2(500) AUTHOR_LOC_TXT: VARCHAR2(500) AUTHOR_PHONE_TXT: VARCHAR2(500) AUTHOR_EMAIL_TXT: VARCHAR2(500) BUSINESS_OWNER_TXT: VARCHAR2(500) METADATA_UPDATE_DT: DATE VALIDATION_DT: DATE CREATE_USER_ID: VARCHAR2(20) CREATE_DT: DATE REPORT_TABLE_ENT_ASSOC RPT_ID: NUMBER TABLE_ENT_ID: NUMBER ATTRIBUTE ATTRIBUTE_ID: NUMBER TABLE_ENT_ID: NUMBER ATTRIBUTE_DESC: VARCHAR2(4000) ATTRIBUTE_NAME: VARCHAR2(50) ATTRIBUTE_DATATYPE: VARCHAR2(50) SAMPLE_VALUE: VARCHAR2(100) INDEX_FLG: NUMBER PRIMARY_KEY_FLG: NUMBER TABLE_POSITION_NO: NUMBER SCHEMA SCHEMA_ID: NUMBER SCHEMA_DESC: VARCHAR2(50)
  • 23. 23 Example Metadata Entry LKUP.POSTAL_CD_MASTER Table Long Name: Postal Code Master - IHC Description: Contains Postal (Zip) codes for the IHC referral region and IHC specific descriptions. These descriptions allow for specific IHC groupings used in various analyses. Data Steward: Jim Allred, ext. 3518
  • 25. 25 Some Info Is Free It can be collected from the database. For example:  Primary and Foreign Keys  Indexed Columns  Table Creation Dates
  • 26. 26 Most Valuable Info is Subjective The human element  Most metadata is not automatically collected by tools because it does NOT exist in that form  Interviews with data stewards are the key  It can take months (and months and months) of effort to collect initial metadata.
  • 27. 27 Holding Feet to the Fire  Made data architects responsible for metadata in their subject areas  Metadata completion reports in every staff meeting for a year  Standing rule: No new data marts go live without metadata
  • 28. 28 Is it all worth it? Data analysts think so. “I couldn’t do my job without it.” It will push the ROI of a home-hum DW into the stratosphere It does for DW’ing what the Yellow Pages did for the business ROI of the telephone
  • 29. 29 It Gets Used At Intermountain Health Care  210 web hits on average each week day  (23,000 employees, $2B revenue) Avg Hits by Day of Week (April 2004 - Sep 2004) 189 217 212 240 188 0 50 100 150 200 250 300 MON TUE WED THU FRI
  • 31. 31 Report Quality  A function of…  Data quality  How well does the report reflect the intent behind the question being asked?  “This report doesn’t make sense. I’m trying to find out how many widgets we can produce next year, based on the last four years’ production.”  “That’s not what you asked for.”  SQL and other programming accuracy  Statistical validity– population size of the data  Timeliness of the data relative to the decision  Event Correlation  Best Practice:  An accompanying “meta-report” for every report that involves significant, high risk decisions
  • 32. 32 Meta Report A document, associated with a published report, which defines the report.
  • 33. 33 Repository A central place for storing and sharing information about business reports
  • 34. 34 IHC Analyst Use of Meta Reports 37% 89% 21% 95% 0% 20% 40% 60% 80% 100% Data Collected Aug-04 N=32 Read Others Search Duplication Search SQL Audience Request
  • 35. 35 Meta Report  Core Elements  Author Information  Report Name  Report Purpose  Data Source(s)  Report Methods  Recommended Elements  Business Owner  Run Frequency  Intended Audience  Statistical Tests  Software Used  Source Code  Formulas  Relevant Issues & Commentary
  • 36. 36 • Title • Location • Author • Owner
  • 37. 37 • Purpose • Frequency • Audience • Data Source(s)
  • 38. 38 • Selection Criteria • Statistics • Software • Source Code • Formulas
  • 40. 40
  • 42. 42 Think: Mission Control  Customized ETL Library  Schedule of operations  Alerting tool  Storage strategies / backups  Development philosophy and environment  Performance—monitoring and tuning Operations Best Practices
  • 43. 43  EDW  Oracle v 9.2.0.3 on AIX 5.2  Storage: IBM SAN (shark), >3T  ETL tools  Ascential’s Data Stage  Kornshell (unix), SQL scripts, PL/SQL scripting  OLAP: MS’ Analysis Services  BI: Business Objects (Crystal Enterprise)  With a Cube presentation layer  Dashboard: Visual Mining’s Net Charts  EDW Team: ~16 FTEs, plus SAs and DBAs IHC Architecture
  • 45. 45  One of our ETL programmers noticed he kept doing the same things over and over for all of his ETL jobs. Rather than copying and pasting this repetitive code, he created a library. Now we all use the ETL Library.  We named the library EDW_UTIL (EDW Utilities) History
  • 46. 46 Implementation  Executes via Oracle stored procedures  Supported by associated tables to hold data when necessary  Error table  QA table  Index table
  • 47. 47 Benefits  Provides standardization  Eliminates code rewrites  Can hide complexities  Such as the appropriate way to analyze and gather statistics on tables  Very accessible to all of our ETL tools  Simply an Oracle stored procedure call
  • 48. 48 Index Management  Past process included:  Dropping the table’s indexes with a script  Loading the table  Creating the indexes with a script  The past process resulted in messy scripts to manage and coordinate
  • 49. 49 Index Management  New process includes:  Capturing a table’s existing indexes metadata  Dropping the table’s indexes with a single procedure call  Loading the table  Recreating the indexes with a single procedure call  There are no more messy scripts to manage and coordinate  No “lost” indexes were neglected when adding to create index script
  • 50. 50 Index Management  Samples  IMPORT_SCHEMA_INDEX_DATA  IMPORT_TABLE_INDEX_DATA  DROP_TABLE_INDEXES  CREATE_TABLE_INDEXES
  • 51. 51 Background Loading of Tables  We often load data into tables which are not accessible to end users. A simple rename puts them into production.  Helps transfer the identical attributes from the live to the background table  Samples  COPY_TABLE_METADATA  TRANSFER_TABLE_PRIVS  DROP_TABLE_INDEXES  CREATE_TABLE_INDEXES (Create on background table, identical to production table)
  • 52. 52 Load Times, Errors, QA  We had no idea who was loading what and when  Each staff member logged in their own way and for their own interest  ETL error capturing and QA was difficult  We can now capture errors and QA information in a somewhat standardized fashion
  • 53. 53 Load Times, Errors, QA Samples  BEGIN_JOB_TIME  (ex: CASEMIX)  BEGIN_LOAD_TIME  (ex: CASEMIX INDEX)  END_LOAD_TIME  END_JOB_TIME  COMPLETE_LOAD_TIME (Begin and end together)  LOAD_TIME_ERROR (Alert on these errors)  LOAD_TIME_METRICS QA (row counts)
  • 54. 54 Miscellaneous Procedures  Hide the “gory” details from the majority of the EDW team  Such as Oracle’s table analyze command  Gives us consistent application of system wide parameters such as:  A new box with a different number of CPUs (parallel slaves) or  A new version of Oracle  We populate some metadata too, such as last load date
  • 55. 55 DW Schedule of Operations  Some loads are adhoc, not scheduled  Users query in an adhoc fashion  We have a minimal service/application tier implemented (loss of control)  Use of a variety of ETL tools  Use of a variety of user categories  DBA, SA, ETL user, end users  Use of a variety of servers  Production EDW, Stage EDW, ETL servers, OLAP servers, Presentation layer servers
  • 56. 56 General Approach  Focus on load jobs against production EDW  Still working on all the reporting aspects (a sample on the next slide)  Pull this information out of the “load times” data captured by these ETL library calls  BEGIN_JOB_TIME  BEGIN_LOAD_TIME  END_LOAD_TIME  END_JOB_TIME  COMPLETE_LOAD_TIME
  • 58. 58 DW Alerting Tool  DW alerting  Aggregate data alerts, such as, your average length of stay just crossed a certain threshold  A simple tool was created which sends a text email, based on existence of data returned from a query  Primarily embraced by DW team members for internal DW operations, not that the original intent is abandoned
  • 59. 59 Features  Web based  Open to all EDW users  Run daily, weekly, every two weeks, monthly, quarterly (wakes every 5 minutes)  This is a passive polling  Ability to enter query in SQL  Alert (email) on 3 situations  Query returns data  Query returns no data  Always
  • 61. 61 Examples  ~100 alerts in use  Live performance check  Every 4 hours—look for inactive sessions holding active slaves  Daily—look for any active sessions older than 72 hours  ETL monitoring; alert only if problem  Alert on errors logged via the ETL_UTIL library (manage by exception)  Alert on existence of “bad” records captured during ETL
  • 62. 62 Storage and Backup  Inherited state of affairs  Running like any OLTP database  High end expensive SANs (storage area networks)  FULL nightly online backups  Out of space? Just buy more
  • 63. 63 Nightmare in the Making  Exponential growth  More data sources  More summary tables  More indexes  No data has yet been purged  Relaxed attitude  Disk is cheap  Reality: Disk management is expensive
  • 64. 64 Looming Crisis  Backups often run 16 hours or more  Performance degradation witnessed by users  Good backups obtained less than 50% of the time  Literally running out of space  Gross underestimating  Some reckless overuse  Financial $$$$ cost  The system administrators (SAs) quadruple the price of disk purchase from the previous budget year. Ouch!  SAs roll in the price of tape drives, etc.
  • 65. 65 Major Changes in Operations  Transfer some disk ownership AND backup responsibilities to DW team, away from SAs and DBAs  EDW team more aware of upcoming space demands  EDW team more in tune with which data sets are easily recreated from the source (don’t need a backup)  Stop performing full daily backups  Move towards less expensive disk option  IBM offers a few levels of SANs
  • 67. 67 Changes to Backup Strategy  Perform full backup once monthly during downtime  Perform no data backup on DEV/STAGE environments  Do backup DDL (all code) daily in all environments  Implement daily “incremental” backup
  • 68. 68 Daily Incremental Backups  Easier said than done  We’ve resorted to a table level backup (in Oracle, that’s an EXPORT)  The EDW team owns which tables are exported  EDW team populates a table, the “export table list” with each table’s export frequency  Populated via an application in development  The DBA’s run an export based on the “export table list”
  • 69. 69 Use Cheaper Disk  General practice: You can take greater risks with DW reliability and availability vs. OLTP systems  Use it to your advantage  Our SAN vendor (IBM) offers a few levels of SANs. Next level down is a big step down in price, small step down in features.  Feature loss:  Read cache (referring to disk cache, not box memory).  We rarely read the same thing twice anyway  No “phone home” to IBM (auto paging)  Mean time to failure is higher, but still acceptable
  • 70. 70 Performance Monitoring & Tuning  Err on the side of freedom and empowerment  How much harm can really be done?  We’d rather not constrain our customers  “Pounding queries” do find their way to production  Opportunity to educate users  Opportunity for us to tune underlying structures
  • 71. 71 The Focus Areas  Indexing  Well-defined criteria for when and how to apply indexes  Is this a lost art?  Big use of BITMAPS  Composite index trick (acts like a table)  Partitioning for performance, rather than data management  Exploiting Oracle’s Direct Path INSERT feature  Avoiding UPDATE and DELETE commands  Copy with MINUS instead  Implementing Oracle's Parallel Query  Turn off referential integrity in the DW.. no brainer  That’s the job of the source system
  • 72. 72 DW Monitoring: Empowering End Users  Motive  Too many calls from end users about their queries  “Please kill it.”  “Is it still running or is my PC locked up?”  “Why is the DW so slow?”  Give them the insight and tools  Give them the ability to kill their own queries  Still in the works
  • 74. 74 Tracking Long-Running Queries  We use Pinecone (from Ambeo) to monitor the duration of all queries and the SQL  Each week, we look at the top few  Typical outcome?  We’ll add indexes  We’ll denormalize  We'll contact the user and assist them with writing a better query
  • 75. 75 The DW Sandbox  More empowerment for customers  Motive  Lots of little MS  Access databases (with valuable data) spread all over the place  Needed to be joined with DW data  Costly to maintain  PC hogs  Solution  Provide customers with their own “sandbox” on the DW, with DBA-like priv’s
  • 76. 76 Features  Web based tool for creating tables and loading MS Access data to the DW  Simple, easy to use interface  Privileges  Users have full rights to the tables they create  Can grant rights to others  Big, big victory for customer service and data “maturity”  10% of DW customers use the Sandbox  About 600 tables in use now  About 2G of data
  • 77. 77 Design-Build Best Practices  Build vertically, design horizontally  Start by building data marts that address analytic needs in one area of the business with a fairly limited data set  But, design with the horizontal needs of the company in mind, so that you will eventually “tie” all of these vertical data marts together with a common semantic layer
  • 78. 78 Creating Value In Both Axes Build Design
  • 79. 79 For Example… CancerRegistry Mammography Radiology Pathology Laboratory ContinuingCare AndFollow-Up QualityofLife Survey Radiation Therapy HealthPlans Claims Ambulatory Casemix AcuteCare Casemix An Integrated Reporting Model of Cancer Patient’s Data Oncology Data Integration Strategy Top down reporting requirements and data model Disparate Sources “connected” semantically to the data bus
  • 80. 80 The Logic Layer in Data Warehouses Source System ETL Process Data Warehouse Reports Data Layer Logic Layer Presentation Layer Analytic Systems Transaction Systems HereNot Here
  • 81. 81 Evidence of Business Process Alignment 1. Map out your high level business process  Don’t fall prey to analysis paralysis with endless business process modeling diagrams! 1. Identify and associate the transaction systems that support those processes 2. Identify the common, overlapping semantics/data attributes and their utilization rates 3. Build your data marts within an enterprise framework that is aligned with the processes you are trying to understand
  • 82. 82 For example… DiagnosisHealth Need Patient Perception Procedure Results & Outcomes Episode of Care AP/AR Claims ProcessingHealthcare business process HELP Lab HPI MC400 SurveyAS400IDX HDMCIS/CDRHNA Supported by non-integrated data in Transaction Systems… Rx Integrated in the Data Warehouse Data Warehouse
  • 83. 83 Event Correlation  A leading edge Best Practice  The third dimension to rows and columns  Overlays the data that underlies a report or graph  “In 2004, we experienced a drop in revenue as a result of the earthquake that destroyed our plant in the Philippines.”  “In January of 2005, we saw a spike in the North America market for snow shovel sales that coincided with an increase in sales for pain relievers. This correlates to the record snowfall in that region and should not be considered a trend. Barring major product innovation, we consider the market for snow shovels in this area as saturated. Sales will be slow for the next several years.”
  • 84. 84 Standardizing Semantics  Sweet irony are the many synonyms for “standard semantics”  Data dictionary  Vocabulary  Dimensions  Data elements  Data attributes  The bottom line issue: Standardizing the terms you use to describe key facts about your business
  • 85. 85 Standardizing “Names of Things”  You better do it within the first two months of your data warehouse project  If you are beyond that point, you better stop and do it now, lest you pay a bigger price later  Don’t…  Push the standard on the source systems, unless it’s easy to accomplish  This was one of the common pitfalls of early data warehousing project failures  Try to standardize everything under the sun!  Focus on the high value facts
  • 86. 86 Where Are The “High Value” Semantics? In the high-overlap, high-utilization areas… Source System X Source System Y Source System Z Highest value area for standardizing semantics
  • 88. 88 The Standard Semantic “Layer” Data Warehouse Source Systems Extract, Transform, Load Semantic Standards
  • 89. 89 Data Modeling  Star schemas are great and simple, but they aren’t the end-all, be-all of analytic data modeling  Best practices: Do what makes sense– don’t be a schema bigot  I’ve seen great analytic value from 3NF models  Maintain data familiarity for your customers  When meeting vertical needs  Don’t make massive changes to the way the model looks and feels, nor the naming conventions– you will alienate existing users of the data  Use views to achieve “new” or standards-compliant perspectives on data  When meeting horizontal needs
  • 90. 90 For Example… Source perspective DW perspective Similar names & organization Vertical data customer Horizontal data customer “Standardized” view
  • 91. 91 The Case For Timely Updates %RequestsforData utilization Data Age 0 100 Today 1 year 2 years Generally, to minimize Total Cost of Ownership (TCO), your update frequency should be no greater than the decision making cycle associated with the data. But… everyone wants more timely data.
  • 92. 92 Best Practice: Measure Yourself  Employee satisfaction  Customer satisfaction  Average number of queries/month  Number of queries above a threshold (30 minutes?)  Average query response time  Total number of records  Total number of query-able tables  Total number of query-able columns  Number of “users”  Average rows delivered per month  Storage utilization  CPU utilization  Downtime per month by data mart The Data Warehouse Dashboard
  • 93. 93 Other Best Practices  The Data Warehouse Information Systems Team reports to the CIO  Most data analysts can and probably should report to the business units  Change management/service level agreements with the source systems  No changes in the sources systems unless they are coordinated with the data warehouse team
  • 94. 94 More Best Practices  Skills of the Data Warehouse IS Team  Experienced chief architect/project manager  Procedural/script programmers  SQL/declarative programmers  Data warehouse storage management architects  Data warehouse hardware architects and system administrators  Data architects/modelers  DBAs
  • 95. 95 More Best Practices  Evidence of project collaboration  A cross section of members and expertise from the data warehouse IS team  Statisticians and data analysts who understand the business domain  A customer that understands the process(es) being measured and can influence change  A data steward– usually someone from the front lines who knows how the data is collected Project = complex reports or a data mart
  • 96. 96 More Best Practices  When at all possible, always extract as close to the source as possible Primary Source Copy A Copy B Data Warehouse Best Practice Path
  • 97. 97 The Most Popular Authors  I appreciate…  The interest they stir  The vocabulary– semantics– of this new specialty that they helped create  The downside…  The buzzwords that are more buzz than substance  “Corporate Information Factories”  Endless, meaningless debate  “That’s not an Operational Data Store!”  “Do you follow Kimball or Inmon?”  Follow your own common sense  Most of these authors have not had to build a data warehouse from scratch and live with their decisions through a complete lifecycle
  • 98. 98 ETL Operations  Besides the cultural risks and challenges, the riskiest part of a data warehouse…  Good book  Westerman, WalMart Data Warehousing  The Extract, Transform, and Load processes  Worthy of it’s own “Best Practices” discussion  Suffice to say, mitigate risks in this area carefully and deliberately  The major design errors don’t show up until late in the lifecycle, when the cost of repair is great
  • 99. 99 Two Essential ETL Functions  Initial loads  How far back do we go in history?  Maintenance loads  Differential loads or total refresh?  How often?  You will run and tune these processes several times before you go into production  How many records are we dealing with?  How long will this take to run?  What’s the impact on the source system performance?
  • 100. 100 Maintenance Loads  Total refresh vs. Incremental loads  Total refresh: Truncate and reload everything from the source system  Incremental: Load only the new and updated records  For small data sets, a total refresh strategy is the easiest to implement  How do you define “small”? You will know it when don’t see it.  Sometimes the fastest strategy when you are trying to show quick results  Grab and go…
  • 101. 101 Incremental Loads  How do we get a snapshot of the data that has changed since the last load?  Many source systems will have an existing log file of some kind  Take advantage of these when you can, otherwise incremental loads can be complicated
  • 102. 102 File Transfer Formats Design your extract so that it uses…  Fixed, predetermined length for all records and fields  Avoid variable length if at all possible  A unique character that separates each field in a record, such as ~  A standard format for header records across all source systems  Such as the first three records in each file  Include name of source system, file, and record count and number of fields in the record  This will be handy for monitoring jobs and collecting load metadata
  • 103. 103 Benefits of Standard File Transfer Format  Compatible with standard database and operating system utilities  Dynamically create initial and maintenance load scripts  Read the table definitions (DDL) then merge that with the standard transfer file format  Dynamically generate load monitoring data  Read the header row, insert that into a “Load Status” table with status “Running”, # of records, start time  At EOF, change status to “Complete” and capture end of load time  I wish I would have thought about this topic more, and earlier in my career
  • 104. 104 Westerman Makes A Good Point  My experience: ETL is the least tasteful and productive use of a veteran EDW Team member, so I like Westerman’s insight on this topic  If you design for instantaneous updates from the beginning, it translates to less ETL maintenance and labor time for the EDW staff, later
  • 105. 105 Messaging Applied to ETL  Basic concepts  Use a load message queue for records that need to be updated, coming from the source systems  When the EDW analytical processing workload is low (off-peak), pick the next message off the load queue and load the data  Run this in parallel so that you can process several load messages at the same time while you have a window of opportunity  Sometimes called “throttling”  Speed up and slow down based upon traffic conditions  Motive behind the concept  Continuous updates in a mixed workload environment  Mixed: Analytical processing at the same time as transaction oriented, constant updates, deletes, inserts
  • 106. 106 ETL Message Queue Process Source Systems •Updates •Inserts •Deletes ETL Message Queue ETL Manager Database Workload and Performance Metrics EDW Production Tables
  • 107. 107 Four Data Maintenance Processes  Initial load  Loading into an empty table  Append load  Update process  Delete process  As much as practical, use your database utilities for these processes  Study and know your database utilities for data warehousing; they are getting better all the time  I see some bad strategies in this area-- companies spending time building their own utilities…aye cucumber!
  • 108. 108 A Few Planning Thoughts  Understand the percentage of records that will be updated, deleted, or inserted  You’ll probably develop a different process for 90% inserts vs. 90% updates  Logging  In general, turn logging off during the processes, if logging was on at all  Field vs. Record level updates  Some folks, in the interest of purity, will build complex update processes for passing only field (attribute) level changes  No brainer: Pass the whole record
  • 109. 109 Initial Load  Every table will, at some time, require an initial load  For some tables, it will be the best choice for data maintenance  Total data refresh  Best for “small” tables  Simple process to implement  Simply delete (or truncate) and reload with fresh data
  • 110. 110 A Better Initial Load Process  Background load  Safer– protects against corrupt files  Higher availability to customers  Three or four steps… maybe 6? 1. Create a temporary table 2. Load the temporary table 3. Run quality checks 4. Rename the temporary table to the production table name 5. Delete the old table 6. Regrant rights, if necessary  Westerman: “You want to use as many initial load processes as possible.”  I agree!
  • 111. 111 Append Load  For larger tables that accumulate historical data  There are no updates, just appends  A hard fact that will not change  Example  Sales that are closed  Lab results
  • 112. 112 Append Load Options  Load a single part of a table  Load a partition and ‘attach’ it to the table  Create a new, empty partition  Load the new records  Attach the partition to the table  Look to use the “LOAD APPEND” command in your database
  • 113. 113 Another Append Option 1. Create a temp table identical to the one you are loading 2. Load the new records into the empty temp table 3. Issue INSERT/SELECT INSERT INTO Big_Table (SELECT * FROM Temp_Big_Table) 1. Delete the temp table IF # RECORDS IN TEMP IS MUCH < # OF RECORDS IN BIG THEN GOOD TECHNIQUE ELSE NOT GOOD
  • 114. 114 Update Process  The most difficult and risky to build  Use this process only if the tables are too large for a complete refresh, “Initial Load” process  Updates affect data that changes over time  Like Purchase Orders, hospital transactions, etc.  Medical records, if you treat the data maintenance at the macroscopic level
  • 115. 115 Update Process Options  Simple process 1. Separate the affected records into an update file, insert file, or delete file  Do this on the source system, if possible 1. Transfer the files to the data warehouse staging area 2. Create and run two processes – A delete process for deleting the records in the production table that need updated or deleted – An insert process for inserting the entirely new “updated” record into the production table, as well as the true inserts Simple, but typically not very fast
  • 116. 116 Simple Process Updated records Deleted records New records Source System Updates Deletes Inserts EDW Staging Area Delete Process Insert Process EDW Production Table1 2 4 5 6 1. Delete Process identifies records for deletion from the Production Table based upon contents of the Updates file. 2. Delete Process identifies records for deletion from the Production Table based upon contents of the Deletes file. 3. Delete process deletes records from Production Table. 4. Insert Process identifies records for insert to the Production Table based upon contents of the Updates file. 5. Insert Process identifies records for insert to the Production Table based upon contents of the Inserts file. 6. Insert Process inserts records into the Production Table. 3
  • 117. 117 When You Are Unsure  Sometimes, source system log and audit files make it difficult to know if a record was updated or inserted (i.e. created)  Try this… 1. Load the records into a temp table that is identical to the production table to be updated 2. Delete corresponding records from the production table DELETE FROM prod_table WHERE key_field IN (SELECT temp_key_field FROM temp_table) 3. Insert all the records from the temp table into the production table Most databases now support this with an UPSERT
  • 118. 118 Massive Deletes  Just as with Updates and Inserts, the number of Deletes you have to manage is inversely proportional to the frequency of your ETL processes  Infrequent ETL Massive data operations  Partitions work well for this, again  E.g., keeping a 5 year window of data  Insert most recent year with a partition  Delete the last year’s partition  Blazing fast! 1 2 3 4 5 Delete partition Insert partition
  • 119. 119 “Raw” Data Standards for ETL  Makes the process of communicating with your source system partners much easier  Data type (e.g., format for date time stamps)  File formats (ASCII vs. EBCDIC)  Header records  Control characters  Rule of thumb  Never transfer data at the binary level unless you are transferring between binary compatible computer systems  Use only text-displayable characters  Less rework time vs. Less storage space and faster transfer speed  Storage and CPU time are cheap compared to labor
  • 120. 120 Last Thought…Indexing Strategies  Define these early, practice them religiously, use them extensively  This is “Database Design 101”  Don’t fall prey to this most common performance problem!
  • 121. 121 My Thanks  For being invited…  For your time and attention  For the many folks who have worked for and with me over the years that made me look better as a result  Please contact me if you have any questions  dsanders@nmff.org  PH: 312-695-8618

Editor's Notes

  1. Place students goals in the parking lot.
  2. Briefly explain the objectives for the
  3. ** Allow 45 Minutes for Metadata Transition Statement: After this discussion of relational database concepts, let’s examine the EDW database and some of it’s content.
  4. At the beginning of this module, refer to the glossary of terms found at the end of the documentation before discussing the content — Refer students to glossary when they can’t define a term — Have students mark the glossary with a flag in the documentation — If a term needs an explanation but is not in the glossary, place it in the parking lot as you define it — Encourage students to ask questions if they don’t understand a term — Also write terms in the parking lot when they come up in class
  5. At the beginning of this module, refer to the glossary of terms found at the end of the documentation before discussing the content — Refer students to glossary when they can’t define a term — Have students mark the glossary with a flag in the documentation — If a term needs an explanation but is not in the glossary, place it in the parking lot as you define it — Encourage students to ask questions if they don’t understand a term — Also write terms in the parking lot when they come up in class
  6. At the beginning of this module, refer to the glossary of terms found at the end of the documentation before discussing the content — Refer students to glossary when they can’t define a term — Have students mark the glossary with a flag in the documentation — If a term needs an explanation but is not in the glossary, place it in the parking lot as you define it — Encourage students to ask questions if they don’t understand a term — Also write terms in the parking lot when they come up in class