The document discusses the history and evolution of data warehousing. It notes that data warehousing emerged due to technological limitations that prevented transactional and analytical uses of data on the same platform. Early stages included departments storing unused data to avoid tape changes and government projects consolidating databases. Factors like business reengineering and a focus on continuous improvement drove more analytical uses of data. Key lessons discussed include the importance of business culture supportive of fact-based decision making and managing political issues raised by data warehouses. The document advocates for keeping metadata simple and focused on understandability and findability of data.
2. 2
Introduction & Warnings
Why am I here?
Teach
Stimulate some thought
Share some of my experiences and lessons
Learn
From you, please…
Ask questions, challenge opinions, share your knowledge
I’ll do my best to live up to my end of the bargain
Warnings
The pictures in this presentation
May or may not have any relevance
whatsoever to the topic or slide
Mostly intended to break up the monotony
3. 3
Expectation Management
DW Strengths (according to others)
I know what not to do as much as I know what to do
Seen and made all the big mistakes
Vision, strategy, system architecture, data management &
DW modeling, complex cultural issues, “leapfrog” problem
solving
What not to expect: DW weaknesses
My programming skills suck
Haven’t written a decent line of code in four years!
Some might say it’s been 24 years…
Knowledge of leading products is very rusty
Though I’m beefing up on Microsoft and Cognos
Within these expectations, make no mistake about it… I know data warehousing
4. 4
Today’s Discussions
I am a good “Idea Guy”
But, ideas are worthless without someone to implement and
enhance them
Steve Barlow, Dan Lidgard, Jon Despain, Chuck Lyon, Laure
Shull, Kris Mitchell, Peter Hess, Ron Gault, Rob Carpenter, my
wife, and many others
My greatest strength and blessing
The ability to recognize, listen to, and hold
onto good people
Knock on wood
My achievements in personal and professional life
More a function of those around me than a reflection on me
5. 5
DW Best Practices:
The Most Important Metrics
Employee satisfaction
Without it, long-term customer satisfaction is impossible
Customer satisfaction
That’s the nature of the Information Services career field
Some people in our profession still don’t get it
We are here to serve
The Organizational Laugh Metric
How many times do you hear laughter in the day-to-day
operations of your team?
It is the single most important vital sign to organizational health
and business success
6. 6
My Background
Three, eight-year chapters
Captain, Information Systems Engineer, US Air Force
Nuclear warfare battle management
Force status data integration
Intelligence and attack warning data “fusion”
Consultant in several industries
TRW
CIA Data Center
TRW Credit Reporting Data Base
National Security Agency (NSA)
Intel: New Mexico Data Repository (NMDR)
Air Force
Integrated Minuteman Data Base (IMDB)
Peacekeeper Information Retrieval System (PIRS)
Many others…
Healthcare
Intermountain Health Care Enterprise Data Warehouse
Consultant to other healthcare organizations’ data warehouses
Now at Northwestern University Medical System
7. 7
Overview
Data warehousing history
According to Sanders
Why and how did this become a sub-specialty in information
systems?
What have we learned so far?
My take on “Best Practices”
Key lessons-learned
My thoughts on the most popular authors in the field
What they contribute, where they detract
9. 9
What Happened in the Cloud?
Stage 1: Laziness
Operators grew tired of hanging tapes
In response to requests for historical financial data
They stored data on-line, in “unauthorized” mainframe databases
Stage 2: End of the mainframe bully
Computing moved out from finance to the rest of the business
Unix and relational databases
Distributed computing created islands of information
Stage 2.1: The government gets involved
Consolidating IRS and military databases to save money on mainframes
“Hey, look what I can do with this data…”
Stage 3: Demming comes along
Push towards constant business “reengineering”
Cultural emphasis on “continuous quality improvement” and “business
innovation” drives the need for data
Stage 4: Data warehousing has it’s own language
Ralph Kimball publishes “The Data Warehouse Toolkit”
10. 10
The Real Truth
Data warehousing is a symptom of a
problem
Technological inability to deploy single-platform
information systems that:
Capture data once and reuse it throughout an
enterprise
Support high-transaction rates (single record
CREATE, SELECT, UPDATE, DELETE) and analytic queries
on the same computing platform, with the same
data, at the same time
Someday, maybe we will address the root cause
Until then, it’s a good way to make a living
11. 11
The “Ideal Library” Practice
Stores all of the books and other reference material you need to
conduct your research
The Enterprise data warehouse
A single place to visit
One database environment
Contents are kept current and refreshed
Timely, well choreographed data loads
Staffed with friendly, knowledgeable people that can help you
find your way around
Your Data Warehouse team
Organized for easy navigation and use
Metadata
Data models
“User friendly” naming conventions
12. 12
Cultural Detractors
The two biggies…
The business supported by the data
warehouse must be motivated by a desire
for constant improvement and fact-based
decision making
The data warehouse team falls victim to the
“Politics of Data”
Through naivety
Through misguided motives, themselves
13. 13
Business Culture
Does your CEO…
Talk about constant improvement, constantly?
Drive corporate goals that are SMART?
Specific, Measurable, Attainable, Realistic,
Tangible
Crave data to make better informed decisions?
Become visibly, buoyantly excited at a demo for
a data cube?
If so, the success of your data warehouse
is right around the corner… sort of…
I love data!
14. 14
Political Best Practices
You will be called a “data thief”
Get used to it
Encourage life cycle ownership of the OLTP
data, even in the EDW
You will be called “dangerous”
“You don’t understand our data!”
OLTP owners know their data better than
you do– acknowledge it and leverage it
You will be blamed for poor data quality in the OLTP systems
This is a natural reaction
Data warehouses raise the visibility of poor data quality
Use the EDW as a tool for raising overall data quality
You will be called a “job robber”
EDW is perceived as a replacement for OLTP systems
Educate people: The EDW depends on OLTP systems for its existence
Stick to your values and pure motives
The politics will fade away
15. 15
Data Quality
Pitfall
Taking accountability for data quality on the source system
Spending gobs of time and money “cleansing” data before it’s loaded into
the DW
It’s a never ending, never win battle
You will always be one step behind data quality
You will always be in the cross-hairs of
blame
Best Practice
Push accountability where it belongs– to the
source system
Use the data warehouse as a tool to reveal
data quality, either good or bad
Be prepared to weather the initial storm of
blame
16. 16
Measuring Data Quality
Data Quality = Completeness x Validity
Can it be measured objectively?
Measuring “Completeness”
Number of null values in a column
Measuring “Validity”
Cardinality is a simple way to measure validity
“We only have four standard regions in the business, but we
have 18 distinct values in the region column.”
17. 17
Business Validity
How can you measure it? You can’t…
“I collect this data from our customers, but I have to guess
sometimes because I don’t speak Spanish.”
“This data is valid for trend analysis decisions before
9/11/2001, but should not be used after that date, due to
changes in security procedures.”
“You can’t use insurance billing and reimbursement data to
make clinical, patient care decisions.”
“This customer purchased four copies of ‘Zamfir, Master of
the Pan Flute’, therefore he loves everything about Zamfir.”
What Amazon didn’t know: I bought them for my mom and her
sewing circle.
Where do you capture subjective data quality? Metadata….
18. 18
The Importance of Metadata
Maybe the most over-hyped, underserved area of
data warehousing common sense
Vendors want to charge you big $$$$$ for their tools
Consultants would like you to think that it’s the Holy Grail in
disguise and only they can help you find it
Authors who have never been in an operational environment
would have you chasing your tail in pursuit of an esoteric,
mythological Metadata Nirvana
Don’t listen to the confusing messages! You know the
answer… just listen to your common sense…
19. 19
Metadata: Keep It Simple!
Ultimately, what are the most valuable business
motives behind metadata?
Make data more “understandable” to those who are
not familiar with it
Data quality issues
Data timeliness and temporal issues
Context in which is was collected
Translating physical names to natural language
Make data more “findable” to those who don’t know
where it is
Organize it
Take a lesson from library science and the card
catalog
22. 22
The Data Model
TABLE_ENT
TABLE_ENT_ID: NUMBER
TABLE_ENT_DESC: VARCHAR2(4000)
TABLE_ENT_SRC: VARCHAR2(50)
TABLE_ENT_NAME: VARCHAR2(50)
TABLE_TYPE: VARCHAR2(10)
CREATE_DT: DATE
LAST_LOAD_DT: DATE
SCHEMA_ID: NUMBER
DATA_MART
DATA_MART_ID: NUMBER
DATA_MART_NAME: VARCHAR2(50)
DATA_MART_DESC: VARCHAR2(4000)
DATA_STEWARD: VARCHAR2(50)
LAST_LOAD_DT: DATE
UPDATE_FREQ: VARCHAR2(50)
DATA_BEG_DT: DATE
DATA_END_DT: DATE
DATA_MART_TABLE_ENT
DATA_MART_ID: NUMBER
TABLE_ENT_ID: NUMBER
FOLDER
FOLDER_ID: NUMBER
PARENT_FOLDER_ID: NUMBER
FOLDER_NM: VARCHAR2(50)
FOLDER_DSC: VARCHAR2(4000)
CREATE_USER_ID: VARCHAR2(20)
CREATE_DT: DATE
REPORT
RPT_ID: NUMBER
FOLDER_ID: NUMBER
RPT_NM: VARCHAR2(250)
RPT_LOC_TXT: VARCHAR2(1000)
PURPOSE_TXT: VARCHAR2(4000)
RUN_FREQ_TXT: VARCHAR2(1000)
AUDIENCE_TXT: VARCHAR2(500)
EDW_RPT_FLG: NUMBER
DATA_SOURCE_TXT: VARCHAR2(4000)
SELECT_CRITERIA_TXT: VARCHAR2(4000)
STAT_METHODS_TXT: VARCHAR2(4000)
RPT_TOOL_TXT: VARCHAR2(250)
CODE_TXT: CLOB
FORMULA_TXT: CLOB
COMMENTARY_TXT: VARCHAR2(4000)
AUTHOR_NM: VARCHAR2(500)
AUTHOR_TITLE_TXT: VARCHAR2(500)
AUTHOR_DEPT_TXT: VARCHAR2(500)
AUTHOR_LOC_TXT: VARCHAR2(500)
AUTHOR_PHONE_TXT: VARCHAR2(500)
AUTHOR_EMAIL_TXT: VARCHAR2(500)
BUSINESS_OWNER_TXT: VARCHAR2(500)
METADATA_UPDATE_DT: DATE
VALIDATION_DT: DATE
CREATE_USER_ID: VARCHAR2(20)
CREATE_DT: DATE
REPORT_TABLE_ENT_ASSOC
RPT_ID: NUMBER
TABLE_ENT_ID: NUMBER
ATTRIBUTE
ATTRIBUTE_ID: NUMBER
TABLE_ENT_ID: NUMBER
ATTRIBUTE_DESC: VARCHAR2(4000)
ATTRIBUTE_NAME: VARCHAR2(50)
ATTRIBUTE_DATATYPE: VARCHAR2(50)
SAMPLE_VALUE: VARCHAR2(100)
INDEX_FLG: NUMBER
PRIMARY_KEY_FLG: NUMBER
TABLE_POSITION_NO: NUMBER
SCHEMA
SCHEMA_ID: NUMBER
SCHEMA_DESC: VARCHAR2(50)
23. 23
Example Metadata Entry
LKUP.POSTAL_CD_MASTER Table
Long Name:
Postal Code Master - IHC
Description:
Contains Postal (Zip) codes for the IHC referral region and
IHC specific descriptions. These descriptions allow for
specific IHC groupings used in various analyses.
Data Steward:
Jim Allred, ext. 3518
25. 25
Some Info Is Free
It can be collected from the database.
For example:
Primary and Foreign Keys
Indexed Columns
Table Creation Dates
26. 26
Most Valuable Info is Subjective
The human element
Most metadata is not automatically
collected by tools because it does NOT
exist in that form
Interviews with data stewards are the key
It can take months (and months and
months) of effort to collect initial
metadata.
27. 27
Holding Feet to the Fire
Made data architects
responsible for metadata
in their subject areas
Metadata completion
reports in every staff
meeting for a year
Standing rule: No new
data marts go live
without metadata
28. 28
Is it all worth it?
Data analysts think so.
“I couldn’t do my job without it.”
It will push the ROI of a home-hum DW into the
stratosphere
It does for DW’ing what the Yellow Pages did for
the business ROI of the telephone
29. 29
It Gets Used
At Intermountain Health Care
210 web hits on average each week day
(23,000 employees, $2B revenue)
Avg Hits by Day of Week
(April 2004 - Sep 2004)
189
217 212
240
188
0
50
100
150
200
250
300
MON TUE WED THU FRI
31. 31
Report Quality
A function of…
Data quality
How well does the report reflect the intent behind the question being
asked?
“This report doesn’t make sense. I’m trying to find out how many
widgets we can produce next year, based on the last four years’
production.”
“That’s not what you asked for.”
SQL and other programming accuracy
Statistical validity– population size of the data
Timeliness of the data relative to the decision
Event Correlation
Best Practice:
An accompanying “meta-report” for every report that involves
significant, high risk decisions
34. 34
IHC Analyst Use of Meta Reports
37%
89%
21%
95%
0%
20%
40%
60%
80%
100%
Data Collected Aug-04 N=32
Read Others Search Duplication Search SQL Audience Request
35. 35
Meta Report
Core Elements
Author Information
Report Name
Report Purpose
Data Source(s)
Report Methods
Recommended Elements
Business Owner
Run Frequency
Intended Audience
Statistical Tests
Software Used
Source Code
Formulas
Relevant Issues &
Commentary
45. 45
One of our ETL programmers noticed he kept
doing the same things over and over for all of
his ETL jobs. Rather than copying and
pasting this repetitive code, he created a
library. Now we all use the ETL Library.
We named the library EDW_UTIL (EDW
Utilities)
History
46. 46
Implementation
Executes via Oracle stored procedures
Supported by associated tables to hold data
when necessary
Error table
QA table
Index table
47. 47
Benefits
Provides standardization
Eliminates code rewrites
Can hide complexities
Such as the appropriate way to
analyze and gather statistics
on tables
Very accessible to all of
our ETL tools
Simply an Oracle stored
procedure call
48. 48
Index Management
Past process included:
Dropping the table’s indexes with a script
Loading the table
Creating the indexes with a script
The past process resulted in messy
scripts to manage and
coordinate
49. 49
Index Management
New process includes:
Capturing a table’s existing indexes metadata
Dropping the table’s indexes with a single procedure call
Loading the table
Recreating the indexes with a single
procedure call
There are no more messy scripts to
manage and coordinate
No “lost” indexes were neglected
when adding to create index script
51. 51
Background Loading of Tables
We often load data into tables which are not
accessible to end users. A simple rename puts them
into production.
Helps transfer the identical attributes from the live to
the background table
Samples
COPY_TABLE_METADATA
TRANSFER_TABLE_PRIVS
DROP_TABLE_INDEXES
CREATE_TABLE_INDEXES
(Create on background table, identical to production table)
52. 52
Load Times, Errors, QA
We had no idea who was loading what and when
Each staff member logged in their own way and for their
own interest
ETL error capturing and QA was difficult
We can now capture errors and QA information in a
somewhat standardized fashion
53. 53
Load Times, Errors, QA
Samples
BEGIN_JOB_TIME
(ex: CASEMIX)
BEGIN_LOAD_TIME
(ex: CASEMIX INDEX)
END_LOAD_TIME
END_JOB_TIME
COMPLETE_LOAD_TIME
(Begin and end together)
LOAD_TIME_ERROR
(Alert on these errors)
LOAD_TIME_METRICS
QA (row counts)
54. 54
Miscellaneous Procedures
Hide the “gory” details from the majority
of the EDW team
Such as Oracle’s table analyze command
Gives us consistent application of
system wide parameters such as:
A new box with a different number of CPUs
(parallel slaves)
or
A new version of Oracle
We populate some metadata too, such
as last load date
55. 55
DW Schedule of Operations
Some loads are adhoc, not scheduled
Users query in an adhoc fashion
We have a minimal service/application tier
implemented (loss of control)
Use of a variety of ETL tools
Use of a variety of user categories
DBA, SA, ETL user, end users
Use of a variety of servers
Production EDW, Stage EDW, ETL servers, OLAP servers,
Presentation layer servers
56. 56
General Approach
Focus on load jobs against production EDW
Still working on all the reporting aspects (a sample on the
next slide)
Pull this information out of the “load times” data
captured by these ETL library calls
BEGIN_JOB_TIME
BEGIN_LOAD_TIME
END_LOAD_TIME
END_JOB_TIME
COMPLETE_LOAD_TIME
58. 58
DW Alerting Tool
DW alerting
Aggregate data alerts, such as, your
average length of stay just crossed a
certain threshold
A simple tool was created which
sends a text email, based on
existence of data returned from a
query
Primarily embraced by DW team
members for internal DW
operations, not that the original
intent is abandoned
59. 59
Features
Web based
Open to all EDW users
Run daily, weekly, every two weeks, monthly,
quarterly (wakes every 5 minutes)
This is a passive polling
Ability to enter query in SQL
Alert (email) on 3 situations
Query returns data
Query returns no data
Always
61. 61
Examples
~100 alerts in use
Live performance check
Every 4 hours—look for inactive sessions holding active
slaves
Daily—look for any active sessions older than 72 hours
ETL monitoring; alert only if problem
Alert on errors logged via the ETL_UTIL library (manage by
exception)
Alert on existence of “bad” records captured during ETL
62. 62
Storage and Backup
Inherited state of affairs
Running like any OLTP database
High end expensive SANs (storage area
networks)
FULL nightly online backups
Out of space? Just buy more
63. 63
Nightmare in the Making
Exponential growth
More data sources
More summary tables
More indexes
No data has yet been
purged
Relaxed attitude
Disk is cheap
Reality: Disk management is expensive
64. 64
Looming Crisis
Backups often run 16 hours or more
Performance degradation witnessed by users
Good backups obtained less than 50% of the time
Literally running out of space
Gross underestimating
Some reckless overuse
Financial $$$$ cost
The system administrators (SAs) quadruple the price of
disk purchase from the previous budget year. Ouch!
SAs roll in the price of tape drives, etc.
65. 65
Major Changes in Operations
Transfer some disk ownership AND backup
responsibilities to DW team, away from
SAs and DBAs
EDW team more aware of upcoming space
demands
EDW team more in tune with which data sets
are easily recreated from the source (don’t
need a backup)
Stop performing full daily backups
Move towards less expensive disk
option
IBM offers a few levels of SANs
67. 67
Changes to Backup Strategy
Perform full backup once monthly during
downtime
Perform no data backup on DEV/STAGE
environments
Do backup DDL (all code)
daily in all environments
Implement daily
“incremental” backup
68. 68
Daily Incremental Backups
Easier said than done
We’ve resorted to a table level backup (in Oracle,
that’s an EXPORT)
The EDW team owns which tables are exported
EDW team populates a table, the “export table list” with
each table’s export frequency
Populated via an application in development
The DBA’s run an export based on the “export table
list”
69. 69
Use Cheaper Disk
General practice: You can take greater risks with DW reliability
and availability vs. OLTP systems
Use it to your advantage
Our SAN vendor (IBM) offers a few levels of SANs. Next level
down is a big step down in price, small step down in features.
Feature loss:
Read cache (referring to disk cache, not box memory).
We rarely read the same thing twice anyway
No “phone home” to IBM (auto paging)
Mean time to failure is higher, but still acceptable
70. 70
Performance Monitoring & Tuning
Err on the side of freedom and empowerment
How much harm can really be done?
We’d rather not constrain our customers
“Pounding queries” do find
their way to production
Opportunity to educate users
Opportunity for us to tune
underlying structures
71. 71
The Focus Areas
Indexing
Well-defined criteria for when and how to apply indexes
Is this a lost art?
Big use of BITMAPS
Composite index trick (acts like a table)
Partitioning for performance, rather than data management
Exploiting Oracle’s Direct Path INSERT feature
Avoiding UPDATE and DELETE commands
Copy with MINUS instead
Implementing Oracle's Parallel Query
Turn off referential integrity in the DW.. no brainer
That’s the job of the source system
72. 72
DW Monitoring:
Empowering End Users
Motive
Too many calls from end users about their
queries
“Please kill it.”
“Is it still running or is my PC locked up?”
“Why is the DW so slow?”
Give them the insight and tools
Give them the ability to kill their own
queries
Still in the works
74. 74
Tracking Long-Running Queries
We use Pinecone (from Ambeo) to monitor
the duration of all queries and the SQL
Each week, we look at the top few
Typical outcome?
We’ll add indexes
We’ll denormalize
We'll contact the user
and assist them with
writing a better query
75. 75
The DW Sandbox
More empowerment for customers
Motive
Lots of little MS
Access databases
(with valuable data) spread
all over the place
Needed to be joined
with DW data
Costly to maintain
PC hogs
Solution
Provide customers with their own “sandbox” on the DW,
with DBA-like priv’s
76. 76
Features
Web based tool for creating tables and
loading MS Access data to the DW
Simple, easy to use interface
Privileges
Users have full rights to the
tables they create
Can grant rights to others
Big, big victory for customer
service and data “maturity”
10% of DW customers use the
Sandbox
About 600 tables in use now
About 2G of data
77. 77
Design-Build Best Practices
Build vertically, design horizontally
Start by building data marts that address analytic
needs in one area of the business with a fairly
limited data set
But, design with the horizontal needs of the
company in mind, so that you will eventually “tie”
all of these vertical data marts together with a
common semantic layer
80. 80
The Logic Layer in Data Warehouses
Source
System
ETL Process Data
Warehouse
Reports
Data Layer Logic Layer
Presentation
Layer
Analytic Systems
Transaction Systems
HereNot Here
81. 81
Evidence of Business Process Alignment
1. Map out your high level business process
Don’t fall prey to analysis paralysis with endless business
process modeling diagrams!
1. Identify and associate the transaction systems that support
those processes
2. Identify the common, overlapping semantics/data attributes
and their utilization rates
3. Build your data marts within an enterprise framework
that is aligned with the processes
you are trying to understand
82. 82
For example…
DiagnosisHealth Need
Patient
Perception
Procedure
Results &
Outcomes
Episode of Care
AP/AR
Claims
ProcessingHealthcare business process
HELP Lab HPI
MC400
SurveyAS400IDX HDMCIS/CDRHNA
Supported by non-integrated data in Transaction Systems…
Rx
Integrated in the Data
Warehouse
Data
Warehouse
83. 83
Event Correlation
A leading edge Best Practice
The third dimension to rows and columns
Overlays the data that underlies a report or
graph
“In 2004, we experienced a drop in revenue
as a result of the earthquake that destroyed
our plant in the Philippines.”
“In January of 2005, we saw a spike in the
North America market for snow shovel sales
that coincided with an increase in sales for
pain relievers. This correlates to the record
snowfall in that region and should not be
considered a trend. Barring major product
innovation, we consider the market for snow
shovels in this area as saturated. Sales will be
slow for the next several years.”
84. 84
Standardizing Semantics
Sweet irony are the many synonyms for “standard
semantics”
Data dictionary
Vocabulary
Dimensions
Data elements
Data attributes
The bottom line issue: Standardizing the terms you
use to describe key facts about your business
85. 85
Standardizing “Names of Things”
You better do it within the first two months of your
data warehouse project
If you are beyond that point, you better stop and do it now,
lest you pay a bigger price later
Don’t…
Push the standard on the source systems, unless it’s easy
to accomplish
This was one of the common pitfalls of early data
warehousing project failures
Try to standardize everything under
the sun!
Focus on the high value facts
86. 86
Where Are The “High Value” Semantics?
In the high-overlap, high-utilization areas…
Source
System X
Source
System Y
Source
System Z
Highest value
area for
standardizing
semantics
88. 88
The Standard Semantic “Layer”
Data
Warehouse
Source
Systems Extract, Transform, Load
Semantic Standards
89. 89
Data Modeling
Star schemas are great and simple, but they aren’t
the end-all, be-all of analytic data modeling
Best practices: Do what makes sense– don’t be a schema
bigot
I’ve seen great analytic value from 3NF models
Maintain data familiarity for your customers
When meeting vertical needs
Don’t make massive changes to the way the model looks
and feels, nor the naming conventions– you will alienate
existing users of the data
Use views to achieve “new” or standards-compliant
perspectives on data
When meeting horizontal needs
91. 91
The Case For Timely Updates
%RequestsforData
utilization
Data Age
0
100
Today 1 year 2 years
Generally, to minimize Total Cost of Ownership (TCO), your update frequency
should be no greater than the decision making cycle associated with the data.
But… everyone wants more timely data.
92. 92
Best Practice: Measure Yourself
Employee satisfaction
Customer satisfaction
Average number of
queries/month
Number of queries above a
threshold (30 minutes?)
Average query response time
Total number of records
Total number of query-able
tables
Total number of query-able
columns
Number of “users”
Average rows delivered per
month
Storage utilization
CPU utilization
Downtime per month by data
mart
The Data Warehouse Dashboard
93. 93
Other Best Practices
The Data Warehouse Information
Systems Team reports to the CIO
Most data analysts can and probably
should report to the business units
Change management/service level
agreements with the source systems
No changes in the sources systems
unless they are coordinated with the data
warehouse team
94. 94
More Best Practices
Skills of the Data Warehouse IS Team
Experienced chief architect/project manager
Procedural/script programmers
SQL/declarative programmers
Data warehouse storage management architects
Data warehouse hardware architects and system
administrators
Data architects/modelers
DBAs
95. 95
More Best Practices
Evidence of project collaboration
A cross section of members and expertise from the data
warehouse IS team
Statisticians and data analysts who understand the
business domain
A customer that understands the process(es) being
measured and can influence change
A data steward– usually someone from the front lines who
knows how the data is collected
Project = complex reports or a data mart
96. 96
More Best Practices
When at all possible, always extract as close
to the source as possible
Primary
Source
Copy A
Copy B
Data
Warehouse
Best Practice Path
97. 97
The Most Popular Authors
I appreciate…
The interest they stir
The vocabulary– semantics– of this new specialty that they helped
create
The downside…
The buzzwords that are more buzz than substance
“Corporate Information Factories”
Endless, meaningless debate
“That’s not an Operational Data Store!”
“Do you follow Kimball or Inmon?”
Follow your own common sense
Most of these authors have not had to build a data warehouse from
scratch and live with their decisions through a complete lifecycle
98. 98
ETL Operations
Besides the cultural risks and challenges, the riskiest part of a
data warehouse…
Good book
Westerman, WalMart Data Warehousing
The Extract, Transform, and Load
processes
Worthy of it’s own “Best Practices” discussion
Suffice to say, mitigate risks in this area carefully and deliberately
The major design errors don’t show up until late in the lifecycle,
when the cost of repair is great
99. 99
Two Essential ETL Functions
Initial loads
How far back do we go in history?
Maintenance loads
Differential loads or total refresh?
How often?
You will run and tune these processes several times
before you go into production
How many records are we dealing with?
How long will this take to run?
What’s the impact on the source system performance?
100. 100
Maintenance Loads
Total refresh vs. Incremental loads
Total refresh: Truncate and reload everything from the
source system
Incremental: Load only the new and updated records
For small data sets, a total refresh strategy is the
easiest to implement
How do you define “small”? You will know it when don’t
see it.
Sometimes the fastest strategy when you are trying to
show quick results
Grab and go…
101. 101
Incremental Loads
How do we get a snapshot of the data that
has changed since the last load?
Many source systems will have an existing log file
of some kind
Take advantage of these when you can,
otherwise incremental loads can be complicated
102. 102
File Transfer Formats
Design your extract so that it uses…
Fixed, predetermined length for all records and fields
Avoid variable length if at all possible
A unique character that separates each field in a record, such as ~
A standard format for header records across all source systems
Such as the first three records in each file
Include name of source system,
file, and record count and
number of fields in the
record
This will be handy for
monitoring jobs and
collecting load metadata
103. 103
Benefits of Standard
File Transfer Format
Compatible with standard database and operating system utilities
Dynamically create initial and maintenance load scripts
Read the table definitions (DDL) then merge that with the
standard transfer file format
Dynamically generate load monitoring data
Read the header row, insert that into a
“Load Status” table with status “Running”,
# of records, start time
At EOF, change status to “Complete” and
capture end of load time
I wish I would have thought about this topic
more, and earlier in my career
104. 104
Westerman Makes A Good Point
My experience: ETL is the least tasteful and
productive use of a veteran EDW Team member, so
I like Westerman’s insight on this topic
If you design for
instantaneous updates
from the beginning, it
translates to less ETL
maintenance and labor
time for the EDW staff,
later
105. 105
Messaging Applied to ETL
Basic concepts
Use a load message queue for records that need to be updated, coming
from the source systems
When the EDW analytical processing workload is low (off-peak), pick the
next message off the load queue and load the data
Run this in parallel so that you can process several load messages at
the same time while you have a window of opportunity
Sometimes called “throttling”
Speed up and slow down based upon traffic conditions
Motive behind the concept
Continuous updates in a mixed
workload environment
Mixed: Analytical processing at the
same time as transaction oriented,
constant updates, deletes, inserts
107. 107
Four Data Maintenance
Processes
Initial load
Loading into an empty table
Append load
Update process
Delete process
As much as practical, use your database utilities for these
processes
Study and know your database utilities for data warehousing;
they are getting better all the time
I see some bad strategies in this area-- companies spending
time building their own utilities…aye cucumber!
108. 108
A Few Planning Thoughts
Understand the percentage of records
that will be updated, deleted, or inserted
You’ll probably develop a different
process for 90% inserts vs. 90%
updates
Logging
In general, turn logging off during the processes, if logging was
on at all
Field vs. Record level updates
Some folks, in the interest of purity, will build complex update
processes for passing only field (attribute) level changes
No brainer: Pass the whole record
109. 109
Initial Load
Every table will, at some time, require an
initial load
For some tables, it will be the best choice for data
maintenance
Total data refresh
Best for “small” tables
Simple process to implement
Simply delete (or truncate) and reload with fresh
data
110. 110
A Better Initial Load Process
Background load
Safer– protects against corrupt files
Higher availability to customers
Three or four steps… maybe 6?
1. Create a temporary table
2. Load the temporary table
3. Run quality checks
4. Rename the temporary table to the production table name
5. Delete the old table
6. Regrant rights, if necessary
Westerman: “You want to use as many initial load processes as
possible.”
I agree!
111. 111
Append Load
For larger tables that accumulate historical
data
There are no updates, just appends
A hard fact that will not change
Example
Sales that are closed
Lab results
112. 112
Append Load Options
Load a single part of a table
Load a partition and ‘attach’ it to the table
Create a new, empty partition
Load the new records
Attach the partition to the table
Look to use the “LOAD APPEND” command
in your database
113. 113
Another Append Option
1. Create a temp table identical to the one you
are loading
2. Load the new records into the empty temp
table
3. Issue INSERT/SELECT
INSERT INTO Big_Table (SELECT * FROM
Temp_Big_Table)
1. Delete the temp table
IF # RECORDS IN TEMP IS MUCH < # OF RECORDS IN BIG
THEN GOOD TECHNIQUE
ELSE NOT GOOD
114. 114
Update Process
The most difficult and risky to build
Use this process only if the tables are too large for a
complete refresh, “Initial Load” process
Updates affect data that changes over time
Like Purchase Orders, hospital transactions, etc.
Medical records, if you treat the data maintenance at the
macroscopic level
115. 115
Update Process Options
Simple process
1. Separate the affected records into an update file, insert file, or
delete file
Do this on the source system, if possible
1. Transfer the files to the data warehouse staging area
2. Create and run two processes
– A delete process for deleting the records in the production table
that need updated or deleted
– An insert process for inserting the entirely new “updated” record
into the production table, as well as the true inserts
Simple, but typically not very fast
116. 116
Simple
Process
Updated records
Deleted records
New records
Source System
Updates
Deletes
Inserts
EDW Staging Area
Delete Process
Insert Process
EDW
Production Table1
2
4
5
6
1. Delete Process identifies records for
deletion from the Production Table
based upon contents of the
Updates file.
2. Delete Process identifies records for
deletion from the Production Table
based upon contents of the
Deletes file.
3. Delete process deletes records
from Production Table.
4. Insert Process identifies records for
insert to the Production Table
based upon contents of the
Updates file.
5. Insert Process identifies records for
insert to the Production Table
based upon contents of the Inserts
file.
6. Insert Process inserts records into
the Production Table.
3
117. 117
When You Are Unsure
Sometimes, source system log and audit files make
it difficult to know if a record was updated or
inserted (i.e. created)
Try this…
1. Load the records into a temp table that is
identical to the production table to be updated
2. Delete corresponding records from the
production table
DELETE FROM prod_table WHERE key_field
IN (SELECT temp_key_field FROM temp_table)
3. Insert all the records from the temp table into
the production table
Most databases now support this with an UPSERT
118. 118
Massive Deletes
Just as with Updates and Inserts, the number of Deletes you
have to manage is inversely proportional to the frequency of your
ETL processes
Infrequent ETL Massive data operations
Partitions work well for this, again
E.g., keeping a 5 year window of data
Insert most recent year with a partition
Delete the last year’s partition
Blazing fast!
1 2 3 4 5
Delete partition
Insert partition
119. 119
“Raw” Data Standards for ETL
Makes the process of communicating with your source system
partners much easier
Data type (e.g., format for date time stamps)
File formats (ASCII vs. EBCDIC)
Header records
Control characters
Rule of thumb
Never transfer data at the binary level unless you are transferring
between binary compatible computer systems
Use only text-displayable characters
Less rework time vs. Less storage space and faster transfer speed
Storage and CPU time are cheap compared to labor
120. 120
Last Thought…Indexing Strategies
Define these early, practice them
religiously, use them extensively
This is “Database Design 101”
Don’t fall prey to this most common
performance problem!
121. 121
My Thanks
For being invited…
For your time and attention
For the many folks who have worked for and with
me over the years that made me look better as a
result
Please contact me if you have any questions
dsanders@nmff.org
PH: 312-695-8618
Editor's Notes
Place students goals in the parking lot.
Briefly explain the objectives for the
** Allow 45 Minutes for Metadata
Transition Statement: After this discussion of relational database concepts, let’s examine the EDW database and some of it’s content.
At the beginning of this module, refer to the glossary of terms found at the end of the documentation before discussing the content
— Refer students to glossary when they can’t define a term
— Have students mark the glossary with a flag in the documentation
— If a term needs an explanation but is not in the glossary, place it in the parking lot as you define it
— Encourage students to ask questions if they don’t understand a term
— Also write terms in the parking lot when they come up in class
At the beginning of this module, refer to the glossary of terms found at the end of the documentation before discussing the content
— Refer students to glossary when they can’t define a term
— Have students mark the glossary with a flag in the documentation
— If a term needs an explanation but is not in the glossary, place it in the parking lot as you define it
— Encourage students to ask questions if they don’t understand a term
— Also write terms in the parking lot when they come up in class
At the beginning of this module, refer to the glossary of terms found at the end of the documentation before discussing the content
— Refer students to glossary when they can’t define a term
— Have students mark the glossary with a flag in the documentation
— If a term needs an explanation but is not in the glossary, place it in the parking lot as you define it
— Encourage students to ask questions if they don’t understand a term
— Also write terms in the parking lot when they come up in class