Data Warehousing History and Best Practices

Data
Warehousing
A Look Back, Moving Forward
Dale Sanders
June 2005

2
Introduction & Warnings
 Why am I here?
 Teach
 Stimulate some thought
 Share some of my experiences and lessons
 Learn
 From you, please…
 Ask questions, challenge opinions, share your knowledge
 I’ll do my best to live up to my end of the bargain
 Warnings
 The pictures in this presentation
 May or may not have any relevance
whatsoever to the topic or slide
 Mostly intended to break up the monotony

3
Expectation Management
 DW Strengths (according to others)
 I know what not to do as much as I know what to do
 Seen and made all the big mistakes
 Vision, strategy, system architecture, data management &
DW modeling, complex cultural issues, “leapfrog” problem
solving
 What not to expect: DW weaknesses
 My programming skills suck
 Haven’t written a decent line of code in four years!
 Some might say it’s been 24 years… 
 Knowledge of leading products is very rusty
 Though I’m beefing up on Microsoft and Cognos
Within these expectations, make no mistake about it… I know data warehousing 

4
Today’s Discussions
 I am a good “Idea Guy”
 But, ideas are worthless without someone to implement and
enhance them
 Steve Barlow, Dan Lidgard, Jon Despain, Chuck Lyon, Laure
Shull, Kris Mitchell, Peter Hess, Ron Gault, Rob Carpenter, my
wife, and many others
 My greatest strength and blessing
 The ability to recognize, listen to, and hold
onto good people
 Knock on wood
 My achievements in personal and professional life
 More a function of those around me than a reflection on me

5
DW Best Practices:
The Most Important Metrics
 Employee satisfaction
 Without it, long-term customer satisfaction is impossible
 Customer satisfaction
 That’s the nature of the Information Services career field
 Some people in our profession still don’t get it
 We are here to serve
 The Organizational Laugh Metric
 How many times do you hear laughter in the day-to-day
operations of your team?
 It is the single most important vital sign to organizational health
and business success

6
My Background
 Three, eight-year chapters
 Captain, Information Systems Engineer, US Air Force
 Nuclear warfare battle management
 Force status data integration
 Intelligence and attack warning data “fusion”
 Consultant in several industries
 TRW
 CIA Data Center
 TRW Credit Reporting Data Base
 National Security Agency (NSA)
 Intel: New Mexico Data Repository (NMDR)
 Air Force
 Integrated Minuteman Data Base (IMDB)
 Peacekeeper Information Retrieval System (PIRS)
 Many others…
 Healthcare
 Intermountain Health Care Enterprise Data Warehouse
 Consultant to other healthcare organizations’ data warehouses
 Now at Northwestern University Medical System

7
Overview
 Data warehousing history
 According to Sanders
 Why and how did this become a sub-specialty in information
systems?
 What have we learned so far?
 My take on “Best Practices”
 Key lessons-learned
 My thoughts on the most popular authors in the field
 What they contribute, where they detract

8
Data Warehousing History
“Newspaper Rock”
100 B.C.
American Retail
2005 A.D.
Lots of stuff happened

9
What Happened in the Cloud?
 Stage 1: Laziness
 Operators grew tired of hanging tapes
 In response to requests for historical financial data
 They stored data on-line, in “unauthorized” mainframe databases
 Stage 2: End of the mainframe bully
 Computing moved out from finance to the rest of the business
 Unix and relational databases
 Distributed computing created islands of information
 Stage 2.1: The government gets involved
 Consolidating IRS and military databases to save money on mainframes
 “Hey, look what I can do with this data…”
 Stage 3: Demming comes along
 Push towards constant business “reengineering”
 Cultural emphasis on “continuous quality improvement” and “business
innovation” drives the need for data
 Stage 4: Data warehousing has it’s own language
 Ralph Kimball publishes “The Data Warehouse Toolkit”

10
The Real Truth
 Data warehousing is a symptom of a
problem
 Technological inability to deploy single-platform
information systems that:
 Capture data once and reuse it throughout an
enterprise
 Support high-transaction rates (single record
CREATE, SELECT, UPDATE, DELETE) and analytic queries
on the same computing platform, with the same
data, at the same time
 Someday, maybe we will address the root cause
 Until then, it’s a good way to make a living

11
The “Ideal Library” Practice
 Stores all of the books and other reference material you need to
conduct your research
 The Enterprise data warehouse
 A single place to visit
 One database environment
 Contents are kept current and refreshed
 Timely, well choreographed data loads
 Staffed with friendly, knowledgeable people that can help you
find your way around
 Your Data Warehouse team
 Organized for easy navigation and use
 Metadata
 Data models
 “User friendly” naming conventions

12
Cultural Detractors
The two biggies…
 The business supported by the data
warehouse must be motivated by a desire
for constant improvement and fact-based
decision making
 The data warehouse team falls victim to the
“Politics of Data”
 Through naivety
 Through misguided motives, themselves

13
Business Culture
 Does your CEO…
 Talk about constant improvement, constantly?
 Drive corporate goals that are SMART?
 Specific, Measurable, Attainable, Realistic,
Tangible
 Crave data to make better informed decisions?
 Become visibly, buoyantly excited at a demo for
a data cube?
 If so, the success of your data warehouse
is right around the corner… sort of…
I love data!

14
Political Best Practices
 You will be called a “data thief”
 Get used to it
 Encourage life cycle ownership of the OLTP
data, even in the EDW
 You will be called “dangerous”
 “You don’t understand our data!”
 OLTP owners know their data better than
you do– acknowledge it and leverage it
 You will be blamed for poor data quality in the OLTP systems
 This is a natural reaction
 Data warehouses raise the visibility of poor data quality
 Use the EDW as a tool for raising overall data quality
 You will be called a “job robber”
 EDW is perceived as a replacement for OLTP systems
 Educate people: The EDW depends on OLTP systems for its existence
 Stick to your values and pure motives
 The politics will fade away

15
Data Quality
 Pitfall
 Taking accountability for data quality on the source system
 Spending gobs of time and money “cleansing” data before it’s loaded into
the DW
 It’s a never ending, never win battle
 You will always be one step behind data quality
 You will always be in the cross-hairs of
blame
 Best Practice
 Push accountability where it belongs– to the
source system
 Use the data warehouse as a tool to reveal
data quality, either good or bad
 Be prepared to weather the initial storm of
blame

16
Measuring Data Quality
 Data Quality = Completeness x Validity
 Can it be measured objectively?
 Measuring “Completeness”
 Number of null values in a column
 Measuring “Validity”
 Cardinality is a simple way to measure validity
 “We only have four standard regions in the business, but we
have 18 distinct values in the region column.”

17
Business Validity
 How can you measure it? You can’t…
 “I collect this data from our customers, but I have to guess
sometimes because I don’t speak Spanish.”
 “This data is valid for trend analysis decisions before
9/11/2001, but should not be used after that date, due to
changes in security procedures.”
 “You can’t use insurance billing and reimbursement data to
make clinical, patient care decisions.”
 “This customer purchased four copies of ‘Zamfir, Master of
the Pan Flute’, therefore he loves everything about Zamfir.”
 What Amazon didn’t know: I bought them for my mom and her
sewing circle.
Where do you capture subjective data quality? Metadata….

18
The Importance of Metadata
 Maybe the most over-hyped, underserved area of
data warehousing common sense
 Vendors want to charge you big $$$$$ for their tools
 Consultants would like you to think that it’s the Holy Grail in
disguise and only they can help you find it
 Authors who have never been in an operational environment
would have you chasing your tail in pursuit of an esoteric,
mythological Metadata Nirvana
 Don’t listen to the confusing messages! You know the
answer… just listen to your common sense…

19
Metadata: Keep It Simple!
 Ultimately, what are the most valuable business
motives behind metadata?
 Make data more “understandable” to those who are
not familiar with it
 Data quality issues
 Data timeliness and temporal issues
 Context in which is was collected
 Translating physical names to natural language
 Make data more “findable” to those who don’t know
where it is
 Organize it
 Take a lesson from library science and the card
catalog

20
Table Elements
Required Elements
 Long Name (or English name)
 Description
Semi-optional Elements
 Source
 Example
 Data Steward

21
Column Elements
Required Elements
 Long Name
 Description
Optional Elements
 Value Range
 Data Quality
 Associated Lookup

22
The Data Model
TABLE_ENT
TABLE_ENT_ID: NUMBER
TABLE_ENT_DESC: VARCHAR2(4000)
TABLE_ENT_SRC: VARCHAR2(50)
TABLE_ENT_NAME: VARCHAR2(50)
TABLE_TYPE: VARCHAR2(10)
CREATE_DT: DATE
LAST_LOAD_DT: DATE
SCHEMA_ID: NUMBER
DATA_MART
DATA_MART_ID: NUMBER
DATA_MART_NAME: VARCHAR2(50)
DATA_MART_DESC: VARCHAR2(4000)
DATA_STEWARD: VARCHAR2(50)
LAST_LOAD_DT: DATE
UPDATE_FREQ: VARCHAR2(50)
DATA_BEG_DT: DATE
DATA_END_DT: DATE
DATA_MART_TABLE_ENT
DATA_MART_ID: NUMBER
FOLDER
FOLDER_ID: NUMBER
PARENT_FOLDER_ID: NUMBER
FOLDER_NM: VARCHAR2(50)
FOLDER_DSC: VARCHAR2(4000)
CREATE_USER_ID: VARCHAR2(20)
CREATE_DT: DATE
REPORT
RPT_ID: NUMBER
FOLDER_ID: NUMBER
RPT_NM: VARCHAR2(250)
RPT_LOC_TXT: VARCHAR2(1000)
PURPOSE_TXT: VARCHAR2(4000)
RUN_FREQ_TXT: VARCHAR2(1000)
AUDIENCE_TXT: VARCHAR2(500)
EDW_RPT_FLG: NUMBER
DATA_SOURCE_TXT: VARCHAR2(4000)
SELECT_CRITERIA_TXT: VARCHAR2(4000)
STAT_METHODS_TXT: VARCHAR2(4000)
RPT_TOOL_TXT: VARCHAR2(250)
CODE_TXT: CLOB
FORMULA_TXT: CLOB
COMMENTARY_TXT: VARCHAR2(4000)
AUTHOR_NM: VARCHAR2(500)
AUTHOR_TITLE_TXT: VARCHAR2(500)
AUTHOR_DEPT_TXT: VARCHAR2(500)
AUTHOR_LOC_TXT: VARCHAR2(500)
AUTHOR_PHONE_TXT: VARCHAR2(500)
AUTHOR_EMAIL_TXT: VARCHAR2(500)
BUSINESS_OWNER_TXT: VARCHAR2(500)
METADATA_UPDATE_DT: DATE
VALIDATION_DT: DATE
CREATE_USER_ID: VARCHAR2(20)
CREATE_DT: DATE
REPORT_TABLE_ENT_ASSOC
RPT_ID: NUMBER
ATTRIBUTE
ATTRIBUTE_ID: NUMBER
ATTRIBUTE_DESC: VARCHAR2(4000)
ATTRIBUTE_NAME: VARCHAR2(50)
ATTRIBUTE_DATATYPE: VARCHAR2(50)
SAMPLE_VALUE: VARCHAR2(100)
INDEX_FLG: NUMBER
PRIMARY_KEY_FLG: NUMBER
TABLE_POSITION_NO: NUMBER
SCHEMA
SCHEMA_ID: NUMBER
SCHEMA_DESC: VARCHAR2(50)

23
Example Metadata Entry
LKUP.POSTAL_CD_MASTER Table
Long Name:
Postal Code Master - IHC
Description:
Contains Postal (Zip) codes for the IHC referral region and
IHC specific descriptions. These descriptions allow for
specific IHC groupings used in various analyses.
Data Steward:
Jim Allred, ext. 3518

25
Some Info Is Free
It can be collected from the database.
For example:
 Primary and Foreign Keys
 Indexed Columns
 Table Creation Dates

26
Most Valuable Info is Subjective
The human element
 Most metadata is not automatically
collected by tools because it does NOT
exist in that form
 Interviews with data stewards are the key
 It can take months (and months and
months) of effort to collect initial
metadata.

27
Holding Feet to the Fire
 Made data architects
responsible for metadata
in their subject areas
 Metadata completion
reports in every staff
meeting for a year
 Standing rule: No new
data marts go live
without metadata

28
Is it all worth it?
Data analysts think so.
“I couldn’t do my job without it.”
It will push the ROI of a home-hum DW into the
stratosphere
It does for DW’ing what the Yellow Pages did for
the business ROI of the telephone

29
It Gets Used
At Intermountain Health Care
 210 web hits on average each week day
 (23,000 employees, $2B revenue)
Avg Hits by Day of Week
(April 2004 - Sep 2004)
189
217 212
240
188
0
50
100
150
200
250
300
MON TUE WED THU FRI

31
Report Quality
 A function of…
 Data quality
 How well does the report reflect the intent behind the question being
asked?
 “This report doesn’t make sense. I’m trying to find out how many
widgets we can produce next year, based on the last four years’
production.”
 “That’s not what you asked for.”
 SQL and other programming accuracy
 Statistical validity– population size of the data
 Timeliness of the data relative to the decision
 Event Correlation
 Best Practice:
 An accompanying “meta-report” for every report that involves
significant, high risk decisions

32
Meta Report
A document, associated with a
published report, which defines the
report.

33
Repository
A central place for storing and sharing
information about business reports

34
IHC Analyst Use of Meta Reports
37%
89%
21%
95%
0%
20%
40%
60%
80%
100%
Data Collected Aug-04 N=32
Read Others Search Duplication Search SQL Audience Request

35
Meta Report
 Core Elements
 Author Information
 Report Name
 Report Purpose
 Data Source(s)
 Report Methods
 Recommended Elements
 Business Owner
 Run Frequency
 Intended Audience
 Statistical Tests
 Software Used
 Source Code
 Formulas
 Relevant Issues &
Commentary

36
• Title
• Location
• Author
• Owner

37
• Purpose
• Frequency
• Audience
• Data Source(s)

38
• Selection Criteria
• Statistics
• Software
• Source Code
• Formulas

41
Utilization and Creation Rate
Error

42
Think: Mission Control
 Customized ETL Library
 Schedule of operations
 Alerting tool
 Storage strategies / backups
 Development philosophy and environment
 Performance—monitoring and tuning
Operations Best Practices

43
 EDW
 Oracle v 9.2.0.3 on AIX 5.2
 Storage: IBM SAN (shark), >3T
 ETL tools
 Ascential’s Data Stage
 Kornshell (unix), SQL scripts, PL/SQL
scripting
 OLAP: MS’ Analysis Services
 BI: Business Objects (Crystal Enterprise)
 With a Cube presentation layer
 Dashboard: Visual Mining’s Net Charts
 EDW Team: ~16 FTEs, plus SAs and DBAs
IHC Architecture

45
 One of our ETL programmers noticed he kept
doing the same things over and over for all of
his ETL jobs. Rather than copying and
pasting this repetitive code, he created a
library. Now we all use the ETL Library.
 We named the library EDW_UTIL (EDW
Utilities)
History

46
Implementation
 Executes via Oracle stored procedures
 Supported by associated tables to hold data
when necessary
 Error table
 QA table
 Index table

47
Benefits
 Provides standardization
 Eliminates code rewrites
 Can hide complexities
 Such as the appropriate way to
analyze and gather statistics
on tables
 Very accessible to all of
our ETL tools
 Simply an Oracle stored
procedure call

48
Index Management
 Past process included:
 Dropping the table’s indexes with a script
 Loading the table
 Creating the indexes with a script
 The past process resulted in messy
scripts to manage and
coordinate

49
Index Management
 New process includes:
 Capturing a table’s existing indexes metadata
 Dropping the table’s indexes with a single procedure call
 Loading the table
 Recreating the indexes with a single
procedure call
 There are no more messy scripts to
manage and coordinate
 No “lost” indexes were neglected
when adding to create index script

50
Index Management
 Samples
 IMPORT_SCHEMA_INDEX_DATA
 IMPORT_TABLE_INDEX_DATA
 DROP_TABLE_INDEXES
 CREATE_TABLE_INDEXES

51
Background Loading of Tables
 We often load data into tables which are not
accessible to end users. A simple rename puts them
into production.
 Helps transfer the identical attributes from the live to
the background table
 Samples
 COPY_TABLE_METADATA
 TRANSFER_TABLE_PRIVS
 DROP_TABLE_INDEXES
 CREATE_TABLE_INDEXES
(Create on background table, identical to production table)

52
Load Times, Errors, QA
 We had no idea who was loading what and when
 Each staff member logged in their own way and for their
own interest
 ETL error capturing and QA was difficult
 We can now capture errors and QA information in a
somewhat standardized fashion

53
Load Times, Errors, QA
Samples
 BEGIN_JOB_TIME
 (ex: CASEMIX)
 BEGIN_LOAD_TIME
 (ex: CASEMIX INDEX)
 END_LOAD_TIME
 END_JOB_TIME
 COMPLETE_LOAD_TIME
(Begin and end together)
 LOAD_TIME_ERROR
(Alert on these errors)
 LOAD_TIME_METRICS
QA (row counts)

54
Miscellaneous Procedures
 Hide the “gory” details from the majority
of the EDW team
 Such as Oracle’s table analyze command
 Gives us consistent application of
system wide parameters such as:
 A new box with a different number of CPUs
(parallel slaves)
or
 A new version of Oracle
 We populate some metadata too, such
as last load date

55
DW Schedule of Operations
 Some loads are adhoc, not scheduled
 Users query in an adhoc fashion
 We have a minimal service/application tier
implemented (loss of control)
 Use of a variety of ETL tools
 Use of a variety of user categories
 DBA, SA, ETL user, end users
 Use of a variety of servers
 Production EDW, Stage EDW, ETL servers, OLAP servers,
Presentation layer servers

56
General Approach
 Focus on load jobs against production EDW
 Still working on all the reporting aspects (a sample on the
next slide)
 Pull this information out of the “load times” data
captured by these ETL library calls
 BEGIN_JOB_TIME
 BEGIN_LOAD_TIME
 END_LOAD_TIME
 END_JOB_TIME
 COMPLETE_LOAD_TIME

58
DW Alerting Tool
 DW alerting
 Aggregate data alerts, such as, your
average length of stay just crossed a
certain threshold
 A simple tool was created which
sends a text email, based on
existence of data returned from a
query
 Primarily embraced by DW team
members for internal DW
operations, not that the original
intent is abandoned

59
Features
 Web based
 Open to all EDW users
 Run daily, weekly, every two weeks, monthly,
quarterly (wakes every 5 minutes)
 This is a passive polling
 Ability to enter query in SQL
 Alert (email) on 3 situations
 Query returns data
 Query returns no data
 Always

61
Examples
 ~100 alerts in use
 Live performance check
 Every 4 hours—look for inactive sessions holding active
slaves
 Daily—look for any active sessions older than 72 hours
 ETL monitoring; alert only if problem
 Alert on errors logged via the ETL_UTIL library (manage by
exception)
 Alert on existence of “bad” records captured during ETL

62
Storage and Backup
 Inherited state of affairs
 Running like any OLTP database
 High end expensive SANs (storage area
networks)
 FULL nightly online backups
 Out of space? Just buy more

63
Nightmare in the Making
 Exponential growth
 More data sources
 More summary tables
 More indexes
 No data has yet been
purged
 Relaxed attitude
 Disk is cheap
 Reality: Disk management is expensive

64
Looming Crisis
 Backups often run 16 hours or more
 Performance degradation witnessed by users
 Good backups obtained less than 50% of the time
 Literally running out of space
 Gross underestimating
 Some reckless overuse
 Financial $$$$ cost
 The system administrators (SAs) quadruple the price of
disk purchase from the previous budget year. Ouch!
 SAs roll in the price of tape drives, etc.

65
Major Changes in Operations
 Transfer some disk ownership AND backup
responsibilities to DW team, away from
SAs and DBAs
 EDW team more aware of upcoming space
demands
 EDW team more in tune with which data sets
are easily recreated from the source (don’t
need a backup)
 Stop performing full daily backups
 Move towards less expensive disk
option
 IBM offers a few levels of SANs

66
Tracking and Predicting
Storage Use

67
Changes to Backup Strategy
 Perform full backup once monthly during
downtime
 Perform no data backup on DEV/STAGE
environments
 Do backup DDL (all code)
daily in all environments
 Implement daily
“incremental” backup

68
Daily Incremental Backups
 Easier said than done
 We’ve resorted to a table level backup (in Oracle,
that’s an EXPORT)
 The EDW team owns which tables are exported
 EDW team populates a table, the “export table list” with
each table’s export frequency
 Populated via an application in development
 The DBA’s run an export based on the “export table
list”

69
Use Cheaper Disk
 General practice: You can take greater risks with DW reliability
and availability vs. OLTP systems
 Use it to your advantage
 Our SAN vendor (IBM) offers a few levels of SANs. Next level
down is a big step down in price, small step down in features.
 Feature loss:
 Read cache (referring to disk cache, not box memory).
 We rarely read the same thing twice anyway
 No “phone home” to IBM (auto paging)
 Mean time to failure is higher, but still acceptable

70
Performance Monitoring & Tuning
 Err on the side of freedom and empowerment
 How much harm can really be done?
 We’d rather not constrain our customers
 “Pounding queries” do find
their way to production
 Opportunity to educate users
 Opportunity for us to tune
underlying structures

71
The Focus Areas
 Indexing
 Well-defined criteria for when and how to apply indexes
 Is this a lost art?
 Big use of BITMAPS
 Composite index trick (acts like a table)
 Partitioning for performance, rather than data management
 Exploiting Oracle’s Direct Path INSERT feature
 Avoiding UPDATE and DELETE commands
 Copy with MINUS instead
 Implementing Oracle's Parallel Query
 Turn off referential integrity in the DW.. no brainer
 That’s the job of the source system

72
DW Monitoring:
Empowering End Users
 Motive
 Too many calls from end users about their
queries
 “Please kill it.”
 “Is it still running or is my PC locked up?”
 “Why is the DW so slow?”
 Give them the insight and tools
 Give them the ability to kill their own
queries
 Still in the works

74
Tracking Long-Running Queries
 We use Pinecone (from Ambeo) to monitor
the duration of all queries and the SQL
 Each week, we look at the top few
 Typical outcome?
 We’ll add indexes
 We’ll denormalize
 We'll contact the user
and assist them with
writing a better query

75
The DW Sandbox
 More empowerment for customers
 Motive
 Lots of little MS
 Access databases
(with valuable data) spread
all over the place
 Needed to be joined
with DW data
 Costly to maintain
 PC hogs
 Solution
 Provide customers with their own “sandbox” on the DW,
with DBA-like priv’s

76
Features
 Web based tool for creating tables and
loading MS Access data to the DW
 Simple, easy to use interface
 Privileges
 Users have full rights to the
tables they create
 Can grant rights to others
 Big, big victory for customer
service and data “maturity”
 10% of DW customers use the
Sandbox
 About 600 tables in use now
 About 2G of data

77
Design-Build Best Practices
 Build vertically, design horizontally
 Start by building data marts that address analytic
needs in one area of the business with a fairly
limited data set
 But, design with the horizontal needs of the
company in mind, so that you will eventually “tie”
all of these vertical data marts together with a
common semantic layer

78
Creating Value In Both Axes
Build
Design

79
For Example…
CancerRegistry
Mammography
Radiology
Pathology
Laboratory
ContinuingCare
AndFollow-Up
QualityofLife
Survey
Radiation
Therapy
HealthPlans
Claims
Ambulatory
Casemix
AcuteCare
Casemix
An Integrated Reporting Model of Cancer Patient’s Data
Oncology Data Integration Strategy
Top down reporting
requirements and
data model
Disparate Sources “connected” semantically to the data bus

80
The Logic Layer in Data Warehouses
Source
System
ETL Process Data
Warehouse
Reports
Data Layer Logic Layer
Presentation
Layer
Analytic Systems
Transaction Systems
HereNot Here

81
Evidence of Business Process Alignment
1. Map out your high level business process
 Don’t fall prey to analysis paralysis with endless business
process modeling diagrams!
1. Identify and associate the transaction systems that support
those processes
2. Identify the common, overlapping semantics/data attributes
and their utilization rates
3. Build your data marts within an enterprise framework
that is aligned with the processes
you are trying to understand

82
For example…
DiagnosisHealth Need
Patient
Perception
Procedure
Results &
Outcomes
Episode of Care
AP/AR
Claims
ProcessingHealthcare business process
HELP Lab HPI
MC400
SurveyAS400IDX HDMCIS/CDRHNA
Supported by non-integrated data in Transaction Systems…
Rx
Integrated in the Data
Warehouse
Data
Warehouse

83
Event Correlation
 A leading edge Best Practice
 The third dimension to rows and columns
 Overlays the data that underlies a report or
graph
 “In 2004, we experienced a drop in revenue
as a result of the earthquake that destroyed
our plant in the Philippines.”
 “In January of 2005, we saw a spike in the
North America market for snow shovel sales
that coincided with an increase in sales for
pain relievers. This correlates to the record
snowfall in that region and should not be
considered a trend. Barring major product
innovation, we consider the market for snow
shovels in this area as saturated. Sales will be
slow for the next several years.”

84
Standardizing Semantics
 Sweet irony are the many synonyms for “standard
semantics”
 Data dictionary
 Vocabulary
 Dimensions
 Data elements
 Data attributes
 The bottom line issue: Standardizing the terms you
use to describe key facts about your business

85
Standardizing “Names of Things”
 You better do it within the first two months of your
data warehouse project
 If you are beyond that point, you better stop and do it now,
lest you pay a bigger price later
 Don’t…
 Push the standard on the source systems, unless it’s easy
to accomplish
 This was one of the common pitfalls of early data
warehousing project failures
 Try to standardize everything under
the sun!
 Focus on the high value facts

86
Where Are The “High Value” Semantics?
In the high-overlap, high-utilization areas…
Source
System X
Source
System Y
Source
System Z
Highest value
area for
standardizing
semantics

87
Another Perspective
Semantic Utilization
SemanticOverlap

88
The Standard Semantic “Layer”
Data
Warehouse
Source
Systems Extract, Transform, Load
Semantic Standards

89
Data Modeling
 Star schemas are great and simple, but they aren’t
the end-all, be-all of analytic data modeling
 Best practices: Do what makes sense– don’t be a schema
bigot
 I’ve seen great analytic value from 3NF models
 Maintain data familiarity for your customers
 When meeting vertical needs
 Don’t make massive changes to the way the model looks
and feels, nor the naming conventions– you will alienate
existing users of the data
 Use views to achieve “new” or standards-compliant
perspectives on data
 When meeting horizontal needs

90
For Example…
Source perspective
DW perspective
Similar names
& organization
Vertical data customer
Horizontal data customer
“Standardized” view

91
The Case For Timely Updates
%RequestsforData
utilization
Data Age
0
100
Today 1 year 2 years
Generally, to minimize Total Cost of Ownership (TCO), your update frequency
should be no greater than the decision making cycle associated with the data.
But… everyone wants more timely data.

92
Best Practice: Measure Yourself
 Employee satisfaction
 Customer satisfaction
 Average number of
queries/month
 Number of queries above a
threshold (30 minutes?)
 Average query response time
 Total number of records
 Total number of query-able
tables
 Total number of query-able
columns
 Number of “users”
 Average rows delivered per
month
 Storage utilization
 CPU utilization
 Downtime per month by data
mart
The Data Warehouse Dashboard

93
Other Best Practices
 The Data Warehouse Information
Systems Team reports to the CIO
 Most data analysts can and probably
should report to the business units
 Change management/service level
agreements with the source systems
 No changes in the sources systems
unless they are coordinated with the data
warehouse team

94
More Best Practices
 Skills of the Data Warehouse IS Team
 Experienced chief architect/project manager
 Procedural/script programmers
 SQL/declarative programmers
 Data warehouse storage management architects
 Data warehouse hardware architects and system
administrators
 Data architects/modelers
 DBAs

95
More Best Practices
 Evidence of project collaboration
 A cross section of members and expertise from the data
warehouse IS team
 Statisticians and data analysts who understand the
business domain
 A customer that understands the process(es) being
measured and can influence change
 A data steward– usually someone from the front lines who
knows how the data is collected
Project = complex reports or a data mart

96
More Best Practices
 When at all possible, always extract as close
to the source as possible
Primary
Source
Copy A
Copy B
Data
Warehouse
Best Practice Path

97
The Most Popular Authors
 I appreciate…
 The interest they stir
 The vocabulary– semantics– of this new specialty that they helped
create
 The downside…
 The buzzwords that are more buzz than substance
 “Corporate Information Factories”
 Endless, meaningless debate
 “That’s not an Operational Data Store!”
 “Do you follow Kimball or Inmon?”
 Follow your own common sense
 Most of these authors have not had to build a data warehouse from
scratch and live with their decisions through a complete lifecycle

98
ETL Operations
 Besides the cultural risks and challenges, the riskiest part of a
data warehouse…
 Good book
 Westerman, WalMart Data Warehousing
 The Extract, Transform, and Load
processes
 Worthy of it’s own “Best Practices” discussion
 Suffice to say, mitigate risks in this area carefully and deliberately
 The major design errors don’t show up until late in the lifecycle,
when the cost of repair is great

99
Two Essential ETL Functions
 Initial loads
 How far back do we go in history?
 Maintenance loads
 Differential loads or total refresh?
 How often?
 You will run and tune these processes several times
before you go into production
 How many records are we dealing with?
 How long will this take to run?
 What’s the impact on the source system performance?

100
Maintenance Loads
 Total refresh vs. Incremental loads
 Total refresh: Truncate and reload everything from the
source system
 Incremental: Load only the new and updated records
 For small data sets, a total refresh strategy is the
easiest to implement
 How do you define “small”? You will know it when don’t
see it.
 Sometimes the fastest strategy when you are trying to
show quick results
 Grab and go…

101
Incremental Loads
 How do we get a snapshot of the data that
has changed since the last load?
 Many source systems will have an existing log file
of some kind
 Take advantage of these when you can,
otherwise incremental loads can be complicated

102
File Transfer Formats
Design your extract so that it uses…
 Fixed, predetermined length for all records and fields
 Avoid variable length if at all possible
 A unique character that separates each field in a record, such as ~
 A standard format for header records across all source systems
 Such as the first three records in each file
 Include name of source system,
file, and record count and
number of fields in the
record
 This will be handy for
monitoring jobs and
collecting load metadata

103
Benefits of Standard
File Transfer Format
 Compatible with standard database and operating system utilities
 Dynamically create initial and maintenance load scripts
 Read the table definitions (DDL) then merge that with the
standard transfer file format
 Dynamically generate load monitoring data
 Read the header row, insert that into a
“Load Status” table with status “Running”,
# of records, start time
 At EOF, change status to “Complete” and
capture end of load time
 I wish I would have thought about this topic
more, and earlier in my career

104
Westerman Makes A Good Point
 My experience: ETL is the least tasteful and
productive use of a veteran EDW Team member, so
I like Westerman’s insight on this topic
 If you design for
instantaneous updates
from the beginning, it
translates to less ETL
maintenance and labor
time for the EDW staff,
later

105
Messaging Applied to ETL
 Basic concepts
 Use a load message queue for records that need to be updated, coming
from the source systems
 When the EDW analytical processing workload is low (off-peak), pick the
next message off the load queue and load the data
 Run this in parallel so that you can process several load messages at
the same time while you have a window of opportunity
 Sometimes called “throttling”
 Speed up and slow down based upon traffic conditions
 Motive behind the concept
 Continuous updates in a mixed
workload environment
 Mixed: Analytical processing at the
same time as transaction oriented,
constant updates, deletes, inserts

106
ETL Message Queue
Process
Source
Systems
•Updates
•Inserts
•Deletes
ETL Message Queue
ETL
Manager
Database
Workload and
Performance
Metrics
EDW Production Tables

107
Four Data Maintenance
Processes
 Initial load
 Loading into an empty table
 Append load
 Update process
 Delete process
 As much as practical, use your database utilities for these
processes
 Study and know your database utilities for data warehousing;
they are getting better all the time
 I see some bad strategies in this area-- companies spending
time building their own utilities…aye cucumber!

108
A Few Planning Thoughts
 Understand the percentage of records
that will be updated, deleted, or inserted
 You’ll probably develop a different
process for 90% inserts vs. 90%
updates
 Logging
 In general, turn logging off during the processes, if logging was
on at all
 Field vs. Record level updates
 Some folks, in the interest of purity, will build complex update
processes for passing only field (attribute) level changes
 No brainer: Pass the whole record

109
Initial Load
 Every table will, at some time, require an
initial load
 For some tables, it will be the best choice for data
maintenance
 Total data refresh
 Best for “small” tables
 Simple process to implement
 Simply delete (or truncate) and reload with fresh
data

110
A Better Initial Load Process
 Background load
 Safer– protects against corrupt files
 Higher availability to customers
 Three or four steps… maybe 6?
1. Create a temporary table
2. Load the temporary table
3. Run quality checks
4. Rename the temporary table to the production table name
5. Delete the old table
6. Regrant rights, if necessary
 Westerman: “You want to use as many initial load processes as
possible.”
 I agree!

111
Append Load
 For larger tables that accumulate historical
data
 There are no updates, just appends
 A hard fact that will not change
 Example
 Sales that are closed
 Lab results

112
Append Load Options
 Load a single part of a table
 Load a partition and ‘attach’ it to the table
 Create a new, empty partition
 Load the new records
 Attach the partition to the table
 Look to use the “LOAD APPEND” command
in your database

113
Another Append Option
1. Create a temp table identical to the one you
are loading
2. Load the new records into the empty temp
table
3. Issue INSERT/SELECT
INSERT INTO Big_Table (SELECT * FROM
Temp_Big_Table)
1. Delete the temp table
IF # RECORDS IN TEMP IS MUCH < # OF RECORDS IN BIG
THEN GOOD TECHNIQUE
ELSE NOT GOOD

114
Update Process
 The most difficult and risky to build
 Use this process only if the tables are too large for a
complete refresh, “Initial Load” process
 Updates affect data that changes over time
 Like Purchase Orders, hospital transactions, etc.
 Medical records, if you treat the data maintenance at the
macroscopic level

115
Update Process Options
 Simple process
1. Separate the affected records into an update file, insert file, or
delete file
 Do this on the source system, if possible
1. Transfer the files to the data warehouse staging area
2. Create and run two processes
– A delete process for deleting the records in the production table
that need updated or deleted
– An insert process for inserting the entirely new “updated” record
into the production table, as well as the true inserts
Simple, but typically not very fast

116
Simple
Process
Updated records
Deleted records
New records
Source System
Updates
Deletes
Inserts
EDW Staging Area
Delete Process
Insert Process
EDW
Production Table1
2
4
5
6
1. Delete Process identifies records for
deletion from the Production Table
based upon contents of the
Updates file.
2. Delete Process identifies records for
deletion from the Production Table
Deletes file.
3. Delete process deletes records
from Production Table.
4. Insert Process identifies records for
insert to the Production Table
Updates file.
5. Insert Process identifies records for
insert to the Production Table
based upon contents of the Inserts
file.
6. Insert Process inserts records into
the Production Table.
3

117
When You Are Unsure
 Sometimes, source system log and audit files make
it difficult to know if a record was updated or
inserted (i.e. created)
 Try this…
1. Load the records into a temp table that is
identical to the production table to be updated
2. Delete corresponding records from the
production table
DELETE FROM prod_table WHERE key_field
IN (SELECT temp_key_field FROM temp_table)
3. Insert all the records from the temp table into
the production table
Most databases now support this with an UPSERT

118
Massive Deletes
 Just as with Updates and Inserts, the number of Deletes you
have to manage is inversely proportional to the frequency of your
ETL processes
 Infrequent ETL Massive data operations
 Partitions work well for this, again
 E.g., keeping a 5 year window of data
 Insert most recent year with a partition
 Delete the last year’s partition
 Blazing fast!
1 2 3 4 5
Delete partition
Insert partition

119
“Raw” Data Standards for ETL
 Makes the process of communicating with your source system
partners much easier
 Data type (e.g., format for date time stamps)
 File formats (ASCII vs. EBCDIC)
 Header records
 Control characters
 Rule of thumb
 Never transfer data at the binary level unless you are transferring
between binary compatible computer systems
 Use only text-displayable characters
 Less rework time vs. Less storage space and faster transfer speed
 Storage and CPU time are cheap compared to labor

120
Last Thought…Indexing Strategies
 Define these early, practice them
religiously, use them extensively
 This is “Database Design 101”
 Don’t fall prey to this most common
performance problem!

121
My Thanks
 For being invited…
 For your time and attention
 For the many folks who have worked for and with
me over the years that made me look better as a
result
 Please contact me if you have any questions
 dsanders@nmff.org
 PH: 312-695-8618

Data Warehousing History and Best Practices

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Warehousing History and Best Practices

Similar to Data Warehousing History and Best Practices (20)

Recently uploaded

Recently uploaded (20)

Data Warehousing History and Best Practices

Editor's Notes