The document discusses operational data warehousing and the Data Vault model. It begins with an agenda for the presentation and introduction of the speaker. It then provides a short review of the Data Vault model. The remainder of the document discusses operational data warehousing, how the Data Vault model is well-suited for this purpose, and the benefits it provides including flexibility, scalability, and productivity. It also discusses how tools and technologies are advancing to support automation and self-service business intelligence using an operational data warehouse architecture based on the Data Vault model.
2. Agenda Introduction – why are you here? Short Data Vault Review What’s Next? Advanced Architecture… Defining Operational Data Warehousing Why is Data Vault a Good Fit? <BREAK> Fundamental Paradigm Shift Business Keys & Business Processes Technical Review Query Performance (PIT & Bridge) What wasn’t covered in this presentation… 2
3. A bit about me… 3 Author, Inventor, Speaker – and part time photographer… 25+ years in the IT industry Worked in DoD, US Gov’t, Fortune 50, and so on… Find out more about the Data Vault: http://YouTube.com/LearnDataVault http://LearnDataVault.com Slides available: http://SlideShare.net Search: “Advanced Architecture Data Vault” Full profile on http://www.LinkedIn.com/dlinstedt
4. Why Are You Here? 4 Your Expectations? Your Questions? Your Background? Areas of Interest? Biggest question: What are the top 3 pains your current EDW / BI solution is experiencing?
6. Data Warehousing Timeline E.F. Codd invented relational modeling 1976 Dr Peter Chen Created E-R Diagramming 2010- DV Alive and Well Around the World 1990 – Dan Linstedt Begins R&D on Data Vault Modeling Chris Date and Hugh Darwen Maintained and Refined Modeling Mid 70’s AC Nielsen Popularized Dimension & Fact Terms 1970 2010 2000 1960 1980 1990 Late 80’s – Barry Devlin and Dr Kimball Release “Business Data Warehouse” Early 70’s Bill Inmon Began Discussing Data Warehousing Mid 80’s Bill Inmon Popularizes Data Warehousing Mid 60’s Dimension & Fact Modeling presented by General Mills and Dartmouth University 2000 – Dan Linstedt releases first 5 articles on Data Vault Modeling Mid – Late 80’s Dr Kimball Popularizes Star Schema
7. Data Vault Modeling… Took 10 years of Research and Design, including TESTING to become flexible, consistent, and scalable 7
13. Complete with Best Practices for BI/DWBusiness Keys Span / Cross Lines of Business Sales Contracts Planning Delivery Finance Operations Procurement Functional Area
17. Satellite10 Hub = List of Unique Business Keys Link = List of Relationships, Associations Satellites = Descriptive Data
18. Colorized Perspective… Data Vault 3rd NF & Star Schema (separation) Business Keys Associations Details HUB Satellite The Data Vault uniquely separates the Business Keys (Hubs) from the Associations (Links) and both of these from the Detailsthat describe them and provide context (Satellites). LINK Satellite (Colors Concept Originated By: Hans Hultgren) 11
19. A Quick Look at Methodology Issues Business Rule Processing, Lack of Agility, and Future proofing your new solution 12
28. Re-Engineering Business Rules Data Flow (Mapping) Current Sources Sales Customer Source Join Finance Customer Transactions Customer Purchases IMPACT!! ** NEW SYSTEM** 15
29. Federated Star Schema Inhibiting Agility Data Mart 3 High Effort & Cost Data Mart 2 Data Mart 1 Changing and Adjusting conformed dimensions causes an exponential rise in the cost curve over time RESULT: Business builds their own Data Marts! Low Maintenance Cycle Begins Time Start 16 The main driver for this is the maintenance costs, and re-engineering of the existing system which occurs for each new “federated/conformed” effort. This increases delivery time, difficulty, and maintenance costs.
35. AuditableThe business rules are moved closer to the business, improving IT reaction time, reducing cost and minimizing impacts to the enterprise data warehouse (EDW) 17
36. NO Re-Engineering Current Sources Data Vault Sales Stage Copy Hub Customer Customer Finance Stage Copy Link Transaction Customer Transactions Hub Acct Hub Product Customer Purchases Stage Copy NO IMPACT!!! NO RE-ENGINEERING! ** NEW SYSTEM** IMPACT!! 18
37. Progressive Agility and Responsiveness of IT High Effort & Cost Low Maintenance Cycle Begins Time Start 19 Foundational Base Built New Functional Areas Added Initial DV Build Out Re-Engineering does NOT occur with a Data Vault Model. This keeps costs down, and maintenance easy. It also reduces complexity of the existing architecture.
39. What are the top businessobstacles in your data warehousetoday? 21
40. Poor Agility Inconsistent Answer Sets Needs Accountability Demands Auditability Desires IT Transparency Are you feeling Pinned Down? 22
41. What are the top technologyobstacles in yourdata warehousetoday? 23
42. Complex Systems Real-Time Data Arrival Unimaginable Data Growth Master Data Alignment Bad Data Quality Late Delivery/Over Budget Are your systems CRUMBLING? 24
44. Projects Cancelled & Restarted Re-engineering required to absorb new systems Complexity drives maintenance cost Sky high Disparate Silo Solutions provide inaccurate answers! Severe lack of Accountability 26
47. What is it? It’s a simple Easy-to-use Plan To build your valuable Data Warehouse! 29
48. What’s the Value? Painless Auditability Understandable Standards Rapid Adaptability Simple Build-out Uncomplicated Design Effortless Scalability Pursue Your Goals! 30
49. Why Bother With Something New? Old Chinese proverb: 'Unless you change direction, you're apt to end up where you're headed.' 31
50. What Are the Issues? This is NOT what you want happening to your project! Business… Changes Frequently IT…. Needs Accountability Takes Too Long Demands Auditability Is Over-budget Has No Visibility Too Complex Wants More Control Can’t Sustain Growth THE GAP!! 32
51. What Are the Foundational Keys? Flexibility Scalability Productivity 33
65. Case In Point: Result of scalability was to produce a Data Vault model that scaled to 3 Petabytes in size, and is still growing today! 41
66. Key: Scalability in Team Size You should be able to SCALE your TEAM as well! With the Data Vault methodology, you can: Scale your team when desired, at different points in the project! 42
67. Case In Point: (Dutch Tax Authority) Result of scalability was to increase ETL developers for each new source system, and reassign them when the system was completely loaded to the Data Vault 43
77. The Competing Bid? The competition bid this with 15 people and 3 months to completion, at a cost of $250k! (they bid a Very complex system) Our total cost? $30k and 2 weeks! 46
78. Results? Changing the direction of the river takes less effort than stopping the flow of water 47
80. What’s Next? A look at what’s around the corner for Data Warehousing and Business Intelligence, believe me, it’s going to get interesting fast. 49
81. Operational Data Vault 50 Data Co-Location: Transactions & Transaction History Master Data & Master Data History Metadata & Metadata History External Data & External Data History Business Rules & Business Rule History Security / Access data & History Unstructured Data Ties & History Real-time Data Feeds DIRECTLY in to the data store Operational Applications ON TOP of the warehouse!
95. Results of all of this? 52 EDW Will: become BACK OFFICE!! become SELF-RELIANT / SELF-HEALING adapt to new structures, new hardware, and new data automatically backup and remove old data Self-Reliance http://images.businessweek.com/ss/06/10/bestunder25/source/1.htm
96. How Long Will it Take? 53 My milestone predictions: 1 yr: Operational Data Vault 2 yrs: Beginning automation of business rules 3 yrs: Beginning dynamic restructuring in the DV 4 yrs: Oper Apps contain BI & metadata & Master data GUI’s in a single place 5 yrs: the “all-in-one” appliance, containing 75% of what we need at the firmware levels to do all these things http://thypolarlife.wordpress.com/2011/08/02/this-moment-in-time/
105. What IS An Operational DW? A raw, time-variant, integrated, non-volatile data warehouse, on top of which sits an operational application – “editing and changing data”. However, instead of updates and deletes in place, the data is “marked” deleted, and updates are turned in to Inserts, creating a delta audit trail along the way. Yes, it’s an operational application on top of the integrated data warehouse (or in this case, Data Vault model). 60
106. Oper/Active DW Timeline 61 Real-Time & Oper BI Make the Scene (Users Want Direct Control & Up to the Minute Data) Teradata makes Real advances in Active DW “Appliances” begin appearing On-scene Data Warehouses Split From Operational Systems 2010 2000 1980 1990 2002 - Cendant-TRG Creates Worlds First Operational Data Vault Mid 90’s “Active” DW Becomes Important But has to wait for Technology To Catch Up!
119. Why should I care? 66 TWO REASONS: CONVERGENCE SELF-SERVICE BI
120. Under the Covers… 67 Presents Data to User in Conformed Screens Application 3. Present in GUI 4. Accept Ins, Upd, Del Data Access Control Layer 5. Perform Insert / Status change 2. Lock Business Key Rows 1. Read Data for Edit 6. Release Lock On Business Key Rows Sat 1 Operational Data Vault (ODW) Layer Sat 2 Hub Parts Link Hub Seller Hub Product Link Sat 3 Sat 4 Satellite Satellite
121. Dropping by the Way-Side No… ETL BATCH DRIVEN PROCESSING “Synchronization” with the Source System missing source data No scalability problems No ODS needed! No “Master Data” system needed No Staging area needed 68
122. Positives Data in the ODW can be governed Audit trail built in Delta’s only are stored NEW applications can be created to “automatically” generate Cubes/Star Schemas – these apps can be run by the users… Self-Service BI is enabled! Master data can be “marked, scored, stored” in the same place as the EDW 69
123. Old Components Still There? Staging areas will exist as long as there is external data to load and integrate ODS areas may still exist as long as there are other legacy applications existing as source systems Master Data areas may still exist as long as the logic is not built directly in to the “operational DW application” 70
124. Secure ODV Technical Layers 71 Visible Objects Inbound API Outbound API Services Authentication API Master Data API Component Groups Packaging API Pedigree API Security Key Mgr API Transaction API Aggregation API File Management Interface Kit API Busn. Intelligence API Notification Interface Vault Accessibility Subject Area API Scheduling Interface Local DB Interface Global DB Interface Common Data Object Area Security Interface (Encryption Too) Format Interface Persistence Cache DB Interface Logging Interface Database Interface Web Server Locally Based Persistent DB Cache for Joining Global DB Local DB1 Local DB2
125. What are the benefits? Simplified Architecture Single Copy of the data! No “intermediate” IT work to do Users become empowered, with direct access to data sets Of course, using the Data Vault model, you gain ALL the benefits of the Data Vault (Scalability, flexibility, etc…) NOTE: Two or more “users” can actually EDIT different parts of the same record at the same time! Integrating external data basically makes it all available to the application immediately! NO NEED TO BUILD A SEPARATE EDW!! 72
126. What are the drawbacks? No current “application” is using the Data Vault for operational data In other words, off-the-shelf apps in this area do not yet exist – you have to “build it” yourself Self-Service BI application technology is nascent or non-existent today Master Data & Metadata Applications are not currently available on top of Data Vault 73
129. Link Structures Link_Product_Supplier Link_Customer_Account_Employee LPS_SQN PRODUCT_SQN SUPPLIER_SQN LPS_LOAD_DTS LPS_REC_SOURCE LPS_ENCR_KEY LCAE_SQN CUSTOMER_SQN ACCOUNT_SQN EMPLOYEE_SQN LCAE_LOAD_DTS LCAE_REC_SOURCE Unique Index Link Structure SEQUENCE <HUB KEY SQN 1> <HUB KEY SQN 2> <HUB KEY SQN N> {LAST SEEN DATE} {CONFIDENCE} {STRENGTH} <LOAD DATE> <RECORD SOURCE> Unique Index } Optional Dynamic Link 76
130. Satellites Split By Source System SAT_FINANCE_CUST SAT_CONTRACTS_CUST SAT_SALES_CUST PARENT SEQUENCE LOAD DATE <LOAD-END-DATE> <RECORD-SOURCE> Contact Name Contact Email Contact Phone Number PARENT SEQUENCE LOAD DATE <LOAD-END-DATE> <RECORD-SOURCE> First Name Last Name Guardian Full Name Co-Signer Full Name Phone Number Address City State/Province Zip Code PARENT SEQUENCE LOAD DATE <LOAD-END-DATE> <RECORD-SOURCE> Name Phone Number Best time of day to reach Do Not Call Flag Satellite Structure PARENT SEQUENCE LOAD DATE <LOAD-END-DATE> <RECORD-SOURCE> {user defined descriptive data} {or temporal based timelines} Primary Key 77
132. History Teaches Us… If we model for ONE relationship in the EDW, we BREAK the others! 79 Portfolio The EDW is designed to handle TODAY’S relationship, as soon as history is loaded, it breaks the model! 1 Today: M Customer Hub Portfolio X 1 Portfolio 5 years From now M M M Customer Hub Customer X Portfolio M 10 Years ago 1 This situation forces re-engineering of the model, load routines, and queries! Customer
133. History Teaches Us… If we model with a LINK table, we can handle ALL the requirements! 80 Portfolio 1 Today: Hub Portfolio M Customer 1 M Portfolio LNK Cust-Port 5 years from now M M M Customer 1 Hub Customer Portfolio M 10 Years ago This design is flexible, handles past, present, and future relationship changes with NO RE-ENGINEERING! 1 Customer
134. Applying the Data Vault to Global DW2.0 Manufacturing EDW in China Planning in Brazil Hub Hub Link Sat Sat Link Sat Sat Link Hub Link Hub Hub Sat Sat Sat Sat Sat Sat Sat Sat Base EDW Created in Corporate Financials in USA 81
137. Purpose Of PIT & Bridge To reduce the number of joins, and to reduce the amount of data being queried for a given range of time. These two together, allow “direct table match”, as well as table elimination in the queries to occur. These tables are not necessary for the entire model; only when: Massive amounts of data are found Large numbers of Satellites surround a Hub or Link Large query across multiple Hubs & Links is necessary Real-time-data is flowing in, uninterrupted What are they? Snapshot tables – Specifically built for query speed 84
138. PIT Table Architecture Satellite: Point In Time Primary Key PARENT SEQUENCE LOAD DATE {Satellite 1 Load Date} {Satellite 2 Load Date} {Satellite 3 Load Date} {…} {Satellite N Load Date} PIT Sat Sat 1 Sat 2 Hub Order PIT Sat Sat 3 Sat 1 Sat 4 Sat 2 Sat 1 Hub Customer Hub Product Sat 2 Sat 3 Link Line Item Sat 4 Satellite Line Item 85
140. BridgeTable Architecture Satellite: Bridge Primary Key UNIQUE SEQUENCE LOAD DATE {Hub 1 Sequence #} {Hub 2 Sequence #} {Hub 3 Sequence #} {Link 1 Sequence #} {Link 2 Sequence #} {…} {Link N Sequence #} {Hub 1 Business Key} {Hub 2 Business Key} {…} {Hub N Business Key} Bridge Sat 1 Sat 2 Hub Parts Hub Seller Hub Product Link Link Sat 3 Sat 4 Satellite Satellite 87
141. Bridge Table Data Example Bridge Table: Seller by Product by Part SQN LOAD_DTSSELL_SQN SELL_ID PROD_SQN PROD_NUM PART_SQN PART_NUM 1 08-01-200015 NY*1 2756 ABC-123-9K 525 JK*2*4 209-01-200016CO*242654DEF-847-0L 324 MN*5-2 310-01-200016CO*2482374PPA-252-2A 9938 DD*2*3 411-01-200024AZ*2525222UIF-525-88 7 UF*9*0 512-01-200099NM*581DAN-347-7F 16 KI*9-2 601-01-200199NM*581DAN-347-7F 24 DL*0-5 Snapshot Date 88
142. What WASN’T Covered ETL Automation ETL Implementation SQL Query Logic Balanced MPP design Data Vault Modeling on Appliances Deep Dive on Structures (Hubs, Links, Satellites) What happens when you break the rules? Project management, Risk management & mitigation, methodology & approach Automation: Automated DV modeling, Automated ETL production Change Management Temporal Data Modeling Concerns… And so on… 89
145. The Experts Say… “The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.” Bill Inmon “The Data Vault is foundationally strong and exceptionally scalable architecture.” Stephen Brobst “The Data Vault is a technique which some industry experts have predicted may spark a revolution as the next big thing in data modeling for enterprise warehousing....” Doug Laney
146. More Notables… “This enables organizations to take control of their data warehousing destiny, supporting better and more relevant data warehouses in less time than before.” Howard Dresner “[The Data Vault] captures a practical body of knowledge for data warehouse development which both agile and traditional practitioners will benefit from..” Scott Ambler
147. Where To Learn More The Technical Modeling Book: http://LearnDataVault.com The Discussion Forums: & eventshttp://LinkedIn.com – Data Vault Discussions Contact me:http://DanLinstedt.com - web siteDanLinstedt@gmail.com - email World wide User Group (Free)http://dvusergroup.com Certification Training: Contact me, or learn more at: http://GeneseeAcademy.com 94
148. ODV – Case Study Operational Data Vault – IN THE REAL WORLD! 95
149. E-Pedigree, Drug Track & Trace 96 Product Returns And Recalls Product Packaging CorpSite Server Secure Integration Services Corporate Serialization Vault Serialization Analytics Engine Packaging Orders Product Authenticator 3rd Party Logistics Distribution Warehouse Secure Integration Services E-Pedigree Management Manufacturer Product Packager Supply Chain
159. Changes to the data model ripple (larger impacts) as more customers are signed up.
160. Each “support call” requires separate login to see the data set.Data Exchange/Sharing Through Code Only Web-Services and Flat File Delivery Customer Login Corp Login Customer Login Corp Login Employee Validation Admin Login Encrypt Key Encrypt Key Encrypt Key Mart 1 Mart 2 Mart 3 Mart 1 Mart 2 Mart 3 Tracking # Machine Info SQL View Layer SQL View Layer Global Data Vault Data Vault Manufacturer Shipper 9/27/2011
166. Corporate Owned Key (Encrypts data internally)Corp Managed / Owned Copy Web Services Customer Copy Customer Login Corp Login +HTTPS Corp Encrypt Key Web Services Encrypted Flat Files Decryption Key + SFTP Customer Local Copy
167. Security: ODV Web Services 102 Corp Managed / Owned Copy Web Browser Web Site / Server Java Script Or PHP Web Services Customer Login Corp Login Corporate Encrypt Key Corporate Owned Encryption Key Global DB
168. Inflow/Outflow Applications 103 Customer Corporation Corporation Customer Source Machine Encrypts Data Using Customer Key Corp Decrypts Data According to Customer Key Corp Re-Encrypts Data According to Internal Key For Specific Customer Corp Decrypts Data According to Internal Key For Specific Customer Corp Encrypts Data According to Customer Key Customer Decrypts Data According to Customer Key DB DB Transmit Encrypted Data over HTTPS Transmit Encrypted Data over HTTPS Web Service Sender Web Service Collector
169. ODV: Secure File Request 104 Corporation Customer ** Note: Each Customer DB is encrypted via an internally owned Corp key which is unique to EACH customer. Customer Decrypts File According to Customer Key Transmit Encrypted Data over FTPS Encrypted File
170. ODV: Front-End Ping Request 105 Corporation Customer Corp One-Way Hash of key Number To Execute Ping Web-Based PING Validation DBMS Unencrypted Data Transfer Login / Auth
Editor's Notes
Before we begin exploring how the Data Vault can help you, or even defining what a Data Vault is, we need to first understand some of the business problems that may be causing you heartburn on a daily basis.
Everything from poor agility to a lack of IT Transparency plague todays’ data warehouses. I can’t begin to tell you how much pain these businesses are suffering as a result of these problems. Inconsistent Answer Sets, Lack of accountability, inadequate auditablitiy all play a part in data warehouses that are currently on the brink of falling apart.But it’s not just business issues, there are technical ones to cope with as well.
There are always technology obstacles that we face in any data warehousing project. So the question is: what kinds of problems have you seen in your journey? Do they haunt you today?
Complexity drives high cost, resulting in unnecessary late delivery schedules and unsustainable business logic in the integration channels.Real-time data is flooding our data warehouses, has your architecture fallen down on the job?Unstructured data and legal requirements for auditability are bringing huge data volumes.Master Data Alignment is missing from our data warehouses, as they are split in disparate systems all over the world.Bad data quality is covered up through the transformation layers on the way IN to your EDW.Data warehouses grow so large and become so difficult to maintain that IT teams are often delivering late, and beyond original costs.The foundations of your data warehouse are probably crumbling under sheer weight and pressure.
Disparate data marts, unmatched answer sets, geographical problems, and worse…Projects are under fire from a number of areas. Let’s take a look at what happenswhen a data warehouse project reaches the brick wall head-on, at 90 miles an hour.
I think this says it all…. Projects cancelled and restarted, Re-Engineering required to absorb changes, high complexity making it difficult to upgrade, change, and keep up at the speed of business. Disparate silo solutions screaming for consolidation, and of course – a lack of accountability on BOTH sides of the fence…All signs of an ailing BI solution on the brink of being shut-down.
We have got to keep focus on the prize. Business still wants a BI systemBacked by an enterprise EDW.IT still wants a manageable system that will grow and change without major re-engineering.There is a better way, and I can help you with it.
The Data Vault model is really just another name for “Common foundational architecture and design”.It’s based on 10 years of Research and design work, followed by10 years of implementation best practices.It is architected to help you solve the problems!
Put quite simply: It’s an easy-to-use architecture and plan, a guide-bookFor building a repeatable, consistent, and scalable data warehouse system.So just what is the value of the Data Vault?
The Data Vault model and methodology provide:Painless AuditabilityUnderstandable standardsRapid AdaptabilitySimple Build-outUncomplicated DesignAnd Effortless ScalabilityGo after your goals, build a wildly successful data warehouse just like I have.
Beginning: 5 advanced ETLBy the 1st month, they 5 advanced, and 15 basic/introBy the 6th month, they 5 advanced, but 50 basicBy the end of the 8th month they went to production with 10 MF sourcesAnd their team size was: 12 people (5 advanced, 7 basic – for support).
You’re not the first, nor will you be the last one to use it.Some of the worlds biggest companies are implementing Data Vaults.From Diamler Motors to Lockheed Martin, to the Department of Defense.JPMorgan and Chase used the Data Vault model to merge 3 companies in 90 days!