SlideShare a Scribd company logo
1 of 18
Download to read offline
Column Replication or Movement
May want to replicate columns in order to facilitate co-location of commonly joined
tables.
Before denormalization:
A three table join requires re-distribution of significant amounts of data to answer many
important questions related to customer transaction behavior.
Customer_Id Customer_Nm Address Ph …
Account_Id Customer_Id Balance$ Open_Dt …
Tx_Id Account_Id Tx$ Tx_Dt Location_Id …
1
m
1
m
CustTable
AcctTable
TrxTable
Column Replication or Movement
May want to replicate columns in order to facilitate co-location of commonly joined
tables.
After denormalization:
All three tables can be co-located using customer# as primary index to make the three
table join run much more quickly.
Customer_Id Customer_Nm Address Ph …
Account_Id Customer_Id Balance$ Open_Dt …
Tx_Id Account_Id Customer_Id Tx$ Tx_Dt Location_Id …
1
m
1
m
1
m
Column Replication or Movement
What is the impact of this approach to achieving table co-
location?
• Increases size of transaction table (largest table in the
database) by the size of the customer_id key.
• If customer key changes (consider impact of
individualization), then updates down to transaction table
must be propagated.
• Must include customer_id in join between transaction
table and account table to ensure optimizer recognition of
co-location (even though it is redundant to join on
account_id).
Column Replication or Movement
Resultant query example:
select sum(tx.tx_amt)
from customer
,account
,tx
where customer.customer_id = account.customer_id
and account.customer_id = tx.customer_id
and account.account_id = tx.account_id
and customer.birth_dt > '1972-01-01'
and account.registration_cd = 'IRA'
and tx.tx_dt between '2000-01-01' and '2000-04-
15'
;
Pre-aggregation
Take aggregate values that are frequently used in decision-making
and pre-compute them into physical tables in the database.
Can provide huge performance advantage in avoiding frequent
aggregation of detailed data.
Storage implications are usually small compared to size of detailed
data - but can be very large if many multi-dimensional summaries
are constructed.
Pre-aggregation
Ease-of-use for data warehouse can be significantly increased with
selective pre-aggregation.
Pre-aggregation adds significant burden to maintenance for DW.
Pre-aggregation
Typical pre-aggregate summary tables:
Retail: Inventory on hand, sales revenue, cost of goods sold, quantity of good sold,
etc. by store, item, and week.
Healthcare: Effective membership by member age and gender, product, network,
and month.
Telecommunications: Toll call activity in time slot and destination region buckets by
customer and month.
Financial Services: First DOE, last DOE, first DOI, last DOI, rolling $ and
transaction volume in account type buckets, etc. by household.
Transportation: Transaction quantity and $ by customer, source, destination, class of
service, and month.
Pre-aggregation
Standardized definitions for aggregates are critical...
Need business agreement on aggregate definitions.
e.g., accounting period vs. calendar month vs. billing cycle
Must ensure stability in aggregate definitions to provide value in
historical analysis.
Pre-aggregation
Overhead for maintaining aggregates should not be under estimated.
Can choose transactional update strategy or re-build strategy for
maintaining aggregates.
Choice depends on volatility of aggregates and ability to segregate
aggregate records that need to be refreshed based on incoming
data.
e.g., customer aggregates vs. weekly POS activity aggregates.
Cost of updating an aggregate record is typically ten times higher
than the cost of inserting a new record in a detail table
(transactional update cost versus bulk loading cost).
Pre-aggregation
An aggregate table must be used many, many times per day to
justify its existence in terms of maintenance overhead in most
environments.
Consider views if primary motivation is ease-of-use as opposed to
a need for performance enhancement.
Pre-aggregation
Aggregates should NOT replace detailed data.
Aggregates enhance performance and usability for accessing pre-
defined views of the data.
Detailed data will still be required for ad hoc and more
sophisticated analyses.
Other types of de-normalization
Adding derived columns
May reduce/remove joins as well as aggregates are run time
Requires maintenance of the derived column
Increases storage
Splitting
Horizontal
placing rows in two separate tables, depending on data values in one or more
columns.
Vertically
placing the primary key and some columns in one table, and placing other
columns and the primary key in another table.
Surrogate keys
Virtual De-normalization
Derived Attributes
Age is also a derived attribute, calculated as Current_Date – DoB
(calculated periodically).
GP (Grade Point) column in the data warehouse data model is included as
a derived value.The formula for calculating this field is Grade*Credits.
#SID
DoB
Degree
Course
Grade
Credits
Business Data
Model
#SID
DoB
Degree
Course
Grade
Credits
GP
Age
DWH Data Model
DoB: Date of Birth
ColA ColB ColC
Table
Vertical Split
ColA ColB ColA ColC
Table_v1 Table_v2
ColA ColB ColC
Horizontal split
ColA ColB ColC
Table_h1 Table_h2
Splitting
Bottom Line
In a perfect world of infinitely fast machines and well-designed end
user access tools, de-normalization would never be discussed.
In the reality in which we design very large databases, selective
denormalization is usually required - but it is important to initiate the
design from a clean (normalized) starting point.
A good approach is to normalize your data (to 3NF) and then perform
selective denormalization if and when required by performance issues.
Denormalization is NOT “total chaos” but more like a controlled crash.
Bottom Line
When a table is normalized, the non-key columns depend on the
key, the whole key, and nothing but the key.
In order to denormalize, you should have very good knowledge of
the underlying database schema.
Need to be acutely aware of storage and maintenance costs
associated with de-normalization techniques.
Bottom Line
The process of denormalizing:
Can be done with tables or columns
Assumes prior normalization
Requires a thorough knowledge of how the data is being used
Good reasons for denormalizing are:
All or nearly all of the most frequent queries require access to the full
set of joined data
A majority of applications perform table scans when joining tables
Computational complexity of derived columns requires temporary
tables or excessively complex queries
Bottom Line
Advantages of DeAdvantages of DeAdvantages of DeAdvantages of De----
normalizationnormalizationnormalizationnormalization
Disadvantages of DeDisadvantages of DeDisadvantages of DeDisadvantages of De----
normalizationnormalizationnormalizationnormalization
Minimizing the need for joins
Reducing the number of foreign
keys on tables
Reducing the number of
indexes, saving storage space
and reducing data modification
time
Precomputing aggregate values,
that is, computing them at data
modification time rather than at
select time
Reducing the number of tables
(in some cases)
It usually speeds retrieval but
can slow data modification.
It is always application-
specific and needs to be re-
evaluated if the application
changes.
It can increase the size of
tables.
In some instances, it
simplifies coding; in others, it
makes coding more complex.

More Related Content

What's hot

Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17AnwarrChaudary
 
Data warehousing
Data warehousingData warehousing
Data warehousingAllen Woods
 
Optimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP SystemsOptimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP SystemsEMC
 
Business Intelligence and Multidimensional Database
Business Intelligence and Multidimensional DatabaseBusiness Intelligence and Multidimensional Database
Business Intelligence and Multidimensional DatabaseRussel Chowdhury
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Inductiongregoryg
 
BW Multi-Dimensional Model
BW Multi-Dimensional ModelBW Multi-Dimensional Model
BW Multi-Dimensional Modelyujesh
 
Advanced Dimensional Modelling
Advanced Dimensional ModellingAdvanced Dimensional Modelling
Advanced Dimensional ModellingVincent Rainardi
 
Dimensional modelling-mod-3
Dimensional modelling-mod-3Dimensional modelling-mod-3
Dimensional modelling-mod-3Malik Alig
 
Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2akitda
 
Fact less fact Tables & Aggregate Tables
Fact less fact Tables & Aggregate Tables Fact less fact Tables & Aggregate Tables
Fact less fact Tables & Aggregate Tables Sunita Sahu
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4ambujm
 
Dimensional modeling primer
Dimensional modeling primerDimensional modeling primer
Dimensional modeling primerTerry Bunio
 
Dwh lecture 07-denormalization
Dwh   lecture 07-denormalizationDwh   lecture 07-denormalization
Dwh lecture 07-denormalizationSulman Ahmed
 

What's hot (20)

Normalization
NormalizationNormalization
Normalization
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Optimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP SystemsOptimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP Systems
 
Business Intelligence and Multidimensional Database
Business Intelligence and Multidimensional DatabaseBusiness Intelligence and Multidimensional Database
Business Intelligence and Multidimensional Database
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Induction
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Dw concepts
Dw conceptsDw concepts
Dw concepts
 
BW Multi-Dimensional Model
BW Multi-Dimensional ModelBW Multi-Dimensional Model
BW Multi-Dimensional Model
 
Advanced Dimensional Modelling
Advanced Dimensional ModellingAdvanced Dimensional Modelling
Advanced Dimensional Modelling
 
Dimensional modelling-mod-3
Dimensional modelling-mod-3Dimensional modelling-mod-3
Dimensional modelling-mod-3
 
Dbms schemas for decision support
Dbms schemas for decision supportDbms schemas for decision support
Dbms schemas for decision support
 
MULTIMEDIA MODELING
MULTIMEDIA MODELINGMULTIMEDIA MODELING
MULTIMEDIA MODELING
 
Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2
 
Fact less fact Tables & Aggregate Tables
Fact less fact Tables & Aggregate Tables Fact less fact Tables & Aggregate Tables
Fact less fact Tables & Aggregate Tables
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Dimensional modeling primer
Dimensional modeling primerDimensional modeling primer
Dimensional modeling primer
 
Hierarchical Denormalization
Hierarchical DenormalizationHierarchical Denormalization
Hierarchical Denormalization
 
Dwh lecture 07-denormalization
Dwh   lecture 07-denormalizationDwh   lecture 07-denormalization
Dwh lecture 07-denormalization
 

Viewers also liked

Viewers also liked (20)

A&p 1 lab practical 3 - review
A&p 1   lab practical 3 - reviewA&p 1   lab practical 3 - review
A&p 1 lab practical 3 - review
 
Tik icha SMPIT RPI
Tik icha SMPIT RPITik icha SMPIT RPI
Tik icha SMPIT RPI
 
Balance of payments
Balance of paymentsBalance of payments
Balance of payments
 
Psy 2
Psy 2Psy 2
Psy 2
 
Fiqih icha
Fiqih ichaFiqih icha
Fiqih icha
 
經濟部訴願委員會第A410501007號決定書
經濟部訴願委員會第A410501007號決定書經濟部訴願委員會第A410501007號決定書
經濟部訴願委員會第A410501007號決定書
 
Engranajes fotos
Engranajes fotosEngranajes fotos
Engranajes fotos
 
Epc slides (part1)
Epc slides (part1)Epc slides (part1)
Epc slides (part1)
 
Ici final project report
Ici final project reportIci final project report
Ici final project report
 
Pelota
PelotaPelota
Pelota
 
Occupational Health Technician Training
Occupational Health Technician TrainingOccupational Health Technician Training
Occupational Health Technician Training
 
Digital Business Briefing December 2014
Digital Business Briefing December 2014 Digital Business Briefing December 2014
Digital Business Briefing December 2014
 
Pkn
PknPkn
Pkn
 
Cs437 lecture 09
Cs437 lecture 09Cs437 lecture 09
Cs437 lecture 09
 
Digital business briefing January 2015
Digital business briefing January 2015Digital business briefing January 2015
Digital business briefing January 2015
 
ApresentaMilenniumPrime
ApresentaMilenniumPrimeApresentaMilenniumPrime
ApresentaMilenniumPrime
 
Final slides
Final slidesFinal slides
Final slides
 
Creative Business Development Briefing - November 2014
Creative Business Development Briefing - November 2014Creative Business Development Briefing - November 2014
Creative Business Development Briefing - November 2014
 
Creative, Digital & Design Business Briefing — December 2015
Creative, Digital & Design Business Briefing — December 2015Creative, Digital & Design Business Briefing — December 2015
Creative, Digital & Design Business Briefing — December 2015
 
Agenda and list
Agenda and list Agenda and list
Agenda and list
 

Similar to Cs437 lecture 7-8

When & Why\'s of Denormalization
When & Why\'s of DenormalizationWhen & Why\'s of Denormalization
When & Why\'s of DenormalizationAliya Saldanha
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecyclebartlowe
 
denormalization.ppt
denormalization.pptdenormalization.ppt
denormalization.pptABUSUFYAN55
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingPrithwis Mukerjee
 
Teradata Aggregate Join Indices And Dimensional Models
Teradata Aggregate Join Indices And Dimensional ModelsTeradata Aggregate Join Indices And Dimensional Models
Teradata Aggregate Join Indices And Dimensional Modelspepeborja
 
Why To Use Data Partitioning?
Why To Use Data Partitioning?Why To Use Data Partitioning?
Why To Use Data Partitioning?raima sen
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional ModelingSunita Sahu
 
Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )Jenny Calhoon
 
05. Physical Data Specification Template
05. Physical Data Specification Template05. Physical Data Specification Template
05. Physical Data Specification TemplateAlan D. Duncan
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
 

Similar to Cs437 lecture 7-8 (20)

When & Why\'s of Denormalization
When & Why\'s of DenormalizationWhen & Why\'s of Denormalization
When & Why\'s of Denormalization
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
 
Denormalization
DenormalizationDenormalization
Denormalization
 
ITReady DW Day2
ITReady DW Day2ITReady DW Day2
ITReady DW Day2
 
denormalization.ppt
denormalization.pptdenormalization.ppt
denormalization.ppt
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 
Date Analysis .pdf
Date Analysis .pdfDate Analysis .pdf
Date Analysis .pdf
 
Teradata Aggregate Join Indices And Dimensional Models
Teradata Aggregate Join Indices And Dimensional ModelsTeradata Aggregate Join Indices And Dimensional Models
Teradata Aggregate Join Indices And Dimensional Models
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Why To Use Data Partitioning?
Why To Use Data Partitioning?Why To Use Data Partitioning?
Why To Use Data Partitioning?
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Aggreagate awareness
Aggreagate awarenessAggreagate awareness
Aggreagate awareness
 
Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )
 
Data Warehouse 101
Data Warehouse 101Data Warehouse 101
Data Warehouse 101
 
mod 2.pdf
mod 2.pdfmod 2.pdf
mod 2.pdf
 
05. Physical Data Specification Template
05. Physical Data Specification Template05. Physical Data Specification Template
05. Physical Data Specification Template
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 

More from Aneeb_Khawar

Cs437 lecture 16-18
Cs437 lecture 16-18Cs437 lecture 16-18
Cs437 lecture 16-18Aneeb_Khawar
 
Cs437 lecture 14_15
Cs437 lecture 14_15Cs437 lecture 14_15
Cs437 lecture 14_15Aneeb_Khawar
 
Cs437 lecture 10-12
Cs437 lecture 10-12Cs437 lecture 10-12
Cs437 lecture 10-12Aneeb_Khawar
 
Developing for Windows 8 based devices
Developing for Windows 8 based devicesDeveloping for Windows 8 based devices
Developing for Windows 8 based devicesAneeb_Khawar
 

More from Aneeb_Khawar (6)

Cs437 lecture 16-18
Cs437 lecture 16-18Cs437 lecture 16-18
Cs437 lecture 16-18
 
Cs437 lecture 14_15
Cs437 lecture 14_15Cs437 lecture 14_15
Cs437 lecture 14_15
 
Cs437 lecture 13
Cs437 lecture 13Cs437 lecture 13
Cs437 lecture 13
 
Cs437 lecture 10-12
Cs437 lecture 10-12Cs437 lecture 10-12
Cs437 lecture 10-12
 
Cs437 lecture 1-6
Cs437 lecture 1-6Cs437 lecture 1-6
Cs437 lecture 1-6
 
Developing for Windows 8 based devices
Developing for Windows 8 based devicesDeveloping for Windows 8 based devices
Developing for Windows 8 based devices
 

Recently uploaded

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 

Recently uploaded (20)

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 

Cs437 lecture 7-8

  • 1. Column Replication or Movement May want to replicate columns in order to facilitate co-location of commonly joined tables. Before denormalization: A three table join requires re-distribution of significant amounts of data to answer many important questions related to customer transaction behavior. Customer_Id Customer_Nm Address Ph … Account_Id Customer_Id Balance$ Open_Dt … Tx_Id Account_Id Tx$ Tx_Dt Location_Id … 1 m 1 m CustTable AcctTable TrxTable
  • 2. Column Replication or Movement May want to replicate columns in order to facilitate co-location of commonly joined tables. After denormalization: All three tables can be co-located using customer# as primary index to make the three table join run much more quickly. Customer_Id Customer_Nm Address Ph … Account_Id Customer_Id Balance$ Open_Dt … Tx_Id Account_Id Customer_Id Tx$ Tx_Dt Location_Id … 1 m 1 m 1 m
  • 3. Column Replication or Movement What is the impact of this approach to achieving table co- location? • Increases size of transaction table (largest table in the database) by the size of the customer_id key. • If customer key changes (consider impact of individualization), then updates down to transaction table must be propagated. • Must include customer_id in join between transaction table and account table to ensure optimizer recognition of co-location (even though it is redundant to join on account_id).
  • 4. Column Replication or Movement Resultant query example: select sum(tx.tx_amt) from customer ,account ,tx where customer.customer_id = account.customer_id and account.customer_id = tx.customer_id and account.account_id = tx.account_id and customer.birth_dt > '1972-01-01' and account.registration_cd = 'IRA' and tx.tx_dt between '2000-01-01' and '2000-04- 15' ;
  • 5. Pre-aggregation Take aggregate values that are frequently used in decision-making and pre-compute them into physical tables in the database. Can provide huge performance advantage in avoiding frequent aggregation of detailed data. Storage implications are usually small compared to size of detailed data - but can be very large if many multi-dimensional summaries are constructed.
  • 6. Pre-aggregation Ease-of-use for data warehouse can be significantly increased with selective pre-aggregation. Pre-aggregation adds significant burden to maintenance for DW.
  • 7. Pre-aggregation Typical pre-aggregate summary tables: Retail: Inventory on hand, sales revenue, cost of goods sold, quantity of good sold, etc. by store, item, and week. Healthcare: Effective membership by member age and gender, product, network, and month. Telecommunications: Toll call activity in time slot and destination region buckets by customer and month. Financial Services: First DOE, last DOE, first DOI, last DOI, rolling $ and transaction volume in account type buckets, etc. by household. Transportation: Transaction quantity and $ by customer, source, destination, class of service, and month.
  • 8. Pre-aggregation Standardized definitions for aggregates are critical... Need business agreement on aggregate definitions. e.g., accounting period vs. calendar month vs. billing cycle Must ensure stability in aggregate definitions to provide value in historical analysis.
  • 9. Pre-aggregation Overhead for maintaining aggregates should not be under estimated. Can choose transactional update strategy or re-build strategy for maintaining aggregates. Choice depends on volatility of aggregates and ability to segregate aggregate records that need to be refreshed based on incoming data. e.g., customer aggregates vs. weekly POS activity aggregates. Cost of updating an aggregate record is typically ten times higher than the cost of inserting a new record in a detail table (transactional update cost versus bulk loading cost).
  • 10. Pre-aggregation An aggregate table must be used many, many times per day to justify its existence in terms of maintenance overhead in most environments. Consider views if primary motivation is ease-of-use as opposed to a need for performance enhancement.
  • 11. Pre-aggregation Aggregates should NOT replace detailed data. Aggregates enhance performance and usability for accessing pre- defined views of the data. Detailed data will still be required for ad hoc and more sophisticated analyses.
  • 12. Other types of de-normalization Adding derived columns May reduce/remove joins as well as aggregates are run time Requires maintenance of the derived column Increases storage Splitting Horizontal placing rows in two separate tables, depending on data values in one or more columns. Vertically placing the primary key and some columns in one table, and placing other columns and the primary key in another table. Surrogate keys Virtual De-normalization
  • 13. Derived Attributes Age is also a derived attribute, calculated as Current_Date – DoB (calculated periodically). GP (Grade Point) column in the data warehouse data model is included as a derived value.The formula for calculating this field is Grade*Credits. #SID DoB Degree Course Grade Credits Business Data Model #SID DoB Degree Course Grade Credits GP Age DWH Data Model DoB: Date of Birth
  • 14. ColA ColB ColC Table Vertical Split ColA ColB ColA ColC Table_v1 Table_v2 ColA ColB ColC Horizontal split ColA ColB ColC Table_h1 Table_h2 Splitting
  • 15. Bottom Line In a perfect world of infinitely fast machines and well-designed end user access tools, de-normalization would never be discussed. In the reality in which we design very large databases, selective denormalization is usually required - but it is important to initiate the design from a clean (normalized) starting point. A good approach is to normalize your data (to 3NF) and then perform selective denormalization if and when required by performance issues. Denormalization is NOT “total chaos” but more like a controlled crash.
  • 16. Bottom Line When a table is normalized, the non-key columns depend on the key, the whole key, and nothing but the key. In order to denormalize, you should have very good knowledge of the underlying database schema. Need to be acutely aware of storage and maintenance costs associated with de-normalization techniques.
  • 17. Bottom Line The process of denormalizing: Can be done with tables or columns Assumes prior normalization Requires a thorough knowledge of how the data is being used Good reasons for denormalizing are: All or nearly all of the most frequent queries require access to the full set of joined data A majority of applications perform table scans when joining tables Computational complexity of derived columns requires temporary tables or excessively complex queries
  • 18. Bottom Line Advantages of DeAdvantages of DeAdvantages of DeAdvantages of De---- normalizationnormalizationnormalizationnormalization Disadvantages of DeDisadvantages of DeDisadvantages of DeDisadvantages of De---- normalizationnormalizationnormalizationnormalization Minimizing the need for joins Reducing the number of foreign keys on tables Reducing the number of indexes, saving storage space and reducing data modification time Precomputing aggregate values, that is, computing them at data modification time rather than at select time Reducing the number of tables (in some cases) It usually speeds retrieval but can slow data modification. It is always application- specific and needs to be re- evaluated if the application changes. It can increase the size of tables. In some instances, it simplifies coding; in others, it makes coding more complex.