2. Objective
At the end of this module, you will be able to know
Trainer Introduction
What is Data Warehousing ?
What is Data Warehouse Architecture ?
What is Dimensional Modelling & Design ?
What is Business Intelligence ?
3. Person, Academic & Professional Information
Name Kiran Kumar
Academic BE
Companies Graymatter Software Service Pvt. Lmt. India
BI/DWH Technologies Exposure
Domain Knowledge
4. s
Refers to a Database, Which is maintianed seperately from an organization’s operational database
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support
of management's decision making process.
Loosely Speaking
Officially Speaking
What is Data Warehouse
10. Goals of Data Warehousing / Business Intelligence
• DW/BI system must make information easily accessible.
• DW/BI system must present information consistently.
• DW/BI system must adapt to change.
• DW/BI system must be a secure bastion that protects the information assets.
• DW/BI system must serve as the authoritative and trustworthy foundation for improved
decision making.
• DW/BI system present informaion in a timely way.
• Business community must accept the DW/BI system to deem it successful.
11. Strategic uses of Data Warehousing
Industry Functional areas of
use
Strategic use
Airline Operations; marketing Crew assignment, aircraft development, mix of
fares, analysis of route profitability,
frequent flyer program promotions
Banking Product development;
Operations; marketing
Customer service, trend analysis, product and
service promotions, reduction of IS
expenses
Credit card Product development;
marketing
Customer service, new information service,
fraud detection
Health care Operations Reduction of operational expenses
Investment and
Insurance
Product development;
Operations; marketing
Risk management, market movements
analysis, customer tendencies analysis,
portfolio management
Retail chain Distribution; marketing Trend analysis, buying pattern analysis,
pricing policy, inventory control, sales
promotions, optimal distribution channel
Telecommunications Product development;
Operations; marketing
New product and service promotions,
reduction of IS budget, profitability
analysis
Personal care Distribution; marketing Distribution decisions, product promotions,
sales decisions, pricing policy
Public sector Operations Intelligence gathering
12. Evolution in Organizational use of data warehouses
• Off line Data Warehouse
Data warehouses at this stage are updated from data in the operational systems on a regular
basis and the data warehouse data is stored in a data structure designed to facilitate reporting.
• Real Time Data Warehouse
Data warehouses at this stage are updated every time an operational system performs a
transaction (e.g. an order or a delivery or a booking.)
13. Data Marts
• A data mart is a scaled down version of a data warehouse that focuses on a particular subject area.
• A data mart is a subset of an organizational data store, usually oriented to a specific purpose or
major data subject, that may be distributed to support business needs.
• Data marts are analytical data stores designed to focus on specific business functions for a specific
community within an organization.
• Usually designed to support the unique business requirements of a specified department or
business process
• Implemented as the first step in proving the usefulness of the technologies to solve business
problems
Reasons for creating a data mart
• Easy access to frequently needed data
• Creates collective view by a group of users
• Improves end-user response time
• Ease of creation in less time
• Lower cost than implementing a full Data warehouse
• Potential users are more clearly defined than in a full Data warehouse
14. From the Data Warehouse to Data Marts
Departmentally
Structured
Individually
Structured
Data Warehouse
Organizationally
Structured
Less
More
History
Normalized
Detailed
Data
Information
15. Characteristics of the Departmental Data Mart
• Small
• Flexible
• Customized by Department
• Source is departmentally
structured data warehouse
Data mart
Data warehouse
17. Data warehousing Integration
DATA
SOURCES
(databases)
End Users:
Decision making and other
tasks:
CRM, DSS, EIS
Information Data
Warehouse (storage)
Analytical processing,
Data mining
Data visualization
Generate knowledge
Direct use
Direct use
Use
Use
Use of
knowledge
Data
organization ;
storage
use
19. DWH Architecture Cont..
• Data Source Layer
• Data Extraction Layer
• Staging Area
• ETL Layer
• Data Storage Layer
• Data Logic Layer
• Data Presentation Layer
• Metadata Layer
20. Adv & DisAdv of Data Warehouse
Advantage:
Data warehouses tend to have a very high query success as they have complete control
over the four main areas of data management systems.
• Bottom Up Appoarch
• Clean data
• Indexes: multiple types
• Query processing: multiple options
• Security: data and access
• Easy report creation
• Enhanced access to data and information
Disadvantages:
• Preparation may be time consuming
• Long initial implementation time and associated high cost
• Because data must be extracted, transformed and loaded into the warehouse, there is an
element of latency in data warehouse data.
23. Data, Data everywhere yet ...
• I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences
• I can’t get the data I need
– need an expert to get the data
• I can’t understand the data I found
– available data poorly documented
• I can’t use the data I found
– results are unexpected
– data needs to be transformed from
one form to other
24. Business Intelligence
• One ultimate use of the data gathered and processed in the data life cycle is for business
intelligence.
• Business intelligence generally involves the creation or use of a data warehouse and/or data
mart for storage of data, and the use of front-end analytical tools such as Pentaho BI Suite,
SAP BO, MSBI, Oracle’s Sales Analyzer and Financial Analyzer or Micro Strategy’s Web.
• Such tools can be employed by end users to access data, ask queries, request ad hoc (special)
reports, examine scenarios, create CRM activities, devise pricing strategies, and much more.
25. A producer wants to know….
Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?
What is the most
effective distribution
channel?
What product prom-
-otions have the biggest
impact on revenue?
What impact will
new products/services
have on revenue
and margins?
Which customers
are most likely to go
to the competition ?
26. How Business Intelligence works?
• The process starts with raw data which are usually kept in corporate data bases. For
example, a national retail chain that sells everything from grills and patio furniture to plastic
utensils had data about inventory, customer information, data about past promotions, and
sales numbers in various databases.
• Though all this information may be scattered across multiple systems and may seem
unrelated-business intelligence software can being it together. This is done by using a data
warehouse.
• In the data warehouse (or mart) tables can be linked, and data cubes are formed. For
instance, inventory information is linked to sales numbers and customer databases, allowing
for deep analysis of information.
• Using the business intelligence software the user can ask queries, request ad-hoc reports, or
conduct any other analysis.
• For example, deep analysis can be carried out by performing multilayer queries. Because all
the databases are linked, one can search for what products a store has too much of,
determine which of these products commonly sell with popular items, bases on previous
sales. After planning a promotion to move the excess stock along with the popular products
(by bundling them together, for example), one can dig deeper to see where this promotion
would be most popular (and most profitable).
• The results of the request can be reports, predictions, alerts, and/or graphical presentations.
These can be disseminated to decision makers to help them in their decision-making tasks.
27. Dimension Tables
• Dimension table is one that Contain text and descriptive information of the business entities
of an enterprise, represent as hierarchical, categorical information such as Customer,
Product, Date, Location, Department etc.
• 1 in a 1-M relationship
• Also called as lookup or reference tables
• Typically contain the attributes for the SQL answer set.
28. Type of Dimension Tables
• Standard / Common Dimension
• Conformed Dimension
• Junk Dimension
• Degenerated Dimension
• Role-Playing dimension
• Denormalized Flattened Dimension
• Snowflaked Dimension
• Outrigger Dimension
• Shrunken Dimension
29. Slowly Changing Dimensions
• Dimensions attributes that change slowly over time, rather than changing on regular
schedule, time-base.
• In Data Warehouse there is a need to track changes in dimension attributes in order to report
historical data.
• Ex: Person chaging his/her city from Bangalore to Mumbai.
Type of SCD:
– Type 1: Store only the current value ( Overwrite)
– Type 2: Maintain History changes ( Add New Row)
– Type 3: Create an attribute in the dimension record for previous value ( Add New Attribute)
– Type 4: Using historical table ( Add Mini – Dimension table)
– Type 5: Add Mini-Dimensional & Type 1 Outrigger
34. SCD 4
• What is Mini Dimension ?
– In case of a dimension, whre there are attributes which change rapidly or at a frequent interval of time, they are split
off to form a dimension table named as mini-dimension
Ex: Age of a Customer or Employee, Salary Band, Designation etc.
• Design aspects of Mini Dimension
– Should have its own surrogate key of mini dimension table.
– There is no direct connection btw the base & mini dimension table.
– Fact table contains Primary Key of both Base & Mini Dimension table.
• What is SCD4 ?
– Involves usage of 2 or more dimension table in
which one would act as a base dimension and
one or more mini dimension tables
• When to use ?
– Handling Rapidly changing attributes
35. SCD 5
• What is SCD 5 ?
– Scd 5 involves usage of one or more mini dimension tables and a base dimension table with a reference to mini
dimension key in the base dimension table.
– This reference key in base dimension should be of Type 1 in nature. Therefore it would reflect the current version of
mini dimension attributes in the dimension table
• When to use ?
– When there is a need to access the current values in the mini-dimension directly from the base dimension without
joining a fact table
• What is SCD 5 ?
– Type 1 referential key should get updated in the
base dimension in all the version of the dimension records
whenever there is a change involved
in corresponding mini dimension attributes values
• Design aspects of Mini Dimension
– Should have its own surrogate key of mini dimension table.
– There is direct connection btw the base &
mini dimension table.
– Fact table contains Primary Key of
both Base & Mini Dimension table.
36. Fact Tables
• Stores the performance measurements resulting from an organization’s business process events
• Store the low-level measurement data resulting from a business process in a single dimensional
model
• The term fact represents a business measure.
• Each row in a fact table corresponds to a measurement event
• Contains two or more foreign keys
• Tend to have huge numbers of records
• Useful facts tend to be numeric and additive
Types of Fact Table:
1. Transactional Fact Table
2.Factless Fact Table
3. Snapshot Fact Table
4. Accumulating Fact Table
5. Aggregate Fact Table
6. Consolidated Fact Tables
37. Transactional Fact table
• These fact tables represent an event that occurred at an instantaneous point in time.
A row exists in the fact table for a given customer or product only if a transaction has occurred
• Grain is the individual transaction
• Mostly Additive Facts
38. Periodic Snapshot Fact table
• Fact table summarizes many measuresment events occuring over a standard period such as a
day, week, month or Quarter
• Grain is the period not the individual transaction
• If we have 1000 peopleliving in a region at the end of month 1 and 1500 people living in the
same region at the end of month 2 then the total number of people will not be 2500
• Semi Additive & Non – Additive Facts
42. Consolidated Fact table
It is often convenient to combine facts from multiple processes together into a
single consolidated fact table if they can be expressed at the same grain. For example, sales
actuals can be consolidated with sales forecasts in a single fact table to make the task of
analyzing actuals versus forecasts simple and fast, as compared to assembling a drill-across
application using separate fact tables. Consolidated fact tables add burden to the ETL
processing, but ease the analytic burden on the BI applications. They should be considered
for cross-process metrics that are frequently analyzed together.
43. Type of Fact / Measure
• Additive: Additive facts are facts that can be summed up through all of the dimensions in the
fact table.
• Semi-Additive: Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others.
• Non-Additive: Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
•
• The purpose of this table is to record the current balance for each account at the end of each
day, as well as the profit margin for each account for each
day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive
fact, as it makes sense to add them up for all accounts (what's the total current balance for all
accounts in the bank?), but it does not make sense to add them up through time (adding up
all current balances for a given account for each day of the month does not give us any useful
information). Profit_Margin is a non-additive fact, for it does not make sense to add them up
for the account level or the day level.
44. Type of Fact / Measure Cont..
• Additive
The purpose of this table is to record the sales amount for each product in each store on a daily
basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up
this fact along any of the three dimensions present in the fact table -- date, store, and product. For
example, the sum of Sales_Amount for all 7 days in a week represents the total sales amount for that
week.
• Semi-Additive & Non-Additive:
The purpose of this table is to record the current balance for each account at the end of each day, as well
as the profit margin for each account for each day. Current_Balance and Profit_Margin are the
facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's
the total current balance for all accounts in the bank?), but it does not make sense to add them up
through time (adding up all current balances for a given account for each day of the month does not give
us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up
for the account level or the day level.
45. Dimensional Models
• A denormalized relational model
– Made up of tables with attributes
– Relationships defined by keys and foreign keys
• Organized for understandability and ease of reporting rather than update.
• Queried and maintained by SQL or special purpose management tools.
• Star Schemas Versus OLAP Cubes
– Dimensional models implemented in relational database management systems are
referred to as star schemas because of their resemblance to a star-like structure.
– Dimensional models implemented in multidimensional database environments are
referred to as online analytical processing (OLAP) cubes.
– Both stars and cubes have a common logical design with recognizable dimensions;
however, the physical implementation differs
46. OLAP
• OLAP stands for On-Line Analytical Processing
• For people on the business side, the key feature out of the above list is "Multidimensional."
In other words, the ability to analyze metrics in different dimensions such as time, geography,
gender, product, etc.
For example, sales for the company are up.
- What region is most responsible for this increase?
- Which store in this region is most responsible for the increase?
- What particular product category contributed the most to the increase?
Answering these types of questions in order means that you are performing an OLAP
analysis.
• In the OLAP world, there are mainly two different types:
1. Multidimensional OLAP (MOLAP)
2. Relational OLAP (ROLAP)
3. Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.
47. MOLAP
• This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats.
Advantages:
• Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for
slicing and dicing operations.
• Can perform complex calculations: All calculations have been pre-generated when the cube is
created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
• Limited in the amount of data it can handle: Because all calculations are performed when the
cube is built, it is not possible to include a large amount of data in the cube itself. This is not
to say that the data in the cube cannot be derived from a large amount of data. Indeed, this
is possible. But in this case, only summary-level information will be included in the cube itself.
• Requires additional investment: Cube technology are often proprietary and do not already
exist in the organization. Therefore, to adopt MOLAP technology, chances are additional
investments in human and capital resources are needed.
48. MOLAP Operation
• Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
• Here is the list of OLAP operations:
1. Roll-up
2. Drill-down
3.Slice and dice
4. Pivot (rotate)
49. MOLAP Operation – Roll Up
• Roll-up performs aggregation on a data cube in any of the following ways:
– By climbing up a concept hierarchy for a dimension
– By dimension reduction
• The following diagram illustrates how roll-up works
– Roll-up is performed by climbing up a concept hierarchy for the dimension location.
– Initially the concept hierarchy was "street < city < province < country".
– On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country.
– The data is grouped into cities rather than countries.
– When roll-up is performed, one or more dimensions from the data cube are removed.
50. MOLAP Operation – Drill Down
• Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:
– By stepping down a concept hierarchy for a dimension
– By introducing a new dimension.
• The following diagram illustrates how drill-down works:
– Drill-down is performed by stepping down a concept hierarchy for the dimension time.
– Initially the concept hierarchy was "day < month < quarter < year."
– On drilling down, the time dimension is descended from the level of quarter to the level of month.
– When drill-down is performed, one or more dimensions from the data cube are added.
– It navigates the data from less detailed data to highly detailed data.
51. MOLAP Operation – Slice
• The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
– Here Slice is performed for the dimension "time" using the criterion time = "Q1".
– It will form a new sub-cube by selecting one or more dimensions.
52. MOLAP Operation – Dice
• Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
• The dice operation on the cube based on the following selection criteria involves three
dimensions.
– (location = "Toronto" or "Vancouver")
– (time = "Q1" or "Q2")
– (item =" Mobile" or "Modem")
53. MOLAP Operation – Pivot
• The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows the
pivot operation
54. ROLAP
• This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action
of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Advantages:
• Can handle large amounts of data: The data size limitation of ROLAP technology is the
limitation on data size of the underlying relational database. In other words, ROLAP itself
places no limitation on data amount.
• Can leverage functionalities inherent in the relational database: Often, relational database
already comes with a host of functionalities. ROLAP technologies, since they sit on top of the
relational database, can therefore leverage these functionalities.
Disadvantages:
• Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple
SQL queries) in the relational database, the query time can be long if the underlying data size
is large.
• Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL
statements to query the relational database, and SQL statements do not fit all needs (for
example, it is difficult to perform complex calculations using SQL), ROLAP technologies are
therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by
building into the tool out-of-the-box complex functions as well as the ability to allow users to
define their own functions.
59. Fact Constellation Schema
• For each star schema it is possible to construct fact constellation schema
(for example by splitting the original star schema into more star schemes each of them describes
facts on another level of dimension hierarchies). The fact constellation architecture contains
multiple fact tables that share many dimension tables.
• The main shortcoming of the fact constellation schema is a more complicated design
because many variants for particular kinds of aggregation must be considered and selected.
Moreover, dimension tables are still large.
60. HOLAP
• HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For
summary-type information, HOLAP leverages cube technology for faster performance. When
detail information is needed, HOLAP can "drill through" from the cube into the underlying
relational data.
61. Difference btw ERD & Dimensional Model
• One table per entity
• Minimize data redundancy
• Optimize update / insert
• The Transaction Processing Model
• One fact table for data organization
• Maximize understandability
• Optimized for retrieval
• The data warehousing model
62. Choosing the Data Mart / Dimensional
Design Process
1. Select the business process
2. Declare the grain
3. Identify the dimensions
4. Identify the facts
63. Business Process
• Businnes process are the operational activities performed by your organization, such taking
an order, registring students etc.
• It is important to determine the identity of the transaction table and specify exactly what it
represents.
• Represent a process or reporting environment that is of value to the organization
64. Grain (unit of analysis)
• Atomic graing refers to the lowest level at which data is captured by a given business process
• The grain determines what each fact record represents: the level of detail
• For example
– Individual transactions
– Snapshots (points in time)
– Line items
• Generally better to focus on the smallest grain
65. Dimensions
• A table (or hierarchy of tables) connected with the fact table with keys and foreign keys
• Preferably single valued for each fact record (1:m)
• Connected with surrogate (generated) keys, not operational keys
• Dimension tables contain text or numeric attributes
66. Facts
• Normally numeric Keys and additive measures
• Measurements associated with fact table records at fact table granularity
• Non-key attributes in the fact table
Attributes in dimension tables are constants. Facts vary with the granularity of the fact
table