SlideShare a Scribd company logo
1 of 103
M.Sc. Computer Science Data Mining
The secret of success is to know something nobody
else knows - Aristotle Onassis
DATA MINING
 Introduction
 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Are all the patterns interesting?
 Classification of data mining systems
 Major issues in data mining
March 28, 2014 2Module I : Data Mining and Warehousing
Introduction
 Data is growing at a phenomenal rate
 Users expect more sophisticated information
 How?
3© Prentice Hall
UNCOVER HIDDEN INFORMATION
DATA MINING
Evolution of Database Technology
 1960s:
 Data collection, database creation, data management –primitive file
processing
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
and application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s—2000s:
 Data mining and data warehousing, multimedia databases, and Web
databases
March 28, 2014 4Module I : Data Mining and Warehousing
What Is Data Mining?
 Data mining (knowledge discovery in databases):
 The non-trivial process of identifying
 valid
 novel
 potentially useful, and
 ultimately understandable patterns in data.
 Data mining refers to the discovery of new information in terms of
patterns or rules from vast amounts of data
March 28, 2014 5Module I : Data Mining and Warehousing
Why Data Mining?
 From a managerial perspective????
 Strategic Decision Making
 Wealth Generation
 Analyzing trends
 Security
March 28, 2014 6Module I : Data Mining and Warehousing
Database Processing vs. Data Mining Processing
 Query
- Well defined
- SQL
 Query
- Poorly defined
- No precise query language
March 28, 2014 7Module I : Data Mining and Warehousing
 Data
– Operational data
 Output
– Precise
– Subset of database
 Data
– Not operational data
 Output
– Not a subset of database
Query Examples
 Database
• Find all customers who live in Boa Vista
• Find all customers who use Mastercard
• Find all customers who missed one payment
 Data Mining
• Find all customers who are likely to miss one payment (Classification)
• List all items that are frequently purchased with bicycles (Association rules)
• Find any “unusual” customers or behavior (e.g., phone calls)
(Outlier detection, anomaly discovery)
March 28, 2014 8Module I : Data Mining and Warehousing
Data Mining vs. KDD
 Knowledge Discovery in Databases (KDD): process of finding useful
information and patterns in data.
 Data Mining: Use of algorithms to extract the information and patterns
derived by the KDD process.
March 28, 2014 9Module I : Data Mining and Warehousing
Data Mining: A KDD Process
 Data mining: the core of knowledge
discovery process.
March 28, 2014 10Module I : Data Mining and Warehousing
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection and
Transformation
Data Mining
Pattern Evaluation
Steps of a KDD Process
 Data Cleaning : Remove noise and inconsistent data
 Data Integration: multiple data sources are integrated
 Data Selection: Obtain relevant data from the database.
 Data Transformation: Convert to common format or consolidated into
forms appropriate for mining by performing aggregation or summary
operations.
 Data Mining: Obtain desired results.
 Pattern Evaluation: The patterns obtained in the data mining stage are
converted into knowledge based on some interestingness measures
 Knowledge Presentation: The knowledge obtained are presented to end-
users in an understandable form, for example, visualization.
March 28, 2014 11Module I : Data Mining and Warehousing
Architecture of a Typical Data Mining System
March 28, 2014 12Module I : Data Mining and Warehousing
Data
Warehouse
Databases
Database or data
warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
Data cleaning , integration and Selection
WWW
 Database,Datawarehouse,WorldWideWeb,or other information
repository: This is one or a set of databases, data warehouses,
spreadsheets, or other kinds of information repositories. Data
cleaning and data integration techniques may be performed on
the data.
 Database or data warehouse server: The database or data
warehouse server is responsible for fetching the relevant data,
based on the user’s data mining request.
 Knowledge base: This is the domain knowledge that is used to
guide the search or evaluate the interestingness of resulting
patterns. Such knowledge can include concept hierarchies, used
to organize attributes or attribute values into different levels of
abstraction.
 Data mining engine: This is essential to the data mining system
and ideally consists of a set of functional modules for tasks such
as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and
evolution analysis.
March 28, 2014 13
 Pattern evaluation module: This component typically employs interestingness
measures and interacts with the data mining modules so as to focus the
search toward interesting patterns.
 User interface: This module communicates between users and the data
mining system, allowing the user to interact with the system by specifying a
data mining query or task, providing information to help focus the search, and
performing exploratory data mining based on the intermediate data mining
results.
March 28, 2014 14Module I : Data Mining and Warehousing
Relational Databases
 A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database, and a set
of software programs to manage and access the data.
 A relational database is a collection of tables, each of which is assigned a
unique name. Each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows). Each tuple in a
relational table represents an object identified by a unique key and
described by a set of attribute values.
March 28, 2014 15Module I : Data Mining and Warehousing
Data Mining: On What Kind of Data?
March 28, 2014 16
Data source in Chicago
Data source in New York
Data source in Toronto
Clean, Integ
rate
Transform
Load
Refresh
Data
Warehouse
Query and
Analysis
Tools
client
client
Module I : Data Mining and Warehousing
Data warehouses
A data ware house is a repository of information collected from multiple sources,
stored under a unified schema, and that usually resides at a single site. Data
warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing
Multidimensional Data
 Sales volume as a function of product, month, and region
March 28, 2014 17
time(quarters)
Item(types)
Dimensions: address, time, item
A Sample Data Cube
March 28, 2014 18
Total annual sales
of TV in
Chicago for past 4Qtr
Time(quarters)
item
sum
sum
Chicago
Toronto
New York
1Qtr 2Qtr 3Qtr 4Qtr
TV
comp
phone
sum
Sales
182
March 28, 2014 19Module I : Data Mining and Warehousing
Product Sales
Pen 120
Honey 12
Pencil 50
Store Sales
1 102
2 80
store Product Sales
1 Pen 90
1 Honey 12
2 Pencil 50
2 Pen 30
 Transactional databases
 Object-oriented and object-relational databases
 Spatial databases: contain space related information
 Time-series data and temporal data: Time related attributes
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW
March 28, 2014 20Module I : Data Mining and Warehousing
Trans_ID List of Items
T100 11,13,15,16
T200 12,13,18
Data Mining Functionalities (1)
 are used to specify the kinds of patterns to be found in data mining tasks.
 Data mining task can be : Predictive or descriptive
 Concept/Class description: Characterization and discrimination
 Data can be associated with classes or concepts
 The description of a class in summarized, concise, and yet precise terms is
called class/concept description.
 These description can be derived via
 Data characterization
 Data discrimination
 Both characterization and discrimination
 Data characterization is a summarization of the general characteristics or
features of a target class of data.
March 28, 2014 21Module I : Data Mining and Warehousing
 Data corresponding to the user specified class are typically collected by a
database query
 for example, a DM system should be able to produce a description
summarizing the characteristics of customers who spend more than $1,000 a
year
 Data discrimination is a comparison of the general features of target class
data objects from one or set of contrasting classes.
 A DM system should able to compare two groups of customers, those who
shop for computer products regularly(more than two times a month) and
those who rarely shop for such products(i.e., less than 3 times a year)
March 28, 2014 22Module I : Data Mining and Warehousing
Data Mining Functionalities (2)
 Mining Frequent Patterns, Associations, and Correlations
 A frequent itemset typically refers to a set of items that frequently appear
together in a transactional data set, such as milk and bread.
 Association analysis is the discovery of association rules showing attribute value
conditions that occur frequently together in a given set of data.
 X => Y
 E.g., buys(X,”computer”) => buys(X,”software”)*support=1%,confidence=50%]
 Confidence: “is a measure of how often the consequent is true when
the antecedent is true.”
 Here, if the customer buys a computer, there is a 50% chance that
he will buy software as well.
March 28, 2014 23Module I : Data Mining and Warehousing
Data Mining Functionalities (3)
 “support is a measure of what fraction of the population
satisfies both the antecedent and the consequent of the rule”
 Here, 1% support means that 1% of all transactions under
analysis showed that computer and software purchased
together.
 Can have more predicates or attributes
 Association rules that contain a single predicate are referred to as single-
dimensional association rules.
 age(X, “20…29”) ^ income(X, “20K...29K”)  buys(X, “computer”) *support = 2%,
confidence = 60%]
March 28, 2014 24Module I : Data Mining and Warehousing
Data Mining Functionalities (4)
 Classification is the process of finding a model( or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the
model to predict the class of objects whose class label is unknown.
 Given a set of items that have several classes, and given the past instances
(training instances) with their associated class, Classification is the process of
predicting the class of a new item.
 The derived model can be represented using
 IF-THEN
 DECISION TREE
 NEURAL NETWORKS etc.
March 28, 2014 25Module I : Data Mining and Warehousing
Data Mining Functionalities (5) - Classification
26
Classification Process: Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
27
Classification Process: Use the Model in
Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
 age(X, “youth”) AND income(X, ”high”)  class(X, “A”)
 age(X, “youth”) AND income(X, ”low”)  class(X, “B”)
 age(X, “middle_aged”)  class(X, “C”)
 age(X, “Senior”)  class(X, “C”)
March 28, 2014 28Module I : Data Mining and Warehousing
Data Mining Functionalities (6) - Classification
March 28, 2014 29Module I : Data Mining and Warehousing
age?
income? class C
class A class B
youth Middle_aged, senior
high
low
f1
f2
f3
f4
f5 f8
f7
f6
age
income
Class A
Class B
Class C
Data Mining Functionalities (7) - Classification
Data Mining Functionalities (8) - Prediction
 Prediction is used to predict missing or unavailable numeric data values
rather than class labels.
 Classification and prediction may need to be proceeded by relevance analysis
, which attempts to identify attributes that do not contribute to the
classification or prediction process.
March 28, 2014 30Module I : Data Mining and Warehousing
Data Mining Functionalities (9)
 Cluster analysis
 Similar to classification, but the class label is unknown and it is upto
clustering algorithm to discover acceptable classes
 “Clustering algorithms find groups of items that are similar. … It divides a
data set so that records with similar content are in the same group, and
groups are as different as possible from each other. ”
 Example: Insurance company could use clustering to group clients by
their age, location and types of insurance purchased.
 The categories are unspecified and this is referred to as ‘unsupervised
learning’
March 28, 2014 31Module I : Data Mining and Warehousing
Data Mining Functionalities (10)
March 28, 2014 32Module I : Data Mining and Warehousing
Data Mining Functionalities (11)
 Clustering based on the principle: maximizing the intra-class similarity and
minimizing the interclass similarity
 Intra-class similarity means similarity between objects in same class
 Inter-class similarity means similarity between objects of different classes
 Each cluster that is formed can be viewed as a class of objects, from which
rules can be derived
March 28, 2014 33Module I : Data Mining and Warehousing
Data Mining Functionalities (12)
 Outlier analysis
 Outlier: a data object that does not comply with the general behavior of the
data
 It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
 Trend and evolution analysis
 Describes and models regularities or trends for objects whose behavior
changes over time
March 28, 2014 34Module I : Data Mining and Warehousing
Classification of Data Mining systems:
Confluence of Multiple Disciplines
March 28, 2014 35Module I : Data Mining and Warehousing
Data Mining
Data Mining: Classification Schemes
 Different views, different classifications
 Kinds of databases to be mined:
relational, transactional, spatial etc.
 Kinds of knowledge to be discovered : based on DM functionalities;
characterization, discrimination, association, classification etc.
 Kinds of techniques utilized : DM can be categorized according to
the underlying DM technique employed.
 These tech can be defined according the degree of user interaction
involved or the methods of data analysis employed
 Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural network, etc.
March 28, 2014 36Module I : Data Mining and Warehousing
 Kinds of applications adapted: DM systems can also be classified
according to the applications they adapt
 Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
March 28, 2014 37Module I : Data Mining and Warehousing
Data Mining: Classification Schemes
DATA MINING TASK PRIMITIVES
 A data mining task can be specified in the form of a data mining query
 A data mining query is defined in terms of the following data mining task
primitives.
 Task-relevant data: This specifies the portions of the database or the set of data
in which the user is interested.
 This includes the database attributes or data warehouse dimensions of interest.
 kind of knowledge: This specifies the data mining functions to be performed,
such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
March 28, 2014 38Module I : Data Mining and Warehousing
 background knowledge : knowledge about the domain to be mined is
useful for guiding the knowledge discovery process and for evaluating the
patterns found.
 Concept hierarchies are a popular form of background knowledge, allow
data to be mined at multiple levels of abstraction.
March 28, 2014 39Module I : Data Mining and Warehousing
all
India Canada
OntariaColumbiaTamil naduKerala
EKMTVM Coimb chennai …
 Interestingness measures and thresholds: They may be used to guide the
mining process or, after discovery, to evaluate the discovered patterns.
 Different kinds of knowledge may have different interestingness measures.
 For example, interestingness measures for association rules include support
and confidence.
 Rules whose support and confidence values are below user-specified
thresholds are considered uninteresting.
 Representation for visualizing: This refers to the form in which discovered
patterns are to be displayed,
 rules, tables, charts, graphs, decision trees, and cubes.
March 28, 2014 40Module I : Data Mining and Warehousing
INTEGRATION OF A DATA MINING SYSTEM WITH A
DATABASE OR DATA WAREHOUSE SYSTEM
No Coupling:
 DM will not utilize any function of a DB or DW system.
 It may fetch data from a particular source (such as file system), process
data using some data mining algorithms, and then store the mining result
in another file.
 DB system provides a great deal of flexibility and efficiency at storing,
organizing, accessing, and processing data.
 Without using a DB/DW system , a DM system may spend a substantial
amount of time finding, collecting, cleaning and transforming data.
 Second, there are many tested, scalable algorithms and data structures
implemented in DB and DW systems. It is feasible to realize efficient,
scalable implementations using such systems.
March 28, 2014 41Module I : Data Mining and Warehousing
 most data have been or will be stored in DB/DW systems. Without any
coupling of such systems, a DM system will need to use other tools to
extract data, making it difficult to integrate such a system into an
information processing environment. Thus no coupling is a poor design.
 LOOSE COUPLING:
 that a data mining system will use some facilities of a DB/DW system,
fetching data from a data repository managed by these systems and then
performing data mining and then storing the mining results either in a
file or in a designated place in a database or data warehouse.
March 28, 2014 42Module I : Data Mining and Warehousing
INTEGRATION OF A DATA MINING SYSTEM WITH A
DATABASE OR DATA WAREHOUSE SYSTEM
 It incurs some advantages of flexibility, efficiency, and other features
provided by such systems.
 loosely coupled mining systems are main memory based. Because
mining does not explore data structures and query optimization
methods provided by DB or DW systems, it is difficult for loose coupling
to achieve high scalability and good performance with large data sets.
 SEMI-TIGHT COUPLING
 besides linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives can be
provided in the DB/DW system.
 Also we can precompute some frequently used intermediate mining
results and stored in DB/DW system. This will enhance performance of a
DM system.
March 28, 2014 43Module I : Data Mining and Warehousing
 TIGHT COUPLING:
 DM system is smoothly integrated into the DB/DW system.
 The data mining subsystem is treated as one functional component of an
information system.
 Data mining queries and functions are optimized based on query
analysis, data structures, indexing schemes and query processing
methods of a DB or DW system.
 The tight coupling provides a uniform information processing
environment.
March 28, 2014 44Module I : Data Mining and Warehousing
Major Issues in Data Mining (1)
 Mining methodology and user interaction
 Mining different kinds of knowledge in databases
 Interactive mining of knowledge at multiple levels of abstraction
 Incorporation of background knowledge
 Data mining query languages and ad-hoc data mining
 Expression and visualization of data mining results
 Handling noise and incomplete data
 Pattern evaluation: the interestingness problem
 Performance and scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed and incremental mining methods
March 28, 2014 45Module I : Data Mining and Warehousing
Major Issues in Data Mining (2)
 Issues relating to the diversity of data types
 Handling relational and complex types of data
 Mining information from heterogeneous databases and global information
systems (WWW)
 Issues related to applications and social impacts
 Application of discovered knowledge
Domain-specific data mining tools
Intelligent query answering
Process control and decision making
 Integration of the discovered knowledge with existing knowledge: A
knowledge fusion problem
 Protection of data security, integrity, and privacy
March 28, 2014 46Module I : Data Mining and Warehousing
DATA WAREHOUSE
 The main repository of an organization historical data
 It contains the raw material for management’s decision
support system
 The term Data Warehouse was coined by Bill Inmon in 1990
 “A DW is a subject oriented, integrated, time-variant and non-
volatile collection of data in support of management’s decision
making process.”
March 28, 2014 47Module I : Data Mining and Warehousing
 Subject oriented: A DW is organized around major subjects, such as
customer, supplier, product, sales etc.
 Rather than focusing on day-to-day operations DW concentrate on
the modeling and analysis of data for decision makers.
 Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process
March 28, 2014 48Module I : Data Mining and Warehousing
SalesProducts
Customers
 Integrated: A DW is usually constructed by integrating multiple
heterogeneous sources such as relational databases, flat files, etc.
 Data cleaning and data integration techniques are applied
March 28, 2014 49Module I : Data Mining and Warehousing
Savings
account
Loans
account
Subject =
account
 Time-variant:
 The time horizon for the data warehouse is significantly longer
than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
March 28, 2014 50Module I : Data Mining and Warehousing
51
Nonvolatile
 A physically separate store of data transformed from the
operational environment
 Operational update of data does not occur in the data
warehouse environment
 Does not require transaction processing, recovery, and
concurrency control mechanisms
 Requires only two operations in data accessing:
• initial loading of data and access of data
March 28, 2014Module I : Data Mining and Warehousing
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration:
Build wrappers/mediators on top of heterogeneous databases
Query driven approach
 When a query is posed to a client site, a meta-dictionary
is used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results
are integrated into a global answer set
 Complex information filtering
Data warehouse: update-driven, high performance
Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis
March 28, 2014 52Module I : Data Mining and Warehousing
Data Warehouse vs. Operational DBMS
 OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
March 28, 2014 53Module I : Data Mining and Warehousing
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries
March 28, 2014 54Module I : Data Mining and Warehousing
March 28, 2014 55Module I : Data Mining and Warehousing
A multi-dimensional data model
 From Tables and Spreadsheets to Data Cubes:
 A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
 The data cube can be n-dimensional
March 28, 2014 56Module I : Data Mining and Warehousing
 In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids forms
a data cube.
March 28, 2014 57Module I : Data Mining and Warehousing
Cube: A Lattice of Cuboids
March 28, 2014 58Module I : Data Mining and Warehousing
Conceptual Modeling of Data Warehouses
 The most popular data model for a data warehouse is a
multidimensional model. Such a model exist in the form of a star
schema, a snowflake schema or a fact constellation schema.
 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema
or fact constellation
March 28, 2014 59Module I : Data Mining and Warehousing
Example of Star Schema
March 28, 2014 60Module I : Data Mining and Warehousing
Example of Snowflake Schema
March 28, 2014 61Module I : Data Mining and Warehousing
Example of Fact Constellation
March 28, 2014 62Module I : Data Mining and Warehousing
A Data Mining Query Language, DMQL:
Language Primitives
 Cube Definition (Fact Table)
define cube<cube_name> [<dimension_list>]:
<measure_list>
 Dimension Definition ( Dimension Table )
define dimension<dimension_name>
as(<attribute_or_subdimension_list>)
 Special Case (Shared Dimension Tables)
define dimension<dimension_name>
as<dimension_name_first_time> in cube
<cube_name_first_time>
March 28, 2014 63Module I : Data Mining and Warehousing
Defining a Star Schema in DMQL
 define cubesales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales=
avg(sales_in_dollars), units_sold = count(*)
 define dimensiontime as (time_key, day, day_of_week, month,
quarter, year)
 define dimension item as (item_key, item_name, brand, type,
supplier_type)
 define dimension branch as(branch_key, branch_name,
branch_type)
 define dimensionlocation as(location_key, street, city,
province_or_state, country)
March 28, 2014 64Module I : Data Mining and Warehousing
Defining a Snowflake Schema in DMQL
 define cubesales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales=
avg(sales_in_dollars), units_sold = count(*)
 define dimensiontime as (time_key, day, day_of_week, month,
quarter, year)
 define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
 define dimension branch as(branch_key, branch_name,
branch_type)
 define dimensionlocation as(location_key, street, city(city_key,
province_or_state, country))
March 28, 2014 65Module I : Data Mining and Warehousing
Defining a Fact Constellation in DMQL
 define cubesales [time, item, branch, location]:dollars_sold =
sum(sales_in_dollars), avg_sales= avg(sales_in_dollars), units_sold =
count(*)
 define dimensiontime as (time_key, day, day_of_week, month, quarter,
year)
 define dimension item as (item_key, item_name, brand, type,
supplier_type)
 define dimension branch as(branch_key, branch_name, branch_type)
 define dimensionlocation as(location_key, street, city,
province_or_state, country)
March 28, 2014 66Module I : Data Mining and Warehousing
 define cubeshipping [time, item, shipper, from_location,
to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped =
count(*)
 define dimensiontime as time in cubesales
 define dimension item as item in cubesales
 define dimension shipper as(shipper_key, shipper_name,
locationaslocation in cubesales, shipper_type)
 define dimensionfrom_location aslocation in cubesales
 define dimensionto_location aslocation in cubesales
March 28, 2014 67Module I : Data Mining and Warehousing
March 28, 2014
Measures of Data Cube
 A data cube measure is a numerical function that can be
evaluated at each point in the data cube space.
 A measure value is computed for a given point by aggregating the
data corresponding to the respective dimension-value pairs
defining the given point.
Module I : Data Mining and Warehousing 68
 Distributive: if the result derived by applying the function to n
aggregate values is the same as that derived by applying the
function on all the data without partitioning
 E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic function with M
arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function
 E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
 E.g., median(), mode(), rank()
March 28, 2014
Measures of Data Cube: Three Categories
Module I : Data Mining and Warehousing 69
Concept Hierarchy
 A concept hierarchy defines a sequence of mappings from a
set of low-level concepts to higher-level, more general
concepts
 Consider dimension location: vancouver,Toronto,New York
and Chicago. Each city can be mapped to province or state
to which it belongs. The province or state can be mapped to
country.
March 28, 2014Module I : Data Mining and Warehousing 70
March 28, 2014
A Concept Hierarchy: Dimension (location)
all
Europe North_America
MexicoCanadaSpainGermany
Vancouver
M. WindL. Chan
...
......
... ...
...
all
region
office
country
TorontoFrankfurtcity
Module I : Data Mining and Warehousing 71
March 28, 2014
Typical OLAP Operations
 Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or detailed
data, or introducing new dimensions
 Slice and dice: project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its back-
end relational tables (using SQL)
Module I : Data Mining and Warehousing 72
Typical OLAP
Operations
(quarters)
March 28, 2014
Design of Data Warehouse: A Business Analysis
Framework
 Four views regarding the design of a data warehouse
 Top-down view
• allows selection of the relevant information necessary for the data
warehouse
 Data source view
• exposes the information being captured, stored, and managed by
operational systems
 Data warehouse view
• consists of fact tables and dimension tables
 Business query view
• sees the perspectives of data in the warehouse from the view of end-user
74Module I : Data Mining and Warehousing
March 28, 2014
Data Warehouse Design Process
 Top-down, bottom-up approaches or a combination of both
 Top-down: Starts with overall design and planning (mature and well known)
 Bottom-up: Starts with experiments and prototypes (rapid)
 From software engineering point of view
 Waterfall: structured and systematic analysis at each step before
proceeding to the next
 Spiral: rapid generation of increasingly functional systems, with short
interval between successive releases
 Typical data warehouse design process
 Choose a business process to model, e.g., orders, invoices, etc.
 Choose the grain (atomic level of data) of the business process
 Choose the dimensions that will apply to each fact table record
 Choose the measure that will populate each fact table record
Module I : Data Mining and Warehousing 75
March 28, 2014 76Data Mining: Concepts and Techniques
Data Warehouse: A three-Tier DW Architecture
Metadata
Data
Warehouse
Extract
Transform
Load
Refresh
Middle tier:
OLAP server
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Data Top tier:
Front-End Tools
Serve
Data Marts
Operational
DBs
Other
sources
Bottom tier:
Data warehouse
OLAP Server
March 28, 2014
Three Data Warehouse Models
 Enterprise warehouse
 collects all of the information about subjects spanning the entire
organization
 Data Mart
 a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart
 Targeted to meet the needs of small groups within the
organization
o Independent vs. dependent (directly from warehouse) data mart
 Dependent data mart : A subset that is created directly from a data warehouse
 Independent data mart : A small data warehouse designed for a strategic business unit or a
department
Data Mining: Concepts and Techniques 77
 Virtual warehouse
 A set of views over operational databases
 Only some of the possible summary views may be
materialized
March 28, 2014 78Module I : Data Mining and Warehousing
Three Data Warehouse Models
March 28, 2014
Data Warehouse Back-End Tools and Utilities
 Data extraction
 get data from multiple, heterogeneous, and external sources
 Data cleaning
 detect errors in the data and rectify them when possible
 Data transformation
 convert data from legacy or host format to warehouse format
 Load
 sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions
 Refresh
 propagate the updates from the data sources to the
warehouse
Data Mining: Concepts and Techniques 79
 The recommended approach is to implement the warehouse in an
incremental and evolutionary manner.
 First, a high-level corporate data model is defined within a reasonably short
period that provides corporate-wide, consistent, integrated view of data
among different subjects.
 Second, independent data marts can be implemented in parallel with the
enterprise warehouse based on the same corporate data model set.
 Third, distributed data marts can be constructed to integrate different data
marts
March 28, 2014 80Module I : Data Mining and Warehousing
Data Warehouse Development: A
Recommended Approach
March 28, 2014 81
Data Warehouse Development: A
Recommended Approach
Define a high-level corporate data model
Data
Mart
Data
Mart
Distributed
Data Marts
Multi-Tier Data
Warehouse
Enterprise
Data
Warehouse
Model refinementModel refinement
Data Mining: Concepts and Techniques
March 28, 2014
Metadata Repository
 Meta data is the data defining warehouse objects. It stores:
 Description of the structure of the data warehouse
 schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
 Operational meta-data
 data lineage (history of migrated data and transformation
path), currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
 The algorithms used for summarization
 Data related to system performance
 warehouse schema, view etc
 Business data
 business terms and definitions, ownership of data, charging policies
Data Mining: Concepts and Techniques 82
OLAP Server
 An OLAP Server is a high capacity, multi user data
manipulation engine specifically designed to
support and operate on multi-dimensional data
structure.
 OLAP server available are
 MOLAP server
 ROLAP server
 HOLAP server
Data Mining: Concepts and Techniques March 28, 2014 83
84March 28, 2014
OLAP Server Architectures
 Relational OLAP (ROLAP)
 These are intermediate servers that stand in between a relational back-
end server and client front-end tools
 They use a relational or extended-relational DBMS to store and manage
warehouse data
 Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services
 Greater scalability than MOLAP
Data Mining: Concepts and Techniques
85
Relational OLAP: 3 Tier DSS
Data Warehouse ROLAP Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic
data in
industry
standard
RDBMS.
Generate SQL
execution plans in
the ROLAP engine
to obtain OLAP
functionality.
Obtain multi-
dimensional
reports.
Data Mining: Concepts and Techniques March 28, 2014
 Multidimensional OLAP (MOLAP)
 These servers support multidimensional views of data.
 array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer2000)
 Combines ROLAP and MOLAP technology
 Allows large volumes of detail data to be stored in relational db, while
aggregations are kept in a separate MOLAP
March 28, 2014 86Module I : Data Mining and Warehousing
OLAP Server Architectures
87
MOLAP: 2 Tier DSS
MDDB Engine MDDB Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic data in a proprietary data
structure (MDDB), pre-calculate as many
outcomes as possible, obtain OLAP
functionality via proprietary algorithms
running against this data.
Obtain multi-
dimensional reports
Data Mining: Concepts and Techniques March 28, 2014
 Data warehouses contain huge volumes of data.
 OLAP engines demand that decision support queries be answered in
the order of seconds. Therefore, it is crucial for data warehouse
systems to support highly efficient cube computation techniques,
access methods, and query processing techniques.
 Data cube can be viewed as a lattice of cuboids
 One approach to cube computation is to use compute cube operator
 The compute cube computes aggregates over all subsets of the
dimension specified in the operation.
 This incurs excessive storage space, essentially for large number of
dimensions.
March 28, 2014 88Module I : Data Mining and Warehousing
DW Implementation-Efficient Data Cube
Computation
March 28, 2014 89
DW Implementation-Efficient Data Cube
Computation
 Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid
 The top-most cuboid (apex) contains only one cell
 What is the total number of cuboids or group-by that can be computed
for the data cube contains 3 attributes: city, item, year?
 23=8 {(city, item, year),(city, item), (city, year), (item, year), (city), (item),
(year), () }
 Apex cuboid contains total sum of all sales
 Base cuboid returns the total sales for any combination of three dimensions
 Base cuboid is the least generalized of the cuboid
 Apex cuboid is the most generalized of the cuboid
Module I : Data Mining and Warehousing
 An sql query contains no group-by such as ‘compute the sum of total sales’
is a zero dimensional operation
 An sql query contains one group-by such as ‘compute the sum of total sales
group by city’ is a one dimensional operation
 Therefore, the cube operator is the n-dimensional generalization of the
group by operator
March 28, 2014 90Module I : Data Mining and Warehousing
(item)(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
March 28, 2014 91
Cube Operation
 Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
 For a cube with n-dimensions,
compute cube sales
 The cube computation operator was first introduced by Gray
 OLAP may need to access different cuboids for different queries.
 So, pre-computation
 Pre-computation leads to fast response time and avoids some redundant
computation.
 A major challenge related to this pre-computation, however, is that the required
storage space may explode if all of the cuboids in a data cube are pre-computed,
especially when the cube has several dimensions associated with multiple level
hierarchies.
Module I : Data Mining and Warehousing
 The storage requirements are more excessive when many dimensions have
associated concept hierarchies, each with multiple levels. This problem is
referred as curse of dimensionality.
 If there were no hierarchies associated with each dimension, then the total
number of cuboids for an n-dimensional data cube, as we have seen above, is
2n. However, in practice, many dimensions do have hierarchies.
 day < week < month < quarter < year
 Where Li is the number of levels associated with dimension i.
 1 is added to include virtual top level all
March 28, 2014 92Module I : Data Mining and Warehousing
)1
1
(
n
i
iLT
Partial Materialization : Selected Computation
of cuboids
 Materialization of data cube
 No materialization : pre-compute only the base cuboid and none of the
remaining non-base cuboids
 full materialization: pre-compute all of the cuboids
 partial materialization :selectively compute a proper subset of the
whole set of possible cuboids
(1) identify the subset of cuboids to materialize,
•Based on size, sharing, access frequency, etc.
(2) exploit the materialized cuboids during query processing, and
(3) efficiently update the materialized cuboids during load and refresh.
March 28, 2014 93Module I : Data Mining and Warehousing
March 28, 2014 94
Iceberg Cube
 Computing only the cuboid cells whose count or other aggregates
satisfying the condition like
HAVING COUNT(*) >= minsup
 Only calculate “interesting” cells—data above certain
threshold
Module I : Data Mining and Warehousing
March 28, 2014 95
Indexing OLAP Data: Bitmap Index
 Bit map indexing is a popular method in OLAP, allows quick searching in data
cube
 Is an alternative representation of record_id
 In this, for a given attribute there is a distinct bit vector Bv
 If the domain of a given attribute contains n values, then n bits are needed for
each entry in the bitmap index.
 If the attribute has the value v for a given row in the data table, then the bit
representing that value is set to 1 in the corresponding row of bitmap index.
Module I : Data Mining and Warehousing
March 28, 2014 96
Indexing OLAP Data: Bitmap Index
RID Item City
R1 H V
R2 C V
R3 P V
R4 S V
R5 H T
R6 C T
R7 P T
R8 S T
RID V T
R1 1 0
R2 1 0
R3 1 0
R4 1 0
R5 0 1
R6 0 1
R7 0 1
RID H C P S
R1 1 0 0 0
R2 0 1 0 0
R3 0 0 1 0
R4 0 0 0 1
R5 1 0 0 0
Base table Index on Item Index on city
Module I : Data Mining and Warehousing
March 28, 2014 97
Indexing OLAP Data: Join Indices
 Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id,
…)
 In data warehouses, join index relates the values of
the dimensions of a star schema to rows in the fact
table.
 E.g. fact table: Sales and two dimensions city
and product
• A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
Module I : Data Mining and Warehousing
March 28, 2014 98
Efficient Processing OLAP Queries
 The purpose of materializing cuboids and constructing OLAP index structures is to speed
up query processing in data cubes. Given materialized views, then the query processing
will be as follows:
 Determine which operations should be performed on the available cuboids
 Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice =
selection + projection
 Determine which materialized cuboid(s) should be selected for OLAP op.
 Let {time, item,location} and the dimension hierarchies used are “day < month <
quarter < year” for time, “ item_name < brand < type “ for item and for location
“street < city < province or state < country”
 Let the query to be processed be on {brand, province_or_state} with the condition
“year = 2004”, and there are 4 materialized cuboids available:
Module I : Data Mining and Warehousing
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
 Explore indexing structures and compressed
March 28, 2014 99Module I : Data Mining and Warehousing
From DW to DM
 DW Usuage
 Data warehouses and data marts are used in a wide range of applications.
 Business executives in almost every industry uses the data stored in data
warehouses and data marts to perform data analysis and make strategic
decisions.
 Initially, the data warehouse is mainly used for generating reports and answering
predefined queries.
 Progressively, it is used to analyze summarized and detailed data, where the
results are presented in the form of reports and charts.
 Later, the data warehouse is used for strategic purposes, performing
multidimensional analysis and sophisticated slice-and-dice operations.
 Finally, the data warehouse may be employed for knowledge discovery and
strategic decision making using data mining tools.
 Data warehousing can be categorized into access and retrieval tools, database
reporting tools, data analysis tools, and data mining tools.
March 28, 2014 100Module I : Data Mining and Warehousing
March 28, 2014 101
Data Warehouse Usage
 Three kinds of data warehouse applications
 Information processing : supports querying, basic statistical analysis,
and reporting using crosstabs, tables, charts and graphs
 Analytical processing : multidimensional analysis of data warehouse
data. It supports basic OLAP operations, slice-dice, drilling, pivoting
 Data mining : knowledge discovery from hidden patterns . It supports
associations, constructing analytical models, performing classification
and prediction, and presenting the mining results using visualization
tools.
Module I : Data Mining and Warehousing
March 28, 2014 102
From On-Line Analytical Processing to On Line
Analytical Mining (OLAM)
 OLAM integrates OLAP with data mining and mining knowledge in
multidimensional databases
 Why online analytical mining?
 High quality of data in data warehouses
• DW contains integrated, consistent, cleaned data
 Available information processing structure surrounding data warehouses
• ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP
tools
 OLAP-based exploratory data analysis
• mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions
• integration and swapping of multiple mining functions, algorithms, and
tasks. Module I : Data Mining and Warehousing
An OLAM Architecture
Meta Data
MDDB
OLAM
Engine
OLAP
Engine
Graphical User Interface API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data
Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
Data
Warehouse
March 28, 2014 103Module I : Data Mining and Warehousing

More Related Content

What's hot

Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational modelChirag vasava
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Oracle basic queries
Oracle basic queriesOracle basic queries
Oracle basic queriesPRAKHAR JHA
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction StratergiesAnjaliSoorej
 
11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMSkoolkampus
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean
 
B trees in Data Structure
B trees in Data StructureB trees in Data Structure
B trees in Data StructureAnuj Modi
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 

What's hot (20)

3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
B tree
B treeB tree
B tree
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Database Security
Database SecurityDatabase Security
Database Security
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
NUMPY
NUMPY NUMPY
NUMPY
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Oracle basic queries
Oracle basic queriesOracle basic queries
Oracle basic queries
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
DBMS - RAID
DBMS - RAIDDBMS - RAID
DBMS - RAID
 
11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS
 
The Relational Model
The Relational ModelThe Relational Model
The Relational Model
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
B trees in Data Structure
B trees in Data StructureB trees in Data Structure
B trees in Data Structure
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 

Viewers also liked

Data warehouse system and its concepts
Data warehouse system and its conceptsData warehouse system and its concepts
Data warehouse system and its conceptsGaurav Garg
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEZalpa Rathod
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??Abdul Aslam
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 

Viewers also liked (6)

DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
Data warehouse system and its concepts
Data warehouse system and its conceptsData warehouse system and its concepts
Data warehouse system and its concepts
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 

Similar to Uncover Hidden Information with Data Mining

Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
DMML1_overview.ppt
DMML1_overview.pptDMML1_overview.ppt
DMML1_overview.pptbutest
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptPalaniKumarR2
 
Data Mining Concepts and Techniques
Data Mining Concepts and TechniquesData Mining Concepts and Techniques
Data Mining Concepts and TechniquesPratik Tambekar
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptSamPrem3
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Templatebutest
 
DMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryDMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryJohannes Hoppe
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxRupaRani28
 

Similar to Uncover Hidden Information with Data Mining (20)

Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
data mining
data miningdata mining
data mining
 
DMML1_overview.ppt
DMML1_overview.pptDMML1_overview.ppt
DMML1_overview.ppt
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Data mining
Data miningData mining
Data mining
 
Data Mining Concepts and Techniques
Data Mining Concepts and TechniquesData Mining Concepts and Techniques
Data Mining Concepts and Techniques
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
Data warehousing and Data mining
Data warehousing and Data mining Data warehousing and Data mining
Data warehousing and Data mining
 
DMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryDMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining Theory
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptx
 
2 Data-mining process
2   Data-mining process2   Data-mining process
2 Data-mining process
 

Recently uploaded

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 

Recently uploaded (20)

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 

Uncover Hidden Information with Data Mining

  • 1. M.Sc. Computer Science Data Mining The secret of success is to know something nobody else knows - Aristotle Onassis
  • 2. DATA MINING  Introduction  What is data mining?  Data Mining: On what kind of data?  Data mining functionality  Are all the patterns interesting?  Classification of data mining systems  Major issues in data mining March 28, 2014 2Module I : Data Mining and Warehousing
  • 3. Introduction  Data is growing at a phenomenal rate  Users expect more sophisticated information  How? 3© Prentice Hall UNCOVER HIDDEN INFORMATION DATA MINING
  • 4. Evolution of Database Technology  1960s:  Data collection, database creation, data management –primitive file processing  1970s:  Relational data model, relational DBMS implementation  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s—2000s:  Data mining and data warehousing, multimedia databases, and Web databases March 28, 2014 4Module I : Data Mining and Warehousing
  • 5. What Is Data Mining?  Data mining (knowledge discovery in databases):  The non-trivial process of identifying  valid  novel  potentially useful, and  ultimately understandable patterns in data.  Data mining refers to the discovery of new information in terms of patterns or rules from vast amounts of data March 28, 2014 5Module I : Data Mining and Warehousing
  • 6. Why Data Mining?  From a managerial perspective????  Strategic Decision Making  Wealth Generation  Analyzing trends  Security March 28, 2014 6Module I : Data Mining and Warehousing
  • 7. Database Processing vs. Data Mining Processing  Query - Well defined - SQL  Query - Poorly defined - No precise query language March 28, 2014 7Module I : Data Mining and Warehousing  Data – Operational data  Output – Precise – Subset of database  Data – Not operational data  Output – Not a subset of database
  • 8. Query Examples  Database • Find all customers who live in Boa Vista • Find all customers who use Mastercard • Find all customers who missed one payment  Data Mining • Find all customers who are likely to miss one payment (Classification) • List all items that are frequently purchased with bicycles (Association rules) • Find any “unusual” customers or behavior (e.g., phone calls) (Outlier detection, anomaly discovery) March 28, 2014 8Module I : Data Mining and Warehousing
  • 9. Data Mining vs. KDD  Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data.  Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. March 28, 2014 9Module I : Data Mining and Warehousing
  • 10. Data Mining: A KDD Process  Data mining: the core of knowledge discovery process. March 28, 2014 10Module I : Data Mining and Warehousing Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection and Transformation Data Mining Pattern Evaluation
  • 11. Steps of a KDD Process  Data Cleaning : Remove noise and inconsistent data  Data Integration: multiple data sources are integrated  Data Selection: Obtain relevant data from the database.  Data Transformation: Convert to common format or consolidated into forms appropriate for mining by performing aggregation or summary operations.  Data Mining: Obtain desired results.  Pattern Evaluation: The patterns obtained in the data mining stage are converted into knowledge based on some interestingness measures  Knowledge Presentation: The knowledge obtained are presented to end- users in an understandable form, for example, visualization. March 28, 2014 11Module I : Data Mining and Warehousing
  • 12. Architecture of a Typical Data Mining System March 28, 2014 12Module I : Data Mining and Warehousing Data Warehouse Databases Database or data warehouse server Data mining engine Pattern evaluation Graphical user interface Knowledge-base Data cleaning , integration and Selection WWW
  • 13.  Database,Datawarehouse,WorldWideWeb,or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.  Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.  Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction.  Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis. March 28, 2014 13
  • 14.  Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns.  User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. March 28, 2014 14Module I : Data Mining and Warehousing
  • 15. Relational Databases  A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.  A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. March 28, 2014 15Module I : Data Mining and Warehousing Data Mining: On What Kind of Data?
  • 16. March 28, 2014 16 Data source in Chicago Data source in New York Data source in Toronto Clean, Integ rate Transform Load Refresh Data Warehouse Query and Analysis Tools client client Module I : Data Mining and Warehousing Data warehouses A data ware house is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing
  • 17. Multidimensional Data  Sales volume as a function of product, month, and region March 28, 2014 17 time(quarters) Item(types) Dimensions: address, time, item
  • 18. A Sample Data Cube March 28, 2014 18 Total annual sales of TV in Chicago for past 4Qtr Time(quarters) item sum sum Chicago Toronto New York 1Qtr 2Qtr 3Qtr 4Qtr TV comp phone sum
  • 19. Sales 182 March 28, 2014 19Module I : Data Mining and Warehousing Product Sales Pen 120 Honey 12 Pencil 50 Store Sales 1 102 2 80 store Product Sales 1 Pen 90 1 Honey 12 2 Pencil 50 2 Pen 30
  • 20.  Transactional databases  Object-oriented and object-relational databases  Spatial databases: contain space related information  Time-series data and temporal data: Time related attributes  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW March 28, 2014 20Module I : Data Mining and Warehousing Trans_ID List of Items T100 11,13,15,16 T200 12,13,18
  • 21. Data Mining Functionalities (1)  are used to specify the kinds of patterns to be found in data mining tasks.  Data mining task can be : Predictive or descriptive  Concept/Class description: Characterization and discrimination  Data can be associated with classes or concepts  The description of a class in summarized, concise, and yet precise terms is called class/concept description.  These description can be derived via  Data characterization  Data discrimination  Both characterization and discrimination  Data characterization is a summarization of the general characteristics or features of a target class of data. March 28, 2014 21Module I : Data Mining and Warehousing
  • 22.  Data corresponding to the user specified class are typically collected by a database query  for example, a DM system should be able to produce a description summarizing the characteristics of customers who spend more than $1,000 a year  Data discrimination is a comparison of the general features of target class data objects from one or set of contrasting classes.  A DM system should able to compare two groups of customers, those who shop for computer products regularly(more than two times a month) and those who rarely shop for such products(i.e., less than 3 times a year) March 28, 2014 22Module I : Data Mining and Warehousing Data Mining Functionalities (2)
  • 23.  Mining Frequent Patterns, Associations, and Correlations  A frequent itemset typically refers to a set of items that frequently appear together in a transactional data set, such as milk and bread.  Association analysis is the discovery of association rules showing attribute value conditions that occur frequently together in a given set of data.  X => Y  E.g., buys(X,”computer”) => buys(X,”software”)*support=1%,confidence=50%]  Confidence: “is a measure of how often the consequent is true when the antecedent is true.”  Here, if the customer buys a computer, there is a 50% chance that he will buy software as well. March 28, 2014 23Module I : Data Mining and Warehousing Data Mining Functionalities (3)
  • 24.  “support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule”  Here, 1% support means that 1% of all transactions under analysis showed that computer and software purchased together.  Can have more predicates or attributes  Association rules that contain a single predicate are referred to as single- dimensional association rules.  age(X, “20…29”) ^ income(X, “20K...29K”)  buys(X, “computer”) *support = 2%, confidence = 60%] March 28, 2014 24Module I : Data Mining and Warehousing Data Mining Functionalities (4)
  • 25.  Classification is the process of finding a model( or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.  Given a set of items that have several classes, and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item.  The derived model can be represented using  IF-THEN  DECISION TREE  NEURAL NETWORKS etc. March 28, 2014 25Module I : Data Mining and Warehousing Data Mining Functionalities (5) - Classification
  • 26. 26 Classification Process: Model Construction Training Data NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)
  • 27. 27 Classification Process: Use the Model in Prediction Classifier Testing Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured?
  • 28.  age(X, “youth”) AND income(X, ”high”)  class(X, “A”)  age(X, “youth”) AND income(X, ”low”)  class(X, “B”)  age(X, “middle_aged”)  class(X, “C”)  age(X, “Senior”)  class(X, “C”) March 28, 2014 28Module I : Data Mining and Warehousing Data Mining Functionalities (6) - Classification
  • 29. March 28, 2014 29Module I : Data Mining and Warehousing age? income? class C class A class B youth Middle_aged, senior high low f1 f2 f3 f4 f5 f8 f7 f6 age income Class A Class B Class C Data Mining Functionalities (7) - Classification
  • 30. Data Mining Functionalities (8) - Prediction  Prediction is used to predict missing or unavailable numeric data values rather than class labels.  Classification and prediction may need to be proceeded by relevance analysis , which attempts to identify attributes that do not contribute to the classification or prediction process. March 28, 2014 30Module I : Data Mining and Warehousing
  • 31. Data Mining Functionalities (9)  Cluster analysis  Similar to classification, but the class label is unknown and it is upto clustering algorithm to discover acceptable classes  “Clustering algorithms find groups of items that are similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. ”  Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased.  The categories are unspecified and this is referred to as ‘unsupervised learning’ March 28, 2014 31Module I : Data Mining and Warehousing
  • 32. Data Mining Functionalities (10) March 28, 2014 32Module I : Data Mining and Warehousing
  • 33. Data Mining Functionalities (11)  Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity  Intra-class similarity means similarity between objects in same class  Inter-class similarity means similarity between objects of different classes  Each cluster that is formed can be viewed as a class of objects, from which rules can be derived March 28, 2014 33Module I : Data Mining and Warehousing
  • 34. Data Mining Functionalities (12)  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis  Trend and evolution analysis  Describes and models regularities or trends for objects whose behavior changes over time March 28, 2014 34Module I : Data Mining and Warehousing
  • 35. Classification of Data Mining systems: Confluence of Multiple Disciplines March 28, 2014 35Module I : Data Mining and Warehousing Data Mining
  • 36. Data Mining: Classification Schemes  Different views, different classifications  Kinds of databases to be mined: relational, transactional, spatial etc.  Kinds of knowledge to be discovered : based on DM functionalities; characterization, discrimination, association, classification etc.  Kinds of techniques utilized : DM can be categorized according to the underlying DM technique employed.  These tech can be defined according the degree of user interaction involved or the methods of data analysis employed  Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. March 28, 2014 36Module I : Data Mining and Warehousing
  • 37.  Kinds of applications adapted: DM systems can also be classified according to the applications they adapt  Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc. March 28, 2014 37Module I : Data Mining and Warehousing Data Mining: Classification Schemes
  • 38. DATA MINING TASK PRIMITIVES  A data mining task can be specified in the form of a data mining query  A data mining query is defined in terms of the following data mining task primitives.  Task-relevant data: This specifies the portions of the database or the set of data in which the user is interested.  This includes the database attributes or data warehouse dimensions of interest.  kind of knowledge: This specifies the data mining functions to be performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis. March 28, 2014 38Module I : Data Mining and Warehousing
  • 39.  background knowledge : knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for evaluating the patterns found.  Concept hierarchies are a popular form of background knowledge, allow data to be mined at multiple levels of abstraction. March 28, 2014 39Module I : Data Mining and Warehousing all India Canada OntariaColumbiaTamil naduKerala EKMTVM Coimb chennai …
  • 40.  Interestingness measures and thresholds: They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns.  Different kinds of knowledge may have different interestingness measures.  For example, interestingness measures for association rules include support and confidence.  Rules whose support and confidence values are below user-specified thresholds are considered uninteresting.  Representation for visualizing: This refers to the form in which discovered patterns are to be displayed,  rules, tables, charts, graphs, decision trees, and cubes. March 28, 2014 40Module I : Data Mining and Warehousing
  • 41. INTEGRATION OF A DATA MINING SYSTEM WITH A DATABASE OR DATA WAREHOUSE SYSTEM No Coupling:  DM will not utilize any function of a DB or DW system.  It may fetch data from a particular source (such as file system), process data using some data mining algorithms, and then store the mining result in another file.  DB system provides a great deal of flexibility and efficiency at storing, organizing, accessing, and processing data.  Without using a DB/DW system , a DM system may spend a substantial amount of time finding, collecting, cleaning and transforming data.  Second, there are many tested, scalable algorithms and data structures implemented in DB and DW systems. It is feasible to realize efficient, scalable implementations using such systems. March 28, 2014 41Module I : Data Mining and Warehousing
  • 42.  most data have been or will be stored in DB/DW systems. Without any coupling of such systems, a DM system will need to use other tools to extract data, making it difficult to integrate such a system into an information processing environment. Thus no coupling is a poor design.  LOOSE COUPLING:  that a data mining system will use some facilities of a DB/DW system, fetching data from a data repository managed by these systems and then performing data mining and then storing the mining results either in a file or in a designated place in a database or data warehouse. March 28, 2014 42Module I : Data Mining and Warehousing INTEGRATION OF A DATA MINING SYSTEM WITH A DATABASE OR DATA WAREHOUSE SYSTEM
  • 43.  It incurs some advantages of flexibility, efficiency, and other features provided by such systems.  loosely coupled mining systems are main memory based. Because mining does not explore data structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to achieve high scalability and good performance with large data sets.  SEMI-TIGHT COUPLING  besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives can be provided in the DB/DW system.  Also we can precompute some frequently used intermediate mining results and stored in DB/DW system. This will enhance performance of a DM system. March 28, 2014 43Module I : Data Mining and Warehousing
  • 44.  TIGHT COUPLING:  DM system is smoothly integrated into the DB/DW system.  The data mining subsystem is treated as one functional component of an information system.  Data mining queries and functions are optimized based on query analysis, data structures, indexing schemes and query processing methods of a DB or DW system.  The tight coupling provides a uniform information processing environment. March 28, 2014 44Module I : Data Mining and Warehousing
  • 45. Major Issues in Data Mining (1)  Mining methodology and user interaction  Mining different kinds of knowledge in databases  Interactive mining of knowledge at multiple levels of abstraction  Incorporation of background knowledge  Data mining query languages and ad-hoc data mining  Expression and visualization of data mining results  Handling noise and incomplete data  Pattern evaluation: the interestingness problem  Performance and scalability  Efficiency and scalability of data mining algorithms  Parallel, distributed and incremental mining methods March 28, 2014 45Module I : Data Mining and Warehousing
  • 46. Major Issues in Data Mining (2)  Issues relating to the diversity of data types  Handling relational and complex types of data  Mining information from heterogeneous databases and global information systems (WWW)  Issues related to applications and social impacts  Application of discovered knowledge Domain-specific data mining tools Intelligent query answering Process control and decision making  Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem  Protection of data security, integrity, and privacy March 28, 2014 46Module I : Data Mining and Warehousing
  • 47. DATA WAREHOUSE  The main repository of an organization historical data  It contains the raw material for management’s decision support system  The term Data Warehouse was coined by Bill Inmon in 1990  “A DW is a subject oriented, integrated, time-variant and non- volatile collection of data in support of management’s decision making process.” March 28, 2014 47Module I : Data Mining and Warehousing
  • 48.  Subject oriented: A DW is organized around major subjects, such as customer, supplier, product, sales etc.  Rather than focusing on day-to-day operations DW concentrate on the modeling and analysis of data for decision makers.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process March 28, 2014 48Module I : Data Mining and Warehousing SalesProducts Customers
  • 49.  Integrated: A DW is usually constructed by integrating multiple heterogeneous sources such as relational databases, flat files, etc.  Data cleaning and data integration techniques are applied March 28, 2014 49Module I : Data Mining and Warehousing Savings account Loans account Subject = account
  • 50.  Time-variant:  The time horizon for the data warehouse is significantly longer than that of operational systems  Operational database: current value data  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) March 28, 2014 50Module I : Data Mining and Warehousing
  • 51. 51 Nonvolatile  A physically separate store of data transformed from the operational environment  Operational update of data does not occur in the data warehouse environment  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing: • initial loading of data and access of data March 28, 2014Module I : Data Mining and Warehousing
  • 52. Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach  When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set  Complex information filtering Data warehouse: update-driven, high performance Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis March 28, 2014 52Module I : Data Mining and Warehousing
  • 53. Data Warehouse vs. Operational DBMS  OLTP (on-line transaction processing)  Major task of traditional relational DBMS  Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.  OLAP (on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making March 28, 2014 53Module I : Data Mining and Warehousing
  • 54.  Distinct features (OLTP vs. OLAP):  User and system orientation: customer vs. market  Data contents: current, detailed vs. historical, consolidated  Database design: ER + application vs. star + subject  View: current, local vs. evolutionary, integrated  Access patterns: update vs. read-only but complex queries March 28, 2014 54Module I : Data Mining and Warehousing
  • 55. March 28, 2014 55Module I : Data Mining and Warehousing
  • 56. A multi-dimensional data model  From Tables and Spreadsheets to Data Cubes:  A data warehouse is based on a multidimensional data model which views data in the form of a data cube  A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions  Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)  Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables  The data cube can be n-dimensional March 28, 2014 56Module I : Data Mining and Warehousing
  • 57.  In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. March 28, 2014 57Module I : Data Mining and Warehousing
  • 58. Cube: A Lattice of Cuboids March 28, 2014 58Module I : Data Mining and Warehousing
  • 59. Conceptual Modeling of Data Warehouses  The most popular data model for a data warehouse is a multidimensional model. Such a model exist in the form of a star schema, a snowflake schema or a fact constellation schema.  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation March 28, 2014 59Module I : Data Mining and Warehousing
  • 60. Example of Star Schema March 28, 2014 60Module I : Data Mining and Warehousing
  • 61. Example of Snowflake Schema March 28, 2014 61Module I : Data Mining and Warehousing
  • 62. Example of Fact Constellation March 28, 2014 62Module I : Data Mining and Warehousing
  • 63. A Data Mining Query Language, DMQL: Language Primitives  Cube Definition (Fact Table) define cube<cube_name> [<dimension_list>]: <measure_list>  Dimension Definition ( Dimension Table ) define dimension<dimension_name> as(<attribute_or_subdimension_list>)  Special Case (Shared Dimension Tables) define dimension<dimension_name> as<dimension_name_first_time> in cube <cube_name_first_time> March 28, 2014 63Module I : Data Mining and Warehousing
  • 64. Defining a Star Schema in DMQL  define cubesales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales= avg(sales_in_dollars), units_sold = count(*)  define dimensiontime as (time_key, day, day_of_week, month, quarter, year)  define dimension item as (item_key, item_name, brand, type, supplier_type)  define dimension branch as(branch_key, branch_name, branch_type)  define dimensionlocation as(location_key, street, city, province_or_state, country) March 28, 2014 64Module I : Data Mining and Warehousing
  • 65. Defining a Snowflake Schema in DMQL  define cubesales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales= avg(sales_in_dollars), units_sold = count(*)  define dimensiontime as (time_key, day, day_of_week, month, quarter, year)  define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))  define dimension branch as(branch_key, branch_name, branch_type)  define dimensionlocation as(location_key, street, city(city_key, province_or_state, country)) March 28, 2014 65Module I : Data Mining and Warehousing
  • 66. Defining a Fact Constellation in DMQL  define cubesales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales= avg(sales_in_dollars), units_sold = count(*)  define dimensiontime as (time_key, day, day_of_week, month, quarter, year)  define dimension item as (item_key, item_name, brand, type, supplier_type)  define dimension branch as(branch_key, branch_name, branch_type)  define dimensionlocation as(location_key, street, city, province_or_state, country) March 28, 2014 66Module I : Data Mining and Warehousing
  • 67.  define cubeshipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)  define dimensiontime as time in cubesales  define dimension item as item in cubesales  define dimension shipper as(shipper_key, shipper_name, locationaslocation in cubesales, shipper_type)  define dimensionfrom_location aslocation in cubesales  define dimensionto_location aslocation in cubesales March 28, 2014 67Module I : Data Mining and Warehousing
  • 68. March 28, 2014 Measures of Data Cube  A data cube measure is a numerical function that can be evaluated at each point in the data cube space.  A measure value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point. Module I : Data Mining and Warehousing 68
  • 69.  Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning  E.g., count(), sum(), min(), max()  Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function  E.g., avg(), min_N(), standard_deviation()  Holistic: if there is no constant bound on the storage size needed to describe a subaggregate.  E.g., median(), mode(), rank() March 28, 2014 Measures of Data Cube: Three Categories Module I : Data Mining and Warehousing 69
  • 70. Concept Hierarchy  A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts  Consider dimension location: vancouver,Toronto,New York and Chicago. Each city can be mapped to province or state to which it belongs. The province or state can be mapped to country. March 28, 2014Module I : Data Mining and Warehousing 70
  • 71. March 28, 2014 A Concept Hierarchy: Dimension (location) all Europe North_America MexicoCanadaSpainGermany Vancouver M. WindL. Chan ... ...... ... ... ... all region office country TorontoFrankfurtcity Module I : Data Mining and Warehousing 71
  • 72. March 28, 2014 Typical OLAP Operations  Roll up (drill-up): summarize data  by climbing up hierarchy or by dimension reduction  Drill down (roll down): reverse of roll-up  from higher level summary to lower level summary or detailed data, or introducing new dimensions  Slice and dice: project and select  Pivot (rotate):  reorient the cube, visualization, 3D to series of 2D planes  Other operations  drill across: involving (across) more than one fact table  drill through: through the bottom level of the cube to its back- end relational tables (using SQL) Module I : Data Mining and Warehousing 72
  • 74. March 28, 2014 Design of Data Warehouse: A Business Analysis Framework  Four views regarding the design of a data warehouse  Top-down view • allows selection of the relevant information necessary for the data warehouse  Data source view • exposes the information being captured, stored, and managed by operational systems  Data warehouse view • consists of fact tables and dimension tables  Business query view • sees the perspectives of data in the warehouse from the view of end-user 74Module I : Data Mining and Warehousing
  • 75. March 28, 2014 Data Warehouse Design Process  Top-down, bottom-up approaches or a combination of both  Top-down: Starts with overall design and planning (mature and well known)  Bottom-up: Starts with experiments and prototypes (rapid)  From software engineering point of view  Waterfall: structured and systematic analysis at each step before proceeding to the next  Spiral: rapid generation of increasingly functional systems, with short interval between successive releases  Typical data warehouse design process  Choose a business process to model, e.g., orders, invoices, etc.  Choose the grain (atomic level of data) of the business process  Choose the dimensions that will apply to each fact table record  Choose the measure that will populate each fact table record Module I : Data Mining and Warehousing 75
  • 76. March 28, 2014 76Data Mining: Concepts and Techniques Data Warehouse: A three-Tier DW Architecture Metadata Data Warehouse Extract Transform Load Refresh Middle tier: OLAP server Analysis Query Reports Data mining Monitor & Integrator Data Top tier: Front-End Tools Serve Data Marts Operational DBs Other sources Bottom tier: Data warehouse OLAP Server
  • 77. March 28, 2014 Three Data Warehouse Models  Enterprise warehouse  collects all of the information about subjects spanning the entire organization  Data Mart  a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart  Targeted to meet the needs of small groups within the organization o Independent vs. dependent (directly from warehouse) data mart  Dependent data mart : A subset that is created directly from a data warehouse  Independent data mart : A small data warehouse designed for a strategic business unit or a department Data Mining: Concepts and Techniques 77
  • 78.  Virtual warehouse  A set of views over operational databases  Only some of the possible summary views may be materialized March 28, 2014 78Module I : Data Mining and Warehousing Three Data Warehouse Models
  • 79. March 28, 2014 Data Warehouse Back-End Tools and Utilities  Data extraction  get data from multiple, heterogeneous, and external sources  Data cleaning  detect errors in the data and rectify them when possible  Data transformation  convert data from legacy or host format to warehouse format  Load  sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions  Refresh  propagate the updates from the data sources to the warehouse Data Mining: Concepts and Techniques 79
  • 80.  The recommended approach is to implement the warehouse in an incremental and evolutionary manner.  First, a high-level corporate data model is defined within a reasonably short period that provides corporate-wide, consistent, integrated view of data among different subjects.  Second, independent data marts can be implemented in parallel with the enterprise warehouse based on the same corporate data model set.  Third, distributed data marts can be constructed to integrate different data marts March 28, 2014 80Module I : Data Mining and Warehousing Data Warehouse Development: A Recommended Approach
  • 81. March 28, 2014 81 Data Warehouse Development: A Recommended Approach Define a high-level corporate data model Data Mart Data Mart Distributed Data Marts Multi-Tier Data Warehouse Enterprise Data Warehouse Model refinementModel refinement Data Mining: Concepts and Techniques
  • 82. March 28, 2014 Metadata Repository  Meta data is the data defining warehouse objects. It stores:  Description of the structure of the data warehouse  schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents  Operational meta-data  data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)  The algorithms used for summarization  Data related to system performance  warehouse schema, view etc  Business data  business terms and definitions, ownership of data, charging policies Data Mining: Concepts and Techniques 82
  • 83. OLAP Server  An OLAP Server is a high capacity, multi user data manipulation engine specifically designed to support and operate on multi-dimensional data structure.  OLAP server available are  MOLAP server  ROLAP server  HOLAP server Data Mining: Concepts and Techniques March 28, 2014 83
  • 84. 84March 28, 2014 OLAP Server Architectures  Relational OLAP (ROLAP)  These are intermediate servers that stand in between a relational back- end server and client front-end tools  They use a relational or extended-relational DBMS to store and manage warehouse data  Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services  Greater scalability than MOLAP Data Mining: Concepts and Techniques
  • 85. 85 Relational OLAP: 3 Tier DSS Data Warehouse ROLAP Engine Decision Support Client Database Layer Application Logic Layer Presentation Layer Store atomic data in industry standard RDBMS. Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality. Obtain multi- dimensional reports. Data Mining: Concepts and Techniques March 28, 2014
  • 86.  Multidimensional OLAP (MOLAP)  These servers support multidimensional views of data.  array-based multidimensional storage engine  Fast indexing to pre-computed summarized data  Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer2000)  Combines ROLAP and MOLAP technology  Allows large volumes of detail data to be stored in relational db, while aggregations are kept in a separate MOLAP March 28, 2014 86Module I : Data Mining and Warehousing OLAP Server Architectures
  • 87. 87 MOLAP: 2 Tier DSS MDDB Engine MDDB Engine Decision Support Client Database Layer Application Logic Layer Presentation Layer Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data. Obtain multi- dimensional reports Data Mining: Concepts and Techniques March 28, 2014
  • 88.  Data warehouses contain huge volumes of data.  OLAP engines demand that decision support queries be answered in the order of seconds. Therefore, it is crucial for data warehouse systems to support highly efficient cube computation techniques, access methods, and query processing techniques.  Data cube can be viewed as a lattice of cuboids  One approach to cube computation is to use compute cube operator  The compute cube computes aggregates over all subsets of the dimension specified in the operation.  This incurs excessive storage space, essentially for large number of dimensions. March 28, 2014 88Module I : Data Mining and Warehousing DW Implementation-Efficient Data Cube Computation
  • 89. March 28, 2014 89 DW Implementation-Efficient Data Cube Computation  Data cube can be viewed as a lattice of cuboids  The bottom-most cuboid is the base cuboid  The top-most cuboid (apex) contains only one cell  What is the total number of cuboids or group-by that can be computed for the data cube contains 3 attributes: city, item, year?  23=8 {(city, item, year),(city, item), (city, year), (item, year), (city), (item), (year), () }  Apex cuboid contains total sum of all sales  Base cuboid returns the total sales for any combination of three dimensions  Base cuboid is the least generalized of the cuboid  Apex cuboid is the most generalized of the cuboid Module I : Data Mining and Warehousing
  • 90.  An sql query contains no group-by such as ‘compute the sum of total sales’ is a zero dimensional operation  An sql query contains one group-by such as ‘compute the sum of total sales group by city’ is a one dimensional operation  Therefore, the cube operator is the n-dimensional generalization of the group by operator March 28, 2014 90Module I : Data Mining and Warehousing (item)(city) () (year) (city, item) (city, year) (item, year) (city, item, year)
  • 91. March 28, 2014 91 Cube Operation  Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars)  For a cube with n-dimensions, compute cube sales  The cube computation operator was first introduced by Gray  OLAP may need to access different cuboids for different queries.  So, pre-computation  Pre-computation leads to fast response time and avoids some redundant computation.  A major challenge related to this pre-computation, however, is that the required storage space may explode if all of the cuboids in a data cube are pre-computed, especially when the cube has several dimensions associated with multiple level hierarchies. Module I : Data Mining and Warehousing
  • 92.  The storage requirements are more excessive when many dimensions have associated concept hierarchies, each with multiple levels. This problem is referred as curse of dimensionality.  If there were no hierarchies associated with each dimension, then the total number of cuboids for an n-dimensional data cube, as we have seen above, is 2n. However, in practice, many dimensions do have hierarchies.  day < week < month < quarter < year  Where Li is the number of levels associated with dimension i.  1 is added to include virtual top level all March 28, 2014 92Module I : Data Mining and Warehousing )1 1 ( n i iLT
  • 93. Partial Materialization : Selected Computation of cuboids  Materialization of data cube  No materialization : pre-compute only the base cuboid and none of the remaining non-base cuboids  full materialization: pre-compute all of the cuboids  partial materialization :selectively compute a proper subset of the whole set of possible cuboids (1) identify the subset of cuboids to materialize, •Based on size, sharing, access frequency, etc. (2) exploit the materialized cuboids during query processing, and (3) efficiently update the materialized cuboids during load and refresh. March 28, 2014 93Module I : Data Mining and Warehousing
  • 94. March 28, 2014 94 Iceberg Cube  Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= minsup  Only calculate “interesting” cells—data above certain threshold Module I : Data Mining and Warehousing
  • 95. March 28, 2014 95 Indexing OLAP Data: Bitmap Index  Bit map indexing is a popular method in OLAP, allows quick searching in data cube  Is an alternative representation of record_id  In this, for a given attribute there is a distinct bit vector Bv  If the domain of a given attribute contains n values, then n bits are needed for each entry in the bitmap index.  If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of bitmap index. Module I : Data Mining and Warehousing
  • 96. March 28, 2014 96 Indexing OLAP Data: Bitmap Index RID Item City R1 H V R2 C V R3 P V R4 S V R5 H T R6 C T R7 P T R8 S T RID V T R1 1 0 R2 1 0 R3 1 0 R4 1 0 R5 0 1 R6 0 1 R7 0 1 RID H C P S R1 1 0 0 0 R2 0 1 0 0 R3 0 0 1 0 R4 0 0 0 1 R5 1 0 0 0 Base table Index on Item Index on city Module I : Data Mining and Warehousing
  • 97. March 28, 2014 97 Indexing OLAP Data: Join Indices  Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)  In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table.  E.g. fact table: Sales and two dimensions city and product • A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city Module I : Data Mining and Warehousing
  • 98. March 28, 2014 98 Efficient Processing OLAP Queries  The purpose of materializing cuboids and constructing OLAP index structures is to speed up query processing in data cubes. Given materialized views, then the query processing will be as follows:  Determine which operations should be performed on the available cuboids  Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection  Determine which materialized cuboid(s) should be selected for OLAP op.  Let {time, item,location} and the dimension hierarchies used are “day < month < quarter < year” for time, “ item_name < brand < type “ for item and for location “street < city < province or state < country”  Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available: Module I : Data Mining and Warehousing
  • 99. 1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 Which should be selected to process the query?  Explore indexing structures and compressed March 28, 2014 99Module I : Data Mining and Warehousing
  • 100. From DW to DM  DW Usuage  Data warehouses and data marts are used in a wide range of applications.  Business executives in almost every industry uses the data stored in data warehouses and data marts to perform data analysis and make strategic decisions.  Initially, the data warehouse is mainly used for generating reports and answering predefined queries.  Progressively, it is used to analyze summarized and detailed data, where the results are presented in the form of reports and charts.  Later, the data warehouse is used for strategic purposes, performing multidimensional analysis and sophisticated slice-and-dice operations.  Finally, the data warehouse may be employed for knowledge discovery and strategic decision making using data mining tools.  Data warehousing can be categorized into access and retrieval tools, database reporting tools, data analysis tools, and data mining tools. March 28, 2014 100Module I : Data Mining and Warehousing
  • 101. March 28, 2014 101 Data Warehouse Usage  Three kinds of data warehouse applications  Information processing : supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs  Analytical processing : multidimensional analysis of data warehouse data. It supports basic OLAP operations, slice-dice, drilling, pivoting  Data mining : knowledge discovery from hidden patterns . It supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Module I : Data Mining and Warehousing
  • 102. March 28, 2014 102 From On-Line Analytical Processing to On Line Analytical Mining (OLAM)  OLAM integrates OLAP with data mining and mining knowledge in multidimensional databases  Why online analytical mining?  High quality of data in data warehouses • DW contains integrated, consistent, cleaned data  Available information processing structure surrounding data warehouses • ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools  OLAP-based exploratory data analysis • mining with drilling, dicing, pivoting, etc.  On-line selection of data mining functions • integration and swapping of multiple mining functions, algorithms, and tasks. Module I : Data Mining and Warehousing
  • 103. An OLAM Architecture Meta Data MDDB OLAM Engine OLAP Engine Graphical User Interface API Data Cube API Database API Data cleaning Data integration Layer3 OLAP/OLAM Layer2 MDDB Layer1 Data Repository Layer4 User Interface Filtering&Integration Filtering Databases Mining query Mining result Data Warehouse March 28, 2014 103Module I : Data Mining and Warehousing