SlideShare a Scribd company logo
1 of 168
Download to read offline
Data Mining & Data Ware
Housing
(PGCSE302 C)
The New unified Syllabus for both CSE & IT
followed from the session 2013-14
by
Maulana Abul Kalam Azad University of Technology,
West Bengal
(formerlyWest Bengal University ofTechnology)
Dr. Bikramjit Sarkar
Assistant Professor
Dept. of Computer Science and Engineering
Dr. B. C. Roy Engineering College
Jemua Road, Fuljhore, Durgapur – 713206 (W. B.)
[www.bcrec.ac.in]
Presented by
Prescribed Curriculum (MAKAUT)
Data Mining & Data Ware Housing (PGCS302C): 36L
UNIT-I: 4 L
Introduction: Basics of Data Mining. Data Mining Functionalities, Classification of Data Mining
Systems, Data Mining Issues, Data Mining Goals. Stages of the Data Mining Process.
UNIT-II: 5 L
Data Warehouse and OLAP: Data Warehouse concepts, Data Warehouse Architecture, OLAP
technology, DBMS, OLTP VS. Data Warehouse Environment, Multidimensional data model Data
marts.
UNIT-III: 6 L
Data Mining Techniques: Statistics, Similarity Measures, Decision Trees, Neural Networks, Genetic
Algorithms.
UNIT-IV: 9 L
Mining Association Rules: Basic Algorithms, Parallel and Distributed algorithms, Comparative study,
Incremental Rules, Advanced Association Rule Technique, Apriori Algorithm, Partition Algorithm,
Dynamic Item set Counting Algorithm, FP tree growth Algorithm, Boarder Algorithm.
Prescribed Curriculum (MAKAUT) – contd..
Data Mining & Data Ware Housing (PGCS302C): 36L
UNIT-V: 5 L
Clustering Techniques: Partitioning Algorithms-K- means Algorithm, CLARA, CLARANS,
Hierarchical algorithms- DBSCAN, ROCK.
UNIT-VI: 4 L
Classification Techniques: Statistical–based, Distance-based, Decision Tree- based Decision tree.
UNIT-VII: 3 L
Applications and Trends in Data Mining: Applications, Advanced Techniques - Web Mining, Web
Content Mining, Structure Mining.
- - -
UNIT-I
Data, Information, Knowledge, Understanding,
Wisdom
Data
Data are any facts, numbers, or text that can be processed by a
computer. Today, organizations are accumulating vast and
growing amounts of data in different formats and different
databases. This includes:
(A) Operational or transactional data such as, sales, cost,
inventory, payroll, and accounting…
(B) Non-operational data, such as industry sales, forecast data,
and macro-economic data…
(C) Meta data - data about the data itself, such as logical database
design or data dictionary definitions…
Data, Information, Knowledge, Understanding,
Wisdom– contd..
Information
The patterns, associations, or relationships among all this data
can provide information. For example, analysis of retail point of
sale transaction data can yield information on which products
are selling and when.
Data, Information, Knowledge, Understanding,
Wisdom– contd..
Knowledge
Information can be converted into knowledge about historical
patterns and future trends. For example, summary information
on retail supermarket sales can be analyzed in light of
promotional efforts to provide knowledge of consumer buying
behavior. Thus, a manufacturer or retailer could determine
which items are most susceptible to promotional efforts.
Data, Information, Knowledge, Understanding,
Wisdom– contd..
Understanding
Understanding is an interpolative and probabilistic process. It is
cognitive and analytical. It is the process by which I can take
knowledge and synthesize new knowledge from the previously
held knowledge. The difference between understanding and
knowledge is the difference between "learning" and
"memorizing". People who have understanding can undertake
useful actions because they can synthesize new knowledge, or
in some cases, at least new information, from what is
previously known (and understood).
Understanding – contd..
That is, understanding can build upon currently held
information, knowledge and understanding itself. In computer
parlance, AI systems possess understanding in the sense that
they are able to synthesize new knowledge from previously
stored information and knowledge.
Data, Information, Knowledge, Understanding,
Wisdom– contd..
Data, Information, Knowledge, Understanding,
Wisdom– contd..
Wisdom
Wisdom is an extrapolative and non-deterministic, non-
probabilistic process. It calls upon all the previous levels of
consciousness, and specifically upon special types of human
programming (moral, ethical codes, etc.). It beckons to give us
understanding about which there has previously been no
understanding, and in doing so, goes far beyond understanding
itself. It is the essence of philosophical probing. Unlike the
previous four levels, it asks questions to which there is no
(easily-achievable) answer, and in some cases, to which there
can be no humanly-known answer period. Wisdom is therefore,
the process by which we also discern, or judge, between right
and wrong, good and bad.
Data, Information, Knowledge, Understanding,
Wisdom– contd..
Wisdom – contd..
Computers do not have, and will never have the ability to
possess wisdom. Wisdom is a uniquely human state, or as I see
it, wisdom requires one to have a soul, for it resides as much in
the heart as in the mind. And a soul is something machines will
never possess (or perhaps I should reword that to say, a soul is
something that, in general, will never possess a machine).
Data, Information, Knowledge, Understanding,
Wisdom– contd..
The following diagram represents the transitions from data, to
information, to knowledge, and finally to wisdom:
It is understanding that support the transition from each stage to
the next. Understanding is not a separate level of its own.
Concepts of Data Mining
Generally, data mining (sometimes called data or knowledge
discovery) is the process of analyzing data from different
perspectives and summarizing it into useful information -
information that can be used to increase revenue, cuts costs, or
both.
Technically, data mining is the process of finding correlations or
patterns among dozens of fields in large relational databases.
However, continuous innovations in computer processing
power, disk storage, and statistical software are dramatically
increasing the accuracy of analysis while driving down the cost.
Concepts of Data Mining – contd..
Data Mining is a technology that uses data analysis tools with
sophisticated algorithms to search useful information from large
volumes of data.
Data mining is also defined as a process of automatically
discovering useful information from massive amount of data
repositories.
Concepts of Data Mining – contd..
Data mining is the practice of automatically searching large
stores of data to discover patterns and trends that go beyond
simple analysis. Data mining uses sophisticated mathematical
algorithms to segment the data and evaluate the probability of
future events. Data mining is also known as Knowledge
Discovery in Data (KDD).
Data mining can answer questions that cannot be addressed
through simple query and reporting techniques.
The key properties of data mining
• Automatic discovery of patterns
• Prediction of likely outcomes
• Creation of actionable information
• Focus on large data sets and databases
The key properties of data mining – contd..
Automatic Discovery
Data mining is accomplished by building models. A model uses
an algorithm to act on a set of data. The notion of automatic
discovery refers to the execution of data mining models.
Data mining models can be used to mine the data on which they
are built, but most types of models are generalizable to new
data. The process of applying a model to new data is known as
scoring.
Prediction
Many forms of data mining are predictive. For example, a
model might predict income based on education and other
demographic factors. Predictions have an associated probability
(How likely is this prediction to be true?). Prediction
probabilities are also known as confidence.
Some forms of predictive data mining generate rules, which are
conditions that imply a given outcome. For example, a rule
might specify that a person who has a bachelor's degree and
lives in a certain neighborhood is likely to have an income
greater than the regional average. Rules have an associated
support.
The key properties of data mining – contd..
Actionable Information
Data mining can derive actionable information from large
volumes of data. For example, a town planner might use a
model that predicts income based on demographics to develop
a plan for low-income housing. A car leasing agency might a
use model that identifies customer segments to design a
promotion targeting high-value customers.
The key properties of data mining – contd..
The key properties of data mining – contd..
Grouping
Other forms of data mining identify natural groupings in the
data. For example, a model might identify the segment of the
population that has an income within a specified range, that has
a good driving record, and that leases a new car on a yearly
basis.
Data Mining and Knowledge Discovery
Data mining is an integral part of Knowledge Discovery in
databases (KDD), which is an overall process of converting
raw data into useful information, as shown in figure below.
This process consists of a series of transformation steps, from
pre-processing to post-processing of data mining results.
Knowledge Discovery in Databases
The following diagram represents the process of Knowledge
Discovery in databases:
Knowledge Discovery in Databases – contd..
The Knowledge Discovery in Databases process comprises of
a few steps leading from raw data collections to some form of
new knowledge. The iterative process consists of the
following steps:
• Data cleaning: also known as data cleansing, it is a phase
in which noise data and irrelevant data are removed from
the collection.
• Data integration: at this stage, multiple data sources, often
heterogeneous, may be combined in a common source.
• Data selection: At this step, the data relevant to the
analysis is decided on and retrieved from the data
collection.
Knowledge Discovery in Databases – contd..
• Data transformation: also known as data consolidation, it
is a phase in which the selected data is transformed into
forms appropriate for the mining procedure.
• Data mining: it is the crucial step in which clever
techniques are applied to extract patterns potentially useful.
• Pattern evaluation: in this step, strictly interesting patterns
representing knowledge are identified based on given
measures.
• Knowledge representation: is the final phase in which the
discovered knowledge is visually represented to the user.
This essential step uses visualization techniques to help
users understand and interpret the data mining results.
Steps in Knowledge Discovery in Databases
Below are the steps leading from raw data collections to some
form of new knowledge. The iterative process consists of the
following steps:
• Data cleaning: also known as data cleansing, it is a phase in
which noise data and irrelevant data are removed from the
collection.
• Data integration: at this stage, multiple data sources, often
heterogeneous, may be combined in a common source.
• Data selection: at this step, the data relevant to the analysis
is decided on and retrieved from the data collection.
Steps in Knowledge Discovery in Databases – contd..
• Data transformation: also known as data consolidation, it
is a phase in which the selected data is transformed into
forms appropriate for the mining procedure.
• Data mining: it is the crucial step in which clever
techniques are applied to extract patterns potentially useful.
• Pattern evaluation: in this step, strictly interesting patterns
representing knowledge are identified based on given
measures.
• Knowledge representation: is the final phase in which the
discovered knowledge is visually represented to the user.
This essential step uses visualization techniques to help
users understand and interpret the data mining results.
Steps in Knowledge Discovery in Databases – contd..
It is common to combine some of these steps together. For
instance, data cleaning and data integration can be performed
together as a pre-processing phase to generate a data
warehouse. Data selection and data transformation can also be
combined where the consolidation of the data is the result of the
selection, or, as for the case of data warehouses, the selection is
done on transformed data.
The KDD is an iterative process. Once the discovered
knowledge is presented to the user, the evaluation measures can
be enhanced, the mining can be further refined, new data can be
selected or further transformed, or new data sources can be
integrated, in order to get different, more appropriate results.
Motivating Challenges
Below are the motivation challenges that motivated Data
mining:
• Scalability
• High Dimensionality
• Heterogeneous and complex data
• Data ownership and distribution
• Non-traditional Analysis
Motivating Challenges – contd..
Scalability
Scaling and performance are often considered together in Data
Mining. The problem of scalability in Data Mining is not only
how to process such large sets of data, but how to do it within a
useful timeframe. Many of the issues of scalability in Data
Mining and DBMS are similar to scaling performance issues
for Data Management in general.
Motivating Challenges – contd..
High Dimensionality
The variable in 1-D data is usually time. An example is the log
of interrupts in a processor. 2D data can often be found in
statistics like the number of financial transactions in a certain
period of time. 3-D data can be positions in 3-D space or points
on a surface whereas time (the 3rd dimension) varies. High-
dimensional data contains all those sets of data that have more
than three considered variables. Examples are locations in
space that vary with time (here time is the fourth dimension) or
any other combination of more than three variables, e.g.
product - channel - territory - period - customer’s income.
Motivating Challenges – contd..
Heterogeneous and complex data
Heterogeneous data means data set contains attributes of
different types. Traditional data analysis methods contain data
sets with same types of attributes. Complex data is a data with
different attribute and information. For example webpage with
hyperlinks, DNA and 3D structure, climate data (temperature,
pressure, mist, humidity, time, location).
Motivating Challenges – contd..
Data ownership and distribution
Sometimes the data needed for an analysis is not stored in one
location or owned by one organization. Instead the data is
distributed in geographically among multiple entities. This
requires the development of distributed data mining
techniques.
Motivating Challenges – contd..
Non-traditional analysis
It is based on hypothesis and test paradigm. Hypothesis is
proposed one, it is an experiment designed to gather data.
Currently huge data is present in data repositories so it
requires thousands of hypotheses.
Data Mining Functionalities
Data mining functionalities are used to specify the kind of
patterns to be found in data mining tasks. In general, data
mining tasks can be classified into two categories:
• Description Methods: Here the objective is to derive
patterns that summarize the underlying relationships in
data. They find human-interpretable patterns that describe
the data.
• Predictive tasks: The objective of these tasks is to predict
the value of a particular attribute based on the values of
other attribute. They use some variables (independent /
explanatory variable) to predict unknown or future values
of other variables (dependent / target variable).
Data Mining Functionalities – contd..
There are four core tasks in Data Mining:
• Predictive modelling
• Association analysis
• Clustering analysis
• Anomaly detection
Data Mining Functionalities – contd..
Predictive modelling
Find some missing or unavailable data values rather than class
labels referred to as prediction. Although prediction may refer
to both data value prediction and class label prediction, it is
usually confined to data value prediction and thus is distinct
from classification. Prediction also encompasses the
identification of distribution trends based on the available
data.
Data Mining Functionalities – contd..
Association analysis
It is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data.
For example, a data mining system may find association rules
like
major(X, “computing science”) ? owns(X, “personal computer”)
[support = 12%, confidence = 98%]
where X is a variable representing a student. The rule indicates
that of the students under study, 12% (support) major in
computing science and own a personal computer. There is a
98% probability (confidence, or certainty) that a student in this
group owns a personal computer.
Data Mining Functionalities – contd..
Clustering analysis
Clustering analyzes data objects without consulting a known
class label. The objects are clustered or grouped based on the
principle of maximizing the intra-class similarity and
minimizing the interclass similarity. Each cluster that is formed
can be viewed as a class of objects.
Data Mining Functionalities – contd..
Anomaly detection
It is the task of identifying observations whose characteristics
are significantly different from the rest of the data. Such
observations are called anomalies or outliers. This is useful in
fraud detection and network intrusions.
Classification of Data Mining systems
A data mining system can be classified according to the
following criteria:
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization
• Other Disciplines
Classification of Data Mining systems – contd..
Apart from the previous criteria, a data mining system can
also be classified based on the kind of
• Databases mined
• Knowledge mined
• Techniques utilized
• Applications adapted
Classification Based on the Databases Mined
Database system can be classified according to different
criteria such as data models, types of data, etc. And the data
mining system can be classified accordingly.
For example, if we classify a database according to the data
model, then we may have a relational, transactional, object-
relational, or data warehouse mining system.
Classification of Data Mining systems – contd..
Classification of Data Mining systems – contd..
Classification Based on the kind of Knowledge Mined
It means the data mining system is classified on the basis of
functionalities such as
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Prediction
• Outlier Analysis
• Evolution Analysis
Classification of Data Mining systems – contd..
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of
techniques used. We can describe these techniques according to
the degree of user interaction involved or the methods of
analysis employed.
Classification of Data Mining systems – contd..
Classification Based on the Applications Adapted
We can classify a data mining system according to the
applications adapted. These applications are as follows:
• Finance
• Telecommunications
• DNA
• Stock Markets
• E-mail
Integration of Data Mining systems
If a data mining system is not integrated with a database or a
data warehouse system, then there will be no system to
communicate with. This scheme is known as the non-coupling
scheme. In this scheme, the main focus is on data mining
design and on developing efficient and effective algorithms for
mining the available data sets. Following are the Integration
Schemes:
• No Coupling
• Loose Coupling
• Semi−tight Coupling
• Tight coupling
Integration of Data Mining systems – contd..
No Coupling
In this scheme, the data mining system does not utilize any of
the database or data warehouse functions. It fetches the data
from a particular source and processes that data using some
data mining algorithms. The data mining result is stored in
another file.
Integration of Data Mining systems – contd..
Loose Coupling
In this scheme, the data mining system may use some of the
functions of database and data warehouse system. It fetches the
data from the data respiratory managed by these systems and
performs data mining on that data. It then stores the mining
result either in a file or in a designated place in a database or in
a data warehouse.
Integration of Data Mining systems – contd..
Semi-tight Coupling
In this scheme, the data mining system is linked with a
database or a data warehouse system and in addition to that,
efficient implementations of a few data mining primitives can
be provided in the database.
Integration of Data Mining systems – contd..
Tight coupling
In this coupling scheme, the data mining system is smoothly
integrated into the database or data warehouse system. The
data mining subsystem is treated as one functional component
of an information system.
Data Mining issues
Data mining is not an easy task, as the algorithms used can get
very complex and data is not always available at one place. It
needs to be integrated from various heterogeneous data sources.
These factors also create some issues. The major issues are
regarding
• Mining Methodology and User Interaction
• Performance Issues
• Diverse Data Types Issues
Data Mining issues – contd..
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues:
• Mining different kinds of knowledge in databases:
Different users may be interested in different kinds of
knowledge. Therefore it is necessary for data mining to
cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of
abstraction: The data mining process needs to be
interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based
on the returned results.
Data Mining issues – contd..
Mining Methodology and User Interaction Issues – contd..
• Incorporation of background knowledge: To guide
discovery process and to express the discovered patterns,
the background knowledge can be used. Background
knowledge may be used to express the discovered patterns
not only in concise terms but at multiple levels of
abstraction.
• Data mining query languages and ad hoc data mining:
Data Mining Query language that allows the user to
describe ad hoc mining tasks, should be integrated with a
data warehouse query language and optimized for efficient
and flexible data mining.
Data Mining issues – contd..
Mining Methodology and User Interaction Issues – contd..
• Presentation and visualization of data mining results:
Once the patterns are discovered it needs to be expressed in
high level languages, and visual representations. These
representations should be easily understandable.
• Handling noisy or incomplete data: The data cleaning
methods are required to handle the noise and incomplete
objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the
discovered patterns will be poor.
• Pattern evaluation: The patterns discovered should be
interesting because either they represent common
knowledge or lack novelty.
Data Mining issues – contd..
Performance Issues
There can be performance-related issues such as follows:
• Efficiency and scalability of data mining algorithms: In
order to effectively extract the information from huge
amount of data in databases, data mining algorithm must be
efficient and scalable.
Data Mining issues – contd..
Performance Issues – contd..
• Parallel, distributed and incremental mining
algorithms: The factors such as huge size of databases,
wide distribution of data, and complexity of data mining
methods motivate the development of parallel and
distributed data mining algorithms. These algorithms divide
the data into partitions which is further processed in a
parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases
without mining the data again from scratch.
Data Mining issues – contd..
Diverse Data Types Issues
Diverse Data Types Issues may be as follows:
• Handling of relational and complex types of data: The
database may contain complex data objects, multimedia
data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and
global information systems: The data is available at
different data sources on LAN or WAN. These data source
may be structured, semi structured or unstructured.
Therefore mining the knowledge from them adds
challenges to data mining.
Data Mining goals
Data Mining is an analytic process designed to explore data
(usually large amounts of data - typically business or market
related - also known as "big data") in search of consistent
patterns and/or systematic relationships between variables, and
then to validate the findings by applying the detected patterns
to new subsets of data. The ultimate goal of data mining is
prediction - and predictive data mining is the most common
type of data mining and one that has the most direct business
applications.
Stages of the Data Mining Process
The process of data mining consists of three stages:
• The initial exploration
• Model building or pattern identification with validation /
verification
• Deployment (i.e., the application of the model to new data
in order to generate predictions).
Stages of the Data Mining Process – contd..
Initial Exploration
This stage usually starts with data preparation which may
involve cleaning data, data transformations, selecting subsets
of records and - in case of data sets with large numbers of
variables ("fields") - performing some preliminary feature
selection operations to bring the number of variables to a
manageable range (depending on the statistical methods which
are being considered).
Stages of the Data Mining Process – contd..
Initial Exploration – contd..
Then, depending on the nature of the analytic problem, this
first stage of the process of data mining may involve anywhere
between a simple choice of straightforward predictors for a
regression model, to elaborate exploratory analyses using a
wide variety of graphical and statistical methods in order to
identify the most relevant variables and determine the
complexity and/or the general nature of models that can be
taken into account in the next stage.
Stages of the Data Mining Process – contd..
Model building and validation
This stage involves considering various models and choosing
the best one based on their predictive performance (i.e.,
explaining the variability in question and producing stable
results across samples). This may sound like a simple
operation, but in fact, it sometimes involves a very elaborate
process. There are a variety of techniques developed to achieve
that goal - many of which are based on so-called "competitive
evaluation of models," that is, applying different models to the
same data set and then comparing their performance to choose
the best. These techniques - which are often considered the
core of predictive data mining - include: Bagging (Voting,
Averaging), Boosting, Stacking (Stacked Generalizations), and
Meta-Learning.
Stages of the Data Mining Process – contd..
Deployment
This final stage involves using the model selected as best in the
previous stage and applying it to new data in order to generate
predictions or estimates of the expected outcome.
UNIT-II
Concepts of Data Warehousing
A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting,
structured and / or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data
consolidations.
Data warehousing is the process of constructing and using a data
warehouse. Data warehousing is defined as a process of
centralized data management and retrieval.
Concepts of Data Warehousing
A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting,
structured and / or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data
consolidations.
Data warehousing is the process of constructing and using a data
warehouse. Data warehousing is defined as a process of
centralized data management and retrieval.
Data Warehouse Features
The key features of a data warehouse are discussed below:
• Subject Oriented: A data warehouse is subject oriented
because it provides information around a subject rather than
the organization's ongoing operations. These subjects can be
product, customers, suppliers, sales, revenue, etc. A data
warehouse does not focus on the ongoing operations, rather it
focuses on modelling and analysis of data for decision
making.
• Integrated: A data warehouse is constructed by integrating
data from heterogeneous sources such as relational databases,
flat files, etc. This integration enhances the effective analysis
of data.
Data Warehouse Features – contd..
• Time Variant: The data collected in a data warehouse is
identified with a particular time period. The data in a data
warehouse provides information from the historical point of
view.
• Non-volatile: Non-volatile means the previous data is not
erased when new data is added to it. A data warehouse is kept
separate from the operational database and therefore frequent
changes in operational database is not reflected in the data
warehouse.
Note: A data warehouse does not require transaction
processing, recovery, and concurrency controls, because it is
physically stored and separate from the operational database.
Data Warehouse Applications
As discussed before, a data warehouse helps business executives
to organize, analyze, and use their data for decision making. A
data warehouse serves as a sole part of a plan-execute-assess
"closed-loop" feedback system for the enterprise management.
Data warehouses are widely used in the following fields:
• Financial services
• Banking services
• Consumer goods
• Retail sectors
• Controlled manufacturing
Types of Data Warehouses
Information processing, analytical processing, and data mining
are the three types of data warehouse applications that are
discussed below:
• Information Processing: A data warehouse allows to
process the data stored in it. The data can be processed by
means of querying, basic statistical analysis, reporting using
crosstabs, tables, charts, or graphs.
• Analytical Processing: A data warehouse supports analytical
processing of the information stored in it. The data can be
analyzed by means of basic OLAP operations, including
slice-and-dice, drill down, drill up, and pivoting.
Types of Data Warehouses – contd..
• Data Mining: Data mining supports knowledge discovery by
finding hidden patterns and associations, constructing
analytical models, performing classification and prediction.
These mining results can be presented using visualization
tools.
Data Warehouse Architecture
Data warehouses normally adopt three-tier architecture:
• The bottom tiers is a warehouse database server that is
almost always a relational database system. Data from
operational databases and from external sources are extracted
using application program interfaces known as gateways. A
gateway is supported by the underlying DBMS and allows
client programs to execute code.
• The middle tier is an OLAP server that is typically
implemented using a relational OLAP (ROLAP) model.
• The top tier is a client, which contains query and reporting
tools, analysis tools and / or data mining tools. From the
architecture point of view there are three data warehouse
models: the enterprise warehouse, the data mart, and the
virtual warehouse.
Data Warehouse models
From the architecture point of view there are three data
warehouse models:
• Enterprise Warehouse: An enterprise warehouse collects all
details comprising of all information about subjects spanning
the entire organization. It provides corporate wide data
integration, usually from one or more operational systems
and from external information providers. It takes extensive
business modelling and it takes many years to design and
build.
Data Warehouse models – contd..
• Data Mart: A data mart consists of a subset of corporate
wide data that is of value to specific group of users. The
scope is confined to specific selected subjects. The data
contained in a data mart tend to be summarized.
• Virtual Warehouse: A virtual warehouse is a set of views
over operational databases. For efficient query processing,
only some of the possible summary views may be
materialized. A virtual warehouse is easy to build and it
requires excess capacity on the operational database servers.
OLAP technology
OLAP (online analytical processing) is computer processing that
enables a user to easily and selectively extract and view data
from different points of view.
For example, a user can request that data be analyzed to display
a spreadsheet showing all of a company's beach ball products
sold in Florida in the month of July, compare revenue figures
with those for the same products in September, and then see a
comparison of other product sales in Florida in the same time
period.
OLAP technology – contd..
OLAP data is stored in a multidimensional database. Whereas a
relational database can be thought of as two-dimensional, a
multidimensional database considers each data attribute (such as
product, geographic sales region, and time period) as a separate
"dimension."
OLAP software can locate the intersection of dimensions (all
products sold in the Eastern region above a certain price during
a certain time period) and display them. Attributes such as time
periods can be broken down into sub-attributes.
OLAP technology – contd..
OLAP can be used for data mining or the discovery of
previously undiscerned relationships between data items. An
OLAP database does not need to be as large as a data
warehouse, since not all transactional data is needed for trend
analysis. Using Open Database Connectivity (ODBC), data can
be imported from existing relational databases to create a
multidimensional database for OLAP.
Data Warehouse vs. Operational Databases
A data warehouses is kept separate from operational databases
due to the following reasons:
• An operational database is constructed for well-known tasks
and workloads such as searching particular records, indexing,
etc. In contrast, data warehouse queries are often complex
and they present a general form of data.
• Operational databases support concurrent processing of
multiple transactions. Concurrency control and recovery
mechanisms are required for operational databases to ensure
robustness and consistency of the database.
Data Warehouse vs. Operational Databases – contd..
• An operational database query allows to read and modify
operations, while an OLAP query needs only read only
access of stored data.
• An operational database maintains current data. On the other
hand, a data warehouse maintains historical data.
Data Warehouse vs. Operational Databases – contd..
Data Warehouse (OLAP) Operational Database(OLTP)
It involves historical processing of
information.
It involves day-to-day processing.
OLAP systems are used by
knowledge workers such as
executives, managers, and analysts.
OLTP systems are used by clerks,
DBAs, or database professionals.
It is used to analyze the business. It is used to run the business.
It focuses on Information out. It focuses on Data in.
It is based on Star Schema,
Snowflake Schema, and Fact
Constellation Schema.
It is based on Entity Relationship
Model.
It focuses on Information out. It is application oriented.
Data Warehouse vs. Operational Databases – contd..
Data Warehouse (OLAP) Operational Database(OLTP)
It contains historical data. It contains current data.
It provides summarized and
consolidated data. It is highly
flexible.
It provides primitive and highly
detailed data. It provides high
performance.
It provides summarized and
multidimensional view of data.
It provides detailed and flat
relational view of data.
The number of users is in hundreds. The number of users is in
thousands.
The number of records accessed is
in millions.
The number of records accessed is
in tens.
The database size is from 100GB to
100 TB.
The database size is from 100 MB
to 100 GB.
UNIT-III
Data Mining techniques
Following is an overview of some of the most common data
mining techniques in use today. The techniques have been
divided into two broad categories:
• Classical Techniques: Statistics, Neighbourhoods and
Clustering
• Next Generation Techniques: Trees, Networks and Rules
These categories will describe a number of data mining
algorithms at a high level and shall help to understand how each
algorithm fits into the landscape of data mining techniques.
Overall, six broad classes of data mining algorithms are
covered.
Data Mining techniques – contd..
Classical Techniques
This category contains descriptions of techniques that have
classically been used for decades and the next category
represents techniques that have only been widely used since the
early 1980s. The main techniques here are the ones that are used
99.9% of the time on existing business problems. There are
certainly many other ones as well as proprietary techniques from
particular vendors - but in general the industry is converging to
those techniques that work consistently and are understandable
and explainable.
Data Mining techniques – contd..
Classical Techniques – contd..
Statistics
By strict definition statistics or statistical techniques are not data
mining. They were being used long before, the term data mining
was coined to apply to business applications. However,
statistical techniques are driven by the data and are used to
discover patterns and build predictive models. This is why it is
important to have the idea of how statistical techniques work
and how they can be applied.
Data Mining techniques – contd..
Classical Techniques – contd..
Statistics – contd..
Prediction using Statistics
The term “prediction” is used for a variety of types of analysis
that may elsewhere be more precisely called regression.
Regression is further explained in order to simplify some of the
concepts and to emphasize the common and most important
aspects of predictive modelling. Nonetheless regression is a
powerful and commonly used tool in statistics.
Data Mining techniques – contd..
Classical Techniques – contd..
Statistics – contd..
Linear Regression
In statistics prediction is usually synonymous with regression of
some form. There are a variety of different types of regression in
statistics but the basic idea is that a model is created that maps
values from predictors in such a way that the lowest error occurs
in making a prediction. The simplest form of regression is
simple linear regression that just contains one predictor and a
prediction.
Data Mining techniques – contd..
Classical Techniques – contd..
Statistics – contd..
Linear Regression – contd..
The relationship between the two can be mapped on a two
dimensional space and the records plotted for the prediction
values along the Y axis and the predictor values along the X
axis. The simple linear regression model then could be viewed
as the line that minimized the error rate between the actual
prediction value and the point on the line (the prediction from
the model).
Data Mining techniques – contd..
Classical Techniques – contd..
Statistics – contd..
Linear Regression – contd..
Graphically this would look as it does in the figure below:
Data Mining techniques – contd..
Classical Techniques – contd..
Nearest Neighbour
Clustering and the Nearest Neighbour prediction technique are
among the oldest techniques used in data mining. Most people
think that clustering is like records are grouped together. Nearest
neighbour is a prediction technique that is quite similar to
clustering. Its essence is that in order to predict what a
prediction value is in one record look for records with similar
predictor values in the historical database and use the prediction
value from the record that is “nearest” to the unclassified record.
Data Mining techniques – contd..
Classical Techniques – contd..
Nearest Neighbour – contd..
The nearest neighbour prediction algorithm works in very much
the same way except that “nearness” in a database may consist
of a variety of factors not just where the person lives. It may, for
instance, be far more important to know which school someone
attended and what degree they attained when predicting income.
Nearest Neighbour techniques are easy to use and understand
because they work in a way similar to the way that people think
- by detecting closely matching examples.
Data Mining techniques – contd..
Classical Techniques – contd..
Clustering
Clustering is basically a partition of the database so that each
partition or group is similar according to some criteria or metric.
Clustering according to similarity is a concept, which appears in
many disciplines. If a measure of similarity is available there are
a number of techniques for forming clusters. Membership of
groups can be based on the level of similarity between members
and from this the rules of membership can be defined. Another
approach is to build set functions that measure some property of
partitions i.e. groups or subsets as functions of some parameter
of the partition. This latter approach achieves what is known as
optimal partitioning.
Data Mining techniques – contd..
Classical Techniques – contd..
Clustering – contd..
Hierarchical Clustering
The hierarchical clustering techniques create a hierarchy of
clusters, from small to big. The main reason is that clustering is
an unsupervised learning technique, and as such, there is no
absolutely correct answer. Now depending upon the particular
application of the clustering, fewer or greater numbers of
clusters may be desired. With a hierarchy of clusters defined it is
possible to choose the number of clusters that are desired. Also
it is possible to have as many clusters as there are records in the
database.
Data Mining techniques – contd..
Classical Techniques – contd..
Clustering – contd..
Hierarchical Clustering – contd..
There are two main types of hierarchical clustering algorithms:
• Agglomerative: Agglomerative clustering techniques start
with as many clusters as there are records where each cluster
contains just one record. The clusters that are nearest to each
other are merged together to form the next largest cluster.
This merging is continued until a hierarchy of clusters is built
with just a single cluster containing all the records at the top
of the hierarchy.
Data Mining techniques – contd..
Classical Techniques – contd..
Clustering – contd..
Hierarchical Clustering – contd..
• Divisive: Divisive clustering techniques take the opposite
approach from agglomerative techniques. These techniques
start with all the records in one cluster and then try to split
that cluster into smaller pieces and then in turn to try to split
those smaller pieces into more smaller ones.
Data Mining techniques – contd..
Classical Techniques – contd..
Clustering – contd..
Non-Hierarchical Clustering
There are two main non-hierarchical clustering techniques. Both
of them are very fast to compute on the database but have some
drawbacks.
• The first are the single pass methods. They derive their name
from the fact that the database must only be passed through
once in order to create the clusters (i.e. each record is only
read from the database once).
Data Mining techniques – contd..
Classical Techniques – contd..
Clustering – contd..
Non-Hierarchical Clustering – contd..
• The other class of techniques is called reallocation methods.
They get their name from the movement or “reallocation” of
records from one cluster to another in order to create better
clusters. The reallocation techniques do use multiple passes
through the database but are relatively fast in comparison to
the hierarchical techniques.
Data Mining techniques – contd..
Next Generation Techniques
This category of techniques include the following:
• Trees
• Networks
• Rules
Data Mining techniques – contd..
Next Generation Techniques – contd..
Decision Trees
A decision tree is a predictive model that, as its name implies,
can be viewed as a tree. Specifically each branch of the tree is a
classification question and the leaves of the tree are partitions of
the dataset with their classification.
There are some interesting things about the tree:
• It divides up the data on each branch point without losing any
of the data (the number of total records in a given parent node
is equal to the sum of the records contained in its two
children).
Data Mining techniques – contd..
Next Generation Techniques – contd..
Decision Trees – contd..
• The number of churners and non-churners is conserved as
you move up or down the tree.
• It is pretty easy to understand how the model is being built
(in contrast to the models from neural networks or from
standard statistics).
• It would also be pretty easy to use this model if you actually
had to target those customers that are likely to churn with a
targeted marketing offer.
Data Mining techniques – contd..
Next Generation Techniques – contd..
Neural Networks
Neural networks is an approach to computing that involves
developing mathematical structures with the ability to learn. The
methods are the result of academic investigations to model
nervous system learning. Neural networks have the remarkable
ability to derive meaning from complicated or imprecise data.
This can be used to extract patterns and detect trends that are too
complex to be noticed by either humans or other computer
techniques. A trained neural network can be thought of as an
"expert" in the category of information it has been given to
analyze. This expert can then be used to provide projections
given new situations of interest and answer "what if' questions.
Data Mining techniques – contd..
Next Generation Techniques – contd..
Neural Networks – contd..
Neural networks is an approach to computing that involves
developing mathematical structures with the ability to learn. The
methods are the result of academic investigations to model
nervous system learning. Neural networks have the remarkable
ability to derive meaning from complicated or imprecise data.
This can be used to extract patterns and detect trends that are too
complex to be noticed by either humans or other computer
techniques. A trained neural network can be thought of as an
"expert" in the category of information it has been given to
analyze. This expert can then be used to provide projections
given new situations of interest and answer "what if' questions.
Data Mining techniques – contd..
Next Generation Techniques – contd..
Neural Networks – contd..
The structure of a neural network is shown in figure below:
Data Mining techniques – contd..
Next Generation Techniques – contd..
Neural Networks – contd..
In the figure, the bottom layer represents the input layer, in this
case with 5 inputs labels Xl through X5. In the middle, there is
the hidden layer, with a variable number of nodes. The hidden
layer performs much of the work of the network. The output
layer in this case has two nodes, Z1 and Z2 representing output
values determined from the inputs.
Data Mining techniques – contd..
Next Generation Techniques – contd..
Rule Induction
Rule induction is one of the major forms of data mining and is
the most common form of knowledge discovery in unsupervised
learning systems. Rule induction on a data base can be a
massive undertaking where all possible patterns are
systematically pulled out of the data and then an accuracy and
significance are added to them that tell the user how strong the
pattern is and how likely it is to occur again.
Data Mining techniques – contd..
Next Generation Techniques – contd..
Rule Induction – contd..
In general these rules are relatively simple such as for a market
basket database of items scanned in a consumer market basket
you might find interesting correlations in your database such as:
• If bagels are purchased then cream cheese is purchased 90%
of the time and this pattern occurs in 3% of all shopping
baskets.
• If live plants are purchased from a hardware store then plant
fertilizer is purchased 60% of the time and these two items
are bought together in 6% of the shopping baskets.
UNIT-IV
Mining Association Rules
There are several efficient algorithms that cope with the popular
and computationally expensive tasks of association rule mining.
In brief, association rule is an expression that X => Y, where X
and Y are sets of items. The meaning of such rules is quite
intuitive: Given a database D of transactions – where each
transaction T ϵ D is a set of items. X => Y expresses that
whenever a transaction T contains X, T probably contains Y
also. The probability of rule confidence is defined as the
percentage of transactions containing Y in addition to X with
regard to the overall number of transactions containing X.
Mining Association Rules – contd..
Below are the most common algorithms:
• BFS and Counting Occurrences
• BFS and TID-list Intersections
• DFS and Counting Occurrences
• DFS and TID-list Intersections
Mining Association Rules – contd..
Distributed Algorithms
Most parallel or distributed association rule algorithms strive to
parallelize either the data, known as data parallelism, or the
candidates, referred to as task parallelism. With task parallelism,
the candidates are partitioned and counted separately at each
processor. Obviously, the partition algorithm would be easy to
parallelize using the task parallelism approach.
Mining Association Rules – contd..
Distributed Algorithms – contd..
Other dimensions in differentiating the parallel association rule
algorithms are the load-balancing approach used and the
architecture. The data parallelism algorithms have reduced
communication costs over the task, because only the initial
candidates (the set of items) and the local counts must be
distributed at each iteration. With task parallelism, not only the
candidates but also the local set of transactions must be
broadcast to all other sites. However, the data parallelism
algorithms require that memory at each processor be large
enough to store all candidates at each scan (otherwise the
performance will degrade considerably because I/O is required
for both the database and the candidate set).
Mining Association Rules – contd..
Distributed Algorithms – contd..
The task parallelism approaches can avoid this because only the
subset of the candidates that are assigned to a processor during
each scan must fit into memory. Since not all partitions of the
candidates must be the same size, the task parallel algorithms
can adapt to the amount of memory at each site. The only
restriction is that the total size of all candidates be small enough
to fit into the total size of memory in all processors.
Mining Association Rules – contd..
Distributed Algorithms – contd..
The CDA Algorithm
One data parallelism algorithm is the count distribution
algorithm (CDA). The database is divided into p partitions, one
for each processor. Each processor counts the candidates for its
data and then broadcasts its counts to all other processors. Each
processor then determines the global counts. These then are used
to determine the large item sets and to generate the candidates
for the next scan.
Mining Association Rules – contd..
Distributed Algorithms – contd..
The FDM Algorithm
The FDM (Fast Distributed Algorithm for Data Mining)
algorithm, proposed in (Cheung et al. 1996) has the following
distinguishing characteristics:
• Candidate set generation is Apriori-like. However, some
interesting properties of locally and globally frequent item
sets are used to generate a reduced set of candidates at each
iteration, this resulting in a reduction in the number of
messages interchanged between sites.
Mining Association Rules – contd..
Distributed Algorithms – contd..
The FDM Algorithm – contd..
• After the candidate sets were generated, two types of
reduction techniques are applied, namely a local reduction
and a global reduction, to eliminate some candidate sets from
each site.
• To be able to determine if a candidate set is frequent, the
algorithm needs only O(n) messages for the exchange of
support counts, where n is the number of sites from the
distributed system. This number is much less than a discrete
adaptation of Apriori, which would need O(n2) messages for
calculating the support counts.
Mining Association Rules – contd..
Distributed Algorithms – contd..
Increasing the support factor also increases the performance of
the algorithms. Also good performances are obtained when the
support factor is low and the data set large, but the number of
processors increased.
The increase in processor number should be done relative to the
dimension of the data set. Thus, for a relatively small data set,
the large increase in processor number can lead to large sets of
local candidates and a large number of messages, thus
increasing the execution time of CDA and FDM algorithms.
Mining Association Rules – contd..
Distributed Algorithms – contd..
The CDA algorithm has a simple synchronization scheme, using
only one set of messages for every step, while the FDM
algorithm uses two synchronizations and the same scheme as
CDA.
The distributed mining algorithms can be used on distributed
databases, as well as for mining large databases by partitioning
them between sites and processing them in a distributed manner.
The high flexibility, the scalability, the small cost/performance
ratio and the connectivity of a distributed system make them an
ideal platform for data mining.
Mining Association Rules – contd..
Incremental Rules
With the increasing use of the record-based databases whose
data is being continuously added, recent important applications
have called for the need of incremental mining. In dynamic
transaction databases, new transactions are appended and
obsolete transactions are discarded as time advances. Several
research works have developed feasible algorithms for deriving
precise association rules efficiently and effectively in such
dynamic databases.
Mining Association Rules – contd..
Incremental Rules – contd..
The mining of association rules on transactional database is
usually an offline process since it is costly to find the
association rules in large databases. With usual market-basket
applications, new transactions are generated and old transactions
may be obsolete as time advances. As a result, incremental
updating techniques should be developed for maintenance of the
discovered association rules to avoid redoing mining on the
whole updated database.
Mining Association Rules – contd..
Apriori-Based Algorithms
Algorithm Apriori is an influential algorithm for mining
association rules. It uses prior knowledge of frequent item set
properties to help on narrowing the search space of required
frequent item sets. Specifically, k-item sets are used to explore
(k+1)-item sets during the level wise process of frequent item
set generation. The set of frequent 1-itemsets (L1) is firstly
found by scanning the whole dataset once. L1 is then used by
performing join and prune actions to form the set of candidate 2-
itemsets (C2). After another data scan, the set of frequent 2 item
sets (L2) are identified and extracted from C2. The whole
process continues iteratively until there is no more candidate
item sets which can be formed from previous Lk.
Mining Association Rules – contd..
Apriori-Based Algorithms for Incremental Mining
The Apriori heuristic is an anti-monotone principle. Specifically,
if any item set is not frequent in the database, its super item set
will never be frequent. Below are the algorithms belonging to
this category that adopt a level wise approach:
• Algorithm FUP (Fast UPdate)
• Algorithms FUP2 and FUP2H
• Algorithm UWEP (Update With Early Pruning)
• Algorithm Utilizing Negative Borders
• Algorithm DELI (Difference Estimation for Large Item sets)
• Algorithms MAAP (Maintaining Association rules with
Apriori Property) and PELICAN
Mining Association Rules – contd..
Partition-Based Algorithms
There are several techniques developed in prior works to
improve the efficiency of algorithm Apriori, e.g., hashing item
set counts, transaction reduction, data sampling, data
partitioning and so on. Among the various techniques, the data
partitioning is the one with great importance since the goal in
this chapter is on the incremental mining where bulks of
transactions may be appended or discarded as time advances.
Mining Association Rules – contd..
Partition-Based Algorithms for Incremental Mining
In contrast to the Apriori heuristic, the partition-based technique
well utilizes the partitioning on the whole transactional dataset.
Moreover, after the partitioning, it is understood that if X is a
frequent item set in database D which is divided into n partitions
p1, p2, ..., pn, then X must be a frequent item set in at least one
of the n partitions. Consequently, algorithms belonging to this
category work on each partition of data iteratively and gather the
information obtained from the processing of each partition to
generate the final (integrated) results.
Mining Association Rules – contd..
Partition-Based Algorithms for Incremental Mining – contd..
Below are the algorithms belonging to this category:
• Algorithm SWF (Sliding—Window Filtering)
• Algorithms FI_SWF and CI_SWF
Mining Association Rules – contd..
Pattern Growth Algorithms
The generation of frequent item sets in both the Apriori-based
algorithms and the partition-based algorithms is in the style of
candidate generate-and-test. No matter how the search space for
candidate item sets is narrowed, in some cases, it may still need
to generate a huge number of candidate item sets. In addition,
the number of database scans is limited to be at least twice, and
usually some extra scans are needed to avoid unreasonable
computing overheads. These two problems are nontrivial and
are resulted from the utilization of the Apriori approach.
Mining Association Rules – contd..
Pattern Growth Algorithms – contd..
To overcome these difficulties, the tree structure which stores
projected information of large datasets are utilized in some prior
works. The algorithm TreeProjection constructs a
lexicographical tree and has the whole database projected based
on the frequent item sets mined so far. The transaction
projection can limit the support counting in a relatively small
space and the lexicographical tree can facilitate the management
of candidate item sets. These features of algorithm
TreeProjection provide a great improvement in computing
efficiency when mining association rules.
Mining Association Rules – contd..
Pattern Growth Algorithms for Incremental Mining
Both the Apriori-based algorithms and the partition-based
algorithms aim at the goal of reducing the number of scans on
the entire dataset when updates occur. Generally speaking, the
updated portions, i.e., ∆− and ∆+, could be scanned several
times during the level wise generation of frequent item sets in
works belonging to these two categories.
Below are the algorithms belonging to this category:
• Algorithms DB-tree and PotFp-tree (Potential Frequent
Pattern)
• Algorithm FELINE (FrEquent/Large patterns mINing with
CATS trEe)
UNIT-V
Clustering Techniques
A cluster is a collection of data objects, similar to one another
within the same cluster. The objects of a particular cluster are
dissimilar to those in other clusters. Below are the major
clustering approaches:
• Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
• Hierarchical algorithms: Create a hierarchical
decomposition of the set of data (or objects) using some
criterion
• Density-based algorithms: based on connectivity and
density functions
• Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
Clustering Techniques – contd..
Partitioning Algorithms: Basic Concept
To Construct a partition of a database D of n objects into a set of
k clusters. Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
Clustering Techniques – contd..
Optimization problem
The goal is to optimize a score function. The most commonly
used is the square error criterion:
  
 

k
i i
Cp
impE
1
2
Clustering Techniques – contd..
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps:
• Partition objects into k nonempty subsets.
• Compute seed points as the centroids of the clusters of the
current partition. The centroid is the centre (mean point) of
the cluster.
• Assign each object to the cluster with the nearest seed point.
• Go back to Step 2, stop when no more new assignment.
Clustering Techniques – contd..
The K-Means Clustering Method – contd..
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Clustering Techniques – contd..
The K-Means Clustering Method – contd..
Strength
• Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
• Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic
annealing and genetic algorithms
Weakness
• Applicable only when mean is defined, then what about
categorical data?
• Need to specify k, the number of clusters, in advance
• Unable to handle noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
Clustering Techniques – contd..
The K-Means Clustering Method – contd..
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
• starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids if
it improves the total distance of the resulting clustering
• PAM works effectively for small data sets, but does not
scale well for large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995)
Clustering Techniques – contd..
The K-Means Clustering Method – contd..
PAM (Partitioning Around Medoids)
• PAM (Kaufman and Rousseeuw, 1987), built in Splus
• Use real object to represent the cluster
• Select k representative objects arbitrarily
• For each pair of non-selected object h and selected object
i, calculate the total swapping cost TCih
• For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most
similar representative object
• repeat steps 2-3 until there is no change
Clustering Techniques – contd..
The K-Means Clustering Method – contd..
PAM (Partitioning Around Medoids) – contd..
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
j
ih
t
Cjih = 0
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i h
j
Cjih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
h
i t
j
Cjih = d(j, t) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i
h j
Cjih = d(j, h) - d(j, t)
Clustering Techniques – contd..
The K-Means Clustering Method – contd..
CLARA (Clustering Large Applications)
• Built in statistical analysis packages, such as S+
• It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
• Strength:
• deals with larger data sets than PAM
• Weakness:
• Efficiency depends on the sample size
• A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the
sample is biased
Clustering Techniques – contd..
The K-Means Clustering Method – contd..
CLARANS (“Randomized” CLARA)
• CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
• CLARANS draws sample of neighbors dynamically
• The clustering process can be presented as searching a graph
where every node is a potential solution, that is, a set of k
medoids
• If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
• It is more efficient and scalable than both PAM and CLARA
Clustering Techniques – contd..
The K-Means Clustering Method – contd..
Hierarchical Clustering
The Hierarchical Clustering uses distance matrix as clustering
criteria. This method does not require the number of clusters k
as an input, but needs a termination condition.
UNIT-VI
Classification Techniques
In machine learning and statistics, classification is the problem
of identifying to which of a set of categories (sub-populations) a
new observation belongs, on the basis of a training set of data
containing observations (or instances) on whose category
membership is known.
In the terminology of machine learning, classification is
considered an instance of supervised learning, i.e. learning
where a training set of correctly identified observations is
available. The corresponding unsupervised procedure is known
as clustering, and involves grouping data into categories based
on some measure of inherent similarity or distance.
Classification Techniques – contd..
Statistical-based
Two main phases of work on classification can be identified
within the statistical community. The first, “classical” phase
concentrated on derivatives of Fisher’s early work on linear
discrimination. The second, “modern” phase exploits more
flexible classes of models, many of which attempt to provide an
estimate of the joint distribution of the features within each
class, which can in turn provide a classification rule.
Classification Techniques – contd..
Statistical-based – contd..
Statistical approaches are generally characterised by having an
explicit underlying probability model, which provides a
probability of being in each class rather than simply a
classification. In addition, it is usually assumed that the
techniques will be used by statisticians, and hence some human
intervention is assumed with regard to variable selection and
transformation, and overall structuring of the problem.
Classification Techniques – contd..
Distance-based
A typical distance-based classifier is Knn (K Nearest
Neighbours). Knn calculates proximity between a test instance
and each one of all the training instances for selecting k nearest
neighbours of the test instance1. Among K nearest neighbours
(training instances) the class label of the nearest neighbours is
assigned as the class label of the test instance. Majority voting is
used to assign a class label to a test instance: it will be the class
of the majority of the training instances in the k-nn set. The
most used proximity measure is Euclidean distance or cosine
similarity: with instances described by the values of n attributes,
proximity is computed between two instances where each
instance is thought as a vector in an n-dimensional space.
Classification Techniques – contd..
Distance-based – contd..
These classifiers are simple and powerful but some of the well-
known limitations of Knn are given below:
•If there are many training instances then Knn requires many
distance calculations as well.
•k-nn has the problem of model over-fitting. Model over-fitting
is the following situation in which: 1) the classifier relies too
much on the training data for its predictions and is not able to
generalize its model to new test data and 2) Over-fitting is
exemplified by the observation of the classification errors
respectively in the training set and in the test set. The
misclassification error on the training set continues to decrease
whilst the error on test instances starts to increase again.
Classification Techniques – contd..
Decision Tree- based
A decision tree is a classifier expressed as a recursive partition
of the instance space. The decision tree consists of nodes that
form a rooted tree, meaning it is a directed tree with a node
called “root” that has no incoming edges. All other nodes have
exactly one incoming edge. A node with outgoing edges is
called an internal or test node. All other nodes are called leaves
(also known as terminal or decision nodes). In a decision tree,
each internal node splits the instance space into two or more
sub-spaces according to a certain discrete function of the input
attributes values. In the simplest and most frequent case, each
test considers a single attribute, such that the instance space is
partitioned according to the attribute’s value. In the case of
numeric attributes, the condition refers to a range.
Classification Techniques – contd..
Decision Tree- based – contd..
Each leaf is assigned to one class representing the most
appropriate target value. Alternatively, the leaf may hold a
probability vector indicating the probability of the target
attribute having a certain value. Instances are classified by
navigating them from the root of the tree down to a leaf,
according to the outcome of the tests along the path.
Decision tree inducers are algorithms that automatically
construct a decision tree from a given dataset. Typically the goal
is to find the optimal decision tree by minimizing the
generalization error. However, other target functions can be also
defined, for instance, minimizing the number of nodes or
minimizing the average depth.
UNIT-VII
Applications and Trends in Data Mining
Data mining is an interdisciplinary field with wide and diverse
applications. There exist nontrivial gaps between data mining
principles and domain-specific applications.
Some application domains:
• Financial data analysis
• Retail industry
• Telecommunication industry
• Biological data analysis
Applications and Trends in Data Mining – contd..
Financial Data Analysis
• Financial data collected in banks and financial institutions are
often relatively complete, reliable, and of high quality
• Design and construction of data warehouses for
multidimensional data analysis and data mining
• View the debt and revenue changes by month, by region,
by sector, and by other factors
• Access statistical information such as max, min, total,
average, trend, etc.
• Loan payment prediction/consumer credit policy analysis
• feature selection and attribute relevance ranking
• Loan payment performance
• Consumer credit rating
Applications and Trends in Data Mining – contd..
Financial Data Analysis – contd..
• Classification and clustering of customers for targeted
marketing
• multidimensional segmentation by nearest-neighbor,
classification, decision trees, etc. to identify customer
groups or associate a new customer to an appropriate
customer group
• Detection of money laundering and other financial crimes
• integration of from multiple DBs (e.g., bank transactions,
federal/state crime history DBs)
• Tools: data visualization, linkage analysis, classification,
clustering tools, outlier analysis, and sequential pattern
analysis tools (find unusual access sequences)
Applications and Trends in Data Mining – contd..
Retail Industry
• Retail industry: huge amounts of data on sales, customer
shopping history, etc.
• Applications of retail data mining
• Identify customer buying behaviors
• Discover customer shopping patterns and trends
• Improve the quality of customer service
• Achieve better customer retention and satisfaction
• Enhance goods consumption ratios
• Design more effective goods transportation and
distribution policies
Applications and Trends in Data Mining – contd..
Telecomm. Industry
• A rapidly expanding and highly competitive industry and a
great demand for data mining
• Understand the business involved
• Identify telecommunication patterns
• Catch fraudulent activities
• Make better use of resources
• Improve the quality of service
• Multidimensional analysis of telecommunication data
• Intrinsically multidimensional: calling-time, duration,
location of caller, location of callee, type of call, etc.
Applications and Trends in Data Mining – contd..
Telecomm. Industry – contd..
• Fraudulent pattern analysis and the identification of unusual
patterns
• Identify potentially fraudulent users and their atypical
usage patterns
• Detect attempts to gain fraudulent entry to customer
accounts
• Discover unusual patterns which may need special
attention
• Multidimensional association and sequential pattern analysis
• Find usage patterns for a set of communication services
by customer group, by month, etc.
• Promote the sales of specific services
• Improve the availability of particular services in a region
Applications and Trends in Data Mining – contd..
Biomedical Data Analysis
• DNA sequences: 4 basic building blocks (nucleotides):
adenine (A), cytosine (C), guanine (G), and thymine (T).
• Gene: a sequence of hundreds of individual nucleotides
arranged in a particular order
• Humans have around 30,000 genes
• Tremendous number of ways that the nucleotides can be
ordered and sequenced to form distinct genes
• Semantic integration of heterogeneous, distributed genome
databases
• Current: highly distributed, uncontrolled generation and
use of a wide variety of DNA data
• Data cleaning and data integration methods developed in
data mining will help
Applications and Trends in Data Mining – contd..
Choosing a Data Mining System
• Commercial data mining systems have little in common
• Different data mining functionality or methodology
• May even work with completely different kinds of data
sets
• Need multiple dimensional view in selection
• Data types: relational, transactional, text, time sequence,
spatial?
• System issues
• running on only one or on several operating systems?
• a client/server architecture?
• Provide Web-based interfaces and allow XML data as
input and/or output?
Applications and Trends in Data Mining – contd..
Choosing a Data Mining System – contd..
• Data sources
• ASCII text files, multiple relational data sources
• support ODBC connections (OLE DB, JDBC)?
• Data mining functions and methodologies
• One vs. multiple data mining functions
• One vs. variety of methods per function
• More data mining functions and methods per function
provide the user with greater flexibility and analysis
power
• Coupling with DB and/or data warehouse systems
• Four forms of coupling: no coupling, loose coupling,
semitight coupling, and tight coupling
Applications and Trends in Data Mining – contd..
Choosing a Data Mining System – contd..
• Scalability
• Row (or database size) scalability
• Column (or dimension) scalability
• Curse of dimensionality: it is much more challenging to
make a system column scalable that row scalable
• Visualization tools
• “A picture is worth a thousand words”
• Visualization categories: data visualization, mining result
visualization, mining process visualization, and visual
data mining
• Data mining query language and graphical user interface
• Easy-to-use and high-quality graphical user interface
• Essential for user-guided, highly interactive data mining
Advanced Techniques of Data Mining
Web Mining
Web mining is the use of data mining techniques to
automatically discover and extract information from Web
documents and services.
There are three general classes of information that can be
discovered by web mining:
• Web activity, from server logs and Web browser activity
tracking.
• Web graph, from links between pages, people and other data.
• Web content, for the data found on Web pages and inside of
documents.
Advanced Techniques of Data Mining – contd..
Web Mining – contd..
Note that there’s no explicit reference to “search” in the above
description. While search is the biggest web miner by far, and
generates the most revenue, there are many other valuable end
uses for web mining results. A partial list includes:
• Business intelligence
• Competitive intelligence
• Pricing analysis
• Events
• Product data
• Popularity
• Reputation
Advanced Techniques of Data Mining – contd..
Web Mining – contd..
When extracting Web content information using web mining,
there are four typical steps:
• Collect: fetch the content from the Web
• Parse: extract usable data from formatted data (HTML, PDF,
etc)
• Analyze: tokenize, rate, classify, cluster, filter, sort, etc.
• Produce: turn the results of analysis into something useful
(report, search index, etc)
Advanced Techniques of Data Mining – contd..
Web Mining versus Data Mining
When comparing web mining with traditional data mining, there
are three main differences to consider:
• Scale: In traditional data mining, processing 1 million
records from a database would be large job. In web mining,
even 10 million pages wouldn’t be a big number.
Advanced Techniques of Data Mining – contd..
Web Mining versus Data Mining – contd..
• Access: When doing data mining of corporate information,
the data is private and often requires access rights to read. For
web mining, the data is public and rarely requires access
rights. But web mining has additional constraints, due to the
implicit agreement with webmasters regarding automated
(non-user) access to this data. This implicit agreement is that
a webmaster allows crawlers access to useful data on the
website, and in return the crawler (a) promises not to
overload the site, and (b) has the potential to drive more
traffic to the website once the search index is published. With
web mining, there often is no such index, which means the
crawler has to be extra careful/polite during the crawling
process, to avoid causing any problems for the webmaster.
Advanced Techniques of Data Mining – contd..
Web Mining versus Data Mining – contd..
• Structure: A traditional data mining task gets information
from a database, which provides some level of explicit
structure. A typical web mining task is processing
unstructured or semi-structured data from web pages. Even
when the underlying information for web pages comes from a
database, this often is obscured by HTML markup.
Note that by “traditional” data mining we mean the type of
analysis supported by most vendor tools, which assumes you’re
processing table-oriented data that typically comes from a
database.
Text Books
• Roiger & Geatz, Data Mining, Pearson Education
• A.K.Pujari, Data Mining, University Press
• M. H. Dunham. Data Mining: Introductory and Advanced
Topics. Pearson Education.
• J. Han and M. Kamber. Data Mining: Concepts and
Techniques. Morgan Kaufman.
References Books
• I. H. Witten and E. Frank. Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann.
• D. Hand, H. Mannila and P. Smyth. Principles of Data
Mining. Prentice-Hall.
Data mining:
“Drowning in Data yet Starving for
Knowledge”

More Related Content

What's hot

Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithmKIRAN R
 
Substitution techniques
Substitution techniquesSubstitution techniques
Substitution techniquesvinitha96
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptxmaha797959
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
 
Predicting Diabetes Using Machine Learning
Predicting Diabetes Using Machine LearningPredicting Diabetes Using Machine Learning
Predicting Diabetes Using Machine LearningJohn Alex
 
Machine Learning for Disease Prediction
Machine Learning for Disease PredictionMachine Learning for Disease Prediction
Machine Learning for Disease PredictionMustafa Oğuz
 
data mining
data miningdata mining
data mininguoitc
 

What's hot (20)

Text MIning
Text MIningText MIning
Text MIning
 
Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithm
 
Substitution techniques
Substitution techniquesSubstitution techniques
Substitution techniques
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data clustring
Data clustring Data clustring
Data clustring
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Data warehousing ppt
Data warehousing pptData warehousing ppt
Data warehousing ppt
 
Data Mining
Data MiningData Mining
Data Mining
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Predicting Diabetes Using Machine Learning
Predicting Diabetes Using Machine LearningPredicting Diabetes Using Machine Learning
Predicting Diabetes Using Machine Learning
 
Machine Learning for Disease Prediction
Machine Learning for Disease PredictionMachine Learning for Disease Prediction
Machine Learning for Disease Prediction
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
data mining
data miningdata mining
data mining
 
data mining
data miningdata mining
data mining
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 

Viewers also liked

Microblogging
MicrobloggingMicroblogging
Microblogginguday p
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architectureuncleRhyme
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data BaseSiva Rushi
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 

Viewers also liked (8)

Neural networks
Neural networksNeural networks
Neural networks
 
Microblogging
MicrobloggingMicroblogging
Microblogging
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Data-ware Housing
Data-ware HousingData-ware Housing
Data-ware Housing
 
Main MeMory Data Base
Main MeMory Data BaseMain MeMory Data Base
Main MeMory Data Base
 
Decision trees
Decision treesDecision trees
Decision trees
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 

Similar to Data Mining and Data Warehousing (MAKAUT)

TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueMehmet Beyaz
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxTake1As
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfssuserb933d8
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniquesHatem Magdy
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docxAbshar Fatima
 
Business Intelligence Plan Essay
Business Intelligence Plan EssayBusiness Intelligence Plan Essay
Business Intelligence Plan EssayJennifer Letterman
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)Kartik Kalpande Patil
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining TechniquesSanzid Kawsar
 

Similar to Data Mining and Data Warehousing (MAKAUT) (20)

TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining Technique
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptx
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdf
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 
Data mining
Data miningData mining
Data mining
 
notes_dmdw_chap1.docx
notes_dmdw_chap1.docxnotes_dmdw_chap1.docx
notes_dmdw_chap1.docx
 
2 Data-mining process
2   Data-mining process2   Data-mining process
2 Data-mining process
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
Business Intelligence Plan Essay
Business Intelligence Plan EssayBusiness Intelligence Plan Essay
Business Intelligence Plan Essay
 
Data mining
Data miningData mining
Data mining
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
Data mining
Data miningData mining
Data mining
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining Techniques
 

Data Mining and Data Warehousing (MAKAUT)

  • 1. Data Mining & Data Ware Housing (PGCSE302 C) The New unified Syllabus for both CSE & IT followed from the session 2013-14 by Maulana Abul Kalam Azad University of Technology, West Bengal (formerlyWest Bengal University ofTechnology)
  • 2. Dr. Bikramjit Sarkar Assistant Professor Dept. of Computer Science and Engineering Dr. B. C. Roy Engineering College Jemua Road, Fuljhore, Durgapur – 713206 (W. B.) [www.bcrec.ac.in] Presented by
  • 3. Prescribed Curriculum (MAKAUT) Data Mining & Data Ware Housing (PGCS302C): 36L UNIT-I: 4 L Introduction: Basics of Data Mining. Data Mining Functionalities, Classification of Data Mining Systems, Data Mining Issues, Data Mining Goals. Stages of the Data Mining Process. UNIT-II: 5 L Data Warehouse and OLAP: Data Warehouse concepts, Data Warehouse Architecture, OLAP technology, DBMS, OLTP VS. Data Warehouse Environment, Multidimensional data model Data marts. UNIT-III: 6 L Data Mining Techniques: Statistics, Similarity Measures, Decision Trees, Neural Networks, Genetic Algorithms. UNIT-IV: 9 L Mining Association Rules: Basic Algorithms, Parallel and Distributed algorithms, Comparative study, Incremental Rules, Advanced Association Rule Technique, Apriori Algorithm, Partition Algorithm, Dynamic Item set Counting Algorithm, FP tree growth Algorithm, Boarder Algorithm.
  • 4. Prescribed Curriculum (MAKAUT) – contd.. Data Mining & Data Ware Housing (PGCS302C): 36L UNIT-V: 5 L Clustering Techniques: Partitioning Algorithms-K- means Algorithm, CLARA, CLARANS, Hierarchical algorithms- DBSCAN, ROCK. UNIT-VI: 4 L Classification Techniques: Statistical–based, Distance-based, Decision Tree- based Decision tree. UNIT-VII: 3 L Applications and Trends in Data Mining: Applications, Advanced Techniques - Web Mining, Web Content Mining, Structure Mining. - - -
  • 6. Data, Information, Knowledge, Understanding, Wisdom Data Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes: (A) Operational or transactional data such as, sales, cost, inventory, payroll, and accounting… (B) Non-operational data, such as industry sales, forecast data, and macro-economic data… (C) Meta data - data about the data itself, such as logical database design or data dictionary definitions…
  • 7. Data, Information, Knowledge, Understanding, Wisdom– contd.. Information The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.
  • 8. Data, Information, Knowledge, Understanding, Wisdom– contd.. Knowledge Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.
  • 9. Data, Information, Knowledge, Understanding, Wisdom– contd.. Understanding Understanding is an interpolative and probabilistic process. It is cognitive and analytical. It is the process by which I can take knowledge and synthesize new knowledge from the previously held knowledge. The difference between understanding and knowledge is the difference between "learning" and "memorizing". People who have understanding can undertake useful actions because they can synthesize new knowledge, or in some cases, at least new information, from what is previously known (and understood).
  • 10. Understanding – contd.. That is, understanding can build upon currently held information, knowledge and understanding itself. In computer parlance, AI systems possess understanding in the sense that they are able to synthesize new knowledge from previously stored information and knowledge. Data, Information, Knowledge, Understanding, Wisdom– contd..
  • 11. Data, Information, Knowledge, Understanding, Wisdom– contd.. Wisdom Wisdom is an extrapolative and non-deterministic, non- probabilistic process. It calls upon all the previous levels of consciousness, and specifically upon special types of human programming (moral, ethical codes, etc.). It beckons to give us understanding about which there has previously been no understanding, and in doing so, goes far beyond understanding itself. It is the essence of philosophical probing. Unlike the previous four levels, it asks questions to which there is no (easily-achievable) answer, and in some cases, to which there can be no humanly-known answer period. Wisdom is therefore, the process by which we also discern, or judge, between right and wrong, good and bad.
  • 12. Data, Information, Knowledge, Understanding, Wisdom– contd.. Wisdom – contd.. Computers do not have, and will never have the ability to possess wisdom. Wisdom is a uniquely human state, or as I see it, wisdom requires one to have a soul, for it resides as much in the heart as in the mind. And a soul is something machines will never possess (or perhaps I should reword that to say, a soul is something that, in general, will never possess a machine).
  • 13. Data, Information, Knowledge, Understanding, Wisdom– contd.. The following diagram represents the transitions from data, to information, to knowledge, and finally to wisdom: It is understanding that support the transition from each stage to the next. Understanding is not a separate level of its own.
  • 14. Concepts of Data Mining Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost.
  • 15. Concepts of Data Mining – contd.. Data Mining is a technology that uses data analysis tools with sophisticated algorithms to search useful information from large volumes of data. Data mining is also defined as a process of automatically discovering useful information from massive amount of data repositories.
  • 16. Concepts of Data Mining – contd.. Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD). Data mining can answer questions that cannot be addressed through simple query and reporting techniques.
  • 17. The key properties of data mining • Automatic discovery of patterns • Prediction of likely outcomes • Creation of actionable information • Focus on large data sets and databases
  • 18. The key properties of data mining – contd.. Automatic Discovery Data mining is accomplished by building models. A model uses an algorithm to act on a set of data. The notion of automatic discovery refers to the execution of data mining models. Data mining models can be used to mine the data on which they are built, but most types of models are generalizable to new data. The process of applying a model to new data is known as scoring.
  • 19. Prediction Many forms of data mining are predictive. For example, a model might predict income based on education and other demographic factors. Predictions have an associated probability (How likely is this prediction to be true?). Prediction probabilities are also known as confidence. Some forms of predictive data mining generate rules, which are conditions that imply a given outcome. For example, a rule might specify that a person who has a bachelor's degree and lives in a certain neighborhood is likely to have an income greater than the regional average. Rules have an associated support. The key properties of data mining – contd..
  • 20. Actionable Information Data mining can derive actionable information from large volumes of data. For example, a town planner might use a model that predicts income based on demographics to develop a plan for low-income housing. A car leasing agency might a use model that identifies customer segments to design a promotion targeting high-value customers. The key properties of data mining – contd..
  • 21. The key properties of data mining – contd.. Grouping Other forms of data mining identify natural groupings in the data. For example, a model might identify the segment of the population that has an income within a specified range, that has a good driving record, and that leases a new car on a yearly basis.
  • 22. Data Mining and Knowledge Discovery Data mining is an integral part of Knowledge Discovery in databases (KDD), which is an overall process of converting raw data into useful information, as shown in figure below. This process consists of a series of transformation steps, from pre-processing to post-processing of data mining results.
  • 23. Knowledge Discovery in Databases The following diagram represents the process of Knowledge Discovery in databases:
  • 24. Knowledge Discovery in Databases – contd.. The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps: • Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection. • Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common source. • Data selection: At this step, the data relevant to the analysis is decided on and retrieved from the data collection.
  • 25. Knowledge Discovery in Databases – contd.. • Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure. • Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful. • Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on given measures. • Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results.
  • 26. Steps in Knowledge Discovery in Databases Below are the steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps: • Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection. • Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common source. • Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the data collection.
  • 27. Steps in Knowledge Discovery in Databases – contd.. • Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure. • Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful. • Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on given measures. • Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results.
  • 28. Steps in Knowledge Discovery in Databases – contd.. It is common to combine some of these steps together. For instance, data cleaning and data integration can be performed together as a pre-processing phase to generate a data warehouse. Data selection and data transformation can also be combined where the consolidation of the data is the result of the selection, or, as for the case of data warehouses, the selection is done on transformed data. The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order to get different, more appropriate results.
  • 29. Motivating Challenges Below are the motivation challenges that motivated Data mining: • Scalability • High Dimensionality • Heterogeneous and complex data • Data ownership and distribution • Non-traditional Analysis
  • 30. Motivating Challenges – contd.. Scalability Scaling and performance are often considered together in Data Mining. The problem of scalability in Data Mining is not only how to process such large sets of data, but how to do it within a useful timeframe. Many of the issues of scalability in Data Mining and DBMS are similar to scaling performance issues for Data Management in general.
  • 31. Motivating Challenges – contd.. High Dimensionality The variable in 1-D data is usually time. An example is the log of interrupts in a processor. 2D data can often be found in statistics like the number of financial transactions in a certain period of time. 3-D data can be positions in 3-D space or points on a surface whereas time (the 3rd dimension) varies. High- dimensional data contains all those sets of data that have more than three considered variables. Examples are locations in space that vary with time (here time is the fourth dimension) or any other combination of more than three variables, e.g. product - channel - territory - period - customer’s income.
  • 32. Motivating Challenges – contd.. Heterogeneous and complex data Heterogeneous data means data set contains attributes of different types. Traditional data analysis methods contain data sets with same types of attributes. Complex data is a data with different attribute and information. For example webpage with hyperlinks, DNA and 3D structure, climate data (temperature, pressure, mist, humidity, time, location).
  • 33. Motivating Challenges – contd.. Data ownership and distribution Sometimes the data needed for an analysis is not stored in one location or owned by one organization. Instead the data is distributed in geographically among multiple entities. This requires the development of distributed data mining techniques.
  • 34. Motivating Challenges – contd.. Non-traditional analysis It is based on hypothesis and test paradigm. Hypothesis is proposed one, it is an experiment designed to gather data. Currently huge data is present in data repositories so it requires thousands of hypotheses.
  • 35. Data Mining Functionalities Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general, data mining tasks can be classified into two categories: • Description Methods: Here the objective is to derive patterns that summarize the underlying relationships in data. They find human-interpretable patterns that describe the data. • Predictive tasks: The objective of these tasks is to predict the value of a particular attribute based on the values of other attribute. They use some variables (independent / explanatory variable) to predict unknown or future values of other variables (dependent / target variable).
  • 36. Data Mining Functionalities – contd.. There are four core tasks in Data Mining: • Predictive modelling • Association analysis • Clustering analysis • Anomaly detection
  • 37. Data Mining Functionalities – contd.. Predictive modelling Find some missing or unavailable data values rather than class labels referred to as prediction. Although prediction may refer to both data value prediction and class label prediction, it is usually confined to data value prediction and thus is distinct from classification. Prediction also encompasses the identification of distribution trends based on the available data.
  • 38. Data Mining Functionalities – contd.. Association analysis It is the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. For example, a data mining system may find association rules like major(X, “computing science”) ? owns(X, “personal computer”) [support = 12%, confidence = 98%] where X is a variable representing a student. The rule indicates that of the students under study, 12% (support) major in computing science and own a personal computer. There is a 98% probability (confidence, or certainty) that a student in this group owns a personal computer.
  • 39. Data Mining Functionalities – contd.. Clustering analysis Clustering analyzes data objects without consulting a known class label. The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the interclass similarity. Each cluster that is formed can be viewed as a class of objects.
  • 40. Data Mining Functionalities – contd.. Anomaly detection It is the task of identifying observations whose characteristics are significantly different from the rest of the data. Such observations are called anomalies or outliers. This is useful in fraud detection and network intrusions.
  • 41. Classification of Data Mining systems A data mining system can be classified according to the following criteria: • Database Technology • Statistics • Machine Learning • Information Science • Visualization • Other Disciplines
  • 42. Classification of Data Mining systems – contd.. Apart from the previous criteria, a data mining system can also be classified based on the kind of • Databases mined • Knowledge mined • Techniques utilized • Applications adapted
  • 43. Classification Based on the Databases Mined Database system can be classified according to different criteria such as data models, types of data, etc. And the data mining system can be classified accordingly. For example, if we classify a database according to the data model, then we may have a relational, transactional, object- relational, or data warehouse mining system. Classification of Data Mining systems – contd..
  • 44. Classification of Data Mining systems – contd.. Classification Based on the kind of Knowledge Mined It means the data mining system is classified on the basis of functionalities such as • Characterization • Discrimination • Association and Correlation Analysis • Classification • Prediction • Prediction • Outlier Analysis • Evolution Analysis
  • 45. Classification of Data Mining systems – contd.. Classification Based on the Techniques Utilized We can classify a data mining system according to the kind of techniques used. We can describe these techniques according to the degree of user interaction involved or the methods of analysis employed.
  • 46. Classification of Data Mining systems – contd.. Classification Based on the Applications Adapted We can classify a data mining system according to the applications adapted. These applications are as follows: • Finance • Telecommunications • DNA • Stock Markets • E-mail
  • 47. Integration of Data Mining systems If a data mining system is not integrated with a database or a data warehouse system, then there will be no system to communicate with. This scheme is known as the non-coupling scheme. In this scheme, the main focus is on data mining design and on developing efficient and effective algorithms for mining the available data sets. Following are the Integration Schemes: • No Coupling • Loose Coupling • Semi−tight Coupling • Tight coupling
  • 48. Integration of Data Mining systems – contd.. No Coupling In this scheme, the data mining system does not utilize any of the database or data warehouse functions. It fetches the data from a particular source and processes that data using some data mining algorithms. The data mining result is stored in another file.
  • 49. Integration of Data Mining systems – contd.. Loose Coupling In this scheme, the data mining system may use some of the functions of database and data warehouse system. It fetches the data from the data respiratory managed by these systems and performs data mining on that data. It then stores the mining result either in a file or in a designated place in a database or in a data warehouse.
  • 50. Integration of Data Mining systems – contd.. Semi-tight Coupling In this scheme, the data mining system is linked with a database or a data warehouse system and in addition to that, efficient implementations of a few data mining primitives can be provided in the database.
  • 51. Integration of Data Mining systems – contd.. Tight coupling In this coupling scheme, the data mining system is smoothly integrated into the database or data warehouse system. The data mining subsystem is treated as one functional component of an information system.
  • 52. Data Mining issues Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. These factors also create some issues. The major issues are regarding • Mining Methodology and User Interaction • Performance Issues • Diverse Data Types Issues
  • 53. Data Mining issues – contd.. Mining Methodology and User Interaction Issues It refers to the following kinds of issues: • Mining different kinds of knowledge in databases: Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task. • Interactive mining of knowledge at multiple levels of abstraction: The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results.
  • 54. Data Mining issues – contd.. Mining Methodology and User Interaction Issues – contd.. • Incorporation of background knowledge: To guide discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction. • Data mining query languages and ad hoc data mining: Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining.
  • 55. Data Mining issues – contd.. Mining Methodology and User Interaction Issues – contd.. • Presentation and visualization of data mining results: Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable. • Handling noisy or incomplete data: The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor. • Pattern evaluation: The patterns discovered should be interesting because either they represent common knowledge or lack novelty.
  • 56. Data Mining issues – contd.. Performance Issues There can be performance-related issues such as follows: • Efficiency and scalability of data mining algorithms: In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable.
  • 57. Data Mining issues – contd.. Performance Issues – contd.. • Parallel, distributed and incremental mining algorithms: The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch.
  • 58. Data Mining issues – contd.. Diverse Data Types Issues Diverse Data Types Issues may be as follows: • Handling of relational and complex types of data: The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data. • Mining information from heterogeneous databases and global information systems: The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining.
  • 59. Data Mining goals Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications.
  • 60. Stages of the Data Mining Process The process of data mining consists of three stages: • The initial exploration • Model building or pattern identification with validation / verification • Deployment (i.e., the application of the model to new data in order to generate predictions).
  • 61. Stages of the Data Mining Process – contd.. Initial Exploration This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered).
  • 62. Stages of the Data Mining Process – contd.. Initial Exploration – contd.. Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.
  • 63. Stages of the Data Mining Process – contd.. Model building and validation This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.
  • 64. Stages of the Data Mining Process – contd.. Deployment This final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.
  • 66. Concepts of Data Warehousing A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and / or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations. Data warehousing is the process of constructing and using a data warehouse. Data warehousing is defined as a process of centralized data management and retrieval.
  • 67. Concepts of Data Warehousing A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and / or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations. Data warehousing is the process of constructing and using a data warehouse. Data warehousing is defined as a process of centralized data management and retrieval.
  • 68. Data Warehouse Features The key features of a data warehouse are discussed below: • Subject Oriented: A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision making. • Integrated: A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data.
  • 69. Data Warehouse Features – contd.. • Time Variant: The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view. • Non-volatile: Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse. Note: A data warehouse does not require transaction processing, recovery, and concurrency controls, because it is physically stored and separate from the operational database.
  • 70. Data Warehouse Applications As discussed before, a data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a plan-execute-assess "closed-loop" feedback system for the enterprise management. Data warehouses are widely used in the following fields: • Financial services • Banking services • Consumer goods • Retail sectors • Controlled manufacturing
  • 71. Types of Data Warehouses Information processing, analytical processing, and data mining are the three types of data warehouse applications that are discussed below: • Information Processing: A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs. • Analytical Processing: A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.
  • 72. Types of Data Warehouses – contd.. • Data Mining: Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using visualization tools.
  • 73. Data Warehouse Architecture Data warehouses normally adopt three-tier architecture: • The bottom tiers is a warehouse database server that is almost always a relational database system. Data from operational databases and from external sources are extracted using application program interfaces known as gateways. A gateway is supported by the underlying DBMS and allows client programs to execute code. • The middle tier is an OLAP server that is typically implemented using a relational OLAP (ROLAP) model. • The top tier is a client, which contains query and reporting tools, analysis tools and / or data mining tools. From the architecture point of view there are three data warehouse models: the enterprise warehouse, the data mart, and the virtual warehouse.
  • 74. Data Warehouse models From the architecture point of view there are three data warehouse models: • Enterprise Warehouse: An enterprise warehouse collects all details comprising of all information about subjects spanning the entire organization. It provides corporate wide data integration, usually from one or more operational systems and from external information providers. It takes extensive business modelling and it takes many years to design and build.
  • 75. Data Warehouse models – contd.. • Data Mart: A data mart consists of a subset of corporate wide data that is of value to specific group of users. The scope is confined to specific selected subjects. The data contained in a data mart tend to be summarized. • Virtual Warehouse: A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized. A virtual warehouse is easy to build and it requires excess capacity on the operational database servers.
  • 76. OLAP technology OLAP (online analytical processing) is computer processing that enables a user to easily and selectively extract and view data from different points of view. For example, a user can request that data be analyzed to display a spreadsheet showing all of a company's beach ball products sold in Florida in the month of July, compare revenue figures with those for the same products in September, and then see a comparison of other product sales in Florida in the same time period.
  • 77. OLAP technology – contd.. OLAP data is stored in a multidimensional database. Whereas a relational database can be thought of as two-dimensional, a multidimensional database considers each data attribute (such as product, geographic sales region, and time period) as a separate "dimension." OLAP software can locate the intersection of dimensions (all products sold in the Eastern region above a certain price during a certain time period) and display them. Attributes such as time periods can be broken down into sub-attributes.
  • 78. OLAP technology – contd.. OLAP can be used for data mining or the discovery of previously undiscerned relationships between data items. An OLAP database does not need to be as large as a data warehouse, since not all transactional data is needed for trend analysis. Using Open Database Connectivity (ODBC), data can be imported from existing relational databases to create a multidimensional database for OLAP.
  • 79. Data Warehouse vs. Operational Databases A data warehouses is kept separate from operational databases due to the following reasons: • An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contrast, data warehouse queries are often complex and they present a general form of data. • Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database.
  • 80. Data Warehouse vs. Operational Databases – contd.. • An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data. • An operational database maintains current data. On the other hand, a data warehouse maintains historical data.
  • 81. Data Warehouse vs. Operational Databases – contd.. Data Warehouse (OLAP) Operational Database(OLTP) It involves historical processing of information. It involves day-to-day processing. OLAP systems are used by knowledge workers such as executives, managers, and analysts. OLTP systems are used by clerks, DBAs, or database professionals. It is used to analyze the business. It is used to run the business. It focuses on Information out. It focuses on Data in. It is based on Star Schema, Snowflake Schema, and Fact Constellation Schema. It is based on Entity Relationship Model. It focuses on Information out. It is application oriented.
  • 82. Data Warehouse vs. Operational Databases – contd.. Data Warehouse (OLAP) Operational Database(OLTP) It contains historical data. It contains current data. It provides summarized and consolidated data. It is highly flexible. It provides primitive and highly detailed data. It provides high performance. It provides summarized and multidimensional view of data. It provides detailed and flat relational view of data. The number of users is in hundreds. The number of users is in thousands. The number of records accessed is in millions. The number of records accessed is in tens. The database size is from 100GB to 100 TB. The database size is from 100 MB to 100 GB.
  • 84. Data Mining techniques Following is an overview of some of the most common data mining techniques in use today. The techniques have been divided into two broad categories: • Classical Techniques: Statistics, Neighbourhoods and Clustering • Next Generation Techniques: Trees, Networks and Rules These categories will describe a number of data mining algorithms at a high level and shall help to understand how each algorithm fits into the landscape of data mining techniques. Overall, six broad classes of data mining algorithms are covered.
  • 85. Data Mining techniques – contd.. Classical Techniques This category contains descriptions of techniques that have classically been used for decades and the next category represents techniques that have only been widely used since the early 1980s. The main techniques here are the ones that are used 99.9% of the time on existing business problems. There are certainly many other ones as well as proprietary techniques from particular vendors - but in general the industry is converging to those techniques that work consistently and are understandable and explainable.
  • 86. Data Mining techniques – contd.. Classical Techniques – contd.. Statistics By strict definition statistics or statistical techniques are not data mining. They were being used long before, the term data mining was coined to apply to business applications. However, statistical techniques are driven by the data and are used to discover patterns and build predictive models. This is why it is important to have the idea of how statistical techniques work and how they can be applied.
  • 87. Data Mining techniques – contd.. Classical Techniques – contd.. Statistics – contd.. Prediction using Statistics The term “prediction” is used for a variety of types of analysis that may elsewhere be more precisely called regression. Regression is further explained in order to simplify some of the concepts and to emphasize the common and most important aspects of predictive modelling. Nonetheless regression is a powerful and commonly used tool in statistics.
  • 88. Data Mining techniques – contd.. Classical Techniques – contd.. Statistics – contd.. Linear Regression In statistics prediction is usually synonymous with regression of some form. There are a variety of different types of regression in statistics but the basic idea is that a model is created that maps values from predictors in such a way that the lowest error occurs in making a prediction. The simplest form of regression is simple linear regression that just contains one predictor and a prediction.
  • 89. Data Mining techniques – contd.. Classical Techniques – contd.. Statistics – contd.. Linear Regression – contd.. The relationship between the two can be mapped on a two dimensional space and the records plotted for the prediction values along the Y axis and the predictor values along the X axis. The simple linear regression model then could be viewed as the line that minimized the error rate between the actual prediction value and the point on the line (the prediction from the model).
  • 90. Data Mining techniques – contd.. Classical Techniques – contd.. Statistics – contd.. Linear Regression – contd.. Graphically this would look as it does in the figure below:
  • 91. Data Mining techniques – contd.. Classical Techniques – contd.. Nearest Neighbour Clustering and the Nearest Neighbour prediction technique are among the oldest techniques used in data mining. Most people think that clustering is like records are grouped together. Nearest neighbour is a prediction technique that is quite similar to clustering. Its essence is that in order to predict what a prediction value is in one record look for records with similar predictor values in the historical database and use the prediction value from the record that is “nearest” to the unclassified record.
  • 92. Data Mining techniques – contd.. Classical Techniques – contd.. Nearest Neighbour – contd.. The nearest neighbour prediction algorithm works in very much the same way except that “nearness” in a database may consist of a variety of factors not just where the person lives. It may, for instance, be far more important to know which school someone attended and what degree they attained when predicting income. Nearest Neighbour techniques are easy to use and understand because they work in a way similar to the way that people think - by detecting closely matching examples.
  • 93. Data Mining techniques – contd.. Classical Techniques – contd.. Clustering Clustering is basically a partition of the database so that each partition or group is similar according to some criteria or metric. Clustering according to similarity is a concept, which appears in many disciplines. If a measure of similarity is available there are a number of techniques for forming clusters. Membership of groups can be based on the level of similarity between members and from this the rules of membership can be defined. Another approach is to build set functions that measure some property of partitions i.e. groups or subsets as functions of some parameter of the partition. This latter approach achieves what is known as optimal partitioning.
  • 94. Data Mining techniques – contd.. Classical Techniques – contd.. Clustering – contd.. Hierarchical Clustering The hierarchical clustering techniques create a hierarchy of clusters, from small to big. The main reason is that clustering is an unsupervised learning technique, and as such, there is no absolutely correct answer. Now depending upon the particular application of the clustering, fewer or greater numbers of clusters may be desired. With a hierarchy of clusters defined it is possible to choose the number of clusters that are desired. Also it is possible to have as many clusters as there are records in the database.
  • 95. Data Mining techniques – contd.. Classical Techniques – contd.. Clustering – contd.. Hierarchical Clustering – contd.. There are two main types of hierarchical clustering algorithms: • Agglomerative: Agglomerative clustering techniques start with as many clusters as there are records where each cluster contains just one record. The clusters that are nearest to each other are merged together to form the next largest cluster. This merging is continued until a hierarchy of clusters is built with just a single cluster containing all the records at the top of the hierarchy.
  • 96. Data Mining techniques – contd.. Classical Techniques – contd.. Clustering – contd.. Hierarchical Clustering – contd.. • Divisive: Divisive clustering techniques take the opposite approach from agglomerative techniques. These techniques start with all the records in one cluster and then try to split that cluster into smaller pieces and then in turn to try to split those smaller pieces into more smaller ones.
  • 97. Data Mining techniques – contd.. Classical Techniques – contd.. Clustering – contd.. Non-Hierarchical Clustering There are two main non-hierarchical clustering techniques. Both of them are very fast to compute on the database but have some drawbacks. • The first are the single pass methods. They derive their name from the fact that the database must only be passed through once in order to create the clusters (i.e. each record is only read from the database once).
  • 98. Data Mining techniques – contd.. Classical Techniques – contd.. Clustering – contd.. Non-Hierarchical Clustering – contd.. • The other class of techniques is called reallocation methods. They get their name from the movement or “reallocation” of records from one cluster to another in order to create better clusters. The reallocation techniques do use multiple passes through the database but are relatively fast in comparison to the hierarchical techniques.
  • 99. Data Mining techniques – contd.. Next Generation Techniques This category of techniques include the following: • Trees • Networks • Rules
  • 100. Data Mining techniques – contd.. Next Generation Techniques – contd.. Decision Trees A decision tree is a predictive model that, as its name implies, can be viewed as a tree. Specifically each branch of the tree is a classification question and the leaves of the tree are partitions of the dataset with their classification. There are some interesting things about the tree: • It divides up the data on each branch point without losing any of the data (the number of total records in a given parent node is equal to the sum of the records contained in its two children).
  • 101. Data Mining techniques – contd.. Next Generation Techniques – contd.. Decision Trees – contd.. • The number of churners and non-churners is conserved as you move up or down the tree. • It is pretty easy to understand how the model is being built (in contrast to the models from neural networks or from standard statistics). • It would also be pretty easy to use this model if you actually had to target those customers that are likely to churn with a targeted marketing offer.
  • 102. Data Mining techniques – contd.. Next Generation Techniques – contd.. Neural Networks Neural networks is an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data. This can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if' questions.
  • 103. Data Mining techniques – contd.. Next Generation Techniques – contd.. Neural Networks – contd.. Neural networks is an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data. This can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if' questions.
  • 104. Data Mining techniques – contd.. Next Generation Techniques – contd.. Neural Networks – contd.. The structure of a neural network is shown in figure below:
  • 105. Data Mining techniques – contd.. Next Generation Techniques – contd.. Neural Networks – contd.. In the figure, the bottom layer represents the input layer, in this case with 5 inputs labels Xl through X5. In the middle, there is the hidden layer, with a variable number of nodes. The hidden layer performs much of the work of the network. The output layer in this case has two nodes, Z1 and Z2 representing output values determined from the inputs.
  • 106. Data Mining techniques – contd.. Next Generation Techniques – contd.. Rule Induction Rule induction is one of the major forms of data mining and is the most common form of knowledge discovery in unsupervised learning systems. Rule induction on a data base can be a massive undertaking where all possible patterns are systematically pulled out of the data and then an accuracy and significance are added to them that tell the user how strong the pattern is and how likely it is to occur again.
  • 107. Data Mining techniques – contd.. Next Generation Techniques – contd.. Rule Induction – contd.. In general these rules are relatively simple such as for a market basket database of items scanned in a consumer market basket you might find interesting correlations in your database such as: • If bagels are purchased then cream cheese is purchased 90% of the time and this pattern occurs in 3% of all shopping baskets. • If live plants are purchased from a hardware store then plant fertilizer is purchased 60% of the time and these two items are bought together in 6% of the shopping baskets.
  • 109. Mining Association Rules There are several efficient algorithms that cope with the popular and computationally expensive tasks of association rule mining. In brief, association rule is an expression that X => Y, where X and Y are sets of items. The meaning of such rules is quite intuitive: Given a database D of transactions – where each transaction T ϵ D is a set of items. X => Y expresses that whenever a transaction T contains X, T probably contains Y also. The probability of rule confidence is defined as the percentage of transactions containing Y in addition to X with regard to the overall number of transactions containing X.
  • 110. Mining Association Rules – contd.. Below are the most common algorithms: • BFS and Counting Occurrences • BFS and TID-list Intersections • DFS and Counting Occurrences • DFS and TID-list Intersections
  • 111. Mining Association Rules – contd.. Distributed Algorithms Most parallel or distributed association rule algorithms strive to parallelize either the data, known as data parallelism, or the candidates, referred to as task parallelism. With task parallelism, the candidates are partitioned and counted separately at each processor. Obviously, the partition algorithm would be easy to parallelize using the task parallelism approach.
  • 112. Mining Association Rules – contd.. Distributed Algorithms – contd.. Other dimensions in differentiating the parallel association rule algorithms are the load-balancing approach used and the architecture. The data parallelism algorithms have reduced communication costs over the task, because only the initial candidates (the set of items) and the local counts must be distributed at each iteration. With task parallelism, not only the candidates but also the local set of transactions must be broadcast to all other sites. However, the data parallelism algorithms require that memory at each processor be large enough to store all candidates at each scan (otherwise the performance will degrade considerably because I/O is required for both the database and the candidate set).
  • 113. Mining Association Rules – contd.. Distributed Algorithms – contd.. The task parallelism approaches can avoid this because only the subset of the candidates that are assigned to a processor during each scan must fit into memory. Since not all partitions of the candidates must be the same size, the task parallel algorithms can adapt to the amount of memory at each site. The only restriction is that the total size of all candidates be small enough to fit into the total size of memory in all processors.
  • 114. Mining Association Rules – contd.. Distributed Algorithms – contd.. The CDA Algorithm One data parallelism algorithm is the count distribution algorithm (CDA). The database is divided into p partitions, one for each processor. Each processor counts the candidates for its data and then broadcasts its counts to all other processors. Each processor then determines the global counts. These then are used to determine the large item sets and to generate the candidates for the next scan.
  • 115. Mining Association Rules – contd.. Distributed Algorithms – contd.. The FDM Algorithm The FDM (Fast Distributed Algorithm for Data Mining) algorithm, proposed in (Cheung et al. 1996) has the following distinguishing characteristics: • Candidate set generation is Apriori-like. However, some interesting properties of locally and globally frequent item sets are used to generate a reduced set of candidates at each iteration, this resulting in a reduction in the number of messages interchanged between sites.
  • 116. Mining Association Rules – contd.. Distributed Algorithms – contd.. The FDM Algorithm – contd.. • After the candidate sets were generated, two types of reduction techniques are applied, namely a local reduction and a global reduction, to eliminate some candidate sets from each site. • To be able to determine if a candidate set is frequent, the algorithm needs only O(n) messages for the exchange of support counts, where n is the number of sites from the distributed system. This number is much less than a discrete adaptation of Apriori, which would need O(n2) messages for calculating the support counts.
  • 117. Mining Association Rules – contd.. Distributed Algorithms – contd.. Increasing the support factor also increases the performance of the algorithms. Also good performances are obtained when the support factor is low and the data set large, but the number of processors increased. The increase in processor number should be done relative to the dimension of the data set. Thus, for a relatively small data set, the large increase in processor number can lead to large sets of local candidates and a large number of messages, thus increasing the execution time of CDA and FDM algorithms.
  • 118. Mining Association Rules – contd.. Distributed Algorithms – contd.. The CDA algorithm has a simple synchronization scheme, using only one set of messages for every step, while the FDM algorithm uses two synchronizations and the same scheme as CDA. The distributed mining algorithms can be used on distributed databases, as well as for mining large databases by partitioning them between sites and processing them in a distributed manner. The high flexibility, the scalability, the small cost/performance ratio and the connectivity of a distributed system make them an ideal platform for data mining.
  • 119. Mining Association Rules – contd.. Incremental Rules With the increasing use of the record-based databases whose data is being continuously added, recent important applications have called for the need of incremental mining. In dynamic transaction databases, new transactions are appended and obsolete transactions are discarded as time advances. Several research works have developed feasible algorithms for deriving precise association rules efficiently and effectively in such dynamic databases.
  • 120. Mining Association Rules – contd.. Incremental Rules – contd.. The mining of association rules on transactional database is usually an offline process since it is costly to find the association rules in large databases. With usual market-basket applications, new transactions are generated and old transactions may be obsolete as time advances. As a result, incremental updating techniques should be developed for maintenance of the discovered association rules to avoid redoing mining on the whole updated database.
  • 121. Mining Association Rules – contd.. Apriori-Based Algorithms Algorithm Apriori is an influential algorithm for mining association rules. It uses prior knowledge of frequent item set properties to help on narrowing the search space of required frequent item sets. Specifically, k-item sets are used to explore (k+1)-item sets during the level wise process of frequent item set generation. The set of frequent 1-itemsets (L1) is firstly found by scanning the whole dataset once. L1 is then used by performing join and prune actions to form the set of candidate 2- itemsets (C2). After another data scan, the set of frequent 2 item sets (L2) are identified and extracted from C2. The whole process continues iteratively until there is no more candidate item sets which can be formed from previous Lk.
  • 122. Mining Association Rules – contd.. Apriori-Based Algorithms for Incremental Mining The Apriori heuristic is an anti-monotone principle. Specifically, if any item set is not frequent in the database, its super item set will never be frequent. Below are the algorithms belonging to this category that adopt a level wise approach: • Algorithm FUP (Fast UPdate) • Algorithms FUP2 and FUP2H • Algorithm UWEP (Update With Early Pruning) • Algorithm Utilizing Negative Borders • Algorithm DELI (Difference Estimation for Large Item sets) • Algorithms MAAP (Maintaining Association rules with Apriori Property) and PELICAN
  • 123. Mining Association Rules – contd.. Partition-Based Algorithms There are several techniques developed in prior works to improve the efficiency of algorithm Apriori, e.g., hashing item set counts, transaction reduction, data sampling, data partitioning and so on. Among the various techniques, the data partitioning is the one with great importance since the goal in this chapter is on the incremental mining where bulks of transactions may be appended or discarded as time advances.
  • 124. Mining Association Rules – contd.. Partition-Based Algorithms for Incremental Mining In contrast to the Apriori heuristic, the partition-based technique well utilizes the partitioning on the whole transactional dataset. Moreover, after the partitioning, it is understood that if X is a frequent item set in database D which is divided into n partitions p1, p2, ..., pn, then X must be a frequent item set in at least one of the n partitions. Consequently, algorithms belonging to this category work on each partition of data iteratively and gather the information obtained from the processing of each partition to generate the final (integrated) results.
  • 125. Mining Association Rules – contd.. Partition-Based Algorithms for Incremental Mining – contd.. Below are the algorithms belonging to this category: • Algorithm SWF (Sliding—Window Filtering) • Algorithms FI_SWF and CI_SWF
  • 126. Mining Association Rules – contd.. Pattern Growth Algorithms The generation of frequent item sets in both the Apriori-based algorithms and the partition-based algorithms is in the style of candidate generate-and-test. No matter how the search space for candidate item sets is narrowed, in some cases, it may still need to generate a huge number of candidate item sets. In addition, the number of database scans is limited to be at least twice, and usually some extra scans are needed to avoid unreasonable computing overheads. These two problems are nontrivial and are resulted from the utilization of the Apriori approach.
  • 127. Mining Association Rules – contd.. Pattern Growth Algorithms – contd.. To overcome these difficulties, the tree structure which stores projected information of large datasets are utilized in some prior works. The algorithm TreeProjection constructs a lexicographical tree and has the whole database projected based on the frequent item sets mined so far. The transaction projection can limit the support counting in a relatively small space and the lexicographical tree can facilitate the management of candidate item sets. These features of algorithm TreeProjection provide a great improvement in computing efficiency when mining association rules.
  • 128. Mining Association Rules – contd.. Pattern Growth Algorithms for Incremental Mining Both the Apriori-based algorithms and the partition-based algorithms aim at the goal of reducing the number of scans on the entire dataset when updates occur. Generally speaking, the updated portions, i.e., ∆− and ∆+, could be scanned several times during the level wise generation of frequent item sets in works belonging to these two categories. Below are the algorithms belonging to this category: • Algorithms DB-tree and PotFp-tree (Potential Frequent Pattern) • Algorithm FELINE (FrEquent/Large patterns mINing with CATS trEe)
  • 129. UNIT-V
  • 130. Clustering Techniques A cluster is a collection of data objects, similar to one another within the same cluster. The objects of a particular cluster are dissimilar to those in other clusters. Below are the major clustering approaches: • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based algorithms: based on connectivity and density functions • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other
  • 131. Clustering Techniques – contd.. Partitioning Algorithms: Basic Concept To Construct a partition of a database D of n objects into a set of k clusters. Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen’67): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
  • 132. Clustering Techniques – contd.. Optimization problem The goal is to optimize a score function. The most commonly used is the square error criterion:       k i i Cp impE 1 2
  • 133. Clustering Techniques – contd.. The K-Means Clustering Method Given k, the k-means algorithm is implemented in 4 steps: • Partition objects into k nonempty subsets. • Compute seed points as the centroids of the clusters of the current partition. The centroid is the centre (mean point) of the cluster. • Assign each object to the cluster with the nearest seed point. • Go back to Step 2, stop when no more new assignment.
  • 134. Clustering Techniques – contd.. The K-Means Clustering Method – contd.. 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 135. Clustering Techniques – contd.. The K-Means Clustering Method – contd.. Strength • Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness • Applicable only when mean is defined, then what about categorical data? • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers • Not suitable to discover clusters with non-convex shapes
  • 136. Clustering Techniques – contd.. The K-Means Clustering Method – contd.. • Find representative objects, called medoids, in clusters • PAM (Partitioning Around Medoids, 1987) • starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering • PAM works effectively for small data sets, but does not scale well for large data sets • CLARA (Kaufmann & Rousseeuw, 1990) • CLARANS (Ng & Han, 1994): Randomized sampling • Focusing + spatial data structure (Ester et al., 1995)
  • 137. Clustering Techniques – contd.. The K-Means Clustering Method – contd.. PAM (Partitioning Around Medoids) • PAM (Kaufman and Rousseeuw, 1987), built in Splus • Use real object to represent the cluster • Select k representative objects arbitrarily • For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih • For each pair of i and h, • If TCih < 0, i is replaced by h • Then assign each non-selected object to the most similar representative object • repeat steps 2-3 until there is no change
  • 138. Clustering Techniques – contd.. The K-Means Clustering Method – contd.. PAM (Partitioning Around Medoids) – contd.. 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 j ih t Cjih = 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 t i h j Cjih = d(j, h) - d(j, i) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 h i t j Cjih = d(j, t) - d(j, i) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 t i h j Cjih = d(j, h) - d(j, t)
  • 139. Clustering Techniques – contd.. The K-Means Clustering Method – contd.. CLARA (Clustering Large Applications) • Built in statistical analysis packages, such as S+ • It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output • Strength: • deals with larger data sets than PAM • Weakness: • Efficiency depends on the sample size • A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased
  • 140. Clustering Techniques – contd.. The K-Means Clustering Method – contd.. CLARANS (“Randomized” CLARA) • CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94) • CLARANS draws sample of neighbors dynamically • The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids • If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum • It is more efficient and scalable than both PAM and CLARA
  • 141. Clustering Techniques – contd.. The K-Means Clustering Method – contd.. Hierarchical Clustering The Hierarchical Clustering uses distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition.
  • 143. Classification Techniques In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) on whose category membership is known. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.
  • 144. Classification Techniques – contd.. Statistical-based Two main phases of work on classification can be identified within the statistical community. The first, “classical” phase concentrated on derivatives of Fisher’s early work on linear discrimination. The second, “modern” phase exploits more flexible classes of models, many of which attempt to provide an estimate of the joint distribution of the features within each class, which can in turn provide a classification rule.
  • 145. Classification Techniques – contd.. Statistical-based – contd.. Statistical approaches are generally characterised by having an explicit underlying probability model, which provides a probability of being in each class rather than simply a classification. In addition, it is usually assumed that the techniques will be used by statisticians, and hence some human intervention is assumed with regard to variable selection and transformation, and overall structuring of the problem.
  • 146. Classification Techniques – contd.. Distance-based A typical distance-based classifier is Knn (K Nearest Neighbours). Knn calculates proximity between a test instance and each one of all the training instances for selecting k nearest neighbours of the test instance1. Among K nearest neighbours (training instances) the class label of the nearest neighbours is assigned as the class label of the test instance. Majority voting is used to assign a class label to a test instance: it will be the class of the majority of the training instances in the k-nn set. The most used proximity measure is Euclidean distance or cosine similarity: with instances described by the values of n attributes, proximity is computed between two instances where each instance is thought as a vector in an n-dimensional space.
  • 147. Classification Techniques – contd.. Distance-based – contd.. These classifiers are simple and powerful but some of the well- known limitations of Knn are given below: •If there are many training instances then Knn requires many distance calculations as well. •k-nn has the problem of model over-fitting. Model over-fitting is the following situation in which: 1) the classifier relies too much on the training data for its predictions and is not able to generalize its model to new test data and 2) Over-fitting is exemplified by the observation of the classification errors respectively in the training set and in the test set. The misclassification error on the training set continues to decrease whilst the error on test instances starts to increase again.
  • 148. Classification Techniques – contd.. Decision Tree- based A decision tree is a classifier expressed as a recursive partition of the instance space. The decision tree consists of nodes that form a rooted tree, meaning it is a directed tree with a node called “root” that has no incoming edges. All other nodes have exactly one incoming edge. A node with outgoing edges is called an internal or test node. All other nodes are called leaves (also known as terminal or decision nodes). In a decision tree, each internal node splits the instance space into two or more sub-spaces according to a certain discrete function of the input attributes values. In the simplest and most frequent case, each test considers a single attribute, such that the instance space is partitioned according to the attribute’s value. In the case of numeric attributes, the condition refers to a range.
  • 149. Classification Techniques – contd.. Decision Tree- based – contd.. Each leaf is assigned to one class representing the most appropriate target value. Alternatively, the leaf may hold a probability vector indicating the probability of the target attribute having a certain value. Instances are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path. Decision tree inducers are algorithms that automatically construct a decision tree from a given dataset. Typically the goal is to find the optimal decision tree by minimizing the generalization error. However, other target functions can be also defined, for instance, minimizing the number of nodes or minimizing the average depth.
  • 151. Applications and Trends in Data Mining Data mining is an interdisciplinary field with wide and diverse applications. There exist nontrivial gaps between data mining principles and domain-specific applications. Some application domains: • Financial data analysis • Retail industry • Telecommunication industry • Biological data analysis
  • 152. Applications and Trends in Data Mining – contd.. Financial Data Analysis • Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality • Design and construction of data warehouses for multidimensional data analysis and data mining • View the debt and revenue changes by month, by region, by sector, and by other factors • Access statistical information such as max, min, total, average, trend, etc. • Loan payment prediction/consumer credit policy analysis • feature selection and attribute relevance ranking • Loan payment performance • Consumer credit rating
  • 153. Applications and Trends in Data Mining – contd.. Financial Data Analysis – contd.. • Classification and clustering of customers for targeted marketing • multidimensional segmentation by nearest-neighbor, classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group • Detection of money laundering and other financial crimes • integration of from multiple DBs (e.g., bank transactions, federal/state crime history DBs) • Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)
  • 154. Applications and Trends in Data Mining – contd.. Retail Industry • Retail industry: huge amounts of data on sales, customer shopping history, etc. • Applications of retail data mining • Identify customer buying behaviors • Discover customer shopping patterns and trends • Improve the quality of customer service • Achieve better customer retention and satisfaction • Enhance goods consumption ratios • Design more effective goods transportation and distribution policies
  • 155. Applications and Trends in Data Mining – contd.. Telecomm. Industry • A rapidly expanding and highly competitive industry and a great demand for data mining • Understand the business involved • Identify telecommunication patterns • Catch fraudulent activities • Make better use of resources • Improve the quality of service • Multidimensional analysis of telecommunication data • Intrinsically multidimensional: calling-time, duration, location of caller, location of callee, type of call, etc.
  • 156. Applications and Trends in Data Mining – contd.. Telecomm. Industry – contd.. • Fraudulent pattern analysis and the identification of unusual patterns • Identify potentially fraudulent users and their atypical usage patterns • Detect attempts to gain fraudulent entry to customer accounts • Discover unusual patterns which may need special attention • Multidimensional association and sequential pattern analysis • Find usage patterns for a set of communication services by customer group, by month, etc. • Promote the sales of specific services • Improve the availability of particular services in a region
  • 157. Applications and Trends in Data Mining – contd.. Biomedical Data Analysis • DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T). • Gene: a sequence of hundreds of individual nucleotides arranged in a particular order • Humans have around 30,000 genes • Tremendous number of ways that the nucleotides can be ordered and sequenced to form distinct genes • Semantic integration of heterogeneous, distributed genome databases • Current: highly distributed, uncontrolled generation and use of a wide variety of DNA data • Data cleaning and data integration methods developed in data mining will help
  • 158. Applications and Trends in Data Mining – contd.. Choosing a Data Mining System • Commercial data mining systems have little in common • Different data mining functionality or methodology • May even work with completely different kinds of data sets • Need multiple dimensional view in selection • Data types: relational, transactional, text, time sequence, spatial? • System issues • running on only one or on several operating systems? • a client/server architecture? • Provide Web-based interfaces and allow XML data as input and/or output?
  • 159. Applications and Trends in Data Mining – contd.. Choosing a Data Mining System – contd.. • Data sources • ASCII text files, multiple relational data sources • support ODBC connections (OLE DB, JDBC)? • Data mining functions and methodologies • One vs. multiple data mining functions • One vs. variety of methods per function • More data mining functions and methods per function provide the user with greater flexibility and analysis power • Coupling with DB and/or data warehouse systems • Four forms of coupling: no coupling, loose coupling, semitight coupling, and tight coupling
  • 160. Applications and Trends in Data Mining – contd.. Choosing a Data Mining System – contd.. • Scalability • Row (or database size) scalability • Column (or dimension) scalability • Curse of dimensionality: it is much more challenging to make a system column scalable that row scalable • Visualization tools • “A picture is worth a thousand words” • Visualization categories: data visualization, mining result visualization, mining process visualization, and visual data mining • Data mining query language and graphical user interface • Easy-to-use and high-quality graphical user interface • Essential for user-guided, highly interactive data mining
  • 161. Advanced Techniques of Data Mining Web Mining Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. There are three general classes of information that can be discovered by web mining: • Web activity, from server logs and Web browser activity tracking. • Web graph, from links between pages, people and other data. • Web content, for the data found on Web pages and inside of documents.
  • 162. Advanced Techniques of Data Mining – contd.. Web Mining – contd.. Note that there’s no explicit reference to “search” in the above description. While search is the biggest web miner by far, and generates the most revenue, there are many other valuable end uses for web mining results. A partial list includes: • Business intelligence • Competitive intelligence • Pricing analysis • Events • Product data • Popularity • Reputation
  • 163. Advanced Techniques of Data Mining – contd.. Web Mining – contd.. When extracting Web content information using web mining, there are four typical steps: • Collect: fetch the content from the Web • Parse: extract usable data from formatted data (HTML, PDF, etc) • Analyze: tokenize, rate, classify, cluster, filter, sort, etc. • Produce: turn the results of analysis into something useful (report, search index, etc)
  • 164. Advanced Techniques of Data Mining – contd.. Web Mining versus Data Mining When comparing web mining with traditional data mining, there are three main differences to consider: • Scale: In traditional data mining, processing 1 million records from a database would be large job. In web mining, even 10 million pages wouldn’t be a big number.
  • 165. Advanced Techniques of Data Mining – contd.. Web Mining versus Data Mining – contd.. • Access: When doing data mining of corporate information, the data is private and often requires access rights to read. For web mining, the data is public and rarely requires access rights. But web mining has additional constraints, due to the implicit agreement with webmasters regarding automated (non-user) access to this data. This implicit agreement is that a webmaster allows crawlers access to useful data on the website, and in return the crawler (a) promises not to overload the site, and (b) has the potential to drive more traffic to the website once the search index is published. With web mining, there often is no such index, which means the crawler has to be extra careful/polite during the crawling process, to avoid causing any problems for the webmaster.
  • 166. Advanced Techniques of Data Mining – contd.. Web Mining versus Data Mining – contd.. • Structure: A traditional data mining task gets information from a database, which provides some level of explicit structure. A typical web mining task is processing unstructured or semi-structured data from web pages. Even when the underlying information for web pages comes from a database, this often is obscured by HTML markup. Note that by “traditional” data mining we mean the type of analysis supported by most vendor tools, which assumes you’re processing table-oriented data that typically comes from a database.
  • 167. Text Books • Roiger & Geatz, Data Mining, Pearson Education • A.K.Pujari, Data Mining, University Press • M. H. Dunham. Data Mining: Introductory and Advanced Topics. Pearson Education. • J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufman. References Books • I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann. • D. Hand, H. Mannila and P. Smyth. Principles of Data Mining. Prentice-Hall.
  • 168. Data mining: “Drowning in Data yet Starving for Knowledge”