Overview of basic concepts related to Data Mining: database, data model, fuzzy sets, information retrieval, data warehouse, dimensional modeling, data cubes, OLAP, machine learning.
2. 2.1 Database/OLTP Systems
• Unlike a simple set, data in a database are usually viewed to have a particular
structure or schema which it is associated with.
• Unlike a file, a database is independent of the physical method used to store it.
• Data model is used to describe the data, attributes, and relationships among
them. A common data model is the ER (entity-relationship) data model. It can be
viewed as a documentation and communication tool to convey type and structure
of the actual data. A data model is independent of the particular the DBMS used.
• Basic database queries are well defined with precise results. Data mining
applications conversely are often vaguely defined with imprecise results. A data
mining query outputs a KDD object.
• A KDD object is either a rule, a classification, or a cluster, which do not exist
before executing the query, and are not part of the database being queried.
3. 2.2 Fuzzy Sets and Fuzzy Logic
• A fuzzy set is a set, 𝐹, in which the set membership function, f, is a real valued (as opposed to
Boolean) function with output in the range [0,1]. An element 𝑥 is said to belong to 𝐹 with
probability 𝑓(𝑥) and simultaneously to be in ¬𝐹 with probability 1 − 𝑓(𝑥)
• Membership function is not Boolean so the results of this query are fuzzy. Classification problem is
solved by assigning a set membership function to each record for each class. The record is then
assigned to the class that has the highest membership function value.
• Association rules are generated given a confidence value that indicates the degree to which it
holds in the entire database. This can be thought of as a membership function.
• Fuzzy logic uses rules and membership functions to estimate a continuous function. Fuzzy logic is
a valuable tool to develop control systems for such things as elevators, trains, and heating
systems.
𝑚𝑒𝑚 ¬𝑥 = 1 − 𝑚𝑒𝑚 𝑥
𝑚𝑒𝑚 𝑥 ∧ 𝑦 = min(𝑚𝑒𝑚 𝑥 , 𝑚𝑒𝑚 𝑦 )
𝑚𝑒𝑚 𝑥 ⋁𝑦 = max(𝑚𝑒𝑚 𝑥 , 𝑚𝑒𝑚 𝑦 )
4. 2.3 Information Retrieval
The effectiveness of the IR system in processing the
query is often measured by precision and recall:
• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑑 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
• 𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑑 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡
The inverse document frequency (IDF) is often used
by similarity measures. Given a keyword, 𝑘, and
𝑛 documents, IDF can be defined as:
• 𝐼𝐷𝐹𝑘 = log
𝑛
𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑘
+ 1
Concept hierarchies (tree or DAG (directed acyclic
graph) ) can be used in spatial data mining.
5. 2.5 Data Warehousing
• Data warehouse is a set of data that supports DSS and is subject-
oriented, integrated, time-variant, nonvolatile.
• DW contains contains informational data, which are used to support
other functions such as planning and forecasting.
• OLAP retrieval tools facilitate quick query response at all granularities.
6. 2.6 Dimensional Modeling
Multidimensional star schema Multidimensional constellation schema
• A dimension is a collection of logically related attributes and is viewed as an axis
for modeling the data.
• The specific data stored are called facts and usually are numeric data. In a
relational system each dimension is a table and facts are stored in a fact table.
8. 2.7 Online Analytic Processing (OLAP)
OLAP supports as hoc querying of the data warehouse. OLAP requires
a multidimensional view of the data and involves some analysis.
Operations supported: Slice, Dice, Roll up, Drill down, Visualization
9. 2.8 Statistics
• Such simple concepts as determining a data distribution and calculating a mean, a
variance can be viewed as data mining techniques in their own, a descriptive model for
the data under consideration.
• When a model is generated, the goal is to fit it to the entire data, not just a sample
searched. Assumptions often made about independence of data may be incorrect, thus
leading to errors in the resulting model. Any model should be statistically significant,
meaningful, and valid.
• An often used tool in data mining and machine learning is one of sampling. Here a subset
of the total population is examined, and a generalization (model) about the entire
population is made from this subset.
• The term exploratory data analysis describes the fact that the data can actually drive the
creation of the model and any statistical characteristics.
• Some data mining applications determine correlations among data. These relationships,
however, are not casual in nature. Care must be taken when assigning significance to
such relationships.
10. 2.9 Machine Learning
• Data mining involves not only modeling but also the development of effective and
efficient algorithms and data structures to perform the modeling on large data sets.
• Machine learning is the area of AI that examines how to write programs that can learn. In
data mining, machine learning is often used for prediction or classification.
• Predictive modeling is done in two phases. During the training phase, historical or
sampled data are used to create a model that represents those data. It is assumed to
hold for the whole database and its future states. The testing phase then applies this
model to the remaining and future data.
• With supervised learning a sample of the database is used to train the system to properly
perform the desired task. The quality of the training data determines how well the
program learns. With unsupervised learning there is no knowledge of the correct
answers of applying the model to the data.
• The objective for data mining is to uncover useful information and provide it to humans,
while machine learning research is focused more on the learning portion.