More Related Content
Similar to Mis2013 chapter 12 business intelligence and knowledge management
Similar to Mis2013 chapter 12 business intelligence and knowledge management (20)
More from Andi Iswoyo (20)
Mis2013 chapter 12 business intelligence and knowledge management
- 2. Understand the need for business
intelligence systems.
Know the characteristics of reporting
systems.
Know the purpose and role of data
warehouses and data marts.
Understand fundamental data-mining
techniques.
Know the purpose, features, and functions of
knowledge management systems.
© 2007 Prentice Hall, Inc. 2
- 3. According to a study done at the University of
California at Berkeley, a total of 403
petabytes of new data were created.
403 petabytes is roughly the amount of all
printed material ever written.
◦ The printed collection of the Library of Congress is
.01 petabytes.
◦ 400 petabytes equals 40,000 copies of the print
collection of the Library of Congress.
© 2007 Prentice Hall, Inc. 3
- 4. The generation of all these data has much to
do with Moore’s Law.
The capacity of storage devices increases as
their costs decrease.
Today, storage capacity is nearly unlimited.
We are drowning in data and starving for
information.
© 2007 Prentice Hall, Inc. 4
- 5. © 2007 Prentice Hall, Inc. 5
Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley.
- 6. © 2007 Prentice Hall, Inc. 6
Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley.
- 7. Tools for searching business data in an
attempt to find patterns is called business
intelligence (BI) tools.
Reporting tools are programs that read data
from a variety of sources, process that data,
produce formatted reports, and deliver those
reports to the users who need them.
© 2007 Prentice Hall, Inc. 7
- 8. The processing of data is simple:
◦ Data are sorted and grouped.
◦ Simple totals and averages are calculated.
Reporting tools are used primarily for
assessment
◦ They are used to address questions like:
What has happened in the past?
What is the current situation?
How does the current situation compare to the past?
© 2007 Prentice Hall, Inc. 8
- 9. Data-mining tools process data using
statistical techniques, many of which are
sophisticated and mathematically complex.
Data mining involves searching for patterns
and relationships among data.
In most cases, data-mining tools are used to
make predictions.
For example, we can use one form of analysis to compute
the probability that a customer will default on a loan.
© 2007 Prentice Hall, Inc. 9
- 10. Another way to distinguish the differences of
reporting tools and data-mining tools is :
◦ Reporting tools use simple operations like sorting,
grouping, and summing.
◦ Data-mining tools use sophisticated techniques.
© 2007 Prentice Hall, Inc. 10
- 11. An information system is a collection of
hardware, software, data, procedures, and
people.
The purpose of a business intelligence (BI)
system is to provide the right information, to
the right user, at the right time.
BI systems help users accomplish their goals
and objectives by producing insights that
lead to actions.
© 2007 Prentice Hall, Inc. 11
- 12. A reporting tool can generate a report that
shows a customer has canceled an important
order.
A reporting system, however, alerts that
customer’s salesperson with this unwanted
news, and does so in time for the salesperson
to try to alter the customer’s decision.
A data-mining tool can create an equation
that computes the probability that a customer
will default on a loan.
© 2007 Prentice Hall, Inc. 12
- 13. A data-mining system uses that equation to
enable banking personnel to assess new loan
applications.
© 2007 Prentice Hall, Inc. 13
- 14. The purpose of a reporting system is to
create meaningful information from disparate
data sources and to deliver that information
to the proper user on a timely basis.
Reporting systems generate information from
data as a result of four operations:
◦ Filtering data
◦ Sorting data
◦ Grouping data
◦ Making simple calculations on the data
© 2007 Prentice Hall, Inc. 14
- 17. A reporting system maintains a database of
reporting metadata.
The metadata describes the reports, users,
groups, roles, events, and other entities
involved in the reporting activity.
The reporting system uses the metadata to
prepare and deliver reports to the proper
users on a timely basis.
© 2007 Prentice Hall, Inc. 17
- 20. In terms of a report type, reports can be
static or dynamic.
Static reports are prepared once from the
underlying data, and they do not change.
◦ Example, a report of past year’s sales
Dynamic reports: the reporting system reads
the most current data and generates the
report using that fresh data.
◦ Examples are: a report on sales today and a report
on current stock prices
© 2007 Prentice Hall, Inc. 20
- 21. Query reports are prepared in response to
data entered by users.
Online analytical processing (OLAP) reports
allow the user to dynamically change the
report grouping structures.
© 2007 Prentice Hall, Inc. 21
- 22. Reports are delivered via many different
report media or channels.
Some reports are printed on paper, and
others are created in a format like PDF
whereby they can be printed or viewed
electronically.
Other reports are delivered to computer
screens.
Companies sometimes place reports on
internal corporate Web sites for employees to
access. © 2007 Prentice Hall, Inc. 22
- 23. Another report medium is a digital
dashboard, which is an electronic display
customized for a particular user.
◦ Vendors like Yahoo! and MSN provide common
examples.
◦ Users of these services can define content they
want- say, a local weather forecast, a list of stock
prices, or a list of news sources.
◦ The vendor constructs the display customized for
each user.
© 2007 Prentice Hall, Inc. 23
- 24. Other dashboards are particular to an
organization.
◦ The organization might have a dashboard that shows
up-to-the-minute production and sales activities.
Alerts are another form of report.
◦ Users can declare that they wish to receive
notifications of events, say, via email or on their cell
phones.
Reports can be published via a Web service.
◦ The Web service produces the report in response to
requests from the service-consuming application.
© 2007 Prentice Hall, Inc. 24
- 26. The report mode can be either push report or
pull report.
Organizations send a push report to users
according to a preset schedule.
◦ Users receive the report without any activity on
their part.
Users must request a pull report.
◦ To obtain a pull report, a user goes to a Web portal
or digital dashboard and clicks a link or button to
cause the reporting system to produce and deliver
the report.
© 2007 Prentice Hall, Inc. 26
- 27. Three functions of reporting systems are:
◦ Authoring
◦ Management
◦ Delivery
Report authoring involves connecting to data
sources, creating the reporting structure, and
formatting the report.
© 2007 Prentice Hall, Inc. 27
- 28. © 2007 Prentice Hall, Inc. 28
Source: Microsoft product screen shot reprinted with permission from Microsoft Corporation.
- 29. © 2007 Prentice Hall, Inc. 29
Source: Microsoft product screen shot reprinted with permission from Microsoft Corporation.
- 30. The purpose of report management is to
define who receives what reports, when, and
by what means.
Most report-management systems allow the
report administrator to define user accounts
and user groups and to assign particular
users to particular groups.
Reports that have been created using the
report-authoring system are assigned groups
and users.
© 2007 Prentice Hall, Inc. 30
- 31. Assigning reports to groups saves the
administrator work.
◦ When a report is created, changed, or removed, the
administrator need only change the report
assignments to the group.
◦ All of the users in the group will inherit the
changes.
Metadata also indicates what channel is to be
used and whether the report is to be pushed
or pulled.
◦ If the report is to be pushed, the administrator
declares whether the report is to be generated on a
regular schedule or as an alert.
© 2007 Prentice Hall, Inc. 31
- 32. The report-delivery function of a reporting
system pushes reports or allows them to be
pulled according to report-management
metadata.
Reports can be delivered via an email server,
Web site, XML Web services, or by other
program-specific means.
The report-delivery system uses the
operating system and other program security
components to ensure that only authorized
users receive authorized reports.
© 2007 Prentice Hall, Inc. 32
- 33. The report-delivery system also ensures that
push reports are produced at appropriate
times.
For query reports, the report-delivery system
serves as an intermediary between the user
and the report generator.
◦ It receives user query data, such as item numbers
in an inventory query, passes the query data to the
report generator, receives the resulting report, and
delivers the report to the user.
© 2007 Prentice Hall, Inc. 33
- 34. RFM analysis is a way of analyzing and
ranking customers according to their
purchasing patterns.
It is a simple technique that considers how
recently (R) a customer has ordered, how
frequently (F) a customer orders, and how
much money (M) the customer spends per
order.
To produce an RFM score, the program first
sorts customer purchase records by the date
of their most recent (R) purchase.
© 2007 Prentice Hall, Inc. 34
- 35. In a common form of this analysis, the
program then divides the customers into five
groups and gives customers in each group a
score of 1 to 5.
◦ The top 20% of the customers having the most
recent orders are given an R score 1 (highest).
The program then re-sorts the customers on
the basis of how frequently they order.
◦ The top 20% of the customers who order most
frequently are given a F score of 1 (highest).
Finally the program sorts the customers again
according to the amount spent on their
orders.
◦ The 20% who have ordered the most expensive
items are given an M score of 1 (highest).© 2007 Prentice Hall, Inc. 35
- 36. A reporting system can generate the RFM
data and deliver it in many ways:
◦ A report of RFM scores for all customers can be
pushed to the vice president of sales.
◦ Reports with scores for particular regions can be
pushed to regional sales managers.
◦ Reports of scores for particular accounts can be
pushed to the account salespeople.
◦ All of this reporting can be automated.
© 2007 Prentice Hall, Inc. 36
- 38. Online analytical processing (OLAP) provides
the ability to sum, count, average, and
perform other simple arithmetic operations
on groups of data.
The remarkable characteristics of OLAP
reports is that they are dynamic.
The viewer of the report can change the
report’s format, hence, the term online.
© 2007 Prentice Hall, Inc. 38
- 39. An OLAP report has measures and
dimensions.
A measure is the data item of interest.
◦ It is the item that is to be summed or averaged or
otherwise processed in the OLAP report.
A dimension is a characteristic of a measure.
◦ Purchase data, customer type, customer location, and
sales region are all examples of dimension.
© 2007 Prentice Hall, Inc. 39
- 40. With an OLAP report, it is possible to drill
down into the data.
◦ This term means to further divide the data into
more detail.
Special-purpose products called OLAP servers
have been developed to perform OLAP
analysis.
An OLAP server reads data from an
operational database, performs preliminary
calculations, and stores the results of those
operations in an OLAP database.
© 2007 Prentice Hall, Inc. 40
- 45. Basic reports and simple OLAP analyses can
be made directly from operational data.
For the most part, such reports display the
current state of the business; and if there are
a few missing values or small inconsistencies
with the data, no one is too concerned.
Operational data are unsuited to more
sophisticated analyses, particularly, data-
mining analyses that require high-quality
input for accurate and useful results.
© 2007 Prentice Hall, Inc. 45
- 46. Many organizations choose to extract
operational data into facilities called data
warehouses and data marts, both of which
are facilities that prepare, store, and manage
data specifically for data mining and other
analyses.
Programs read operational data and extract,
clean, and prepare that data for BI
processing.
The prepared data are stored in a data-
warehouse database using data-warehouse
DBMS, which can be different from the
organization’s operational DBMS.
© 2007 Prentice Hall, Inc. 46
- 47. Data warehouses include data that are
purchased from outside sources.
Metadata concerning the data, its source, its
format, its assumptions and constraints, and
other facts about the data is kept in a data-
warehouse metadata database.
The data-warehouse DBMS extracts and
provides data to business intelligence tools
such as data-mining programs.
© 2007 Prentice Hall, Inc. 47
- 50. Most operational and purchased data have
problems that inhibit their usefulness for
business intelligence.
Problematic data are termed dirty data.
◦ Examples are values of B for customer gender and
of 213 for customer age.
Purchased data often contain missing elements.
◦ Most data vendors state the percentage of missing
values for each attribute in the data they sell.
◦ An organization buys such data because for some
uses, some data is better than no data at all.
© 2007 Prentice Hall, Inc. 50
- 51. Inconsistent data are particularly common for
data that have been gathered over time.
◦ When an area code changes, for example, the
phone number for a given customer before the
change will not match the customer’s number after
the change.
Some data inconsistencies occur from the
nature of the business activity.
Nonintegrated data can cause problems when
data comes from different management
information systems.
© 2007 Prentice Hall, Inc. 51
- 52. Data can be too fine or too coarse.
◦ It is possible to capture the customers clicking
behavior in what is termed clickstream data that
includes everything a customer does at a Web site.
If data is in the wrong format, that condition
is sometimes expressed by saying the data
have the wrong granularity.
Because of a phenomenon called the curse of
dimensionally, the more attributes there are,
the easier it is to build a model that fits the
sample data but that is worthless as a
predictor.
© 2007 Prentice Hall, Inc. 52
- 54. The data warehouse takes data from the data
manufacturers (operational systems and
purchased data), cleans and processes the
data, and locates the data on the shelves, so
to speak, of the data warehouse.
A data mart is a data collection, smaller than
the data warehouse, that addresses a
particular component or functional area of
the business.
© 2007 Prentice Hall, Inc. 54
- 55. The data warehouse is like the distributor in
the supply chain and the data mart is like the
retail store in the supply chain.
Users in the data mart obtain data that
pertain to a particular business function from
the data warehouse.
It is expensive to create, staff, and operate
data warehouses and data marts.
© 2007 Prentice Hall, Inc. 55
- 57. Data mining is the application of statistical
techniques to find patterns and relationships
among data and to classify and predict.
Data mining represents a convergence of
disciplines.
Data-mining techniques emerged from
statistics and mathematics and from artificial
intelligence and machine-learning fields in
computer science.
© 2007 Prentice Hall, Inc. 57
- 59. With unsupervised data mining, analysts do
not create a model or hypothesis before
running the analysis.
Instead, they apply the data-mining
technique to the data and observe the results.
Analysts create hypotheses after the analysis
to explain the patterns found.
© 2007 Prentice Hall, Inc. 59
- 60. One common unsupervised technique is
cluster analysis.
◦ A common use for cluster analysis is to find groups
of similar customers from customer order and
demographic data.
© 2007 Prentice Hall, Inc. 60
- 61. With supervised data mining, data miners
develop a model prior to the analysis and
apply statistical techniques to data to
estimate parameters of the model.
One such analysis, which measures the
impact of a set of variables on another
variable, is called a regression analysis.
Neural networks are another popular
supervised data-mining technique used to
predict values and make classifications such
as “good prospect” or “poor prospect”
customers.
© 2007 Prentice Hall, Inc. 61
- 62. A market-basket analysis is a data-mining
technique for determining sales patterns.
A market-basket analysis shows the products
that customers tend to buy together.
In market-basket terminology, support is the
probability that two items will be purchased
together.
You can expect market-basket analysis to
become a standard CRM analysis during your
career.
© 2007 Prentice Hall, Inc. 62
- 64. A decision tree is a hierarchical arrangement
of criteria that predict a classification or a
value.
Decision tree analyses are an unsupervised
data-mining technique.
The analyst sets up the computer program
and provides the data to analyze, and the
decision tree program produces the tree.
© 2007 Prentice Hall, Inc. 64
- 66. A common business application of decision
trees is to classify loans by likelihood of
default.
Organizations analyze data from past loans
to produce a decision tree that can be
converted to loan-decision rules.
◦ A financial institution could use such a tree to
assess the default risk on a new loan.
© 2007 Prentice Hall, Inc. 66
- 67. © 2007 Prentice Hall, Inc. 67
Source: Used with permission of Insightful Corporation. Copyright © 1999-2005 Insightful Corporation. All Rights Reserved.
- 68. Knowledge management systems concern the
sharing of knowledge that is already known
to exist, either in libraries of documents, in
the heads of employees, or in other known
sources.
Knowledge management (KM) is the process
of creating value from intellectual capital and
sharing that knowledge with employees,
managers, suppliers, customers, and others
who need that capital.
© 2007 Prentice Hall, Inc. 68
- 69. Knowledge management is a process that is
supported by the five components of an
information system.
◦ Its emphasis is on people, their knowledge, and
effective means for sharing that knowledge with
others.
The benefits of KM concern the application of
knowledge to enable employees and others to
leverage organizational knowledge to work
smarter.
KM preserves organizational memory by
capturing and storing the lessons learned and
best practices of key employees.
© 2007 Prentice Hall, Inc. 69
- 70. Content management systems are
information systems that track organizational
documents, Web pages, graphics, and related
materials.
Such systems differ from operational
document systems in that they do not directly
support business operations.
KM content management systems are
concerned with the creation, management,
and delivery of documents that exist for the
purpose of imparting knowledge.
© 2007 Prentice Hall, Inc. 70
- 71. Typical users of content management
systems are companies that sell complicated
products and want to share their knowledge
of those products with employees and
customers.
The basic functions of content management
systems are the same as for report
management systems: author, manage, and
deliver.
The only requirement that content managers
place on document authoring is that the
document has been created in a standardized© 2007 Prentice Hall, Inc. 71
- 72. Content management functions are, however,
exceedingly complicated.
Most content databases are huge; some have
thousands of individual documents, pages, and
graphics.
© 2007 Prentice Hall, Inc. 72
- 73. Documents may refer to one another or multiple
documents may refer to the same product or
procedure.
◦ When one of them changes, others must change as
well.
◦ Some content management systems keep semantic
linkages among documents so that content
dependencies can be known and used to maintain
document consistency.
© 2007 Prentice Hall, Inc. 73
- 74. Document contents are perishable.
◦ Documents become obsolete and need to be altered,
removed, or replaced.
Multinational companies have to ensure
document language translations.
© 2007 Prentice Hall, Inc. 74
- 75. © 2007 Prentice Hall, Inc. 75
Source: microsoft.com/backstage/inside.htm (accessed February 2004). © 2003 Microsoft Corporation. All rights reserved.
- 76. © 2007 Prentice Hall, Inc. 76
Source: Used with permission of Tom Rizzo of Microsoft Corporation.
- 77. © 2007 Prentice Hall, Inc. 77
Source: Used with permission of Tom Rizzo of Microsoft Corporation.
- 78. Almost all users of content management
systems pull the contents.
Users cannot pull content if they do not know
it exists.
◦ The content must be arranged and indexed, and a
facility for searching the content devised.
Documents that reside behind a corporate
firewall, however, are not publicly accessible
and will not be reachable by Google or other
search engines.
◦ Organizations must index their own proprietary
documents and provide their own search capability
for them.
© 2007 Prentice Hall, Inc. 78
- 79. Web browsers and other programs can readily
format content expressed in HTML, PDF, or
another standard format.
XML documents often contain their own
formatting rules that browsers can interpret.
◦ The content management system will have to
determine an appropriate format for content
expressed in other ways.
© 2007 Prentice Hall, Inc. 79
- 80. Nothing is more frustrating for a manager to
contemplate than the situation in which one
employee struggles with a problem that another
employee knows how to solve easily.
KM systems are concerned with the sharing not
only of content, but also with the sharing of
knowledge among humans.
◦ How can one person share her knowledge with another?
◦ How can one person learn of another person’s great
idea?
© 2007 Prentice Hall, Inc. 80
- 81. Three forms of technology are used for
knowledge- sharing among humans:
◦ Portals, discussion groups, and email
◦ Collaborations systems
◦ Expert systems
Portals
◦ Employees can share ideas by posting knowledge on a
Web portal whereby managers and employees can pull
the knowledge from the portal.
© 2007 Prentice Hall, Inc. 81
- 83. Discussion Groups
◦ Discussion groups allow employees or customers to
post questions and queries seeking solutions to
problems they have.
◦ Oracle, IBM, PeopleSoft, and other vendors support
product discussion groups where users can post
questions and where employees, vendors, and other
users can answer them.
◦ Later, the organization can edit and summarize the
questions from such discussion groups into frequently
asked questions (FAQs).
© 2007 Prentice Hall, Inc. 83
- 84. Discussion groups (continued)
◦ Basic email can also be used for knowledge-sharing,
especially if email lists have been constructed with KM
in mind.
◦ Two human factors inhibit knowledge-sharing.
Employees can be reluctant to exhibit their ignorance.
Competition exists between employees.
◦ A KM application may be ill-suited to a competitive
group.
The company may be able to restructure rewards and
incentives to foster sharing of ideas among employees.
© 2007 Prentice Hall, Inc. 84
- 85. Collaboration Systems
◦ Collaboration systems are information systems that
enable people to work together more effectively.
◦ The Internet can be used as a broadcast medium for
speeches, panel discussion, and other types of
meetings.
◦ Web broadcasts, because they are digital, can be readily
saved and replayed at the viewer’s convenience.
◦ Web broadcasts can also be made interactive by
combining them with discussion group bulletin boards
that are live during the broadcast.
◦ Video conferencing is another popular form of IT-
supported meetings.
Video-conferencing equipment is expensive and normally is
located in selected sites in the organization.
© 2007 Prentice Hall, Inc. 85
- 86. Collaboration Systems (continued)
◦ Net meetings are a means by which individuals can
participate in remote meetings without leaving their
desk.
With a speaker and a Web camera, virtual meetings can be
conducted among employees who sit in their own offices.
© 2007 Prentice Hall, Inc. 86
- 88. Expert Systems
◦ Expert systems are created by interviewing experts in a
given business domain and codifying the rules stated
by those experts.
◦ Many expert systems were created in the late 1980s
and 1990s, and some of them have been successful.
◦ Expert systems suffer from three major disadvantages.
They are difficult and expensive to develop.
They are difficult to maintain.
They were unable to live up to the high expectations set by
their name.
© 2007 Prentice Hall, Inc. 88
- 89. Enormous amounts of data are generated
each year.
Business intelligence (BI) tools search these
increasing amounts of data for useful
information.
Reporting tools tend to be used for
assessment, process data using simple
calculations such as sums and averages.
© 2007 Prentice Hall, Inc. 89
- 90. Data-mining tools, tend to be used for
prediction, process data using sophisticated
statistical and mathematical techniques.
Reporting systems create meaningful
information from disparate data sources and
deliver that information to the proper user on
a timely basis.
RFM and OLAP are two examples of report
applications.
© 2007 Prentice Hall, Inc. 90
- 91. Data warehouses and data marts are facilities
that prepare, store, and manage data for data
mining and other analyses.
Data Market-basket analysis determines
groups of products that customers tend to
purchase together.
Decision trees are used to construct
“If…Then…” rules for predicting
classifications.
© 2007 Prentice Hall, Inc. 91
- 92. Knowledge management is the process of
creating value from intellectual capital and
sharing that knowledge with employees,
managers, suppliers, customers, and others
who need that capital.
Human knowledge-sharing systems use
portals, bulletin boards, and email to
facilitate knowledge interchange.
Collaboration systems include net
conferencing, video conferencing, and expert
systems.
© 2007 Prentice Hall, Inc. 92
- 93. Business intelligence (BI)
systems
Business intelligence (BI)
tools
Clickstream data
Cluster analysis
Collaboration systems
Confidence
Content management
systems
Curse of dimensionality
Data mart
Data mining
Data-mining tools
Data warehouse © 2007 Prentice Hall, Inc. 93
Decision trees
Digital dashboard
Dimension
Dirty data
Discussion groups
Drill down
Dynamic report
Exabyte
Expert Systems
Frequently asked
questions (FAQs)
- 94. © 2007 Prentice Hall, Inc. 94
Granularity
If…then…rules
Knowledge management
(KM)
Lift
Market-basket analysis
Measure
Neural networks
OLAP cube
OLAP server
Online analytical processing
(OLAP)
Petabyte
Portals
Pull report
Push report
Query report
Regression analysis
Report media
Report mode
Report type
Reporting systems
Reporting tools
RFM analysis
Semantic security
- 95. © 2007 Prentice Hall, Inc. 95
Static report
Supervised data mining
Support
Unsupervised data mining
- 96. © 2007 Prentice Hall, Inc. 96
Security is a very difficult problem, and it gets
worse every year.
Physical security is hard enough: How do we know
that the person (or program) that signs on as
Megan Cho is really Megan Cho?
We use passwords, but files of passwords can be
stolen.
Suppose Megan works in the HR department, so
she has access to personal and private data of
other employees.
- 97. © 2007 Prentice Hall, Inc. 97
We need to design the reporting system so that
Megan can access all of the data she needs to do
her job, and no more.
A reporting server is an obvious and juicy target
for any would-be intruder.
Someone can break in and change access permissions.
Or, a hacker could pose as someone else to obtain
reports.
- 98. © 2007 Prentice Hall, Inc. 98
Semantic security concerns the unintended
release of a combination of reports or documents
that are independently not protected.
Megan was given just two reports to do her job
Yet she combined the information in those reports with
publicly available information and is able to deduce
salaries, for at least some employees.
These salaries are much more than she is supposed to
know.
This is a semantic security problem.
- 99. © 2007 Prentice Hall, Inc. 99
The product managers wanted the data miners to
analyze customer clicks on a Web page to
determine customer preferences for particular
product lines.
The products were competing with one another for
resources.
“Sampling?” asked the product managers in a chorus
“Sampling? No way. We want all the data. This is
important, and we don’t want a guess.”
- 100. © 2007 Prentice Hall, Inc.
10
0
There’s nothing wrong with sampling
Properly done, the results from a sample are just as
accurate as results from the complete data set.
Studies done from samples are also cheaper and faster.
Sampling is a great way to save time and money.
In truth, skill is required to develop a good sample.
The product managers should have listened to the data
miners’ sampling plan and ensured that the sample
would be appropriate, given the goals of the study.
Understanding this concept will save you and your
organization substantial money!
- 101. © 2007 Prentice Hall, Inc.
10
1
Classification is a useful human skill.
Sorting and classifying are necessary, important,
and essential activities.
But those activities can also be dangerous
Serious ethical issues arise when we classify
people.
What makes someone a good or bad “prospect”?
If we’re talking about classifying customers in order to
prioritize our sales calls, then the ethical issue may not
be too serious.
What about classifying applicants for college?
- 102. © 2007 Prentice Hall, Inc.
10
2
I’m not really a contrarian about data mining.
I believe in it.
But data mining in the real world is a lot different from
the way it’s described in textbooks
One problem is that data are always dirty, with missing
values, values way out of the range of possibility, and
time values that make no sense.
“Another problem is that you know the least when
you start the study”.
So you work for a few months and learn that if you had
another variable, say the customer’s zip code, or age, or
something else, you could do a much better analysis.
- 103. © 2007 Prentice Hall, Inc.
10
3
Overfitting is another problem, a huge one.
With neural networks, you can create a model of any
level of complexity you want, except that none of those
equations will predict new cases with any accuracy at all.
When using neural nets, you have to be very careful not
to overfit the data.
Another problem is seasonality:
Say all your training data are from the summer-will your
model be valid for the winter?
- 104. © 2007 Prentice Hall, Inc.
10
4
“When you start a data-mining project, you never
know how it will turn out”;
Some were bad and a wasted of time.
Some were good and found to have interesting and
important patterns and information and created very
accurate predictive models.
It’s not easy, though, you have to be very careful
and lucky.
- 105. © 2007 Prentice Hall, Inc.
10
5
Computer simulation of World War III project at
Pentagon 1971-1973
Analysis process
Run the simulation and obtain a set of results.
The military analysts and weapons experts would
examine the results, and if results weren’t quite what
was expected or wanted, the analysts would ask to
change some of the inputs or a portion of the model.
Over time, an accumulated set of results was approved.
The accumulated results were presented to the four-star
generals and other senior Pentagon managers.
Sometimes these senior people would see problems
in the analyses, and gave instruct ions to discard
some of the results.
- 106. © 2007 Prentice Hall, Inc.
10
6
Observation
I do not believe that anyone thought they were deceiving
anyone else.
The top managers didn’t realize that the results they saw
left out a substantial portion of the unfavorable
simulations.
They never knew about the other results.
The analysts who were filtering the outcomes by
throwing out the numbers didn’t like being dishonest
They simply thought that those results were wrong
or unrealistic.
I do not think they realized they were using the
computer to promulgate their prior ideas about
military needs.
- 107. © 2007 Prentice Hall, Inc.
10
7
Questions to think about
Why perform the analysis?
What are you going to do with the results?
What is it that you want to know or to decide?
Answer the questions above before you begin the
analysis.
Then, pay attention to the results.
Don’t argue with the data.
If the results don’t conform to your expectations, think
long and hard about changing the model, adjusting the
data, or modifying the answers.