AI NOTES CHAPTER SUMMARY

TYBSC-CS SEM 5 (AI) 2018-19 NOTES
FOR PROGRAMS AND SOLUTION REFER CLASSROOM NOTES
WE-IT TUTORIALS CLASSES FOR BSC-IT AND BSC-CS (THANE) 8097071144/55, WEB www.weit.in
1
Unit 1 Chapter 1
What Is AI
Definition of AI
• Artificial Intelligence is a branch of Science which deals with helping machines find
solutions to complex problems in a more human-like fashion.
• This generally involves borrowing characteristics from human intelligence, and applying
them as algorithms in a computer friendly way.
• A more or less flexible or efficient approach can be taken depending on the requirements
established, which influences how artificial the intelligent behavior appears.
• AI is generally associated with Computer Science, but it has many important links with other
fields such as Maths, Psychology, Cognition, Biology and Philosophy, among many others. Our
ability to combine knowledge from all these fields will ultimately benefit our progress in the
quest of creating an intelligent artificial being.
AI is a branch of computer science which is concerned with the study and creation of computer
systems that exhibit
• some form of intelligence
or
• those characteristics which we associate with intelligence in human behavior
What is intelligence?
Intelligence is a property of mind that encompasses many related mental abilities, such as the
capabilities to
• reason
• plan
• solve problems
• think abstractly
• comprehend ideas and language and
• learn
THE FOUNDATIONS OF ARTIFICIAL INTELLIGENCE
Foundation of AI is based on
1. Mathematics
2. Neuroscience
3. Control Theory
4. Linguistics

2
Mathematics
• More formal logical methods
• Boolean logic
• Fuzzy logic
• Uncertainty
• The basis for most modern approaches to handle uncertainty in AI applications can be
handled by
o Probability theory
o Modal and Temporal logics
Neuroscience
How do the brain works?
• Early studies (1824) relied on injured and abnormal people to understand what parts of
brain work
• More recent studies use accurate sensors to correlate brain activity to human thought
o By monitoring individual neurons, monkeys can now control a computer mouse
using thought alone
• Moore’s law states that computers will have as many gates as humans have neurons in
2020
• How close are we to have a mechanical brain?
o Parallel computation, remapping, interconnections
Control Theory
• Machines can modify their behavior in response to the environment (sense/action loop)
o Water-flow regulator, steam engine governor, thermostat
• The theory of stable feedback systems (1894)
o Build systems that transition from initial state to goal state with minimum
energy
o In 1950, control theory could only describe linear systems and AI largely rose as
a response to this shortcoming
Linguistics
• Speech demonstrates so much of human intelligence
o Analysis of human language reveals thought taking place in ways not understood
in other settings
▪ Children can create sentences they have never heard before
▪ Language and thought are believed to be tightly intertwined
History of Artificial Intelligence
1. 1943 McCulloch & Pitts: Boolean circuit model of brain
2. 1950 Turing's "Computing Machinery and Intelligence"
3. 1956 Dartmouth meeting: "Artificial Intelligence" adopted

3
4. 1950s Early AI programs, including Samuel's checkers program, Newell & Simon's
,Logic Theorist, Gelernter's Geometry Engine
5. 1965 Robinson's complete algorithm for logical reasoning
6. 1966—73 AI discovers computational complexity Neural network research almost
disappears
7. 1969—79 Early development of knowledge-based systems
8. 1980-- AI becomes an industry
9. 1986-- Neural networks return to popularity
10. 1987-- AI becomes a science
11. 1995-- The emergence of intelligent agents
THE STATE OF THE ART
The highest level of development, as of a device, technique, or scientific field, achieved at a particular
time.
• Program that plays chess Automatically
• Program books tickets online via voice, and suggest best booking rates on days
• Auto driven car which can make driver relax, and drive on highway without touching steering
wheel
• Full diagnostic system which will check symptoms and give proper solution accordingly to test
cases.
Unit 1 Chapter 2
Intelligent Agents:
Intelligent Systems: Categorization of Intelligent System
1. Systems that think like humans
2. Systems that act like humans
3. Systems that think rationally
4. Systems that act rationally
Systems that think like humans
• Most of the time it is a black box where we are not clear about our thought process.
• One has to know functioning of brain and its mechanism for possessing information.
• It is an area of cognitive science.
o The stimuli are converted into mental representation.
o Cognitive processes manipulate representation to build new representations
that are used to generate actions.
• Neural network is a computing model for processing information similar to brain.
Systems that act like humans

4
• The overall behavior of the system should be human like.
• It could be achieved by observation.
Systems that think rationally
• Such systems rely on logic rather than human to measure correctness.
• For thinking rationally or logically, logic formulas and theories are used for synthesizing
outcomes.
• For example,
o given John is a human and all humans are mortal then one can conclude logically
that John is mortal
• Not all intelligent behavior are mediated by logical deliberation.
Systems that act rationally
• Rational behavior means doing right thing.
• Even if method is illogical, the observed behavior must be rational.
Components of AI Program
AI techniques must be independent of the problem domain as far as possible.
• AI program should have
o knowledge base
o navigational capability
o inferencing
Knowledge Base
• AI programs should be learning in nature and update its knowledge accordingly.
• Knowledge base consists of facts and rules.
• Characteristics of Knowledge:
o It is voluminous in nature and requires proper structuring
o It may be incomplete and imprecise
o It may keep on changing (dynamic)
Navigational Capability
• Navigational capability contains various control strategies
• Control Strategy
o determines the rule to be applied
o some heuristics (thump rule) may be applied
Inferencing
• Inferencing requires
• search through knowledge base and
• derive new knowledge
Agents
The branch of computer science concerned with making computers behave like humans.

5
“Artificial Intelligence is the study of human intelligence such that it can be replicated
artificially.”
An agent is anything that can be viewed as perceiving its environment through sensors and
acting upon that environment through effectors.
• A human agent has eyes, ears, and other organs for sensors, and hands, legs, mouth, and
other body parts for effectors.
• A robotic agent substitutes cameras and infrared range finders for the sensors and
various motors for the effectors.
• A software agent has encoded bit strings as its percepts and actions.
Agents interact with environments through sensors and actuators.
Simple Terms
• Percept
o Agent’s perceptual inputs at any given instant
• Percept sequence: Complete history of everything that the agent has ever perceived
• Agent’s behavior is mathematically described by
o Agent function

6
o A function mapping any given percept sequence to an action
• Practically it is described by
• An agent program
• The real implementation
Example: Vacuum-cleaner world
• Perception: Clean or Dirty? Where it is in?
• Actions: Move left, Move right, suck(clean), do nothing(NoOp)
A vacuum-cleaner world with just two locations.
Program implements the agent function tabulated in above figure;
Function Reflex-Vacuum-Agent([location,status]) return an action
If status = Dirty then return Suck
else if location = A then return Right
else if location = B then return left
Concept of Rationality
• A rational agent is one that does the right thing. As a first approximation, we will
say that the right action is the one that will cause the agent to be most successful.

7
• That leaves us with the problem of deciding how and when to evaluate the agent's
success.
• We use the term performance measure for the how—the criteria that determine
how successful an agent is.
• In summary, what is rational at any given time depends on four things:
o The performance measure that defines degree of success.
o Everything that the agent has perceived so far. We will call this complete
perceptual history the percept sequence.
o What the agent knows about the environment.
o The actions that the agent can perform.
This leads to a definition of an ideal rational agent
For each possible percept sequence, an ideal rational agent should do whatever action is
expected to maximize its performance measure, on the basis of the evidence provided by the
percept sequence and whatever built-in knowledge the agent has.
Example of a rational agent
• Performance measure
o Awards one point for each clean square
▪ at each time step, over 10000 time steps
• Prior knowledge about the environment
o The geography of the environment
o Only two squares
o The effect of the actions
• Actions that can perform
o Left, Right, Suck and NoOp (No Operation)
• Percept sequences
o Where is the agent?
o Whether the location contains dirt?
o Under this circumstance, the agent is rational.
Environment
When designing artificial intelligence(AI) solutions, we spend a lot of time focusing on aspects such as the
nature of learning algorithms [ex: supervised, unsupervised, semi-supervised] or the characteristics of the
data [ex: classified, unclassified…]. However, little attention is often provided to the nature of the
environment on which the AI solution operates. As it turns out, the characteristics of the environment are
one of the absolutely key elements to determine the right models for an AI solution.
There are several aspects that distinguish AI environments. The shape and frequency of the data, the
nature of the problem , the volume of knowledge available at any given time are some of the elements
that differentiate one type of AI environment from another. Understanding the characteristics of the AI
environment is one of the first tasks that AI practitioners focused on in order to tackle a specific AI
problem. From that perspective, there are several categories we use to group AI problems based on the
nature of the environment.

8
The Nature of Environment
PEAS: To design a rational agent, we must specify the task environment
Consider, e.g., the task of designing an automated taxi:
Performance measure?
o The performance given by taxi should make it most successful agent that is flawless
performance.
o e.g. Safety, destination, profits, legality, comfort, . . .
Environment?
o It is a first step in designing an agent. We should specify the environment which suitable for
agent action. If swimming is the task for an agent then environment must be water not air.
o e.g. Streets/freeways, traffic, pedestrians, weather . . .
Actuators?
o These are one of the important details of agent through which agent performs actions in
related and specified environment.
o e.g. Steering, accelerator, brake, horn, speaker/display, . . .
Sensors?
o It is the way to receive different attributes from environment.
o e.g. Cameras, accelerometers, gauges, engine sensors, keyboard, GPS . . .

9
(In designing an agent, the first step must always be to specify the task environment as fully as
possible)
Properties of task environments
Fully observable vs. partially observable
If an agent’s sensors give it access to the complete state of the environment at each point in time
then the environment is effectively and fully observable i.e. If the sensors detect all aspects and that
are relevant to the choice of action.
An environment might be Partially observable because of noisy and inaccurate sensors or because
parts of the state are simply missing from the sensor data.
Deterministic vs. nondeterministic (stochastic)
If the next state of the environment is completely determined by the current state and the actions
selected by the agents, then we say the environment is deterministic. In principle, an agent need not
worry about uncertainty in an accessible, deterministic environment. If the environment is
inaccessible, however, then it may appear to be nondeterministic. This is particularly true if the
environment is complex, making it hard to keep track of all the inaccessible aspects. Thus, it is often
better to think of an environment as deterministic or nondeterministic from the point of view of the
agent.
E.g. Taxi driving (non deterministic), humid environment (deterministic)
Episodic vs. no episodic (Sequential).
In an episodic environment, the agent’s experience is divided into “episodes.” Each episode consists
of the agent perceiving and then acting. The quality of its action depends just on the episode itself,
because subsequent episodes do not depend on what actions occur in previous episodes. Episodic
environments are much simpler because the agent does not need to think ahead.
E.g. chess (sequential)
Static vs. dynamic.
If the environment can change while an agent is deliberating, then we say the environment is dynamic
for that agent; otherwise it is static. Static environments are easy to deal with because the agent need
not keep looking at the world while it is deciding on an action, nor need it worry about the passage
of time. If the environment does not change with the passage of time but the agent’s performance
score does, then we say the environment is semi dynamic.

10
Discrete vs. continuous.
If there are a limited number of distinct, clearly defined percepts and actions we say that the
environment is discrete. Chess is discrete—there are a fixed number of possible moves on each turn.
Taxi driving is continuous—the speed and location of the taxi and the other vehicles sweep through
a range of continuous values.
Single agent VS. Multiagent
Playing a crossword puzzle – single agent and Chess playing – two agents
• Competitive multiagent environment-Chess playing
• Cooperative multiagent environment-Automated taxi driver for avoiding collision
Examples of task environments
Structure of Intelligent agents
• The job of AI is to design the agent program: a function that implements the agent mapping
from percepts to actions.
• We assume this program will run on some sort of computing device, which we will call the
architecture. Obviously, the program we choose has to be one that the architecture will
accept and run.
• The architecture might be a plain computer, or it might include special‐purpose hardware for
certain tasks, such as processing camera images or filtering audio input. It might also include
software that provides a degree of insulation between the raw computer and the agent
program, so that we can program at a higher level.
• In general, the architecture makes the percepts from the sensors available to the program,
runs
• the program, and feeds the program's action choices to the effectors as they are generated.

11
• The relationship among agents, architectures, and programs can be summed up as follows:
• Agent = architecture + program
• Software agents (or software robots or softbot) exist in rich, unlimited domains. Imagine a
softbot designed to fly a flight simulator for a 747.
• The simulator is a very detailed, complex environment, and the software agent must choose
from a wide variety of actions in real time.
• Now we have to decide how to build a real program to implement the mapping from percepts
to action.
• We will find that different aspects of driving suggest different types of agent program.
Intelligent agents categories into five classes based on their degree of perceived
intelligence and capability
1. Simple reflex agents
2. model-based reflex agents
3. goal-based agents
4. utility-based agents
5. Learning agents
Simple reflex agents
• Simple reflex agents act only on the basis of the current percept, ignoring the rest of the
percept history.
• The agent function is based on the condition-action rule: if condition then action.
• This agent function only succeeds when the environment is fully observable.
• Some reflex agents can also contain information on their current state which allows them
to disregard conditions whose actuators are already triggered.
• Infinite loops are often unavoidable for simple reflex agents operating in partially
observable environments.
• Note: If the agent can randomize its actions, it may be possible to escape from infinite
loops.

12
Model-based reflex agents

13
• A model-based agent can handle a partially observable environment.
• Its current state is stored inside the agent maintaining some kind of structure which
describes the part of the world which cannot be seen.
• This knowledge about "how the world works" is called a model of the world, hence the
name "model-based agent".
• A model-based reflex agent should maintain some sort of internal model that depends on
the percept history and thereby reflects at least some of the unobserved aspects of the
current state.
• It then chooses an action in the same way as the reflex agent.
Goal-based agents

14
• Goal-based agents further expand on the capabilities of the model-based agents, by using
"goal" information.
• Goal information describes situations that are desirable.
• This allows the agent a way to choose among multiple possibilities, selecting the one
which reaches a goal state.
• Search and planning are the subfields of artificial intelligence devoted to finding action
sequences that achieve the agent's goals.
• In some instances the goal-based agent appears to be less efficient; it is more flexible
because the knowledge that supports its decisions is represented explicitly and can be
modified.
Utility-based agents
• Goal-based agents only distinguish between goal states and non-goal states.
• It is possible to define a measure of how desirable a particular state is.
• This measure can be obtained through the use of a utility function which maps a state to
a measure of the utility of the state.
• A more general performance measure should allow a comparison of different world
states according to exactly how happy they would make the agent.
• The term utility can be used to describe how "happy" the agent is.
• A rational utility-based agent chooses the action that maximizes the expected utility of
the action outcomes- that is, the agent expects to derive, on average, given the
probabilities and utilities of each outcome.
• A utility-based agent has to model and keep track of its environment, tasks that have
involved a great deal of research on perception, representation, reasoning, and learning.

15
Learning agents
• Learning has an advantage that it allows the agents to initially operate in unknown
environments and to become more competent than its initial knowledge alone might
allow.
• The most important distinction is between the "learning element", which is responsible
for making improvements, and the "performance element", which is responsible for
selecting external actions.
• The learning element uses feedback from the "critic" on how the agent is doing and
determines how the performance element should be modified to do better in the future.
• The performance element is what we have previously considered to be the entire agent:
it takes in percepts and decides on actions.
• The last component of the learning agent is the "problem generator".
• It is responsible for suggesting actions that will lead to new and informative experiences.
Unit 1 Chapter 3
Problem Solving by searching

16
PROBLEM-SOLVING AGENTS
Intelligent agents are supposed to act in such a way that the environment goes through a sequence of
states that maximizes the performance measure.
What is Problem solving agent?
It is a kind of Goal-based Agents
4 general steps in problem-solving:
1. Goal Formulation
2. Problem Formulation
3. Search
4. Execute
➢ E.g. Driving from Arad to Bucharest...
consider one example that “A map is given with different cities connected and their distance values are
also mentioned. Agent starts from one city and reach to other.”
Subclass of goal-based agents
• Goal formulation
• Problem formulation
• Example problems

17
• Toy problems
• Real-world problems
• search
• search strategies
• Constraint satisfaction
• solution
Goal Formulation
Goal formulation, based on the current situation, is the first step in problem solving. As well as
formulating a goal, the agent may wish to decide on some other factors that affect the desirability of
different ways of achieving the goal. For now, let us assume that the agent will consider actions at
the level of driving from one major town to another. The states it will consider therefore correspond
to being in a particular town.
• Declaring the Goal: Goal information given to agent i.e. start from Arad and reach to
Bucharest.
• Ignoring the some actions: agent has to ignore some actions that will not lead agent to
desire goal. i.e. there are three roads out of Arad, one toward Sibiu, one to Timisoara, and
one to Zerind. None of these achieves the goal, so unless the agent is very familiar with the
geography of Romania, it will not know which road to follow. In other words, the agent will
not know which of its possible actions is best, because it does not know enough about the
state that results from taking each action.
• Limits the objective that agent is trying to achieve: Agent will decide its action when he
has some added knowledge about map. i.e. map of Romania is given to agent.
• Goal can be defined as set of world states: The agent can use this information to consider
subsequent stages of a hypothetical journey through each of the three towns, to try to
find a journey that eventually gets to Bucharest. i.e. once it has found a path on the map
from Arad to Bucharest, it can achieve its goal by carrying out the driving actions.
Problem Formulation
Problem formulation is the process of deciding what actions and states to consider, given a
goal.
• Process of looking for action sequence (number of action that agent carried out to
reach to goal) is called search. A search algorithm takes a problem as input and
returns a solution in the form of an action sequence. Once a solution is found, the
actions it recommends can be carried out. This is called the execution phase.
Thus, we have a simple "formulate, search, execute" design for the agent.
Well-defined problem and solutions.
A problem is defined by four items:
Initial state: The initial state that the agent starts in. e.g., the initial state for our agent in
Romania might be described as “In(Arad)”.

18
Successor function S(x) = A description of the possible actions available to the agent. The most
common formulation uses a successor function , given a particular state x, SUCCESSOR-FN(x)
returns a set of <action, successor> ordered pair where each action is one of the legal actions in
state x and each successor is a state that can be reached from x by applying the action. e.g. ,from
state In(Arad),the successor function for Romania problem would return
{<Go(Zerind),In(Zerind)>, <Go(sibiu),In(sibiu)>, <Go(Timisoara),In(Timisoara)>}.
Goal test=It determines whether a given state is a goal state.
path cost (additive)=Function that assigns a numeric cost to each path. e.g., sum of distances,
number of actions executed, etc. Usually given as c(x, a, y), the step cost from x to y by action a,
assumed to be ≥ 0.
“A solution is a sequence of actions leading from the initial state to a goal state”.
Example:
The 8-puzzle consists of a 3x3 board with 8 numbered tiles and a blank space. A tile adjacent to
the blank space can slide into the space.
The standard formulation is as follows:
• States: A state description specifies the location of each of the eight tiles and blank is
one of the nine squares.
• Initial Function: any state can be designated as the initial state. Note that any given
goal can be reached from exactly half of the possible initial states.
• Successor function: This generates the legal states that result from trying the four actions
(blank moves Left, Right, Up or Down).
• Goal State: This checks whether the state matches the goal configuration shown in
figure.
• Path cost: Each step cost 1; so the path cost is the number of steps in path.
There are two types of searching strategies are used in path finding,
1) Uninformed Search strategies.
2) Infirmed Search strategies.
Uninformed Search Methods
Uninformed search means that they have no additional information about states beyond that
provided in the problem definition. All they can do is generate successors and distinguish a goal
state from non-goal state.
Uninformed strategies use only the information available in the problem definition

19
a. Breadth-first search
b. Depth-first search
c. Depth-limited search
d. Iterative deepening search
Note: (Only for Understanding)
1) It is important to understand the distinction between nodes and states.
A node is book keeping data structure used to represent the search tree.
A state corresponds to a configuration of the world.
2) We also need to represent the collection of nodes that have been generated but
not yet expanded; this collection is called the fringe.
3) In AI ,where the graph is represented implicitly by the initial state and successor
function and is frequently infinite , its complexity expressed in terms of three
quantities:
b ; the branching factor or maximum number of successor of any node.
d; the depth of shallowest goal node; and
m; the maximum length of any path in the state space.
Breadth First Search (BFS Algorithm)
• Breadth First Search (BFS) searches breadth-wise in the problem space.
• Breadth-First search is like traversing a tree where each node is a state which may be a
potential candidate for solution.
• Breadth first search expands nodes from the root of the tree and then generates one
level of the tree at a time until a solution is found.
• It is very easily implemented by maintaining a queue of nodes.
• Initially the queue contains just the root.
• In each iteration, node at the head of the queue is removed and then expanded.
• The generated child nodes are then added to the tail of the queue.
Algorithm:
1. Place the starting node.
2. If the queue is empty return failure and stop.
3. If the first element on the queue is a goal node, return success and stop otherwise.
4. Remove and expand the first element from the queue and place all children at the end of
the
5. queue in any order.
6. Go back to step 1.

20
Advantages:
Breadth first search will never get trapped exploring the useless path forever. If there is a
solution, BFS will definitely find it out. If there is more than one solution then BFS can find the
minimal one that requires less number of steps.
Disadvantages:
If the solution is farther away from the root, breath first search will consume lot of time.
Depth First Search (DFS)
Depth-first search (DFS) is an algorithm for traversing or searching a tree, tree structure, or graph.
One starts at the root (selecting some node as the root in the graph case) and explores as far as
possible along each branch before backtracking.
Algorithm:
1. Push the root node onto a stack.
2. Pop a node from the stack and examine it.
• If the element sought is found in this node, quit the search and return a result.
• Otherwise push all its successors (child nodes) that have not yet been discovered onto
the stack.
3. If the stack is empty, every node in the tree has been examined – quit the search and return
"not found".
4. If the stack is not empty, repeat from Step 2.
Advantages:
• If depth-first search finds solution without exploring much in a path then the time and
space it takes will be very less.
• The advantage of depth-first Search is that memory requirement is only linear with
respect to the search graph. This is in contrast with breadth-first search which requires
more space.
Disadvantages:
• Depth-First Search is not guaranteed to find the solution.
• It is not complete algorithm, if it go into infinite loops.

21
Depth Limited Search:
• Depth limited search (DLS) is a modification of depth-first search that minimizes the
depth that the search algorithm may go.
• In addition to starting with a root and goal node, a depth is provided that the algorithm
will not descend below.
• Any nodes below that depth are omitted from the search.
• This modification keeps the algorithm from indefinitely cycling by halting the search
after the pre-imposed depth.
• The time and space complexity is similar to DFS from which the algorithm is derived.
• Space complexity: O(bd) and Time complexity : O(bd)

22
Iterative Deepening Search:
• Iterative Deepening Search (IDS) is a derivative of DLS and combines the feature of
depth-first search with that of breadth-first search.
• IDS operate by performing DLS searches with increased depths until the goal is found.
• The depth begins at one, and increases until the goal is found, or no further nodes can
be enumerated.
• By minimizing the depth of the search, we force the algorithm to also search the breadth
of a graph.
• If the goal is not found, the depth that the algorithm is permitted to search is increased
and the algorithm is started again.
Comparing Search Strategies:
Informed Search techniques(heuristic search):
• A strategy that uses problem-specific knowledge beyond the definition of the problem
itself.
• Also known as “heuristic search,” informed search strategies use information about the
domain to (try to) (usually) head in the general direction of the goal node(s)
• Informed search methods: Hill climbing, best-first, greedy search, beam search, A, A*.
Best First Search:
• It is an algorithm in which a node is selected for expansion based on an evaluation
function f(n).

23
• Traditionally the node with the lowest evaluation function is selected.
• Not an accurate name…expanding the best node first would be a straight march to the
goal.
• Choose the node that appears to be the best.
• There is a whole family of Best-First Search algorithms with different evaluation
functions.
• Each has a heuristic function h(n).
• h(n) = estimated cost of the cheapest path from node n to a goal node
• Example: in route planning the estimate of the cost of the cheapest path might be the
straight line distance between two cities.
• Quick Review,
o g(n) = cost from the initial state to the current state n
o h(n) = estimated cost of the cheapest path from node n to a goal node
o f(n) = evaluation function to select a node for expansion (usually the lowest cost
node)
• Best-First Search can be represented in two ways,
o Greedy Best-First Search
o A* search
• Greedy Best-First Search:
• Greedy Best-First search tries to expand the node that is closest to the goal assuming it
will lead to a solution quickly
o f(n) = h(n)
o aka “Greedy Search”
• • Implementation
o Expand the “most desirable” node into the fringe queue.
o Sort the queue in decreasing order of desirability.
• • Example: consider the straight-line distance heuristic hSLD
o Expand the node that appears to be closest to the goal.
• hSLD(In(Arid)) = 366
• Notice that the values of hSLD cannot be computed from the problem itself
• It takes some experience to know that hSLD is correlated with actual road distances
o Therefore a useful heuristic.

24
So the path is: Arad-Sibiu-Fagaras-Bucharest
A* search
• A* (A star) is the most widely known form of Best-First search
– It evaluates nodes by combining g(n) and h(n).
– f(n) = g(n) + h(n).
– Where
• g(n) = cost so far to reach n.
• h(n) = estimated cost to goal from n.
• f(n) = estimated total cost of path through n.
• When h(n) = actual cost to goal
– Only nodes in the correct path are expanded
– Optimal solution is found

25
• When h(n) < actual cost to goal
– Additional nodes are expanded
– Optimal solution is found
• When h(n) > actual cost to goal
– Optimal solution can be overlooked.

26
The main drawback of A* algorithm and indeed of any best-first search is its memory
requirement.
Heuristic Function:
“A rule of thumb, simplification, or educated guess that reduces or limits the search for
solutions in domains that are difficult and poorly understood.”
– h(n) = estimated cost of the cheapest path from node n to goal node.
– If n is goal then h(n)=0
It is a technique which evaluation the state and finds the significance of that state w.r.t. goal
state because of this it is possible to compare various states and choose the best state to visit
next. Heuristics are used to improve efficiency of search process. The search can be improved
by evaluating the states. The states can be evaluated by applying some evaluation function
which tells the significance of state to achieve goal state. The function which evaluates the state
is the heuristic function and the value calculated by this function is heuristic value of state.
The heuristic function is represented as h(n)
8-puzzle problem
Admissible heuristics
h1(n) = number of misplaced tiles
h2(n) = total Manhattan distance (i.e., no. of squares from desired location of each tile)
In the example
h1(S) = 6
h2(S) = 2 + 0 + 3 + 1 + 0 + 1 + 3 + 4 = 14

27
If h2 dominates h1, then h2 is better for search than h1.
Memory Bounded Heuristic Search
The simplest way to reduce memory requirements for A* is to adapt the idea of iterative
deepening search to the heuristic search context.
Types of memory bounded algorithms
1. Iterative deepening A*(IDA*)- Here cutoff information is the f-cost (g+h) instead of
depth
2. Recursive best first search (RBFS) - Recursive algorithm that attempts to mimic
standard best-first search with linear space.
3. Simplified Memory bounded A* (SMA*)- Drop the worst-leaf node when memory is
full
Iterative deepening A*(IDA*)
Just as iterative deepening solved the space problem of breadth-first search, iterative deepening
A* (IDA*) eliminates the memory constraints of A* search algorithm without sacrificing solution
optimality. Each iteration of the algorithm is a depth-first search that keeps track of the cost, f(n)
= g(n) + h(n), of each node generated. As soon as a node is generated whose cost exceeds a
threshold for that iteration, its path is cut off, and the search backtracks before continuing. The
cost threshold is initialized to the heuristic estimate of the initial state, and in each successive
iteration is increased to the total cost of the lowest-cost node that was pruned during the
previous iteration. The algorithm terminates when a goal state is reached whose total cost does
not exceed the current threshold.
Since Iterative Deepening A* performs a series of depth-first searches, its memory requirement
is linear with respect to the maximum search depth. In addition, if the heuristic function is
admissible, IDA* finds an optimal solution. Finally, by an argument similar to that presented for
DFID, IDA* expands the same number of nodes, asymptotically, as A* on a tree, provided that the
number of nodes, asymptotically, as A* on a tree, provided that the number of nodes grows
exponentially with solution cost. These costs, together with the optimality of A*, imply that IDA*
is asymptotically optimal in time and space over all heuristic search algorithms that find optimal
solutions on a tree. Additional benefits of IDA* are that it is much easier to implement, and often
runs faster than A*, since it does not incur the overhead of managing the open and closed lists.
Recursive best first search (RBFS)
IDA* search is no longer a best-first search since the total cost of a child can be less than that of
its parent, and thus nodes are not necessarily expanded in best-first order. Recursive Best-First
Search (RBFS) is an alternative algorithm. Recursive best-first search is a best-first search that
runs in space that is linear with respect to the maximum search depth, regardless of the cost
function used. Even with an admissible cost function, Recursive Best-First Search generates
fewer nodes than IDA*, and is generally superior to IDA*, except for a small increase in the cost
per node generation.
Simplified Memory bounded A* (SMA*)

28
• Use all available memory.
• I.e. expand best leafs until available memory is full
• When full, SMA* drops worst leaf node (highest f-value)
• Like RFBS backup forgotten node to its parent
• What if all leafs have the same f-value?
• Same node could be selected for expansion and deletion.
• SMA* solves this by expanding newest best leaf and deleting oldest worst leaf.
• SMA* is complete if solution is reachable, optimal if optimal solution is reachable.
– SMA* will utilize whatever memory is made available to it.
– It avoids repeated states as far as its memory allow.
– It is complete if the available memory is sufficient to store the shallowest
solution path.

29
Unit 2
LEARNING FROM EXAMPLES
An agent is learning if it improves its performance LEARNING on future tasks after making
observations about the world.
FORMS OF LEARNING
agent can be improved by learning from data depend on four major factors:
Which component is to be improved.
• A direct mapping from conditions on the current state to actions(when to apply break)
• infer relevant properties of the world from the percept sequence(wet roads)
• Information about the way the world evolves(as per new experiences)
What prior knowledge the agent already has.
• machine learning research covers inputs that form a factored representation—a vector of
attribute values—and outputs that can be either a continuous numerical value or a discrete
value.
What representation is used for the data and the component.
What feedback is available to learn from.
• three types of feedback
o The most common unsupervised learning task is clustering: detecting potentially
useful clusters of input examples.
o reinforcement learning the agent learns from a series of reinforcements—rewards
or punishments.
o In supervised learning the agent observes some example input–output pairs and
learns a function that maps from input to output.
SUPERVISED LEARNING
TRAINING SET
• Given a training set of N example input–output pairs
• (x1, y1), (x2, y2), . . . (xN, yN) ,
• where each yj was generated by an unknown function y = f(x),
• discover a function h that approximates the true function f.
HYPOTHESIS
• The function h is a hypothesis
• Learning is a search through the space of possible hypotheses for one that will perform well
TEST SET
• To measure the accuracy of a hypothesis we give it a test set of examples that are distinct
from the training set.
GENERALIZATION
• We say a hypothesis generalizes well if it correctly predicts the value of y for novel examples
CLASSIFICATION
• When the output y is one of a finite set of values (such as sunny, cloudy or rainy)
REGRESSION
• When y is a number (such as tomorrow’s temperature), the learning problem is called
regression.
HYPOTHESIS SPACE
• approximate it with a function h selected from a hypothesis space (is all legal hypothesis)

30
CONSISTENT
• The line is called a consistent hypothesis because it agrees with all the data.
REALIZABLE
• We say that a learning problem is realizable if the hypothesis space contains the true function.
Learning Decision Tree
• A decision tree is a decision support tool that uses a tree-like graph or model of decisions and
their possible consequences, including chance event outcomes, resource costs, and utility.
• It is one way to display an algorithm.
• Decision trees are commonly used in operations research, specifically in decision analysis, to
help identify a strategy most likely to reach a goal.
• Another use of decision trees is as a descriptive means for calculating conditional probabilities.
• Decision trees can express any
function of the input attributes.
• Training input:
o Data points with a set of
attributes
• Classifier output:
o Can be boolean or have
multiple outputs
o Each leaf stores an
“answer”
Example
Should we wait for a table at a
restaurant?
Possible attributes:
• Alternate restaurant nearby?
• Is there a bar to wait in?
• Is it Friday or Saturday?
• How hungry are we?
• How busy is the restaurant?
• How many people in the restaurant?

31
Expressiveness of Decision Trees
• Each path through tree corresponds to implication like
FORALL r Patrons(r,Full) & WaitEstimate(r,0-10) & Hungry(r,N) -> WillWait(r)
Hence Decision tree corresponds to conjunction of implications.
• Cannot express tests that refer to two different objects like:
EXISTS r2 Nearby(r2) & Price(r,p) & Price(r2,p2) & Cheaper(p2,p)
• Expressiveness essentially propositional logic (no function symbols, no existential
quantifier)
• Complexity for n attributes is 22^n
, since for each function 2n values have to be defined.
(e.g. for n=6 there are 2 x 1019 different functions)
• Functions like parity function (1 for even, 0 for odd) or majority function (1 if more than
half of the inputs are 1) end in large decision trees.
Different Solutions
• Trivial solution: Construct decision tree that has one path to a leaf for each example
Given the examples o.k., but else bad
• Occam's razor: The most likely hypothesis is the simplest one that is consistent with all
observation.
• Finding smallest decision trees is intractable, hence heuristic decisions:
o test the most important attribute first.
o Most important = makes most difference to the classification of an example.
Short paths in the trees, small trees
• Compare splitting the examples by testing on attributes (cf. Patrons, Type)
Assessing the performance of the learning algorithm
A learning algorithm is good if it produces hypotheses that do a good job of predicting the
classifications of unseen examples

32
a prediction is good if it turns out to be true, so we can assess the quality of a hypothesis by
checking its predictions against the correct classification once we know it. We do this on a set of
examples known as the test set.
Selecting Best Attribute
If we train on all our available examples, then we will have to go out and get some more to test
on, so often it is more convenient to adopt the following methodology:
1. Collect a large set of examples.
2. Divide it into two disjoint sets: the training set and the test set.
3. Use the learning algorithm with the training set as examples to generate a hypothesis H.
4. Measure the percentage of examples in the test set that are correctly classified by H.
5. Repeat steps 1 to 4 for different sizes of training sets and different randomly selected
training sets of each size.
Choosing attribute tests
ENTROPY
• We will use the notion of information gain, which is defined in terms of entropy, the
fundamental quantity in information theory.
• Entropy is a measure of the uncertainty of a random variable; acquisition of
information corresponds to a reduction in entropy.
• A random variable with only one value—a coin that always comes up heads—has no
uncertainty and thus its entropy is defined as zero; thus, we gain no information by
observing its value. A flip of a fair coin is equally likely to come up heads or tails, 0
or 1, and we will soon show that this counts as “1 bit” of entropy.
• In general, the entropy of a random variable V with values vk, each with probability
P(vk), is defined as
Generalization and overfitting

33
• Over-fitting is the phenomenon in which the learning system tightly fits the given
training data so much that it would be inaccurate in predicting the outcomes of the
untrained data.
• One of the methods used to address over-fitting in decision tree is
called pruning which is done after the initial training is complete. In pruning, you trim
off the branches of the tree, i.e., remove the decision nodes starting from the leaf
node such that the overall accuracy is not disturbed.
EVALUATING AND CHOOSING THE BEST HYPOTHESIS
• To make that precise we need to define “future data” and “best.”
• We make the stationarity assumption: that there is a probability distribution over examples that
remains stationary over time.
• Examples that satisfy assumptions are called independent and identically distributed IID
• To define “best fit.” We define the error rate of a hypothesis as the proportion of mistakes it
makes—the proportion of times that h(x) not equals y for an (x, y)
• HOLDOUT CROSS-VALIDATION : randomly split the available data into a training set from which
the learning algorithm produces h and a test set on which the accuracy of h is evaluated. This
method, sometimes called holdout cross-validation
• k-fold cross-validation : We can squeeze more out of the data and still get an accurate estimate
using a technique called k-fold cross-validation.
• If the test set is locked away, but you still want to measure performance on unseen data as a way
of selecting a good hypothesis, then divide the available data (without the test set) into a training
set and a validation set.
Model selection: Complexity versus goodness of fit
• MODEL SELECTION
o higher-degree polynomials can fit the training data better, but when the degree is too
high they will overfit, and perform poorly on validation data. Choosing the degree of the
polynomial is an instance of the problem of model selection.
• OPTIMIZATION
o model selection defines the hypothesis space and then optimization finds the best
hypothesis within that space.
• WRAPPER
o The wrapper enumerates models according to a parameter, size. For each size, it uses
cross validation on Learner to compute the average error rate on the training and test
sets.
From error rates to loss

34
LOSS FUNCTION
• In machine learning it is traditional to express utilities by means of a loss function. The loss
function L(x, y, ˆy) is defined as the amount of utility lost by predicting h(x)= ˆy when the
correct answer is f(x)=y
NOISE
• f may be nondeterministic or noisy—it may return different values for f(x) each time x
occurs. By definition, noise cannot be predicted; in many cases, it arises because the observed
labels y are the result of attributes of the environment not listed in x.
THE THEORY OF LEARNING
• How can we be sure that our learning algorithm has produced a hypothesis that will predict the
correct value for previously unseen inputs?
• how do we know that the hypothesis h is close to the target function f if we don’t know what f is?
• A hypothesis h is called approximately correct if error(h) ≤ E, where E is a small constant. We
will show that we can find an N such that, after seeing N examples, with high probability, all
consistent hypotheses will be approximately correct. One can think of an approximately correct
hypothesis as being “close” to the true function in hypothesis space: it lies inside what is called the
E-ball around the true function f. The hypothesis space outside this ball is called Hbad.
REGRESSION AND CLASSIFICATION WITH LINEAR MODELS
the class of linear functions of continuous-valued inputs.
Univariate linear regression
• A univariate linear function (a straight line) with input x and output y has the form y =w1x+w0,
where w0 and w1 are real-valued coefficients to be learned. We use the letter w because we think of
the coefficients as weights; the value of y is changed by changing the relative weight of one term
or another.
• The task of finding the hw that best fits these data is called linear regression. To fit a line to the
data, all we have to do is find the values of the weights [w0,w1] that minimize the
empirical(verifiable by observation) loss.
• problems can be addressed by a hill-climbing algorithm that follows the gradient of the function
to be optimized. In this case, because we are trying to minimize the loss, we will use gradient
descent. We choose any starting point in weight space—here, a point in the (w0, w1) plane—and
then move to a neighboring point that is downhill, repeating until we converge on the minimum
possible loss:
• The parameter α, which we called the step size usually called the learning rate when we are
trying to minimize loss in a learning problem.
Multivariate linear regression

35
• We can easily extend to multivariate linear regression problems, in which each example xj is an
n-element vector.
• Multivariate linear regression is actually not much more complicated than the univariate case we
just covered. Gradient descent will reach the (unique) minimum of the loss function; the update
equation for each weight wi is
• It is also possible to solve analytically for the w that minimizes loss. Let y be the vector of outputs
for the training examples, and X be the data matrix, i.e., the matrix of inputs with one n-
dimensional example per row. Then the solution minimizes the squared error.
Linear classifiers with a hard threshold
Linear functions can be used to do classification as well as regression.
A. Plot of two seismic data parameters, body wave magnitude x1 and surface wave magnitude
x2, for earthquakes (white circles) and nuclear explosions (black circles) occurring between
1982 and 1990 in Asia and the Middle East (Kebeasy et al., 1998). Also shown is a decision
boundary between the classes.
B. The same domain with more data points. The earthquakes and explosions are no longer
linearly separable.
• A decision boundary is a line (or a surface, in higher dimensions) that separates the two
classes.
• In Figure (a), the decision boundary is a straight line. A linear decision boundary is called a linear
separator and data that admit such a separator are called linearly separable.
• we can think of h as the result of passing the linear function through a threshold function.
• A training curve measures the classifier performance on a fixed training set as the learning
process proceeds on that same training set. The curve shows the update rule converging to a zero-
error linear separator. The “convergence” process isn’t exactly pretty, but it always works.

36
• training process with a learning rate schedule α(t)=1000/(1000 + t)
Linear classification with logistic regression
the integral of the standard normal distribution (used for the probit model) and the logistic function
(used for the logit model). Although the two functions are very similar in shape, the logistic function
has more convenient mathematical properties.
(a) The hard threshold function Threshold (z) with 0/1 output.
(b) The logistic function, Logistic(z) = 1/1+e−z
, also known as the sigmoid function.
(c) Plot of a logistic regression hypothesis hw(x)=Logistic(w ・ x)
ARTIFICIAL NEURAL NETWORKS
• the hypothesis that mental activity consists primarily of electrochemical activity in networks of
brain cells called neurons.
• Inspired by this hypothesis, some of the earliest AI work aimed to create artificial neural
networks.
A simple mathematical model for a neuron.
Neural network structures
• Neural networks are composed of nodes or units connected by directed links.
• A link from unit i to unit j serves to propagate the activation ai from i to j.

37
• Each link also has a numeric weight wi,j associated with it, which determines the strength and sign
of the connection.
• Then it applies an activation function g to this sum to derive the output:
• The activation function g is typically either a hard threshold in which case the unit is called a
perceptron.
• Having decided on the mathematical model for individual “neurons,” the next task is to connect
them together to form a network.
There are two fundamentally distinct ways to do this.
1. A feed-forward network has connections only in one direction—that is, it forms a directed
acyclic graph.
2. A recurrent network, on the other hand, feeds its outputs back into its own inputs.
• Feed-forward networks are usually arranged in layers, such that each unit receives input only
from units in the immediately preceding layer.
Single-layer feed-forward neural networks (perceptrons)
• A network with all the inputs connected directly to the outputs is called a single-layer neural
network, or a perceptron network.
A perceptron network with two inputs and two output units.
Multilayer feed-forward neural networks
• A network with all the inputs connected through hidden layer or unit to the outputs is called a
multi-layer neural network.
A neural network with two inputs, one hidden layer of two units, and one output unit. Not shown are the dummy
inputs and their associated weights.
Learning neural network structures
• all statistical models, neural networks are subject to overfitting when there are too many
parameters in the model.

38
• If we stick to fully connected networks, the only choices to be made concern the number of hidden
layers and their sizes. The usual approach is to try several and keep the best.
• The optimal brain damage algorithm begins with a fully connected network and removes
connections from it. After the network is trained for the first time, an information-theoretic
approach identifies an optimal selection of connections that can be dropped.
• the tiling algorithm, resembles decision-list learning. The idea is to start with a single unit that does
its best to produce the correct output on as many of the training examples as possible.
NONPARAMETRIC MODELS
• A learning model that summarizes data with a set of parameters of fixed size (independent of the
number of training examples) is called a parametric model.
• A nonparametric model is one that cannot be characterized by a bounded set of parameters.
• For example, suppose that each hypothesis we generate simply retains within itself all of the
training examples and uses all of them to predict the next example.
• Such a hypothesis family would be nonparametric because the effective number of parameters is
unbounded it grows with the number of examples. This approach is called instance-based learning
or memory-based learning.
• The simplest instance-based learning method is table lookup: take all the training examples, put
them in a lookup table, and then when asked for h(x), see if x is in the table; if it is, return the
corresponding y.
Nearest neighbor models
• We can improve on table lookup with a slight variation: given a query xq, find the k examples that
are nearest to xq. This is called k-nearest neighbors lookup. We’ll use the notation NN(k, xq) to
denote the set of k nearest neighbors.
• decision boundary of k-nearest-neighbors classification for k= 1 and 5 on the earthquake data set
from Figure. Nonparametric methods are still subject to underfitting and overfitting, just like
parametric methods. In this case 1-nearest neighbors is overfitting; it reacts too much to the black
outlier in the upper right and the white outlier at (5.4, 3.7). The 5-nearest-neighbors decision
boundary is good; higher k would underfit. As usual, cross-validation can be used to select the best
value of k.
• The very word “nearest” implies a distance metric.
• How do we measure the distance from a query point xq to an example point xj Typically, distances
are measured with a Minkowski distance
• With p=2 this is Euclidean distance and with p=1 it is Manhattan distance. With Boolean attribute
values, the number of attributes on which the two points differ is called the Hamming distance.

39
• Often p=2 is used if the dimensions are measuring similar properties, such as the width, height
and depth of parts on a conveyor belt, and
• Manhattan distance is used if they are dissimilar, such as age, weight, and gender of a patient.
• To avoid dimension change problem (eg mm to cm), it is common to apply normalization to the
measurements in each dimension.
Finding nearest neighbors with k-d trees
• A balanced binary tree over data with an arbitrary number of dimensions is called a k-d tree, for
k-dimensional tree. (In our notation, the number of dimensions is n, so they would be n-d trees.
• The construction of a k-d tree is similar to the construction of a one-dimensional balanced binary
tree.
• We then recursively make a tree for the left and right sets of examples, stopping when there are
fewer than two examples left. To choose a dimension to split on at each node of the tree, one can
simply select dimension i mod n at level i of the tree.
• Exact lookup from a k-d tree is just like lookup from a binary tree (with the slight complication
that you need to pay attention to which dimension you are testing at each node).
Locality-sensitive hashing
• Hash tables have the potential to provide even faster lookup than binary trees. But how can we
find nearest neighbors using a hash table, when hash codes rely on an exact match? Hash codes
randomly distribute values among the bins, but we want to have near points grouped together in
the same bin; we want a locality-sensitive hash (LSH).
• First we define the approximate near neighbors problem
SUPPORT VECTOR MACHINES
• The support vector machine or SVM framework is currently the most popular approach for “off-
the-shelf” supervised learning: if you don’t have any specialized prior knowledge about a domain,
then the SVM is an excellent method to try first. There are three properties that make SVMs
attractive:
1. SVMs construct a maximum margin separator—a decision boundary with the largest possible
distance to example points. This helps them generalize well.
2. SVMs create a linear separating hyperplane, but they have the ability to embed the data into a
higher-dimensional space, using the so-called kernel trick(smoothing)(no separable data to
separable dataset).
3. SVMs are a nonparametric method—they retain training examples and potentially need to store
them all. On the other hand, in practice they often end up retaining only a small fraction of the
number of examples—sometimes as few as a small constant times the number of dimensions.
o Thus SVMs combine the advantages of nonparametric and parametric models: they have
the flexibility to represent complex functions, but they are resistant to overfitting.

40
Support vector machine classification:
a) Two classes of points (black and white circles) and three candidate linear separators.
b) The maximum margin separator (heavy line), is at the midpoint of the margin (area between
dashed lines). The support vectors (points with large circles) are the examples closest to the
separator.
• Instead of minimizing expected empirical loss on the training data, SVMs attempt to minimize
expected generalization loss. We call this separator the maximum margin separator(width of
the area bounded by dashed lines)
ENSEMBLE LEARNING
• The idea of ensemble learning methods is to select a collection, or ensemble, of hypotheses from
the hypothesis space and combine their predictions. For example, during cross-validation we might
generate twenty different decision trees, and have them vote on the best classification for a new
example.
• The motivation for ensemble learning is simple. Consider an ensemble of K =5 hypotheses and
suppose that we combine their predictions using simple majority voting. For the ensemble to
misclassify a new example, at least three of the five hypotheses have to misclassify it.
Illustration of the increased expressive power obtained by ensemble learning. We take three linear threshold
hypotheses, each of which classifies positively on the unshaded side, and classify as positive any example
classified positively by all three. The resulting triangular region is a hypothesis not expressible in the original
hypothesis space.
• Another way to think about the ensemble idea is as a generic way of enlarging the hypothesis space.
That is, think of the ensemble itself as a hypothesis and the new hypothesis space as the set of all
possible ensembles constructable from hypotheses in the original space.

41
• If the original hypothesis space allows for a simple and efficient learning algorithm, then the
ensemble method provides a way to learn a much more expressive class of hypotheses without
incurring much additional computational or algorithmic complexity.
The most widely used ensemble method is called boosting.
• Boosting starts with wj =1 for all the examples (i.e., a normal training set). From this set, it generates
the first hypothesis, h1. This hypothesis will classify some of the training examples correctly and
some incorrectly.
• We would like the next hypothesis to do better on the misclassified examples, so we increase their
weights while decreasing the weights of the correctly classified examples.
• In some cases, boosting has been shown to yield better accuracy than bagging, but it also tends to
be more likely to over-fit the training data.
• There are many variants of the basic boosting idea, with different ways of adjusting the weights
and combining the hypotheses. One specific algorithm, called ADABOOST
weighted training set
• In such a training set, each example has an associated weight wj ≥ 0. The higher the weight of an
example, the higher is the importance attached to it during the learning of a hypothesis.
ADABOOST has a very important property
• if the input learning algorithm L is a weak learning algorithm—which means that L always returns
a hypothesis with accuracy on the training set that is slightly better than random guessing (i.e.,
50%+_ for Boolean classification)—then ADABOOST will return a hypothesis that classifies the
training data perfectly for large enough K. Thus, the algorithm boosts the accuracy of the original
learning algorithm on the training data.
Online Learning
• what to do when the data are not i.i.d. (independent and identically distributed); when they can
change over time. In this case, it matters when we make a prediction, so we will adopt the
perspective called online learning:
• an agent receives an input xj from nature, predicts the corresponding yj , and then is told the correct
answer. Then the process repeats with xj+1, and so on.
• One might think this task is hopeless—if nature is adversarial, all the predictions may be wrong.
• each day a set of K pundits predicts whether the stock market will go up or down, and our task is to
pool those predictions and make our own. One way to do this is to keep track of how well each
expert performs, and choose to believe them in proportion to their past performance. This is called
the randomized weighted majority algorithm.
1. Initialize a set of weights {w1, . . . ,wK} all to 1.

42
2. Receive the predictions {ˆy1, . . . , ˆyK} from the experts.
3. Randomly choose an expert k∗, in proportion to its weight: P(k)=wk/(_k_ wk_).
4. Predict ˆyk∗.
5. Receive the correct answer y.
6. For each expert k such that ˆyk _= y, update wk←βwk
PRACTICAL MACHINE LEARNING
• consider two aspects of practical machine learning. The first involves finding algorithms capable
of learning to recognize handwritten digits and squeezing every last drop of predictive performance
out of them. The second involves anything but— pointing out that obtaining, cleaning, and
representing the data can be at least as important as algorithm engineering.
• Recognizing handwritten digits is an important problem with many applications, including
automated sorting of mail by postal code, automated reading of checks and tax returns, and data
entry for hand-held computers. It is an area where rapid progress has been made, in part because of
better learning algorithms and in part because of the availability of better training sets.
• Examples from the NIST database of handwritten digits. Top row: examples of digits 0–9 that are easy to
identify. Bottom row: more difficult examples of the same digits.
The following table summarizes the error rate and some of the other characteristics of
the seven techniques we have discussed.
Case study: Word senses and house prices
• practical applications of machine learning, the data set is usually large, multidimensional, and
messy. The data are not handed to the analyst in a prepackaged set of (x, y) values; rather the analyst
needs to go out and acquire the right data.
• There is a task to be accomplished, and most of the engineering problem is deciding what data are
necessary to accomplish the task; a smaller part is choosing and implementing an appropriate
machine learning method to process the data.
• Figure shows a typical realworld example, comparing five learning algorithms on the task of word-
sense classification (given a sentence such as “The bank folded,” classify the word “bank” as
“money-bank” or “river-bank”). The point is that machine learning researchers have focused
mainly on the vertical direction: Can I invent a new learning algorithm that performs better than
previouslypublished algorithms on a standard training set of 1 million words? But the graph shows
there is more room for improvement in the horizontal direction:
• instead of inventing a new algorithm, all I need to do is gather 10million words of training data;
even the worst algorithm at 10 million words is performing better than the best algorithm at 1
million. As we gather even more data, the curves continue to rise, dwarfing the differences between
algorithms.

43
Learning curves for five learning algorithms on a common task. Note that there appears to be more room for
improvement in the horizontal direction (more training data) than in the vertical direction (different machine
learning algorithm). Adapted from Banko and Brill (2001).
• the task of estimating the true value of houses that are for sale. needs more data relevant to house
size , bedrooms , locality and all

44
Unit 3
LEARNING PROBABILISTIC MODELS
Agents can handle uncertainty by using the methods of probability and decision theory, but first they must
learn their probabilistic theories of the world from experience.
We will see that a Bayesian view of learning is extremely powerful, providing general solutions to the
problems of noise, overfitting, and optimal prediction. It also takes into account the fact that a less-than-
omniscient agent can never be certain about which theory of the world is correct, yet must still make
decisions by using some theory of the world.
STATISTICAL LEARNING
• The key concepts are data and hypotheses. Here, the data are evidence—that is, instantiations of
some or all of the random variables describing the domain.
• Consider a simple example. Our favorite Surprise candy comes in two flavors: cherry (yum) and
lime (ugh). The manufacturer has a peculiar sense of humor and wraps each piece of candy in the
same opaque wrapper, regardless of flavor. The candy is sold in very large bags, of which there are
known to be five kinds—again, indistinguishable from the outside:
h1: 100% cherry,
h2: 75% cherry + 25% lime,
h5: 100% lime .
• Given a new bag of candy, the random variable H (for hypothesis) denotes the type of the bag, with
possible values h1 through h5. H is not directly observable, of course. As the pieces of candy are
opened and inspected, data are revealed—D1, D2, . . ., DN, where each Di is a random variable with
possible values cherry and lime.
• The basic task faced by the agent is to predict the flavor of the next piece of candy.
• Despite its apparent triviality, this scenario serves to introduce many of the major issues. The
agent really does need to infer a theory of its world, albeit a very simple one.
Bayesian learning (partial believes)
• Bayesian learning simply calculates the probability of each hypothesis, given the data, and makes
predictions on that basis. That is, the predictions are made by using all the hypotheses, weighted
by their probabilities, rather than by using just a single “best” hypothesis.
• Bayesian estimate Calculates validity of the proposition, and calculation is done by prior
estimation and new relevant evidence
Bayes Theorem
• P(hi | d) = αP(d | hi)P(hi) .
• P(hi | d) is probability of hypothesis given Data
• P(hi) . is prior probability
• P(d | hi) is probability of data

45
• The key quantities in the Bayesian approach are the hypothesis prior, P(hi), and the likelihood of
the data under each hypothesis, P(d | hi).
a) Posterior probabilities P(hi | d1, . . . , dN) The number of observations N ranges from 1 to 10,
and each observation is of a lime candy.
b) Bayesian prediction P(dN+1 =lime | d1, . . . , dN)
• the Bayesian prediction eventually agrees with the true hypothesis. This is characteristic of
Bayesian learning.
• For any fixed prior that does not rule out the true hypothesis, the posterior probability of any false
hypothesis will, under certain technical conditions, eventually vanish.
• This happens simply because the probability of generating “uncharacteristic” data indefinitely is
vanishingly small.
• A very common approximation—one that is usually adopted in science—is to make predictions
based on a single most probable hypothesis—that is, an hi that maximizes P(hi | d). This is often
called a maximum a posteriori or MAP (pronounced “em-ay-pee”) hypothesis.
• Predictions made according to an MAP hypothesis hMAP are approximately Bayesian to the extent
that P(X | d) ≈ P(X | hMAP). In our candy example, hMAP =h5
• A final simplification is provided by assuming a uniform prior over the space of hypotheses. In
that case, MAP learning reduces to choosing an hi that maximizes P(d | hi).
• This is called a maximum-likelihood (ML) hypothesis, hML. Maximum-likelihood learning is very
common in statistics, a discipline in which many researchers distrust the subjective nature of
hypothesis priors.
LEARNING WITH COMPLETE DATA
• The general task of learning a probability model, given data that are assumed to be generated from
that model, is called density estimation.
• the simplest case, where we have complete data. Data are complete when each data point contains
values for every variable in the probability model being learned.
• We focus on parameter learning—finding the numerical parameters for a probability model
whose structure is fixed.
Maximum-likelihood parameter learning: Discrete models
• Suppose we buy a bag of lime and cherry candy from a new manufacturer whose lime–cherry
proportions are completely unknown; the fraction could be anywhere between 0 and 1. In that case,
we have a continuum of hypotheses. The parameter in this case, which we call θ, is the proportion

46
of cherry candies, and the hypothesis is hθ. (The proportion of limes is just 1 − θ.) If we assume
that all proportions are equally likely a priori, then a maximumlikelihood approach is reasonable.
• If we model the situation with a Bayesian network, we need just one random variable, Flavor (the
flavor of a randomly chosen candy from the bag). It has values cherry and lime, where the
probability of cherry is θ
(a) Bayesian network model for the case of candies with an unknown proportion of cherries and limes.
(b) Model for the case where the wrapper color depends (probabilistically) on the candy flavor.
• Now suppose we unwrap N candies, of which c are cherries and _=N − c are limes. According to
Equation , the likelihood of this particular data set is
• The maximum-likelihood hypothesis is given by the value of θ that maximizes this expression.
The same value is obtained by maximizing the log likelihood,
Naive Bayes models
• You have hunderds of thousands of data points and quite a few variables in your training data set.
In such situation, if I were at your place, I would have used ‘Naive Bayes‘, which can be extremely
fast relative to other classification algorithms. It works on Bayes theorem of probability to predict
the class of unknown data set.
• It is a classification technique based on Bayes’ Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature. For example, a fruit
may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these
features depend on each other or upon the existence of the other features, all of these properties
independently contribute to the probability that this fruit is an apple and that is why it is known
as ‘Naive’.
• Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
• Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.

47
• Let’s understand it using an example. Below I have a training data set of weather and corresponding
target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether
players will play or not based on weather condition. Let’s follow the below steps to perform it.
• Step 1: Convert the data set into a frequency table
• Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
• Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of prediction.
• Problem: Players will play if weather is sunny. Is this statement is correct?
• We can solve it using above discussed method of posterior probability.
• P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
• Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
• Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
Maximum-likelihood parameter learning: Continuous models
• continuous variables are ubiquitous in real-world applications, it is important to know how to
learn the parameters of continuous models from data. The principles for maximum-likelihood
learning are identical in the continuous and discrete cases.
(a) A linear Gaussian model described as y =θ1x + θ2 plus Gaussian noise with fixed variance.
(b) A set of 50 data points generated from this model.

48
Density estimation with nonparametric models
• It is possible to learn a probability model without making any assumptions about its structure and
parameterization by adopting the nonparametric methods
• The task of nonparametric density estimation is typically done in continuous domains, such as
that shown in Figure (a). The figure shows a probability density function on a space defined by two
continuous variables. In Figure (b) we see a sample of data points from this density function.
• First we will consider k-nearest-neighbors models.
• Given a sample of data points, to estimate the unknown probability density at a query point x we
can simply measure the density of the data points in the neighborhood of x.
• shows two query points (small squares). For each query point we have drawn the smallest circle
that encloses 10 neighbors—the 10-nearest-neighborhood. We can see that the central circle is
large, meaning there is a low density there, and the circle on the right is small, meaning there is a
high density there.
(a) A 3D plot of the mixture of Gaussians.
(b) A 128-point sample of points from the mixture, together with two query points (small squares) and their 10-
nearest-neighborhoods (medium and large circles).
LEARNING WITH HIDDEN VARIABLES: THE EM ALGORITHM
• The preceding section dealt with the fully observable case. Many real-world problems have hidden
variables (sometimes called latent variables), which are not observable in the data that are
available for learning. For example, medical records often include the observed symptoms, the
physician’s diagnosis, the treatment applied, and perhaps the outcome of the treatment, but they
seldom contain a direct observation of the disease itself! (Note that the diagnosis is not the disease;
it is a causal consequence of the observed symptoms, which are in turn caused by the disease.) One
might ask, “If the disease is not observed, why not construct a model without it?”

49
• Assume that each variable has three possible values (e.g., none, moderate, and severe). Removing
the hidden variable from the network in yields the network in (b); the total number of parameters
increases from 78 to 708. Thus, latent variables can dramatically reduce the number of parameters
required to specify a Bayesian network. This, in turn, can dramatically reduce the amount of data
needed to learn the parameters.
• Hidden variables are important, but they do complicate the learning problem for example, it is not
obvious how to learn the conditional distribution for HeartDisease, given its parents, because we
do not know the value of HeartDisease in each case; the same problem arises in learning the
distributions for the symptoms.
• This section describes an algorithm called expectation–maximization, or EM, that solves this
problem in a very general way.
(a) A simple diagnostic network for heart disease, which is assumed to be a hidden variable. Each variable has three
possible values and is labeled with the number of independent parameters in its conditional distribution; the total
number is 78.
(b) The equivalent network with HeartDisease removed. Note that the symptom variables are no longer
conditionally independent given their parents. This network requires 708 parameters.
Unsupervised clustering: Learning mixtures of Gaussians
• Unsupervised clustering is the problem of discerning multiple categories in a collection of objects.
The problem is unsupervised because the category labels are not given. For example, suppose we
record the spectra of a hundred thousand stars; are there different types of stars revealed by the
spectra, and, if so, how many types and what are their characteristics? We are all familiar with
terms such as “red giant” and “white dwarf,” but the stars do not carry these labels on their hats—
astronomers had to perform unsupervised clustering to identify these categories. Other examples
include the identification of species, genera, orders, and so on in the Linnean taxonomy and the
creation of natural kinds for ordinary objects.
• Unsupervised clustering begins with data. Figure 20.11(b) shows 500 data points, each of which
specifies the values of two continuous attributes. The data points might correspond to stars, and the
attributes might correspond to spectral intensities at two particular frequencies. Next, we need to
understand what kind of probability distribution might have generated the data. Clustering
presumes that the data are generated from a mixture distribution, P. Such a distribution has k
components, each of which is a distribution in its own right. A data point is generated by first
choosing a component and then generating a sample from that component. Let the random variable
C denote the component, with values 1, . . . , k; then the mixture distribution is given by
• where x refers to the values of the attributes for a data point. For continuous data, a natural choice
for the component distributions is the multivariate Gaussian, which gives the so-called mixture
of Gaussians family of distributions. The parameters of a mixture of Gaussians are
o wi =P(C =i) (the weight of each component),
o μi (the mean of each component), and

50
o Σi (the covariance of each component).
(a) A Gaussian mixture model with three components; the weights (left-to right) are 0.2, 0.3, and 0.5.
(b) 500 data points sampled from the model in (a). (c) The model reconstructed by EM from the data in (b).
• if we knew which component generated each data point, then it would be easy to recover the
component Gaussians: we could just select all the data points from a given component and then
apply (a multivariate version of) for fitting the parameters of a Gaussian to a set of data.
• For the mixture of Gaussians, we initialize the mixture-model parameters arbitrarily and then iterate
the following two steps:
• E-step: Compute the probabilities pij =P(C =i | xj), the probability that datum xj was generated by
component i. By Bayes’ rule, we have pij =αP(xj |C =i)P(C =i). The term P(xj |C =i) is just the
probability at xj of the ith Gaussian, and the term P(C =i) is just the weight parameter for the ith
Gaussian. Define ni =Σ j pij, the effective number of data points currently assigned to component
i.
• M-step: Compute the new mean, covariance, and component weights using the following steps in
sequence:
• where N is the total number of data points.
• The E-step, or expectation step, can be viewed as computing the expected values pij of the hidden
indicator variables Zij, where Zij is 1 if datum xj was generated by the ith component and 0
otherwise.
• The M-step, or maximization step, finds the new values of the parameters that maximize the log
likelihood of the data, given the expected values of the hidden indicator variables.
Learning Bayesian networks with hidden variables
• To learn a Bayesian network with hidden variables, we apply the same insights that worked
• for mixtures of Gaussians.
(a) A mixture model for candy. The proportions of different flavors, wrappers, presence of holes depend on the bag,
which is not observed.
(b) Bayesian network for a Gaussian mixture. The mean and covariance of the observable variables X depend on the
component C.

51
• Figure represents a situation in which there are two bags of candies that have been mixed together.
Candies are described by three features: in addition to the Flavor and the Wrapper , some candies
have a Hole in the middle and some do not. The distribution of candies in each bag is described by
a naive Bayes model: the features are independent, given the bag, but the conditional probability
distribution for each feature depends on the bag.
• In the figure, the bag is a hidden variable because, once the candies have been mixed together, we
no longer know which bag each candy came from. In such a case, can we recover the descriptions
of the two bags by observing candies from the mixture?
Learning hidden Markov models
• Our final application of EM involves learning the transition probabilities in hidden Markov models
(HMMs).
• hidden Markov model can be represented by a dynamic Bayes net with a single discrete state
variable Each data point consists of an observation sequence of finite length, so the problem is to
learn the transition probabilities from a set of observation sequences (or from just one long
sequence).
• in a hidden Markov model, on the other hand, the individual transition probabilities from state i to
state j at time t, θijt =P(Xt+1 =j |Xt =i), are repeated across time—that is, θijt =θij for all t. To estimate
the transition probability from state i to state j, we simply calculate the expected proportion of times
that the system undergoes a transition to state j when in state i:
• The expected counts are computed by an HMMinference algorithm. The forward–backward
algorithm shown in Figure 15.4 can be modified very easily to compute the necessary probabilities.
One important point is that the probabilities required are obtained by smoothing rather than
filtering; that is, we need to pay attention to subsequent evidence in estimating the probability that
a particular transition occurred. The evidence in a murder case is usually obtained after the crime
(i.e., the transition from state i to state j) has taken place.
REINFORCEMENT LEARNING
• The agent needs to know that something good has happened when it (accidentally) checkmates the
opponent, and that something bad has happened when it is checkmated—or vice versa, if the game
is suicide chess. This kind of feedback is called a reward, or reinforcement. In games like chess,
the reinforcement is received only at the end of the game.
• The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly
optimal) policy for the environment.

52
• In many complex domains, reinforcement learning is the only feasible way to train a program to
perform at high levels.
• Reinforcement learning might be considered to encompass all of AI: an agent is placed in an
environment and must learn to behave successfully therein.
• A utility-based agent learns a utility function on states and uses it to select actions that maximize
the expected outcome utility.
• A Q-learning agent learns an action-utility function, Q-function, giving the ex-pected utility of
taking a given action in a given state.
• A reflex agent learns a policy that maps directly from states to actions.
• passive learning, where the agent’s policy is fixed and the task is to learn the utilities of states (or
state–action pairs); this could also involve learning a model of the environment.
• active learning, where the agent must also learn what to do. The principal issue is exploration: an
agent must experience as much as possible of its environment in order to learn how to behave in it.
PASSIVE REINFORCEMENT LEARNING
• we start with the case of a passive learning agent using a state-based representation in a fully
observable environment. In passive learning, the agent’s policy π is fixed: in state s, it always
executes the action π(s).
• Its goal is simply to learn how good the policy is—that is, to learn the utility function Uπ(s).
• the passive learning task is similar to the policy evaluation task, part of the policy iteration
algorithm The main difference is that the passive learning agent does not know the transition
model P(s`| s, a), which specifies the probability of reaching state s` from state s after doing action
a; nor does it know the reward function R(s), which specifies the reward for each state.
(a) A policy π for the 4×3 world; this policy happens to be optimal with rewards of R(s)= − 0.04 in the nonterminal
states and no discounting.
(b) The utilities of the states in the 4×3 world, given policy π.
• The agent executes a set of trials in the environment using its policy π. In each trial, the agent starts
in state (1,1) and experiences a sequence of state transitions until it reaches one of the terminal
states, (4,2) or (4,3). Its percepts supply both the current state and the reward received in that state.
Typical trials might look like this:
Direct utility estimation
• It is clear that direct utility estimation is just an instance of supervised learning where each
example has the state as input and the observed reward-to-go as output. This means that we
have reduced reinforcement learning to a standard inductive learning problem

53
• Direct utility estimation succeeds in reducing the reinforcement learning problem to an inductive
learning problem, about which much is known. Unfortunately, it misses a very important source of
information, namely, the fact that the utilities of states are not independent! The utility of each state
equals its own reward plus the expected utility of its successor states.
• That is, the utility values obey the Bellman equations for a fixed policy
Adaptive dynamic programming
• An adaptive dynamic programming (or ADP) agent takes advantage of the constraints among
the utilities of states by learning the transition model that connects them and solving the
corresponding Markov decision process using a dynamic programming method.
• The process of learning the model itself is easy, because the environment is fully observable.
This means that we have a supervised learning task where the input is a state–action pair and the
output is the resulting state.
Temporal-difference learning
• Temporal difference (TD) learning is an approach to learning how to predict a quantity that
depends on future values of a given signal. The name TD derives from its use of changes, or
differences, in predictions over successive time steps to drive the learning process. The prediction
at any given time step is updated to bring it closer to the prediction of the same quantity at the next
time step. It is a supervised learning process in which the training signal for a prediction is a future
prediction.
ACTIVE REINFORCEMENT LEARNING
• An active agent must decide what actions to take. Let us begin with the adaptive dynamic
programming agent and consider how it must be modified to handle this new freedom.
• First, the agent will need to learn a complete model with outcome probabilities for all actions, rather
than just the model for the fixed policy. The simple learning mechanism used by PASSIVE-ADP-
AGENT will do just fine for this. Next, we need to take into account the fact that the agent has a
choice of actions. The utilities it needs to learn are those defined by the optimal policy
Exploration
• The agent does not learn the true utilities or the true optimal policy! What happens instead is that,
in the 39th trial, it finds a policy that reaches the +1 reward along the lower route via (2,1), (3,1),
(3,2), and (3,3)
• After experimenting with minor variations, from the 276th trial onward it sticks to that policy, never
learning the utilities of the other states and never finding the optimal route via (1,2), (1,3), and
(2,3).
• We call this agent the greedy agent. Repeated experiments show that the greedy agent very seldom
converges to the optimal policy for this environment and sometimes converges to really horrendous
policies.
• By improving the model, the agent will receive greater rewards in the future
• An agent therefore must make a tradeoff between exploitation to maximize its reward—as
reflected in its current utility estimates—and exploration to maximize its long-term well-being.
• Pure exploration to improve one’s knowledge is of no use if one never puts that knowledge into
practice. In the real world, one constantly has to decide between continuing in a comfortable
existence and striking out into the unknown in the hopes of discovering a new and better life.

AI NOTES CHAPTER SUMMARY

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AI NOTES CHAPTER SUMMARY

Similar to AI NOTES CHAPTER SUMMARY (20)

More from WE-IT TUTORIALS

More from WE-IT TUTORIALS (20)

Recently uploaded

Recently uploaded (20)

AI NOTES CHAPTER SUMMARY