18. Attribute_selection_method a procedure to determine the splitting criterio that “best” partitions the data tuples into individual classes.This criterion consists of splitting_attribute and possibly,either a split point or splitting subset. Output: A decision tree. Method: Create a node N; If tuples in D are all of the same class,C then returns N as a leaf node labeled with the class C; If Attribute_list is empty then return N as a leaf node labeled with the majority class in D; Apply Attribute_selection_method to find “best” splitting_criterion;
19. 5. Label node N with splitting_criterion; 6. If splitting_attribute is discrete_valued and multiway splits allowed then attribute_list=attribute_list – splitting_attribute. 7. For each outcome j of splitting_criterion 8. Let Dj be the set of data tuples in D satisfying outcome j; 9. If Dj is empty then 10. Attach a leaf labeled with the majority class D to node N; 11. Else attach the node returned by Generate_decision_tree to node N;endfor 12.Return N;
20.
21. An attribute selection measure is a experience based techniques for selecting the splitting criterion that “best” separates a given data partition D,of class-labeled training tuples into individual classes.If we were to split D into smaller partitions according to the outcomes of the splitting criterion,ideally each partition would be pure.(i.e.,all of the tuples that fall into a given partition would belong to the same class.)
22. There are main three measures for it.Information Gain. Gain Ratio. Gini Index.
23. Example: Age Income Student Credit_Rating Class:Buy_Computer Young high no fair no Young high no excellent no Middle high no fair yes Senior medium yes fair yes Senior low yes excellent no Middle medium no fair yes Senior medium no excellent no
24. Information Gain: ID3 uses information gain as its attribute selection measure.The measure is based on pioneering work by Claude Shannon on information theory,which studied the value or information content of messages. Info(D)= -∑ Pi log(Pi) (where i=1 to m) Info A (D)=∑((|Dj| / |D|)*Info(Dj)) (where j=1 to v) Gain(A)= Info(D) – Info A (D)
25. In Example, class buy_computer has two distinct value {yes,no}. So m=2. Let class C1 correspond to yes and C2 correspond to no. Here total tuples with “yes” are 3 and with “no” are 4. Total=4+3=7 so Info(D)=-(3/7)Log(3/7) – (4/7)Log(4/7) =0.9851 Here for young=2,middle=2,senior=3.among young both are from “no” class. And among middle both are from “yes” class.and among senior 1 is from “yes” and 2 are from “no” class. so Info age (D)=((2/7)*(-2/2 Log(2/2) – 0/2 Log(0/2)))+ ((2/7)*(-2/2 Log(2/2) - 0/2 Log(0/2)))+ ((3/7)*(-1/3 Log(1/3) - 2/3 Log(2/3))) =0.05931 So Gain(age)=0.9851-0.05931 =0.9257 As we calculated ,gain for age, we have to calculate gain for all attribute.After calculating gain ,attribute which has highest gain value ,becomes our split node.
26. AGE Young Senior Middle Income Student Credit_rating Class:Buy or not Income Student Credit_rating Class:Buy or not Income Student Credit_rating Class:Buy or Not
27. 2. Gain Ratio: The information gain measure is biased toward tests with many outcomes.That is,it prefers to select attributes having a large number of values.For example,consider an attribute that acts as a unique identifier,such as a product_ID.It would give large no. of partitions. So Info product_ID (D)=0. so it is useless to calculate information gain. splitInfo A(D)= -∑((|Dj|/|D| * Log (|Dj|//|D|)) Gain Ratio= Gain (A) / SplitInfo(A) For our example 2 tuple for young,2 for middle and 3 for senior so splitInfo age(D)=-2/7 log(2/7) – 2/7 log(2/7) -3/7 log(3/7) = 1.5564 For age gain(age)=0.9257 So gain ratio =0.9257/1.5564=0.5947 Attribute, which has maximum gain ratio is selected for split node.
28. 3. Gini Index: The gini index used in CART. The gini index measures the impurity of D, Gini(D)= 1- ∑Pi*Pi (where i=1 to m) The gini index considers a binary split for each attribute. If we have v possible values then we have a 2^v possible subsets. For example for income we have {low,medium,high},{low,medium},{low,high},{medium,high},{low},{high},{medium},{}. But we have to consider only 2^v-2 values. Gini A(D)=((|D1|/|D|)Gini(D1) + (|D2|/|D|)Gini(D2)) and Gini(A) = Gini(D) – Gini A(D) The attribute which gives minimum gini index is considered as a split node. Because it has lowest impurity.
29. After calculation, of selection measure we split our decision tree through split node which we decide through any of the selection measures. The process will continue until we get a all tuple from same class. So decision tree algorithm is implemented like above.