SlideShare a Scribd company logo
1 of 24
Download to read offline
Extended Isolation Forest
by Sahand Hariri, Matias Carrasco Kind, Robert J. Brunner
Presenter: Jaehyeong Ahn
Department of Applied Statistics, Konkuk University
jayahn0104@gmail.com
1 / 24
Contents
Motivation
What makes this problem?
Solution
Notation
Methodology
1 Training Stage
2 Evaluation Stage
Empirical Results
Summary
2 / 24
Motivation
• Anomaly scores produced by Isolation Forest reveals that they are
inconsistent
• The best way to see the problem is to examine very simple datasets where
we have an intuition of how the scores should be distributed and what
constitutes an anomaly
• We will see three motivational situations with synthetic datasets which
show the problem of standard Isolation Forest
3 / 24
Motivation
Normally distributed data
• Let there is a two dimensional dataset sampled from a 2-D normal
distribution with zero mean vector and covariance given by the identity
matrix
• We expect to see an anomaly score map with an almost circular and
symmetric pattern with increasing values as we move radially
outward(i.e., similar score values for fixed distances from the origin)
(a) Single Blob (b) Anomaly Score Map
4 / 24
Motivation
Two normally distributed clusters
• Let there are two separate clusters of normally distributed data
concentrated around (0,10) and (10,0)
(a) Multiple Blobs (b) Anomaly Score Map
• We can observe ”ghost” clusters close to (0,0) and (10,10) which raise a
significant problem
• Not only does this increase the chances of false positive, it also wrongly
indicates a non-existent structure in the data
5 / 24
Motivation
Sinusoidal data points with Gaussian noise
• The data has an inherent structure, the sinusoidal shape with a Gaussian
noise added on top
(a) Sinusoidal data points
with Gaussian noise
(b) Anomaly Score Map
• We can see that the algorithm performs very poorly
• Isolation Forest failed to capture the structure of data
6 / 24
What makes this problem?
• The problem arises because of the way the branching of the tree in
Isolation Tree
• Below figure shows the branching process during the training phase for an
anomaly and a nominal points using the standard Isolation Tree for the
normally distributed data
(a) Anomaly point (b) Nominal point
• In this case the random cuts are all either vertical or horizontal because of
the nature of the Isolation Tree
7 / 24
What makes this problem?
• Let’s consider one randomly selected fully grown tree with all its branch
rules and criteria
(a) Single blob (b) Multiple Blobs (c) Nominal point
• Note that in each step, we pick a random feature (dimension), Xj , and a
random value, v, for this feature
• As we move down the branches of the tree, the range of possible values
for v decreases and so the line tend to cluster where most of the data
points are concentrated
• Because of the constraint that the branch cuts are only vertical and
horizontal, regions that don’t necessarily contain many data points end
up with many branch cuts
8 / 24
Solution
• (Sahand Hairiri et.el., 2018) proposes the ”Extended Isolation Forest”
which uses hyperplanes with random slopes (non-axis-parallel) at each
split to overcome the drawbacks
(a) Anomaly (b) Nominal
• For Extended Isolation Forest, the selection of the branch cuts requires
only two pieces of information:
1 A random slope for the branch cut
2 A random intercept for the split which is chosen from the range of
available values of the training data
• This is equivalent to standard Isolation Forest which also requires two
pieces of information (a random feature, a random value for the feature) 9 / 24
Solution
Random Slope for the branch cut
• For a D dimensional dataset
• Selecting a random slope for the branch cut is the same as choosing a
normal vector d uniformly over the unit D-Sphere
• This is equivalent to draw a random number for each coordinate of d
from N(0, 1) which is the standard normal distribution
• For the intercept, p, we simply draw from a uniform distribution over the
range of values present at each branching point
• The branching criteria for the data splitting for a given point x is as
follows:
(x − p) · d ≤0 (1)
• If the condition is satisfied, the data point x is passed to the left branch,
otherwise it moves down to the right branch
10 / 24
Solution
• Fully grown tree with all its branch rules and criteria using random slope
(a) Single blob (b) Multiple Blobs (c) Sinusoidal
• Note that despite the randomness of the process, higher density points
are better represented schematically
• We can observe that there are no regions that artificially receive more
attention that the rest
• The results of this are score maps that are free of artifacts previously
observed
11 / 24
Solution
High Dimensional Data
• The algorithm generalizes readily to higher dimensions
• In this case, the branch cuts are no longer straight line, but D − 1
dimensional hyperplanes
• The same criteria for branching process specified by inequality (1) applies
to the high dimensions
Extension Levels
• We can consider D levels of extension by setting coordinates of the
normal vector to zero
• For example, In the case of three dimensional data
(a) Ex 2 (b) Ex 1 (c) Ex 0
• So for any given D dimensional dataset, the lowest level of extension of
the Extended Isolation Forest is coincident with the standard Isolation
Forest (i.e., it’s a generalization of isolation forest)
12 / 24
Notation
• X = {x, · · · , xn}, xi ∈ Rd
for all i = , · · · , N
• (only consider continuous valued attributes)
• T is a node of an isolation tree
• T is either an external-node (terminal-node) with no child, or an
internal-node with one test and exactly two daughter nodes (Tl, Tr)
• A test consists of a normal vector d and a intercept p such that the
test (x − p) · d ≤0 divides data points into Tl and Tr
• h(x): Path Length of a point x which is measured by the number of
edges x traverses an iTree from the root node until the traversal is
terminated at an external node
• c(n): Average path length of unsuccessful search in BST(Binary Search
Tree)[B. R. Preiss,1999]. This is used for normalization of h(x),
(H(i) ≈ ln(i) + .)
c(n) = H(n − ) − ((n − )/n) (2)
• s(x, n): Anomaly score
s(x, n) = 
−
E(h(x))
c(n) (3) 13 / 24
• s(x, n): Anomaly score
s(x, n) = 
−
E(h(x))
c(n)
• when E(h(x)) → c(n), s → .
• when E(h(x)) → , s → 
• when E(h(x)) → n − , s → 
⇒ s is monotonic to h(x)
•  < s ≤  for  < h(x) ≤ n − 
• (a) if instances return s very close to 1, then they are definitely
anomalies,
• (b) if instances have s much smaller than 0.5, then they are quite
safe to be regarded as normal instances,
• (c) if all the instances return s ≈ 0.5, then the entire sample does
not really have any distinct anomaly
14 / 24
Methodology
Algorithm 0: iForest(X, t, ψ)
Inputs : X - input data, t - number of trees, ψ - sub-sampling size
Outputs: a set of t iTrees
begin
set height limit l = ceiling(log ψ);
for i =  to t do
X ← sample(X, ψ);
Forest ← Forest ∪ iTree(X , , l);
end
return Forest
end
15 / 24
Methodology
Algorithm 1: iTree(X, e, l)
Inputs : X - input data, e - current tree height, l - height limit
Outputs: an iTree
begin
if e ≥ l or |X| ≤  then
return exNode{Size ← |X|};
else
Randomly select a normal vector d ∈ R|X|
by drawing each
coordinate of d from a standard Gaussian distribution
Randomly select an intercept point p ∈ R|X|
in the range of X
Let coordinate of d to zero according to extension level
Xl ← filter(X, (X − p) · d ≤0)
Xr ← filter(X, (X − p) · d >0)
return inNode{Left ← iTree(Xl, e + , l),
Right ← iTree(Xr, e + , l),
Normal ← d,
Intercept ← p}
end
end 16 / 24
Methodology
Algorithm 2: PathLength(x, T, e)
Inputs : x - an instance, T - an iTree, e - current path length;
to be initialized to zero when first called
Outputs: path length of x
begin
if T is an external node then
return e + c(T.size){c(.)is defined in Equation (2)}
end
d ← T.Normal;
p ← T.Intercept;
if (x − p) · d ≤0 then
return PathLength(x, T.left, e + );
else if (x − p) · d >0 then
return PathLength(x, T.right, e + )
end
end
17 / 24
Empirical Results
Score Maps
(a) Standard IF (b) Standard IF (c) Standard IF
(a) Extended IF (b) Extended IF (c) Extended IF
18 / 24
Empirical Results
Variance of anomaly scores
• Comparing the mean and variance of the anomaly scores of points
distributed along roughly constant score lines for various cases between
Standard IF and Extended IF
• Let there is a normally distributed data, anomaly scores along each circle
should remain more or less a constant
(a) Data (b) Score Mean (c) Score Variance
• After about 3σ, the variance among the scores computed by the
Extended IF is much more stable and smaller than Standard IF
• Which means that Extended IF is a much more robust anomaly detector
19 / 24
Empirical Results
• Let’s consider the higher dimensional case and the extension levels
• Choose 4-D blobs of normally distributed data in space around the origin
with unit variance in all directions
• Since it is the case of 4 dimension, we can execute 3 extension levels
(a) Score Mean (b) Score Variance
• After about 3σ, similar to 2-D case, the Extended IF produces much
lower values in the score variance than Standard IF
• We also can observe that the variance decreases as the extension level
increases (i.e., the Extended IF produces the most reliable and robust
anomaly scores)
20 / 24
Empirical Results
AUC Comparison
• In this section we report AUC values for ROC and PRC
• In all cases the improvement is obvious, especially in the case of the
double blob ans the sinusoid
21 / 24
Empirical Results
Real data experiments
Figure: Table of data properties
Figure: Table of AUC values for both ROC and PRC
22 / 24
Summary
• In this work, it has shown an extension to the anomaly detection
algorithm known as Isolation Forest
• The motivation for the study arose from the observation that score maps
in two dimensional datasets demonstrated unexpected artifacts
• It proposed that this problem can be solved by obtaining random slopes
(non-axis-parallel) at each split in branching procedure in the algorithm
• Further more, we have seen that this kind of solution earned improvement
in detecting anomalies not only in synthetic data, but also in real datasets
• Additionally, it insists that this method does not sacrifice computational
efficiency compared to the Standard IF
Future works
• dimensionality reduction (either feature selection or feature
extraction) for anomaly detection in high dimensional data
• Dealing with categorical variables
23 / 24
References
1 Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. ”Isolation forest.” In
2008 Eighth IEEE International Conference on Data Mining, pp. 413-422.
IEEE, 2008.
2 Hariri, Sahand, Matias Carrasco Kind, and Robert J. Brunner. ”Extended
isolation forest.” arXiv preprint arXiv:1811.02141 (2018).
24 / 24

More Related Content

What's hot

2. polynomial interpolation
2. polynomial interpolation2. polynomial interpolation
2. polynomial interpolation
EasyStudy3
 

What's hot (20)

Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
Decision tree learning
Decision tree learningDecision tree learning
Decision tree learning
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Svm and kernel machines
Svm and kernel machinesSvm and kernel machines
Svm and kernel machines
 
07 learning
07 learning07 learning
07 learning
 
Branch and Bound Feature Selection for Hyperspectral Image Classification
Branch and Bound Feature Selection for Hyperspectral Image Classification Branch and Bound Feature Selection for Hyperspectral Image Classification
Branch and Bound Feature Selection for Hyperspectral Image Classification
 
Svm algorithm
Svm algorithmSvm algorithm
Svm algorithm
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
06 image features
06 image features06 image features
06 image features
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
 
Lesson 36
Lesson 36Lesson 36
Lesson 36
 
Kohonen self organizing maps
Kohonen self organizing mapsKohonen self organizing maps
Kohonen self organizing maps
 
Implementing Minimum Error Rate Classifier
Implementing Minimum Error Rate ClassifierImplementing Minimum Error Rate Classifier
Implementing Minimum Error Rate Classifier
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
05 contours seg_matching
05 contours seg_matching05 contours seg_matching
05 contours seg_matching
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
2. polynomial interpolation
2. polynomial interpolation2. polynomial interpolation
2. polynomial interpolation
 

Similar to Extended Isolation Forest

Statisticsforbiologists colstons
Statisticsforbiologists colstonsStatisticsforbiologists colstons
Statisticsforbiologists colstons
andymartin
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
RohanBorgalli
 
Topic 3 Measures of Central Tendency -Grouped Data.pptx
Topic 3 Measures of Central Tendency -Grouped Data.pptxTopic 3 Measures of Central Tendency -Grouped Data.pptx
Topic 3 Measures of Central Tendency -Grouped Data.pptx
CallplanetsDeveloper
 

Similar to Extended Isolation Forest (20)

Clipping & Rasterization
Clipping & RasterizationClipping & Rasterization
Clipping & Rasterization
 
Statisticsforbiologists colstons
Statisticsforbiologists colstonsStatisticsforbiologists colstons
Statisticsforbiologists colstons
 
Lecture slides week14-15
Lecture slides week14-15Lecture slides week14-15
Lecture slides week14-15
 
Lec08 fitting
Lec08 fittingLec08 fitting
Lec08 fitting
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
 
Decision Tree
Decision Tree Decision Tree
Decision Tree
 
Digital Image Processing.pptx
Digital Image Processing.pptxDigital Image Processing.pptx
Digital Image Processing.pptx
 
Circuitanlys2
Circuitanlys2Circuitanlys2
Circuitanlys2
 
descriptive statistics.pptx
descriptive statistics.pptxdescriptive statistics.pptx
descriptive statistics.pptx
 
Measures of Variability.pptx
Measures of Variability.pptxMeasures of Variability.pptx
Measures of Variability.pptx
 
Statistics Formulae for School Students
Statistics Formulae for School StudentsStatistics Formulae for School Students
Statistics Formulae for School Students
 
Input analysis
Input analysisInput analysis
Input analysis
 
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
 
Data Mining Lecture_5.pptx
Data Mining Lecture_5.pptxData Mining Lecture_5.pptx
Data Mining Lecture_5.pptx
 
Topic 3 Measures of Central Tendency -Grouped Data.pptx
Topic 3 Measures of Central Tendency -Grouped Data.pptxTopic 3 Measures of Central Tendency -Grouped Data.pptx
Topic 3 Measures of Central Tendency -Grouped Data.pptx
 
State presentation2
State presentation2State presentation2
State presentation2
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 

More from Konkuk University, Korea (6)

(Slide)concentration effect
(Slide)concentration effect(Slide)concentration effect
(Slide)concentration effect
 
House price
House priceHouse price
House price
 
Titanic data analysis
Titanic data analysisTitanic data analysis
Titanic data analysis
 
A Tutorial of the EM-algorithm and Its Application to Outlier Detection
A Tutorial of the EM-algorithm and Its Application to Outlier DetectionA Tutorial of the EM-algorithm and Its Application to Outlier Detection
A Tutorial of the EM-algorithm and Its Application to Outlier Detection
 
Anomaly Detection: A Survey
Anomaly Detection: A SurveyAnomaly Detection: A Survey
Anomaly Detection: A Survey
 
Random Forest
Random ForestRandom Forest
Random Forest
 

Recently uploaded

obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
wsppdmt
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
mikehavy0
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
aqpto5bt
 
bams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxbams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptx
JocylDuran
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
Solution manual for managerial accounting 8th edition by john wild ken shaw b...
Solution manual for managerial accounting 8th edition by john wild ken shaw b...Solution manual for managerial accounting 8th edition by john wild ken shaw b...
Solution manual for managerial accounting 8th edition by john wild ken shaw b...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
 
bams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxbams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptx
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 

Extended Isolation Forest

  • 1. Extended Isolation Forest by Sahand Hariri, Matias Carrasco Kind, Robert J. Brunner Presenter: Jaehyeong Ahn Department of Applied Statistics, Konkuk University jayahn0104@gmail.com 1 / 24
  • 2. Contents Motivation What makes this problem? Solution Notation Methodology 1 Training Stage 2 Evaluation Stage Empirical Results Summary 2 / 24
  • 3. Motivation • Anomaly scores produced by Isolation Forest reveals that they are inconsistent • The best way to see the problem is to examine very simple datasets where we have an intuition of how the scores should be distributed and what constitutes an anomaly • We will see three motivational situations with synthetic datasets which show the problem of standard Isolation Forest 3 / 24
  • 4. Motivation Normally distributed data • Let there is a two dimensional dataset sampled from a 2-D normal distribution with zero mean vector and covariance given by the identity matrix • We expect to see an anomaly score map with an almost circular and symmetric pattern with increasing values as we move radially outward(i.e., similar score values for fixed distances from the origin) (a) Single Blob (b) Anomaly Score Map 4 / 24
  • 5. Motivation Two normally distributed clusters • Let there are two separate clusters of normally distributed data concentrated around (0,10) and (10,0) (a) Multiple Blobs (b) Anomaly Score Map • We can observe ”ghost” clusters close to (0,0) and (10,10) which raise a significant problem • Not only does this increase the chances of false positive, it also wrongly indicates a non-existent structure in the data 5 / 24
  • 6. Motivation Sinusoidal data points with Gaussian noise • The data has an inherent structure, the sinusoidal shape with a Gaussian noise added on top (a) Sinusoidal data points with Gaussian noise (b) Anomaly Score Map • We can see that the algorithm performs very poorly • Isolation Forest failed to capture the structure of data 6 / 24
  • 7. What makes this problem? • The problem arises because of the way the branching of the tree in Isolation Tree • Below figure shows the branching process during the training phase for an anomaly and a nominal points using the standard Isolation Tree for the normally distributed data (a) Anomaly point (b) Nominal point • In this case the random cuts are all either vertical or horizontal because of the nature of the Isolation Tree 7 / 24
  • 8. What makes this problem? • Let’s consider one randomly selected fully grown tree with all its branch rules and criteria (a) Single blob (b) Multiple Blobs (c) Nominal point • Note that in each step, we pick a random feature (dimension), Xj , and a random value, v, for this feature • As we move down the branches of the tree, the range of possible values for v decreases and so the line tend to cluster where most of the data points are concentrated • Because of the constraint that the branch cuts are only vertical and horizontal, regions that don’t necessarily contain many data points end up with many branch cuts 8 / 24
  • 9. Solution • (Sahand Hairiri et.el., 2018) proposes the ”Extended Isolation Forest” which uses hyperplanes with random slopes (non-axis-parallel) at each split to overcome the drawbacks (a) Anomaly (b) Nominal • For Extended Isolation Forest, the selection of the branch cuts requires only two pieces of information: 1 A random slope for the branch cut 2 A random intercept for the split which is chosen from the range of available values of the training data • This is equivalent to standard Isolation Forest which also requires two pieces of information (a random feature, a random value for the feature) 9 / 24
  • 10. Solution Random Slope for the branch cut • For a D dimensional dataset • Selecting a random slope for the branch cut is the same as choosing a normal vector d uniformly over the unit D-Sphere • This is equivalent to draw a random number for each coordinate of d from N(0, 1) which is the standard normal distribution • For the intercept, p, we simply draw from a uniform distribution over the range of values present at each branching point • The branching criteria for the data splitting for a given point x is as follows: (x − p) · d ≤0 (1) • If the condition is satisfied, the data point x is passed to the left branch, otherwise it moves down to the right branch 10 / 24
  • 11. Solution • Fully grown tree with all its branch rules and criteria using random slope (a) Single blob (b) Multiple Blobs (c) Sinusoidal • Note that despite the randomness of the process, higher density points are better represented schematically • We can observe that there are no regions that artificially receive more attention that the rest • The results of this are score maps that are free of artifacts previously observed 11 / 24
  • 12. Solution High Dimensional Data • The algorithm generalizes readily to higher dimensions • In this case, the branch cuts are no longer straight line, but D − 1 dimensional hyperplanes • The same criteria for branching process specified by inequality (1) applies to the high dimensions Extension Levels • We can consider D levels of extension by setting coordinates of the normal vector to zero • For example, In the case of three dimensional data (a) Ex 2 (b) Ex 1 (c) Ex 0 • So for any given D dimensional dataset, the lowest level of extension of the Extended Isolation Forest is coincident with the standard Isolation Forest (i.e., it’s a generalization of isolation forest) 12 / 24
  • 13. Notation • X = {x, · · · , xn}, xi ∈ Rd for all i = , · · · , N • (only consider continuous valued attributes) • T is a node of an isolation tree • T is either an external-node (terminal-node) with no child, or an internal-node with one test and exactly two daughter nodes (Tl, Tr) • A test consists of a normal vector d and a intercept p such that the test (x − p) · d ≤0 divides data points into Tl and Tr • h(x): Path Length of a point x which is measured by the number of edges x traverses an iTree from the root node until the traversal is terminated at an external node • c(n): Average path length of unsuccessful search in BST(Binary Search Tree)[B. R. Preiss,1999]. This is used for normalization of h(x), (H(i) ≈ ln(i) + .) c(n) = H(n − ) − ((n − )/n) (2) • s(x, n): Anomaly score s(x, n) =  − E(h(x)) c(n) (3) 13 / 24
  • 14. • s(x, n): Anomaly score s(x, n) =  − E(h(x)) c(n) • when E(h(x)) → c(n), s → . • when E(h(x)) → , s →  • when E(h(x)) → n − , s →  ⇒ s is monotonic to h(x) •  < s ≤  for  < h(x) ≤ n −  • (a) if instances return s very close to 1, then they are definitely anomalies, • (b) if instances have s much smaller than 0.5, then they are quite safe to be regarded as normal instances, • (c) if all the instances return s ≈ 0.5, then the entire sample does not really have any distinct anomaly 14 / 24
  • 15. Methodology Algorithm 0: iForest(X, t, ψ) Inputs : X - input data, t - number of trees, ψ - sub-sampling size Outputs: a set of t iTrees begin set height limit l = ceiling(log ψ); for i =  to t do X ← sample(X, ψ); Forest ← Forest ∪ iTree(X , , l); end return Forest end 15 / 24
  • 16. Methodology Algorithm 1: iTree(X, e, l) Inputs : X - input data, e - current tree height, l - height limit Outputs: an iTree begin if e ≥ l or |X| ≤  then return exNode{Size ← |X|}; else Randomly select a normal vector d ∈ R|X| by drawing each coordinate of d from a standard Gaussian distribution Randomly select an intercept point p ∈ R|X| in the range of X Let coordinate of d to zero according to extension level Xl ← filter(X, (X − p) · d ≤0) Xr ← filter(X, (X − p) · d >0) return inNode{Left ← iTree(Xl, e + , l), Right ← iTree(Xr, e + , l), Normal ← d, Intercept ← p} end end 16 / 24
  • 17. Methodology Algorithm 2: PathLength(x, T, e) Inputs : x - an instance, T - an iTree, e - current path length; to be initialized to zero when first called Outputs: path length of x begin if T is an external node then return e + c(T.size){c(.)is defined in Equation (2)} end d ← T.Normal; p ← T.Intercept; if (x − p) · d ≤0 then return PathLength(x, T.left, e + ); else if (x − p) · d >0 then return PathLength(x, T.right, e + ) end end 17 / 24
  • 18. Empirical Results Score Maps (a) Standard IF (b) Standard IF (c) Standard IF (a) Extended IF (b) Extended IF (c) Extended IF 18 / 24
  • 19. Empirical Results Variance of anomaly scores • Comparing the mean and variance of the anomaly scores of points distributed along roughly constant score lines for various cases between Standard IF and Extended IF • Let there is a normally distributed data, anomaly scores along each circle should remain more or less a constant (a) Data (b) Score Mean (c) Score Variance • After about 3σ, the variance among the scores computed by the Extended IF is much more stable and smaller than Standard IF • Which means that Extended IF is a much more robust anomaly detector 19 / 24
  • 20. Empirical Results • Let’s consider the higher dimensional case and the extension levels • Choose 4-D blobs of normally distributed data in space around the origin with unit variance in all directions • Since it is the case of 4 dimension, we can execute 3 extension levels (a) Score Mean (b) Score Variance • After about 3σ, similar to 2-D case, the Extended IF produces much lower values in the score variance than Standard IF • We also can observe that the variance decreases as the extension level increases (i.e., the Extended IF produces the most reliable and robust anomaly scores) 20 / 24
  • 21. Empirical Results AUC Comparison • In this section we report AUC values for ROC and PRC • In all cases the improvement is obvious, especially in the case of the double blob ans the sinusoid 21 / 24
  • 22. Empirical Results Real data experiments Figure: Table of data properties Figure: Table of AUC values for both ROC and PRC 22 / 24
  • 23. Summary • In this work, it has shown an extension to the anomaly detection algorithm known as Isolation Forest • The motivation for the study arose from the observation that score maps in two dimensional datasets demonstrated unexpected artifacts • It proposed that this problem can be solved by obtaining random slopes (non-axis-parallel) at each split in branching procedure in the algorithm • Further more, we have seen that this kind of solution earned improvement in detecting anomalies not only in synthetic data, but also in real datasets • Additionally, it insists that this method does not sacrifice computational efficiency compared to the Standard IF Future works • dimensionality reduction (either feature selection or feature extraction) for anomaly detection in high dimensional data • Dealing with categorical variables 23 / 24
  • 24. References 1 Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. ”Isolation forest.” In 2008 Eighth IEEE International Conference on Data Mining, pp. 413-422. IEEE, 2008. 2 Hariri, Sahand, Matias Carrasco Kind, and Robert J. Brunner. ”Extended isolation forest.” arXiv preprint arXiv:1811.02141 (2018). 24 / 24