3. Introduction & Objectives
• What is a model? ������������������ = ������������������������������ + ������ with ������ being a white noise
ESGF 5IFM Q1 2012
• What the point writing models?
Describe data behaviour
vinzjeannin@hotmail.com
Modelise data behaviour
Forecast data behaviour
• Acquire theory knowledge on Econometrics & Statistics
• Step by step from OLS to ANOVA on residuals
• Usage of R and Excel 3
5. OLS & Exploration
OLS: Ordinary Least Square
ESGF 5IFM Q1 2012
Linear regression model
Minimize the sum of the square vertical distances
between the observations and the linear
approximation
vinzjeannin@hotmail.com
������ = ������ ������ = ������������ + ������
Residual ε
5
6. Two parameters to estimate:
• Intercept α
• Slope β
ESGF 5IFM Q1 2012
Minimising residuals
������ ������
������ = ������������ 2 = ������������ − ������������������ + ������ 2
vinzjeannin@hotmail.com
������=1 ������=1
When E is minimal?
When partial derivatives i.r.w. a and b are 0
6
8. ������������
Leads easily to the intercept
������������
������ ������
������ ∗ ������������ + ������������ = ������������
������=1 ������=1
ESGF 5IFM Q1 2012
������������������ + ������������ = ������������
������������ + ������ = ������
vinzjeannin@hotmail.com
������ = ������ − ������������
The regression line is going through (������ , ������)
The distance of this point to the line is 0 indeed
8
11. ������ Covariance
������=1(������������ − ������ )(������������ − ������)
������ = ������ 2
������=1(������������ − ������ ) Variance
ESGF 5IFM Q1 2012
������������������������������
������ =
������2������
vinzjeannin@hotmail.com
������ = ������ − ������������
You can use Excel function INTERCEPT and SLOPE
11
12. Calculate the Variances and Covariance of X{1,2,3,3,1,2} and Y{2,3,1,1,3,2}
ESGF 5IFM Q1 2012
vinzjeannin@hotmail.com
12
You can use Excel function VAR.P, COVARIANCE.P and STDEV.P
13. Let’s asses the quality of the regression
Let’s calculate the correlation coefficient (aka Pearson Product-Moment
Correlation Coefficient – PPMCC):
ESGF 5IFM Q1 2012
������������������������������
������ = Value between -1 and 1
������������ ������������
������ = 1
vinzjeannin@hotmail.com
Perfect dependence
������ ~0 No dependence
Give an idea of the dispersion of the scatterplot
13
You can use Excel function CORREL
15. What is good quality?
ESGF 5IFM Q1 2012
Slightly discretionary…
vinzjeannin@hotmail.com
If
3
������ ≥ = 0.8666 …
2
It’s largely admitted as the threshold for acceptable / poor
15
16. The regression itself introduces a bias
Let’s introduce the coefficient of determination R-Squared
ESGF 5IFM Q1 2012
Total Dispersion = Dispersion Regression + Dispersion Residual
vinzjeannin@hotmail.com
2 2 2
������������ − ������ = ������������ − ������������ + ������������ − ������
Dispersion Regression
������2 =
Total Dispersion
In other words the part of the total dispersion explained by the regression 16
You can use Excel function RSQ
17. In a simple linear regression with intercept ������2 = ������ 2
ESGF 5IFM Q1 2012
Is a good correlation coefficient and a good coefficient of
determination enough to accept the regression?
vinzjeannin@hotmail.com
Not necessarily!
Residuals need to have no effect, in other word to be a white noise!
17
19. Don’t get fooled by numbers!
ESGF 5IFM Q1 2012
For every dataset of the Quarter
������ = 9
������ = 7.5
vinzjeannin@hotmail.com
������ = 3 + 0.5������
������ = 0.82
������2 = 0.67
Can you say at this stage which regression is the best?
19
Certainly not those on the right you need a LINEAR dependence
20. ESGF 5IFM Q1 2012
Is any linear regression useless?
vinzjeannin@hotmail.com
Think what you could do to the series
Polynomial transformation, log transformation,…
20
Else, non linear regressions, but it’s another story
21. First application on financial market
S&P / AmEx in 2011
ESGF 5IFM Q1 2012
vinzjeannin@hotmail.com
21
22. ������������������������������������������,������&������
������ = = 0.8501
������������������������������ ������������&������
������2 = ������ 2 = 0.7227
ESGF 5IFM Q1 2012
Oups :-o
Is Excel wrong?
vinzjeannin@hotmail.com
R-Squared has different calculation methods
Let’s accept the following regression then as the quality seems pretty good
������������������������������ = 0.06% + 1.1046 ∗ ������������&������
22
23. How to use this?
ESGF 5IFM Q1 2012
• Forecasting? Not really…
Both are random variables
vinzjeannin@hotmail.com
• Hedging? Yes but basis risk
Yes but careful to the residuals…
In theory, what is the daily result of the hedge? ������
Let’s have a try!
23
24. Hedging $1.0M of AmEx Stocks with $1.1046M of S&P
ESGF 5IFM Q1 2012
vinzjeannin@hotmail.com
It would have been too easy… Great differences… Why?
Sensitivity to the size of the sample
24
Heteroscedasticity
25. Let’s have a similar approach using a proper statistics and econometrics software
ESGF 5IFM Q1 2012
• Free
• Open Source
• Developments shared by developers
vinzjeannin@hotmail.com
Let’s begin with statistical exploration to get familiar with the series
and the software
> Val<-read.csv(file="C:/Users/Vinz/Desktop/Val.csv",head=TRUE,sep=",")
> summary(Val)
SPX AMEX
Min. :-0.0666344 Min. :-0.0883287
1st Qu.:-0.0069082 1st Qu.:-0.0094580
Median : 0.0010016 Median : 0.0013007 25
Mean : 0.0001249 Mean : 0.0005891
3rd Qu.: 0.0075235 3rd Qu.: 0.0102923
Max. : 0.0474068 Max. : 0.0710967
27. These are obvious negatively skewed distributions
ESGF 5IFM Q1 2012
Reminders
3
������ − ������ ������ ������ − ������ 3
������������������������ ������ = ������ =
������ ������ ������ − ������ 2 3/2
vinzjeannin@hotmail.com
• Negative skew: long left tail, mass on the right, skew to the left
• Positive skew: long right tail, mass on the left, skew to the right
> skewness(Val$AMEX)
[1] -0.2453693
> skewness(Val$SPX) 27
[1] -0.4178701
28. These are obvious leptokurtic distributions
ESGF 5IFM Q1 2012
Reminders
4
������ − ������ ������ ������ − ������ 4
������������������������ ������ = ������ =
������ ������ ������ − ������ 2 2
vinzjeannin@hotmail.com
> library(moments)
> kurtosis(Val$AMEX) What is their K?
[1] 5.770583 (excess kurtosis)
> kurtosis(Val$SPX)
[1] 5.671254 28
Subtract 3 to make it relative to the
normal distribution…
29. Quick check: what are the Skewness and Kurtosis of {1,2,-3,0,-2,1,1}?
ESGF 4IFM Q1 2012
vinzjeannin@hotmail.com
Excel function SKEW
R function skewness (package moments)
29
30. ESGF 4IFM Q1 2012
vinzjeannin@hotmail.com
Excel function KURT
R function kurtosis (package moments)
30
31. By the way, what is the most platykurtic distribution in the nature?
Toss it!
ESGF 4IFM Q1 2012
Head = Success = 1 / Tail = Failure = 0
vinzjeannin@hotmail.com
> require(moments)
> library(moments)
> toss<-rbinom(10000000,1,0.5)
> mean(toss)
[1] 0.5001777
> kurtosis(toss)
[1] 1.000001
> kurtosis(toss)-3
[1] -1.999999
> hist(toss, breaks=10,main="Tossing a
coin 10 millions times",xlab="Result
of the trial",ylab="Occurence") 31
> sum(toss)
[1] 5001777
32. 50.01777% rate of success: fair or not fair? Trick coin ?
Can be tested later with a Bayesian approach
ESGF 4IFM Q1 2012
On a perfect 50/50, Kurtosis would be 1, Excess Kurtosis -2: the minimum!
This is a Bernoulli trial
������(������, ������) with ������ > 1 and 0 < ������ < 1 ������ ∈ ℝ and ������ integer
vinzjeannin@hotmail.com
Mean ������
SD ������(1 − ������)
Skewness 1 − 2������
������(1 − ������)
Kurtosis 1
−3
������(1 − ������)
32
Easy to demonstrate if p=0.5 the Kurtosis will be the lowest
Bit more complicated to demonstrate it for any distribution
33. Back to our series, a good tool is the BoxPlot
ESGF 5IFM Q1 2012
Too
Many
Outliers!
vinzjeannin@hotmail.com
There should be 2 max
To be normal
Fatter tails than the
normal distribution
33
boxplot(Val$AMEX,Val$SPX, main="AMEX & S&P BoxPlots",
names=c("AMEX","SPX"),col="blue")
34. Leptokurtic distributions
Negatively skewed distribution
ESGF 5IFM Q1 2012
Are they normal distributions?
vinzjeannin@hotmail.com
Let’s compare them to normal distributions with same
standard deviation and mean and make the QQ Plots
34
38. Can use many tests…
• Kolmogorov-Smirnov
• Jarque Bera
• Chi Square
•
ESGF 5IFM Q1 2012
Shapiro Wilk
Let’s try Kolmogorov-Smirnov
It compares the distance between the empirical
vinzjeannin@hotmail.com
CDF and the CFD of the reference distribution
38
39. ESGF 5IFM Q1 2012
x=seq(-4,4,length=1000)
plot(ecdf(Val$AMEX),do.points=FALSE, col="red", lwd=3,
main="Normal Distribution against AMEX - CFD's", xlab="x",
ylab="P(X<=x)")
lines(x,pnorm(x,mean=mean(Val$AMEX),sd=sd(Val$AMEX)),col="blue",t
ype="l",lwd=3)
vinzjeannin@hotmail.com
x=seq(-4,4,length=1000)
plot(ecdf(Val$SPX),do.points=FALSE, col="red", lwd=3,
main="Normal Distribution against S&P - CFD's", xlab="x",
ylab="P(X<=x)")
lines(x,pnorm(x,mean=mean(Val$SPX),sd=sd(Val$SPX)),col="blue",typ
e="l",lwd=3)
39
40. > ks.test(Val$SPX, "pnorm") > ks.test(Val$AMEX, "pnorm")
One-sample Kolmogorov- One-sample Kolmogorov-Smirnov
Smirnov test test
data: Val$SPX data: Val$AMEX
D = 0.4811, p-value < 2.2e-16 D = 0.4742, p-value < 2.2e-16
alternative hypothesis: two-sided alternative hypothesis: two-sided
ESGF 5IFM Q1 2012
The 0 hypothesis is the distribution is normal
vinzjeannin@hotmail.com
Do we accept or reject the hypothesis 0 with a 95%
confidence interval?
The hypothesis regarding the distributional
form is rejected if the test statistic, D, is greater
than the critical value obtained from a table
40
41. vinzjeannin@hotmail.com
1.36
Sample size: 251 = 0.086
251
Rejected or not? 41
P-Value was giving
Rejected! Series aren’t fitting a normal distribution
the answer
42. Ok, we now know a bit more the 2 series we want to regress
> lm(Val$AMEX~Val$SPX)
Call:
lm(formula = Val$AMEX ~ Val$SPX)
ESGF 5IFM Q1 2012
Coefficients:
(Intercept) Val$SPX
0.0004505 1.1096287
plot(Val$SPX,Val$AMEX, main="S&P / AmEx", xlab="S&P", ylab="AmEx",
col="red")
vinzjeannin@hotmail.com
abline(lm(Val$AMEX~Val$SPX), col="blue")
������ = 110.96% ∗ ������ + 0.045%
42
43. The next important step is no analyse the residuals
> Reg<-lm(Val$AMEX~Val$SPX)
> summary(Reg)
ESGF 5IFM Q1 2012
Call:
lm(formula = Val$AMEX ~ Val$SPX)
Residuals:
Min 1Q Median 3Q Max
-0.030387 -0.006072 -0.000114 0.006624 0.027824
vinzjeannin@hotmail.com
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0004505 0.0006365 0.708 0.48
Val$SPX 1.1096287 0.0434231 25.554 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 0.01008 on 249 degrees of freedom
Multiple R-squared: 0.7239, Adjusted R-squared: 0.7228
F-statistic: 653 on 1 and 249 DF, p-value: < 2.2e-16
43
They need to be a white noise, you can have a first assessment with quartiles
45. QQ Plot compares the CDF
ESGF 5IFM Q1 2012
A perfect fit is a line
vinzjeannin@hotmail.com
Left tail noticeably different
45
46. ESGF 5IFM Q1 2012
vinzjeannin@hotmail.com
Residuals should be randomly distributed around the 0 horizontal line
You don’t want to see a trend, a dependence
To accept or reject the regression you need residuals to be a white noise
46
Their mean should be 0
47. ESGF 5IFM Q1 2012
Nothing suggesting a white noise
vinzjeannin@hotmail.com
• Square root of the standardized residuals as a function of the
fitted values
• There should be no obvious trend in this plot
47
48. Showing now leverage
Marginal importance of a point in the regression
ESGF 5IFM Q1 2012
vinzjeannin@hotmail.com
Far points suggest outlier or poor model
48
49. So do we accept the regression?
Probably not… But let’s check…
Kolmogorov-Smirnov on residuals
ESGF 5IFM Q1 2012
1.36 Higher bound value for the
������ = = 0.086
251 H0 to be accepted
vinzjeannin@hotmail.com
Resid<-resid(Reg)
ks.test(Resid, "pnorm")
One-sample Kolmogorov-Smirnov test
data: Resid
D = 0.4889, p-value < 2.2e-16
alternative hypothesis: two-sided
Rejected! Regression between 2 different asset are very often poor
49
Heteroscedasticity
Basis risk if you hedge anyway