You may have already read many times that the job of a Data Scientist is to skim through a huge amount of data searching for correlations between some variables of interest. And also, that one of his worst enemies (besides correlation doesn't imply causation) is spurious correlation. But what really is correlation? Are there several types of correlations? Some "good", some "bad"? What about their estimation? This talk will be a very visual presentation around the notion of correlation and dependence. I will first illustrate how the standard linear correlation is estimated (Pearson coefficient), then some more robust alternative: the Spearman coefficient. Building on the geometric understanding of their nature, I will present a generalization that can help Data Scientists to explore, interpret, and measure the dependence (not necessarily linear or comonotonic) between the variables of a given dataset. Financial time series (stocks, credit default swaps, fx rates), and features from the UCI datasets are considered as use cases.
1. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
A closer look at correlations
Paris Machine Learning Meetup #3 Season 4
G. Marti, S. Andler, F. Nielsen, P. Donnat
HELLEBORECAPITAL
November 9, 2016
Gautier Marti A closer look at correlations
2. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
3. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
What is correlation?
E[Xi Xj ] − E[Xi ]E[Xj ]
(E[X2
i ] − E[Xi ]2)(E[X2
j ] − E[Xj ]2)
∈ [−1, 1]
N
k=1(xik
− xi )(xjk
− xj )
N
k=1(xik
− xi )2 N
k=1(xjk
− xj )2
∈ [−1, 1]
import numpy as np
np.corrcoef(x_i,x_j)
Gautier Marti A closer look at correlations
4. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Pearson correlation coefficient
Spearman correlation coefficient
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
5. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Pearson correlation coefficient
Spearman correlation coefficient
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
13. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Pearson correlation coefficient
Spearman correlation coefficient
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
14. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Pearson correlation coefficient
Spearman correlation coefficient
Spearman correlation: Pearson on ranks
Gautier Marti A closer look at correlations
15. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Pearson correlation coefficient
Spearman correlation coefficient
Spearman correlation: Pearson on ranks
Gautier Marti A closer look at correlations
16. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Pearson correlation coefficient
Spearman correlation coefficient
Spearman correlation: Pearson on ranks
Gautier Marti A closer look at correlations
17. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Pearson correlation coefficient
Spearman correlation coefficient
Spearman correlation: Pearson on ranks
Gautier Marti A closer look at correlations
18. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Pearson correlation coefficient
Spearman correlation coefficient
Spearman correlation: Pearson on ranks
Gautier Marti A closer look at correlations
19. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Pearson correlation coefficient
Spearman correlation coefficient
Spearman correlation: Pearson on ranks
Gautier Marti A closer look at correlations
20. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Pearson correlation coefficient
Spearman correlation coefficient
Spearman correlation with outliers
Gautier Marti A closer look at correlations
21. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
22. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
23. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
From ranks to empirical copula
Sklar’s Theorem [3]
For (Xi , Xj ) having continuous marginal cdfs FXi
, FXj
, its joint cumulative
distribution F is uniquely expressed as
F(Xi , Xj ) = C(FXi
(Xi ), FXj
(Xj )),
where C is known as the copula of (Xi , Xj ).
Gautier Marti A closer look at correlations
24. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
Minimum, Independence, Maximum copulas
Fr´echet–Hoeffding copula bounds
For any copula C : [0, 1]2
→ [0, 1] and any (u, v) ∈ [0, 1]2
the following
bounds hold:
W(u, v) ≤ C(u, v) ≤ M(u, v),
where W is the copula for counter-monotonic random variables, and M
is the copula for co-monotonic random variables.
0 0.5 1
ui
0
0.5
1
uj
w(ui,uj)
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
0.016
0.018
0.020
0 0.5 1
ui
0
0.5
1
uj
W(ui,uj)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 0.5 1
ui
0
0.5
1
uj
π(ui,uj)
0.00036
0.00037
0.00038
0.00039
0.00040
0.00041
0.00042
0.00043
0.00044
0 0.5 1
ui
0
0.5
1
uj Π(ui,uj)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 0.5 1
ui
0
0.5
1
uj
m(ui,uj)
0.000
0.002
0.004
0.006
0.008
0.010
0.012
0.014
0.016
0.018
0.020
0 0.5 1
ui
0
0.5
1
uj
M(ui,uj)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Gautier Marti A closer look at correlations
25. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
26. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
A metric space for copulas
Gautier Marti A closer look at correlations
27. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
A metric space for copulas
Gautier Marti A closer look at correlations
28. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
Which metric? (Regularized) Optimal Transport
Distance is the minimum cost of transportation to transform one
pile of dirt into another one, i.e. the amount of dirt moved times
the distance by which it is moved.
EMD = |x1 − x2| EMD = 1
6|x1 − x3| + 1
6|x2 − x3|
Gautier Marti A closer look at correlations
29. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
Which metric? (Regularized) Optimal Transport
Its geometry has good properties in general [1], and for copulas [2].
0 0.5 1
0
0.5
1
0.0000
0.0015
0.0030
0.0045
0.0060
0.0075
0.0090
0.0105
0.0120
0 0.5 1
0
0.5
1
0.0000
0.0015
0.0030
0.0045
0.0060
0.0075
0.0090
0.0105
0.0120
0 0.5 1
0
0.5
1
0.0000
0.0015
0.0030
0.0045
0.0060
0.0075
0.0090
0.0105
0.0120
0 0.5 1
0
0.5
1
0.0000
0.0015
0.0030
0.0045
0.0060
0.0075
0.0090
0.0105
0.0120
0 0.5 1
0
0.5
1 Bregman barycenter copula
0.0000
0.0008
0.0016
0.0024
0.0032
0.0040
0.0048
0.0056
0 0.5 1
0
0.5
1 Wasserstein barycenter copula
0.0000
0.0004
0.0008
0.0012
0.0016
0.0020
0.0024
0.0028
0.0032
Gautier Marti A closer look at correlations
30. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
A metric space for copulas
Gautier Marti A closer look at correlations
31. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
A metric space for copulas
Gautier Marti A closer look at correlations
32. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
A metric space for copulas
Gautier Marti A closer look at correlations
33. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
A metric space for copulas
Gautier Marti A closer look at correlations
34. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
35. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
The Target/Forget Dependence Coefficient (TFDC)
Gautier Marti A closer look at correlations
36. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
The Target/Forget Dependence Coefficient (TFDC)
Now, we can define our bespoke dependence coefficient:
Build the forget-dependence copulas {CF
l }l
Build the target-dependence copulas {CT
k }k
Compute the empirical copula Cij from xi , xj
TFDC(Cij ) =
minl D(CF
l , Cij )
minl D(CF
l , Cij ) + mink D(Cij , CT
k )
Gautier Marti A closer look at correlations
37. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
TFDC Power
0.00.20.40.60.81.0
xvals
power.cor[typ,]
xvals
power.cor[typ,]
0.00.20.40.60.81.0
xvals
power.cor[typ,]
xvals
power.cor[typ,]
cor
dCor
MIC
ACE
MMD
CMMD
RDC
TFDC
0.00.20.40.60.81.0
xvals
power.cor[typ,]
xvals
power.cor[typ,]
0 20 40 60 80 100
0.00.20.40.60.81.0
xvals
power.cor[typ,]
0 20 40 60 80 100
xvals
power.cor[typ,]
Noise Level
Power
Figure: Power of several dependence coefficients as a function of the
noise level in eight different scenarios. Insets show the noise-free form of
each association pattern. The coefficient power was estimated via 500
simulations with sample size 500 each.
Gautier Marti A closer look at correlations
38. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Explore the correlations with clustering
Query your dataset about correlations with TFDC
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
39. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Explore the correlations with clustering
Query your dataset about correlations with TFDC
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
40. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Explore the correlations with clustering
Query your dataset about correlations with TFDC
Clustering of empirical copulas
Gautier Marti A closer look at correlations
41. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Explore the correlations with clustering
Query your dataset about correlations with TFDC
Financial correlations - Stocks CAC 40
Figure: Stocks: More mass in the bottom-left corner, i.e. lower tail
dependence. Stock prices tend to plummet together.
Gautier Marti A closer look at correlations
42. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Explore the correlations with clustering
Query your dataset about correlations with TFDC
Financial correlations - Credit Default Swaps
Figure: Credit default swaps: More mass in the top-right corner, i.e.
upper tail dependence. Insurance cost against entities’ default tends to
soar in stressed market.
Gautier Marti A closer look at correlations
43. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Explore the correlations with clustering
Query your dataset about correlations with TFDC
Financial correlations - FX rates
Figure: FX rates: Empirical copulas show that dependence between FX
rates are various. For example, rates may exhibit either strong
dependence or independence while being anti-correlated during extreme
events.
Gautier Marti A closer look at correlations
44. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Explore the correlations with clustering
Query your dataset about correlations with TFDC
Associations between features in UCI datasets
Dependence patterns (= clustering centroids) found between features in UCI datasets
Breast Cancer (wdbc) 0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
Libras Movement 0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
Parkinsons 0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
Gamma Telescope 0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
0 0.5 1
0
0.5
1
Gautier Marti A closer look at correlations
45. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Explore the correlations with clustering
Query your dataset about correlations with TFDC
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
46. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Explore the correlations with clustering
Query your dataset about correlations with TFDC
The Art of formulating questions about correlations
Encode your dependence hypothesis as a copula, and your query as a
“k-NN search”.
Gautier Marti A closer look at correlations
47. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
1 Introduction
2 Standard correlation coefficients
Pearson correlation coefficient
Spearman correlation coefficient
3 A metric space for copulas
On the importance of the normalization
Which metric? (Regularized) Optimal Transport
A customizable dependence coefficient: TFDC
4 Applications
Explore the correlations with clustering
Query your dataset about correlations with TFDC
5 Conclusion
Gautier Marti A closer look at correlations
50. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Internships at Hellebore
If you are interested by an internship at Hellebore
in applied machine learning for Finance (NLP, Text
Classification, Information Extraction), please contact:
stage@helleboretech.com
in ML/Finance research (copulas, bayesian inference,
clustering, time series analysis), please contact:
gmarti@helleborecapital.com
Gautier Marti A closer look at correlations
51. HELLEBORECAPITAL
Introduction
Standard correlation coefficients
A metric space for copulas
Applications
Conclusion
Marco Cuturi.
Sinkhorn distances: Lightspeed computation of optimal
transport.
In Advances in Neural Information Processing Systems, pages
2292–2300, 2013.
Gautier Marti, S´ebastien Andler, Frank Nielsen, and Philippe
Donnat.
Optimal transport vs. fisher-rao distance between copulas for
clustering multivariate time series.
In IEEE Statistical Signal Processing Workshop, SSP 2016,
Palma de Mallorca, Spain, June 26-29, 2016, pages 1–5, 2016.
A Sklar.
Fonctions de r´epartition `a n dimensions et leurs marges.
Universit´e Paris 8, 1959.
Gautier Marti A closer look at correlations