SlideShare a Scribd company logo
1 of 120
Download to read offline
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 1
Curse of Dimensionality
and
Big Data
Stephane Marchand-Maillet
Viper group
University of Geneva
Switzerland
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 2
• Are you familiar with vector spaces?
– Dimension, projection
• Are you familiar with statistics?
– Mean, variance, Gaussian distribution
• Are you familiar with linear algebra?
– Matrix, inner product
• Are you familiar with indexing?
– Principle, methods
• Do you realise all the above is one and the same
thing?
– That’s what we’ll see 
– I hope it will not be just trivial…
Quick get-to-know (profiling  )
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 3
• To provide you with an overview of
– Basics of data modelling
– Potential issues with high-dimensional data
– Large-scale indexing techniques
• To create bridges between basic techniques
– For better intuition
– To understand what is the information we
manipulate
– To understand what approximations are made
• To start you on doing your own data modelling
Objectives of the tutorial
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 4
Outline
• Motivation and context
• Large-scale high-dimensional data
• Fighting the dimensionality
• Fighting large-scale volumes
4
Note: Several illustrations from within these slides have been borrowed from the Web, including Wikipedia or teaching
material. Please do not reproduce without permission from the respective authors. When in doubt, don't.
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 5
Data Production
• Growth of Data
– 1,250B GB (=1.2EB) of data generated in
2010.
– Data generation growing at a rate of 58%
per year
• Baraniuk, R., “More is Less: Signal Processing and the
Data Deluge”, Science, V331, 2011.
1 exabyte (EB) = 1,073,741,824 gigabytes
0
2000
4000
6000
8000
10000
2010 2011 2012 2013 2014
DataSize(EB)
Data Generation Growth
http://www.intel.com/conte
nt/www/us/en/communicati
ons/internet-minute-
infographic.html
http://www.ritholtz.com/blog
/2011/12/financial-industry-
interconnectedness/
Internet
Scientific
Industry
Data
By Sverre Jarp,
By Felix'Schürmann
© Copyright attached
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 6
A digital world
© Copyright attached
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 7
[Picture from: http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html]
Data communication
© Copyright attached
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 8
User “productivity”
[Picture from: http://www.go-gulf.com/wp-content/themes/go-gulf/blog/online-time.jpg - Feb 2012]
© Copyright attached
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 9
Motivation
• Decision making requires informed
choices
• The information is often not easy to
manage and access
• The information is often
overwhelming
– « Big Data » trend
We need to bring a structure to the raw data
• Document (data) representation
• Similarity measurements
• Further analysis: mining, retrieval, learning
© Copyright attached
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 10
Information management process
Raw documents
Representation space
(visualisation)
Document features
User interaction
Feature extraction
“Appropriate”
mapping
“Decision”
process
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 11
Example: text
Text documents
Feature extraction
“Appropriate”
mapping
User interaction“Decision”
process
“Word” occurrences
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 12
Example: Images
Images
Feature extraction
“Appropriate”
mapping
User interaction“Decision”
process
Photo collage
Filtering
Color histogram
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 13
Also...
• Any type of media: webpage, audio, video,
data,...
• Objects, based on their characteristics
• People in social networks
• Concepts: processes, states, ... Etc
 Anything for which “characteristics” may be
measured
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 14
The key is distance
• Features help characterizing 1 document (summary)
• Features help comparing 2 documents
• How can they help structuring a collection?
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 15
Most often back to the local neighbours
- Information retrieval
- Similarity query
- Machine learning
- Generalization
- Data mining
- Discover continuous patterns
Distance measurements
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 16
However
Two main issues:
• High-dimensional data
– «Curse of dimensionality»
• Large data
– «Big data»
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 17
• Raw data (the documents) carries information
• Computer essentially perform additions
• We need to represent the data somehow to
provide the computer with as much as
possible faithful information
• The representation is an opportunity for us to
transfer some prior (domain) knowledge as
design assumptions
If this (data modelling) step is flawed, the
computer will work with random information
Representation spaces (intuition)
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 18
Given a set C of N documents di, mapped onto a
set X of points xi of a M-dimensional vector
space RM
• To index and organise (exploit) this collection,
we must understand its underlying structure
We study its geometrical properties
Notion of distance, neighbourhood
We study its statistical properties
Density, generative law
Both are the same information!
Approach
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 19
Terminology
Given a set C of N documents di, mapped onto a
set X of points xi of a M-dimensional vector
space RM
Two main issues:
• High-dimensional data
– M increases
• Large data
– N increases
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 20
• C={d1,d2,…,dN} a collection of documents
– For each document, perform feature extraction f
– di is represented by its feature vector xi in RM
– xi is the view of di from the computer perspective
– f: C  X = {x1,x2,…,xN}
• Examples
– Images: xi is a 128-bin color histogram: M=128
– Text: xi measures the occurrence of each word of
the dictionary: M=50’000
Representation spaces
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 21
We have
We want to create an order or a structure over X
– We define a topology on the representation space
We study distances
We study neighborhoods (kNN)
Representation spaces
M
iN RxxxxX  },...,,{ 21
M
M Reee ofbasis},...,,{ 21
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 22
Norms and distances
• Norm
– norm of x, vector of RM
– if the norm derives from an inner
product
Exple:
• Distance (metric)
• Norm and distance
M
iN RxxxxX  },...,,{ 21
x
xxxxx T
 ,
2
 

M
i
i
M
i
ii xxyxyx
1
2
1
.,

 RXXd :
xxxd  0),(
yxyxdyxd ,),(),( 
yzydyxdzxd  ),(),(),(
yxyxd ),(
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 23
• Examples of norms (distances)
– Minkowsky (Lp norms)
• p=1 : norm L1
• p=2: L2 norm (Euclidean)
• : norm
• Unit ball for distance d(.,.)
Norms and distances M
iN RxxxxX  },...,,{ 21
pM
i
p
ip
xx
1
1






 


M
i
ixx
1
1
 ii xx max
p L
(open)}1),(s.t{)(
(closed)}1),(s.t{)(


yxdyxB
yxdyxB
d
d
1
2

Ilustrations: http://www.viz.tamu.edu/faculty/ergun
Wikipedia
)()()(),(),(),(
2
2
2
2
2
yxyxyxyxyxyxdyxd T
E 
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 24
• Generalised Euclidean distance
• Mahalanobis distance
Norms and distances
2
1
)(
1
),( ii
M
i i
G yx
w
yxd  
)0;0(s.d.p  xAxxRA TMxM
)()(),( 12
yxAyxyxd T
A  
2Idif ddA A 
GAi ddwiagA  )(dif
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 25
• Hausdorff distance (set distance)
X, Y sous ensembles de C
Norms and distances
)),(infsup),,(infsupmax(),( yxdyxdYXd
yyXxXxYy
H


(Illustration: Wikipedia)
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 26
• Unit masses at positions xi
• Center of mass
• Inertia wrt point a:
• Inertia wrt subspace F:
• Huygens theorem:
Physics and statistics M
iN RxxxxX  },...,,{ 21

i
ix
N
g
1

i
ia xadXI ),()( 2
),()()( 2
gadXIXI ag 
Physics Statistics
Mass(xi) Probability P(xi)
Centre of mass g Mean mEX
Inertia Ig Variance s2=V(X)

i
iF xFdXI ),()( 2
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 27
• Relation between standard deviation and
distribution around the mean
– : at least half of the values are between
– Gaussian distribution N(0,1) :
• Centred variable:
Chebichev inequality
1;0)(;
)(
*
**


 X
X
XE
XEX
X s
s
2
1
)(
n
nXP  sm
2n ]2,2[ smsm 
9973.0)3( XP
Illustration: Wikipedia
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 28
Markov inequality
• Upper bound of the cumulative distribution
• Useful for proofs and upper bounds
0
)(
)(  a
a
XE
aXP
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 29
n random variables (X1,…,Xn) such that E(Xi)=m
then
is an « estimator » for m
and if V(Xi)=s2
Weak law of large numbers


n
i
iX
n
XNn
1
* 1
mm p
n
n
XEXP


 )(00)(lim
2
)( sp
n
XV


Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 30
• N uniform draws U([0,1])
•
Simulation: exponential distribution
n=10 n=100 n=1’000 n=3’000
n=10’000 Mean Standard deviation
3)ln(
1
 

UX
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 31
X such that E(X)=m et V(X)=s2
X1,…,Xn random variables iid with X
Then, Zn converges (in probability) to N(0,1)
Central Limit Theorem
dxebZaP
b
a
x
n
n 


 2
2
2
1
)(lim
s
)(
1
1
*
m
s
 
X
n
ZX
n
XNn n
n
i
i
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 32
• n uniform draws U([-0.5,0.5])
• Average n distributions: n draws of
Simulation: Normal distribution
X
n=1 n=4n=3n=2
n=100
Zn: Mean Zn: standard deviation
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 33
• X random variable whose mean m to be estimated
– Exple: « Diameter »
• Xi population
– Exple: « Apples »
• xi : measures
– Exple: « measured diameters »
 (mean of measures) tend to X
(by the Weak Law of Large Numbers)
• The Central Limit Theorem says that the error on the
estimate of m (Zn) follow a normal law N(0,1)
Zn is a random variable representing the error carried
by
Interpretation
X
X
nZ mm
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 34
• In vector spaces, distances are essentially measured using
difference of coordinates
• Statistical distribution may be considered as statistical objects with
inter-distances (similarity)
• However, it would not be relevant to compare their intrinsic values
directly. We rather use Divergences
• The most known/used divergence: KL-Divergence (Kullback Liebler)
– Given two distributions P and Q, the KL divergence between P and Q is
the measure of how much information is lost when Q is used to
approximate P
– The discrete formulation of the KL divergence is
– DKL is non-symmetric, it can be symmetrised (to better approach a
distance) as
A quick note on divergences

i
KL
iQ
iP
iPQPD
)(
)(
ln)()(
2
)()(
)(
PQDQPD
QPD KLKL 

Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 35
Topology (very loose intuition)
• A topology is built based on neighbourhood
• The neighbourhood is the base for the definition of continuity
• Continuity implies some assumption of the propagation of a function
(some smoothness)
In our context
• We are given data points (localised scattered information)
• We need to gain some “smoothness”
• We will propagate the information “around” our data points
• Distance identifies neighbourhoods
• We somehow “interpolate” (spread) information between data
points
• Because that our “best guess”!
• Everything depends on the fact of having informative characteristics
to localise our similar documents (di) as neighbouring points (xi)
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 36
One of the main problems in Data Analysis
• Given a query point
• Find its neighbourhood (vicinity) V
k-NN (nearest neighbour)
is the nearest (k-)neighbour
is the farthest k-neighbour
-NN >0, fixed (range query)
Nearest neighbours
M
Rq
*
Nk 
 },...,{),(),(s.t,...,, 121 kjiiii iijxqdxqdxxxV lk

 kxqdxxxV lk iiii  ),(s.t,...,, 21
1ix
kix
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 37
Voronoi diagram
ci: Voronoi cell associated to point xi
Delaunay Graph
D=(C,E) : points xi are the vertices of D
(xi,xj) is an edge if ci and cj have a common edge
The graph connects neighbouring cells
Space partitioning
M
iN RxxxxX  },...,,{ 21
 ijyxdyxdRyc ji
M
i  ),(),(t.q.
Ilustrations: http://www.wblut.com
Wikipedia
ci
xi
xj
(xi,xj)
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 38
We, as human, are experts in 1D, 2D, 3D, a bit less
in 4D (time) and less so afterwards
In high dimensions (eg 20 is enough), counter
intuitive properties appear
Eg:
• Sparsity
• Concentration of distances
• Relation to kNN: hubness
which we try to model here, to better understand
why things go wrong (or not as good)
Curse of Dimensionality
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 39
• M is the dimension of the space (and the data)
– Measures, characteristics, …
• X is therefore the sample data of a M-
dimensional space
What if M increases?
– Influence on geometric measures (distances, k-NN)
– Influence on statistical distributions
« Curse of dimensionality »
Richard Ernest Bellman (1961). Adaptive control
processes: a guided tour. Princeton University Press.
High dimensionality M
iN RxxxxX  },...,,{ 21
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 40
Imagine a data sample in [a,b]M
We quantify every dimension with k bins
To estimate the distribution we require n samples in each
bin in average
• M=1: N~k.n
• M=2: N~n.k2
…
• M: N~n.kM
Exple:
k=10, n=10, M=6 => N ~ 10’000’000 samples required
High dimensionality
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 41
Curse of dimensionality
• Sparsity
– N samples
– M dimensions
– k quantization steps
n samples per bin
or
to maintain n constant
41
M
k
N
n ~
M
kN ~
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 42
Curse of dimensionality
42
Mki
k
N
xPE ~))bin(( 
• Consequences:
– With finite sample size (limited data collection), most
of the cells are empty if the feature dimension is too
high
– The estimation of probability density is unreliable
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 43
Curse of dimensionality
• Gaussian distribution
43
M
XP )9973.0()3( 
M
1 99.7%
10 97.3%
100 76.3%
500 25.8%
1000 6.7%
)3( XP
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 44
Neighbourhood structure
• S a ball around a point (radius r, dimension M)
• C a cube around a point [-r,+r]M
0
)2/(2)(
)(
ratio
)2()(
)2/(
2
)(
1
2/
2/
 







M
M
M
C
S
M
C
MM
S
MMMV
MV
rMV
MM
r
MV


Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 45
Neighbourhood structure
• Most of the neighbours of the centre are «in the
corners of the cube»
• Empty space: each point (center) sees its neighbours
far away
0
)(
)(
  M
C
S
MV
MV
0))((  M
i SxPE
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 46
• S a ball around a point (radius r, dimension M)
• C enclosed cube: side a
Neighbourhood structure
?
)2/(2)(
)(
ratio 12/1
2/
 

 

M
MM
M
C
S
MMMV
MV 
M
r
aM
a
r
2
2

M
C
M
r
MV 






2
)(
)2/(
2
)(
2/
MM
r
MV
MM
S



Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 47
Dmax and Dmin are smallest and largest neighbour
distances
High-dimensional k-NN
Dmax Dmin
Beyer, K., Goldstein, J.,
Ramakrishnan, R., and Shaft, U.
(1999). When is“nearest neighbor”
meaningful? In Proceedings of the
7th International Conference on
Database Theory, pages 217–235
01
)(
Plimnthe0
)(
limif
min
minmax
M







D
DD











M
kM
kM
XE
X
V
Thm [Beyer et al, 1999]
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 48
Loss of contrast:
High-dimensional k-NN
Dmax Dmin
 Computational imprecision prevents
relevance
 Noise is taking over
 -NN: all or nothing
 k-NN: random draw
0
)(
min
minmax


D
DD M
P
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 49
Loss of contrast 2
/])]1,0([[ MU M
Dimension
Norm
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 50
Loss of contrast 2
/)]1,0([ MN M
Dimension
Norm
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 51
• Consequences
– Database index based on metric distances
• K-d-tree
• VP-tree
have to perform exhaustive search
“Every enclosing rectangle encloses everything”
High dimensional k-NN
Illustrations: Peter N. Yianilos.
Data Structures and Algorithms
for Nearest Neighbor Search in
General Metric Spaces. Fourth
ACM-SIAM Symposium on
Discrete Algorithms, 1993
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 52
In M dimension, the unit hypercube has as
diagonal u=[1 1 … 1]T, then
High dimensional diagonals

M
Mu 2
u
e1
Dimension
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 53
In M dimension, the unit hypercube has as
diagonal u=[1 1 … 1]T, then
High dimensional diagonals
0
1
),cos()cos( 1
1


MT
M
Mu
eu
eu

u
e1
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 54
In M dimensions, the unit hypercube has as
diagonal u=[1 1 … 1]T, then
• In high dimensions, all (2M-1) diagonal vectors are
orthogonal to the basis vectors
• High dimensional spaces have an exponential
number of dimensions
• Everything along the diagonals is projected onto
the origin
High dimensional diagonals

2




M
M
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 55
Given a Gaussian distribution in a M-dimensional space
N(mM,SM), what is the density of samples of radius r?
With no loss of generality we study the centered
distribution N(0,IM)
Gaussian distribution
]r-dr,r+dr[
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 56
Gaussian distribution ]r-dr,r+dr[
MM
M
rVM
M
rE
M
X
M
r
rMXX
rrXP
NX,...,X(XX
M
i
i
M
i
i
T
iM
22
)(1
1
)(
1
~
1
.
)),...,((ofestimation
)1,0(~)
2
22
2
1
22
2
1
22
2
1










kVkE
XVabaXV
bXaEbaXE
2)(;)(
)(2)(
)()(
22




Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 57
]r-dr,r+dr[
MM
M
rVM
M
rE
rrXP
NX,...,X(XX
T
iM
22
)(1
1
)(
)),...,((forestimation
)1,0(~)
2
22
1



« Gaussian egg »
Dimension
0
1
)),...,(( T
rrXP 
)( rXP 
)( rXP 
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 58
MM
M
rVM
M
rE
rrXP
NX,...,X(XX
T
iM
22
)(1
1
)(
)),...,((forestimation
)1,0(~)
2
22
1



« Gaussian egg »
Dimension
)),...,(( T
rrXP 
)( rXP 
)( rXP 
For a M-dimensional Normal distribution of mean 0
and s.d 1, the expected distribution marginalised
over concentric spheres has a mean of 1 and a
variance converging to 0
Intuition: The volume of the sphere tends to 0 goes
against the high density at the centre
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 59
Empirical evidence (10’000 samples)
Dimension
Bins on [0,2]
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 60
Empirical evidence (10’000 samples)
)( sXP
)),...,(( T
XP ss
)( sXP
Probability
Dimension
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 61
Empirical evidence (10’000 samples)Probability(cumulative)
Dimension
)),...,(( T
XP ss
)( sXP
)( sXP
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 62
Consequences
• Loss of contrast: the relative spread of points
is not seen accurately
• Conversely: using high dimensional Gaussian
distributions to model the data may not be as
accurate
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 63
• We want to characterise the number of times
a sample appears in the k-NN of another
sample:
The distribution of Nk is skewed to the left. A
small number of samples appear in the
neighbourhood of many samples
Hubs



 

i
ikk
ik
ik
xPxN
xx
xP
)()(
otherwise0
)(NNif1
)(
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 64
20-NN M=100 (1000 samples) (50bins)
Bin
Frequency
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 65
Hubness
Dimension
Bin
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 66
When using the cosine distance as similarity
measure
Centering the data helps reducing the hubness
Hubs: centering
yx
yx
yxd
T
1),(cos
I Suzuki et al. Centering Similarity Measures to Reduce Hubs.2013 Conf. on Empirical Methods in NLP.
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 67
Lesson
Although data points may be uniformly
distributed, the Lp norms being sums of
coordinate distances,
the computed distances are corrupted by the
excess of uniformative dimensions
As a result, points appear non uniformely
distributed
pM
i
p
ip
xx
1
1






 
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 68
Summary
Two main issues:
• High-dimensional data
– «Curse of dimensionality»
– Making distance measurements unreliable
– Making statistical estimation inaccurate
• Large data
– «Big data»
 Reduction of dimension
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 69
Dimension reduction: principle
• Given a set of data in a M-dimensional space,
we seek an equivalent representation of lower
dimension
• Dimension reduction induces a loss. What to
sacrifice? What to preserve?
– Preserve local: neighbourhood, distances
– Preserve global: distribution of data, variance
– Sacrifice local: noise
– Sacrifice global: large distances
– Map linearly
– Unfold data
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 70
Some example techniques:
• SFC: preserve neighbourhoods
• PCA: preserve global linear structures
• MDS: preserve linear neighbourhoods
• IsoMAP: Unfold neighbourhoods
• SNE family: unfold statistically
Not studied here (but also valid):
• SOM (visualisation), LLE, Random projections
(hashing)
Dimension reduction
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 71
Space-filling curves
• Definition:
– A continuous curve which passes through
every point of a closed n-cell in Euclidean n-
space Rn is called a space filling curve (SFC)
• The idea is to map the complete space
onto a simple index: a continuous line
– Directly implies an order on the dataset
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 72
Application of SFC
• Mapping multi-dimensional space to one dimensional sequence
• Applications in computer science:
– Database multi-attribute access
– Image compression
– Information visualization
– ……
Various types
• Non-recursive
– Z-Scan Curve
– Snake Scan Curve
• Recursive
– Hilbert Curve
– Peano Curve
– Gray Code Curve
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 73
Hilbert curve Ilustrations: Wikipedia
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 74
Peano Curve Ilustrations: Wikipedia
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 75
SFC
• In our case, the idea is to use SFC to
“explore” local neighborhoods, hoping
that neighborhoods will appear
“compact” on the curve
• Hence we study such mapping for SFC
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 76
Visualizing 4D Hyper-Sphere Surface
• Z-Curve  Hilbert Curve
[Illustrations from the lecture “SFC in Info Viz”, Jiwen Huo, Uni Waterloo, CA]
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 77
• Points can be identified as vectors from the
origin
• Orthogonal projection
• x gets projected in x* onto u (which we take of
unit length to represent the subspace Fu)
Projection
x
uo
x*
x-x*
 Fu
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 78
Projection
 
uuxxxxFxd
uoF
yxdx
uuxx
u
u
uy
,),(
),(minarg
,
*
2*
*







x
uo
x*
x-x*

Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 79
• x* is the part of x that lives in Fu (eg subspace of
interest)
• x-x* is the residual (what is not represented)
• x and x-x* are orthogonal (they represent
complementary information)
• Point x* is the closest point from Fu to x (minimal
loss, maximal representation)
Interpretation
x
uo
x*
x-x*
 Fu
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 80
• Given a set of data in a M dimensional space,
we seek a subspace of lower dimension onto
which to project our data
• Criterion: preserve most inertia of the dataset
• Consequence: project and minimize residuals
• We construct incrementally a new subspace
by successive projections
– X is projected onto ui, find an orthogonal ui+1 to
project the residual
– At most M ui s can be found, we then select the
most representative (preserving most inertia)
Principal Component Analysis (PCA)
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 81
• The data is centred around its mean to get
minimal global inertia
• We then look for u1 the direction capturing
most inertia (minimizing the global sum of
residuals)
PCA
),(minarg 2
1 u
i
i
u
Fxdu 
m ii xx
M
iN RxxxxX  },...,,{ 21
uxxuxxuuxxuuxxFxd i
T
i
T
i
T
iii
T
iiui  ),(),(),(2
)(maxarg),(minarg
1
2
1 uutrFxdu T
u
u
i
i
u
S


uuuu
u
L
uu
L T


SS





221
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 82
• PCA incrementally finds “best fit” subspaces, orthogonal to
each other (minimize residuals)
• PCA incrementally finds directions of max variance
(maximize trace of the cov matrix)
• PCA transforms the data linearly (eigen decomposition) to
align it with its axis of max variance (and make the
covariance matrix diagonal)
• The reduction of dimension is made by selecting
eigenvectors corresponding to the (m<<M) largest
eigenvalues
• Because of the max variance criterion, PCA sees the dataset
as a set of data draw from a centred distribution penalised
by their deviation (distance) to the centre: a Normal
distribution
 PCA is a linear transformation adapted to non clustered data
PCA
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 83
PCA
[Illustration Wikipedia]
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 84
• “Discriminant”  Supervised (xi,yi), yis
“labels”
• Simple case: 2 classes
– We seek u such that the projections of xis (xi*)
onto Fu is best linearly separated
Linear Discriminant Analysis (LDA)
u
u
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 85
• Intuitively: max inter-class distance
– u parallel to the original m1-m2
• Fisher criterion adds min intra-class spread (s2)
• Fisher criterion
LDA
*
2
*
1
1
maxarg mm 
u
u
2*
2
2*
1
1
minarg ss 
u
u
2*
2
2*
1
*
2
*
1
1
maxarg
ss
mm



u
u
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 86
• Both inter- and intra-class criterion can be
generalised to multi-class
• Criteria consider classes as one Gaussian
distribution N(mj,sj) each
• Resolved by solving an eigensystem
Linear solution
• Can be used for supervised projection onto a
reduced set of dimensions
LDA
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 87
Given dij a set of inter-distances between points of a
supposed M-dimensional set X (M unknown),
• We seek points X* in a m-dimensional space (m
given) such that dij(X*) approximates dij
• We define stress functions:
which are optimised by majorization
Note: weighting by dij may help privileging local
structures (less penalty on small distance values)
Multi Dimensional Scaling (MDS)







ji
ij
ji
ijij
Y Yd
Yd
X
m
2
2
))((
))((
minarg*
d
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 88
Shepard diagram: plot dij against dij(X*)
• Ideally along the diagonal (or highly
correlated)
• Helps seeing where the discrepancy appears
MDS
[Illustration from I. Borg & PJF Groenen. Modern Multidimensional Scaling. Springer 2005]
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 89
• Recall:
• This implies that if D is an interdistance matrix
– D is symmetric
–
A quick note on “distance” matrices

 RXXd :
xxxd  0),(
yxyxdyxd ,),(),( 
yzydyxdzxd  ),(),(),(
)0;0(s.d.pis  xDxxD T
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 90
• Euclidean distances say that the shortest distance
between two points is along a straight line (any
diversion increases the distance value)
• This also says that if y is close to x and z, then x and z
should be reasonably close to each other
• This may not always be true
– Social nets : if y is friend with x and z, it says nothing about
the social distance between x and z (may be large)
– Data Manifold: if the data lies on a complex manifold, the
straight line is irrelevant
Non Euclidean distances
yzydyxdzxd  ),(),(),(
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 91
• A local neighbourhood graph (eg 5-NN graph) is built to create
a topology and ensure continuity
• Distances are replaced by geodesics (paths on the
neighbourhood graph)
• MDS is applied on this interdistance matrix (eg with m=2)
IsoMap (non Euclidean)
[Illustration from http://isomap.stanford. edu]
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 92
• Locally Euclidean neighbourhoods are
considered
– Requires a good (dense, uniform) data distribution
– Choice of the neighbourhood size to ensure
connectivity and avoid infinite distances
• Powerful to “unfold” the manifold
IsoMap
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 93
• Deterministic distance-based neighbourhoods,
which may contain noise or outlying values, are
replaced by a stochastic view
• Distances are then taken between probability
distributions
• The embedding is made “in probability”
• Given X in M-dimensional space, and m
– pj|i is the probability of xi to pick xj as a neighbour in
M-dimensional space
– qj|i is the probability of xi* to pick xj* as a neighbour in
m-dimensional space
– Do so that q stays “close” to p (divergence)
Stochastic Neighbourhood Embedding (SNE)
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 94
• X* is found by minimizing F(X’) using gradient-based
optimisation
• The definition of P and Q relax the rigid constraints
found using distances
• The exponential decay of likelyhood favors local
structures
• t-SNE uses a Student t-distribution in the low
dimensional space
SNE







k
xxd
xxd
ij
k
xxd
xxd
ij
ki
ji
i
ki
i
ji
e
e
q
e
e
p ),(
),(
|
2
),(
2
),(
| **2
**2
2
2
2
2
s
s
ij
ij
i j
ij
i
iiKL
q
p
pQPDXF
|
|
| log)()'(  
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 95
• MNIST dataset
t-SNE example
[Illustration from L. van der Maaten’s website]
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 96
Traces of our everyday activities can be:
• Captured, exchanged (production,
communication)
• Aggregated, Stored
• Filtered, Mined (Processing)
The “V”’s of Big Data:
• Volume, Variety, Velocity (technical)
• and hopefully... Value
Raw data is almost worthless, the added value is in
juicing the data into information (and knowledge)
Big Data
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 97
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 98
However
Two main issues:
• High-dimensional data
– «Curse of dimensionality»
– Making distance measurements unreliable
– Making statistical estimation inaccurate
• Large data
– «Big data»
– Could compensate for sparsity problems
– But hard to manage efficiently
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 99
Solutions
Two main issues
• High-dimensional data
– Reduce the dimension
– Indexing for solving the kNN problem efficiently
• Large data
– Reduce the volume
– Filter, compress, cluster,…
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 100
Indexing structures
…+
M-tree
Tries
Suffix array
Suffix Tree
Inverted files
LSH…
Illustration: Wikipedia
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 101
Main ideas:
• A point is described by its neighbourhood
• The neighbourhood of a point encodes its
position
• We use only neighboring landmarks
– To be fast
• We don’t keep distances, just ranks
– To be faster
Permutation-based Indexing
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 102
Permutation-based Indexing
L(x1, R)= (𝑟1, 𝑟2, 𝑟3, 𝑟4, 𝑟5) L(x2, R)=(𝑟1, 𝑟2, 𝑟3, 𝑟4, 𝑟5) L(x3, R)= (𝑟5, 𝑟3, 𝑟2, 𝑟4, 𝑟1)
n=5:
D={x1, . . . , x 𝑁}, N objects,
R = {𝑟1, . . . , 𝑟 𝑛} ⊂ D, n references
Each 𝑜𝑖 is identified by an ordered list:
L(x𝑖, R)= {𝑟𝑖1, . . . , 𝑟𝑖𝑛} such that
d(x𝑖, 𝑟𝑖𝑗) ≤ d(x𝑖, 𝑟𝑖𝑗+1 ) ∀j = 1, . . . , n − 1
x
y
z
1x2x
3x
4x
5x
r1
r2
r3
r4
r5
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 103
Permutation-based Indexing
Indexing: Building ordered lists
Querying (kNN):
• Build the query ordered list
• Compare it with points ordered lists
Using the Spearman Footrule Distance:
Solving kNN: “I see what you see if I am close to you”
 
j
ririSFD
rank
i jj
RxLRqLxqdxqd || ),(),(),(),(
x
y
z
1x2x
3x
4x
5x
r1
r2
r3
r4
r5
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 104
PBI in practice
Given a query point q, we seek objects xi such
that L(xi,R) ~ L(q,R)
• We use inverted files to (pre-)select objects
such that L(xi,R)|rj ~ L(q,R)|rj
• We prune the lists with the assumption that
only local neighborhood is important
• We quantize the lists for easier indexing
• … (still an active development)
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 105
Efficiency of PBI
• Still uses distances for creating lists
• Issues with ordering due to the curse of
dimensionality
However
• The choice of reference points (location,
number) may be optimised
• Empirical performance are good
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 106
Optimising PBI
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 107
PBI: Encoding Model
𝒓 𝟏
𝒓 𝟐
𝒓 𝟑
𝒓 𝟒
𝒓 𝟓
𝜹 𝟏𝟐 𝜹 𝟐𝟑
𝛿24
𝛿43
𝛿14
𝛿15
𝛿53
𝛿45
𝐿 𝑜, 𝑅 = (1,2)
𝐿 𝑜, 𝑅 = (1,4)
𝐿 𝑜, 𝑅 = (1,5)
𝐿 𝑜, 𝑅 = (5,1)
𝐿 𝑜, 𝑅 = (4,5)
𝐿 𝑜, 𝑅 = (4,2)
𝐿 𝑜, 𝑅 = (3,4)
𝐿 𝑜, 𝑅 = (3,2)
𝐿 𝑜, 𝑅 = (2,3)
𝐿 𝑜, 𝑅 = (2,4)
𝐿 𝑜, 𝑅 = (3,5)
𝐿 𝑜, 𝑅 = (5,3)
𝐿 𝑜, 𝑅 = (5,4)
𝐿 𝑜, 𝑅 = (2,1)
𝐿 𝑜, 𝑅 = (4,1)
107
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 108
Optimising PBI
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 109
Map-Reduce principle
Two-step parallel processing of data:
• Map the data properties (values) onto
respective keys (data)
– (key,value) pairs
• Reduce the list of values for each of the keys
– (key, list of values)
– Process the list
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 110
Map Reduce – Word Count example
[Illustration: http://blog.trifork.com/2009/08/04/introduction-to-hadoop/]
• Keys: stems
• Values: occurrence (1)
• Reducing: sum (frequency)
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 111
MapReduce
• The MapReduce programming interface brings a
simple and powerful interface for data
parallelization, by keeping the user away from the
communications and exchange of data.
1. Mapping
2. Shuffling
3. Merging
4. Reducing
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 112
Distributed inverted files
• Data size: 36GB of XML data.
• Hadoop: 40 minutes.
• The best ratio between the mappers and reducers is
found to be:
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 113
• Host:
– Computer hosts the GPU card.
• Device:
– GPU
• Kernel:
– Function runs thousands of threads in
parallel
• Grids:
– Two or three-dimensional of blocks.
• Blocks:
– Consists of an upper limit of threads 512 or 1024.
• Memory:
– Local memory (Fast and Small (KB)).
– Global memory (Slow and Big (GB)).
GPU Architecture
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 114
PDPS PIOFPDSS
𝑂
𝑁𝑛
𝑃
+ 𝑁(𝑛 log 𝑛 + 𝑛) + 𝑡1
𝑂
𝑁(2𝑛 + 𝑛 log 𝑛)
𝑃
+ 𝑡2
= 𝑠 × (𝑁𝑙× 𝑚 + 𝑛 × 𝑚
+2(𝑁𝑙 × 𝑛))
= 𝑠 × (𝑁𝑙× 𝑚 + 𝑛 × 𝑚
+ 𝑁𝑙 × 𝑛 + (𝑁𝑙 × 𝑛))
= 𝑠 × (𝑁𝑙× 𝑚 + 𝑛 × 𝑚
+(𝑁𝑙 × 𝑛)
Complexity:
Memory:
𝑂
𝑁(2𝑛 + 𝑛 log 𝑛)
𝑃
+ 𝑡2
PIOF does the sorting while it calculate the distances!
Permutation Based Indexing on GPU
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 115
• Indexing looks at organising neighborhoods to
avoid exhaustive search
• Indexing may be tailored to the issue in
question
– Inverted files for text search
– Spatial indexing for neighbourhood search
Summary
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 116
• Hashing
– LSH, Random projections,
• Outlier detection
– Including in high-dimensional spaces
• Classification, regression
– With sparse data
Were not studied here…
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 117
Conclusion
“Distance is key”
– Defines the neighbourhood of points
– Defines the standard deviation around the mean
– Defines the notion of similarity
However
– Distance may have a non-intuitive behavior
– Distance may not be strictly needed
• Stochastic model for neighbourhoods (SNE)
• Ranking approach for neighbourhoods (PBI)
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 118
References
Big Data and Large-scale data
– Mohammed, H., & Marchand-Maillet, S. (2015). Scalable Indexing for
Big Data Processing. Chapman & Hall.
– Marchand-Maillet, S., & Hofreiter, B. (2014). Big Data Management
and Analysis for Business Informatics. Enterprise Modelling and
Information Systems Architectures (EMISA), 9.
– M. von Wyl, H. Mohamed, E. Bruno, S. Marchand-Maillet, “A parallel
cross-modal search engine over large-scale multimedia collections
with interactive relevance feedback” in ICMR 2011 - ACM International
Conference on Multimedia Retrieval.
– H. Mohamed, M. von Wyl, E. Bruno and S. Marchand-Maillet,
“Learning-based interactive retrieval in large-scale multimedia
collections” in AMR 2011 - 9th International Workshop on Adaptive
Multimedia Retrieval.
– von Wyl, M., Hofreiter, B., & Marchand-Maillet, S. (2012).
Serendipitous Exploration of Large-scale Product Catalogs. In 14th IEEE
International Conference on Commerce and Enterprise Computing
(CEC 2012), Hangzhou, CN.
More at http://viper.unige.ch/publications
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 119
References
Large-scale Indexing
– Mohamed, H., & Marchand-Maillet, S. (2015). Quantized Ranking for Permutation-
Based Indexing. Information Systems.
– Mohamed, H., Osipyan, H., & Marchand-Maillet, S. (2014). Multi-Core (CPU and
GPU) For Permutation-Based Indexing. In Proceedings of the 7th Internation
Conference on Similarity Search and Applications (SISAP2014), Los Cabos, Mexico.
– H. Mohamed and S. Marchand-Maillet “Parallel Approaches to Permutation-Based
Indexing using Inverted Files” in SISAP 2012 - 5th International Conference on
Similarity Search and Applications .
– H. Mohamed and S. Marchand-Maillet “Distributed Media indexing based on MPI
and MapReduce” in CBMI 2012 - 10th Workshop on Content-Based Multimedia
Indexing.
– H. Mohamed and S. Marchand-Maillet “Enhancing MapReduce using MPI and an
optimized data exchange policy”, P2S2 2012 - Fifth International Workshop
onParallel Programming Models and Systems Software for High-End Computing.
– Mohamed, H., & Marchand-Maillet, S. (2014). Distributed media indexing based on
MPI and MapReduce. Multimedia Tools and Applications, 69(2).
– Mohamed, H., & Marchand-Maillet, S. (2013). Permutation-Based Pruning for
Approximate K-NN Search. In DEXA, Prague, CZ.
More at http://viper.unige.ch/publications
Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 120
References
Large data analysis – Manifold learning
– Sun, K., Morrison, D., Bruno, E., & Marchand-Maillet, S. (2013).
Learning Representative Nodes in Social Networks. In 17th Pacific-Asia
Conference on Knowledge Discovery and Data Mining, Gold Coast, AU.
– Sun, K., Bruno, E., & Marchand-Maillet, S. (2012). Unsupervised
Skeleton Learning for Manifold Denoising and Outlier Detection. In
International Conference on Pattern Recognition (ICPR'2012), Tsukuba,
JP.
– Sun, K., & Marchand-Maillet, S. (2014). An Information Geometry of
Statistical Manifold Learning. In Proceedings of the International
Conference on Machine Learning (ICML 2014), Beijing, China.
– Wang, J., Sun, K., Sha, F., Marchand-Maillet, S., & Kalousis, A. (2014).
Two-Stage Metric Learning. In Proceedings of the International
Conference on Machine Learning (ICML 2014), Beijing, China.
– Sun, K., Bruno, E., & Marchand-Maillet, S. (2012). Stochastic Unfolding.
In IEEE Machine Learning for Signal Processing Workshop (MLSP'2012),
Santander, Spain.
More at http://viper.unige.ch/publications

More Related Content

What's hot

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
nextlib
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
Nikhil Sharma
 
Lecture 1 graphical models
Lecture 1  graphical modelsLecture 1  graphical models
Lecture 1 graphical models
Duy Tung Pham
 

What's hot (20)

Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation Learning
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial Networks
 
Lecture 1 graphical models
Lecture 1  graphical modelsLecture 1  graphical models
Lecture 1 graphical models
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and Applications
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
 
Machine learning in image processing
Machine learning in image processingMachine learning in image processing
Machine learning in image processing
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Computer Vision Introduction
Computer Vision IntroductionComputer Vision Introduction
Computer Vision Introduction
 
Hyperparameter Optimization with Hyperband Algorithm
Hyperparameter Optimization with Hyperband AlgorithmHyperparameter Optimization with Hyperband Algorithm
Hyperparameter Optimization with Hyperband Algorithm
 

Viewers also liked

hitachi-ebook-social-innovation-forbes-insights
hitachi-ebook-social-innovation-forbes-insightshitachi-ebook-social-innovation-forbes-insights
hitachi-ebook-social-innovation-forbes-insights
Ingrid Fernandez, PhD
 
Automated Face Detection and Recognition
Automated Face Detection and RecognitionAutomated Face Detection and Recognition
Automated Face Detection and Recognition
Waldir Pimenta
 

Viewers also liked (20)

Big Data Analysis: The curse of dimensionality in official statistics
Big Data Analysis: The curse of dimensionality in official statisticsBig Data Analysis: The curse of dimensionality in official statistics
Big Data Analysis: The curse of dimensionality in official statistics
 
1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenance
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving Data
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document Relevance
 
hitachi-ebook-social-innovation-forbes-insights
hitachi-ebook-social-innovation-forbes-insightshitachi-ebook-social-innovation-forbes-insights
hitachi-ebook-social-innovation-forbes-insights
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
SFD2014_FOSS, Cloud and BigData in Vietnam
SFD2014_FOSS, Cloud and BigData in VietnamSFD2014_FOSS, Cloud and BigData in Vietnam
SFD2014_FOSS, Cloud and BigData in Vietnam
 
CS404 Pattern Recognition - Locality Preserving Projections
CS404   Pattern Recognition - Locality Preserving ProjectionsCS404   Pattern Recognition - Locality Preserving Projections
CS404 Pattern Recognition - Locality Preserving Projections
 
School intro
School introSchool intro
School intro
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 
Fyp
FypFyp
Fyp
 
Humanizing The Machine
Humanizing The MachineHumanizing The Machine
Humanizing The Machine
 
Understandig PCA and LDA
Understandig PCA and LDAUnderstandig PCA and LDA
Understandig PCA and LDA
 
Automated Face Detection and Recognition
Automated Face Detection and RecognitionAutomated Face Detection and Recognition
Automated Face Detection and Recognition
 
k10741 major assig on rac
 k10741 major assig on rac k10741 major assig on rac
k10741 major assig on rac
 
SETTING PERIPHERAL
SETTING PERIPHERALSETTING PERIPHERAL
SETTING PERIPHERAL
 

Similar to Curse of Dimensionality and Big Data

TUW-ASE-Summer 2015: Advanced Services Engineering - Introduction
TUW-ASE-Summer 2015: Advanced Services Engineering - IntroductionTUW-ASE-Summer 2015: Advanced Services Engineering - Introduction
TUW-ASE-Summer 2015: Advanced Services Engineering - Introduction
Hong-Linh Truong
 
cv_loig_thivend-friday
cv_loig_thivend-fridaycv_loig_thivend-friday
cv_loig_thivend-friday
Loig Thivend
 

Similar to Curse of Dimensionality and Big Data (20)

Multimedia Information Retrieval
Multimedia Information RetrievalMultimedia Information Retrieval
Multimedia Information Retrieval
 
Stephan vincent_lancrin_ocde
Stephan vincent_lancrin_ocdeStephan vincent_lancrin_ocde
Stephan vincent_lancrin_ocde
 
MOVING: Applying digital science methodology for TVET
MOVING: Applying digital science methodology for TVETMOVING: Applying digital science methodology for TVET
MOVING: Applying digital science methodology for TVET
 
ppt_ids-data science.pdf
ppt_ids-data science.pdfppt_ids-data science.pdf
ppt_ids-data science.pdf
 
data science and its role in big data analytics.pptx
data science and its role in big data analytics.pptxdata science and its role in big data analytics.pptx
data science and its role in big data analytics.pptx
 
Using Web Science for Educational Research
Using Web Science for Educational ResearchUsing Web Science for Educational Research
Using Web Science for Educational Research
 
Bart van der Laar @ SURF Summerschool 09
Bart van der Laar @ SURF Summerschool 09Bart van der Laar @ SURF Summerschool 09
Bart van der Laar @ SURF Summerschool 09
 
TUW-ASE-Summer 2015: Advanced Services Engineering - Introduction
TUW-ASE-Summer 2015: Advanced Services Engineering - IntroductionTUW-ASE-Summer 2015: Advanced Services Engineering - Introduction
TUW-ASE-Summer 2015: Advanced Services Engineering - Introduction
 
2022_12_16 «Informatics – A Fundamental Discipline for the 21st Century»
2022_12_16 «Informatics – A Fundamental Discipline for the 21st Century»2022_12_16 «Informatics – A Fundamental Discipline for the 21st Century»
2022_12_16 «Informatics – A Fundamental Discipline for the 21st Century»
 
[OOFHEC2018] Manuel Castro: Identifying the best practices in e-engineering t...
[OOFHEC2018] Manuel Castro: Identifying the best practices in e-engineering t...[OOFHEC2018] Manuel Castro: Identifying the best practices in e-engineering t...
[OOFHEC2018] Manuel Castro: Identifying the best practices in e-engineering t...
 
CV
CVCV
CV
 
EADTU 2018 conference e-LIVES project
EADTU 2018 conference e-LIVES project EADTU 2018 conference e-LIVES project
EADTU 2018 conference e-LIVES project
 
Data in the 21st century
Data in the 21st centuryData in the 21st century
Data in the 21st century
 
LTCI Information Communications Lab
LTCI Information Communications LabLTCI Information Communications Lab
LTCI Information Communications Lab
 
Quantum Mechanics meet Information Search and Retrieval – The QUARTZ Project
Quantum Mechanics meet Information Search and Retrieval – The QUARTZ ProjectQuantum Mechanics meet Information Search and Retrieval – The QUARTZ Project
Quantum Mechanics meet Information Search and Retrieval – The QUARTZ Project
 
School on the Cloud: lessons from Digital Earth, Karl Donert
School on the Cloud: lessons from Digital Earth, Karl DonertSchool on the Cloud: lessons from Digital Earth, Karl Donert
School on the Cloud: lessons from Digital Earth, Karl Donert
 
Digital examination, forms and tools for aggregation of information and cogni...
Digital examination, forms and tools for aggregation of information and cogni...Digital examination, forms and tools for aggregation of information and cogni...
Digital examination, forms and tools for aggregation of information and cogni...
 
Cognitive Electronics (COEL) Project
Cognitive Electronics (COEL) ProjectCognitive Electronics (COEL) Project
Cognitive Electronics (COEL) Project
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
cv_loig_thivend-friday
cv_loig_thivend-fridaycv_loig_thivend-friday
cv_loig_thivend-friday
 

Recently uploaded

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

Curse of Dimensionality and Big Data

  • 1. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 1 Curse of Dimensionality and Big Data Stephane Marchand-Maillet Viper group University of Geneva Switzerland
  • 2. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 2 • Are you familiar with vector spaces? – Dimension, projection • Are you familiar with statistics? – Mean, variance, Gaussian distribution • Are you familiar with linear algebra? – Matrix, inner product • Are you familiar with indexing? – Principle, methods • Do you realise all the above is one and the same thing? – That’s what we’ll see  – I hope it will not be just trivial… Quick get-to-know (profiling  )
  • 3. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 3 • To provide you with an overview of – Basics of data modelling – Potential issues with high-dimensional data – Large-scale indexing techniques • To create bridges between basic techniques – For better intuition – To understand what is the information we manipulate – To understand what approximations are made • To start you on doing your own data modelling Objectives of the tutorial
  • 4. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 4 Outline • Motivation and context • Large-scale high-dimensional data • Fighting the dimensionality • Fighting large-scale volumes 4 Note: Several illustrations from within these slides have been borrowed from the Web, including Wikipedia or teaching material. Please do not reproduce without permission from the respective authors. When in doubt, don't.
  • 5. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 5 Data Production • Growth of Data – 1,250B GB (=1.2EB) of data generated in 2010. – Data generation growing at a rate of 58% per year • Baraniuk, R., “More is Less: Signal Processing and the Data Deluge”, Science, V331, 2011. 1 exabyte (EB) = 1,073,741,824 gigabytes 0 2000 4000 6000 8000 10000 2010 2011 2012 2013 2014 DataSize(EB) Data Generation Growth http://www.intel.com/conte nt/www/us/en/communicati ons/internet-minute- infographic.html http://www.ritholtz.com/blog /2011/12/financial-industry- interconnectedness/ Internet Scientific Industry Data By Sverre Jarp, By Felix'Schürmann © Copyright attached
  • 6. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 6 A digital world © Copyright attached
  • 7. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 7 [Picture from: http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html] Data communication © Copyright attached
  • 8. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 8 User “productivity” [Picture from: http://www.go-gulf.com/wp-content/themes/go-gulf/blog/online-time.jpg - Feb 2012] © Copyright attached
  • 9. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 9 Motivation • Decision making requires informed choices • The information is often not easy to manage and access • The information is often overwhelming – « Big Data » trend We need to bring a structure to the raw data • Document (data) representation • Similarity measurements • Further analysis: mining, retrieval, learning © Copyright attached
  • 10. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 10 Information management process Raw documents Representation space (visualisation) Document features User interaction Feature extraction “Appropriate” mapping “Decision” process
  • 11. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 11 Example: text Text documents Feature extraction “Appropriate” mapping User interaction“Decision” process “Word” occurrences
  • 12. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 12 Example: Images Images Feature extraction “Appropriate” mapping User interaction“Decision” process Photo collage Filtering Color histogram
  • 13. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 13 Also... • Any type of media: webpage, audio, video, data,... • Objects, based on their characteristics • People in social networks • Concepts: processes, states, ... Etc  Anything for which “characteristics” may be measured
  • 14. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 14 The key is distance • Features help characterizing 1 document (summary) • Features help comparing 2 documents • How can they help structuring a collection?
  • 15. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 15 Most often back to the local neighbours - Information retrieval - Similarity query - Machine learning - Generalization - Data mining - Discover continuous patterns Distance measurements
  • 16. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 16 However Two main issues: • High-dimensional data – «Curse of dimensionality» • Large data – «Big data»
  • 17. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 17 • Raw data (the documents) carries information • Computer essentially perform additions • We need to represent the data somehow to provide the computer with as much as possible faithful information • The representation is an opportunity for us to transfer some prior (domain) knowledge as design assumptions If this (data modelling) step is flawed, the computer will work with random information Representation spaces (intuition)
  • 18. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 18 Given a set C of N documents di, mapped onto a set X of points xi of a M-dimensional vector space RM • To index and organise (exploit) this collection, we must understand its underlying structure We study its geometrical properties Notion of distance, neighbourhood We study its statistical properties Density, generative law Both are the same information! Approach
  • 19. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 19 Terminology Given a set C of N documents di, mapped onto a set X of points xi of a M-dimensional vector space RM Two main issues: • High-dimensional data – M increases • Large data – N increases
  • 20. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 20 • C={d1,d2,…,dN} a collection of documents – For each document, perform feature extraction f – di is represented by its feature vector xi in RM – xi is the view of di from the computer perspective – f: C  X = {x1,x2,…,xN} • Examples – Images: xi is a 128-bin color histogram: M=128 – Text: xi measures the occurrence of each word of the dictionary: M=50’000 Representation spaces
  • 21. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 21 We have We want to create an order or a structure over X – We define a topology on the representation space We study distances We study neighborhoods (kNN) Representation spaces M iN RxxxxX  },...,,{ 21 M M Reee ofbasis},...,,{ 21
  • 22. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 22 Norms and distances • Norm – norm of x, vector of RM – if the norm derives from an inner product Exple: • Distance (metric) • Norm and distance M iN RxxxxX  },...,,{ 21 x xxxxx T  , 2    M i i M i ii xxyxyx 1 2 1 .,   RXXd : xxxd  0),( yxyxdyxd ,),(),(  yzydyxdzxd  ),(),(),( yxyxd ),(
  • 23. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 23 • Examples of norms (distances) – Minkowsky (Lp norms) • p=1 : norm L1 • p=2: L2 norm (Euclidean) • : norm • Unit ball for distance d(.,.) Norms and distances M iN RxxxxX  },...,,{ 21 pM i p ip xx 1 1           M i ixx 1 1  ii xx max p L (open)}1),(s.t{)( (closed)}1),(s.t{)(   yxdyxB yxdyxB d d 1 2  Ilustrations: http://www.viz.tamu.edu/faculty/ergun Wikipedia )()()(),(),(),( 2 2 2 2 2 yxyxyxyxyxyxdyxd T E 
  • 24. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 24 • Generalised Euclidean distance • Mahalanobis distance Norms and distances 2 1 )( 1 ),( ii M i i G yx w yxd   )0;0(s.d.p  xAxxRA TMxM )()(),( 12 yxAyxyxd T A   2Idif ddA A  GAi ddwiagA  )(dif
  • 25. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 25 • Hausdorff distance (set distance) X, Y sous ensembles de C Norms and distances )),(infsup),,(infsupmax(),( yxdyxdYXd yyXxXxYy H   (Illustration: Wikipedia)
  • 26. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 26 • Unit masses at positions xi • Center of mass • Inertia wrt point a: • Inertia wrt subspace F: • Huygens theorem: Physics and statistics M iN RxxxxX  },...,,{ 21  i ix N g 1  i ia xadXI ),()( 2 ),()()( 2 gadXIXI ag  Physics Statistics Mass(xi) Probability P(xi) Centre of mass g Mean mEX Inertia Ig Variance s2=V(X)  i iF xFdXI ),()( 2
  • 27. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 27 • Relation between standard deviation and distribution around the mean – : at least half of the values are between – Gaussian distribution N(0,1) : • Centred variable: Chebichev inequality 1;0)(; )( * **    X X XE XEX X s s 2 1 )( n nXP  sm 2n ]2,2[ smsm  9973.0)3( XP Illustration: Wikipedia
  • 28. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 28 Markov inequality • Upper bound of the cumulative distribution • Useful for proofs and upper bounds 0 )( )(  a a XE aXP
  • 29. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 29 n random variables (X1,…,Xn) such that E(Xi)=m then is an « estimator » for m and if V(Xi)=s2 Weak law of large numbers   n i iX n XNn 1 * 1 mm p n n XEXP    )(00)(lim 2 )( sp n XV  
  • 30. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 30 • N uniform draws U([0,1]) • Simulation: exponential distribution n=10 n=100 n=1’000 n=3’000 n=10’000 Mean Standard deviation 3)ln( 1    UX
  • 31. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 31 X such that E(X)=m et V(X)=s2 X1,…,Xn random variables iid with X Then, Zn converges (in probability) to N(0,1) Central Limit Theorem dxebZaP b a x n n     2 2 2 1 )(lim s )( 1 1 * m s   X n ZX n XNn n n i i
  • 32. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 32 • n uniform draws U([-0.5,0.5]) • Average n distributions: n draws of Simulation: Normal distribution X n=1 n=4n=3n=2 n=100 Zn: Mean Zn: standard deviation
  • 33. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 33 • X random variable whose mean m to be estimated – Exple: « Diameter » • Xi population – Exple: « Apples » • xi : measures – Exple: « measured diameters »  (mean of measures) tend to X (by the Weak Law of Large Numbers) • The Central Limit Theorem says that the error on the estimate of m (Zn) follow a normal law N(0,1) Zn is a random variable representing the error carried by Interpretation X X nZ mm
  • 34. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 34 • In vector spaces, distances are essentially measured using difference of coordinates • Statistical distribution may be considered as statistical objects with inter-distances (similarity) • However, it would not be relevant to compare their intrinsic values directly. We rather use Divergences • The most known/used divergence: KL-Divergence (Kullback Liebler) – Given two distributions P and Q, the KL divergence between P and Q is the measure of how much information is lost when Q is used to approximate P – The discrete formulation of the KL divergence is – DKL is non-symmetric, it can be symmetrised (to better approach a distance) as A quick note on divergences  i KL iQ iP iPQPD )( )( ln)()( 2 )()( )( PQDQPD QPD KLKL  
  • 35. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 35 Topology (very loose intuition) • A topology is built based on neighbourhood • The neighbourhood is the base for the definition of continuity • Continuity implies some assumption of the propagation of a function (some smoothness) In our context • We are given data points (localised scattered information) • We need to gain some “smoothness” • We will propagate the information “around” our data points • Distance identifies neighbourhoods • We somehow “interpolate” (spread) information between data points • Because that our “best guess”! • Everything depends on the fact of having informative characteristics to localise our similar documents (di) as neighbouring points (xi)
  • 36. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 36 One of the main problems in Data Analysis • Given a query point • Find its neighbourhood (vicinity) V k-NN (nearest neighbour) is the nearest (k-)neighbour is the farthest k-neighbour -NN >0, fixed (range query) Nearest neighbours M Rq * Nk   },...,{),(),(s.t,...,, 121 kjiiii iijxqdxqdxxxV lk   kxqdxxxV lk iiii  ),(s.t,...,, 21 1ix kix
  • 37. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 37 Voronoi diagram ci: Voronoi cell associated to point xi Delaunay Graph D=(C,E) : points xi are the vertices of D (xi,xj) is an edge if ci and cj have a common edge The graph connects neighbouring cells Space partitioning M iN RxxxxX  },...,,{ 21  ijyxdyxdRyc ji M i  ),(),(t.q. Ilustrations: http://www.wblut.com Wikipedia ci xi xj (xi,xj)
  • 38. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 38 We, as human, are experts in 1D, 2D, 3D, a bit less in 4D (time) and less so afterwards In high dimensions (eg 20 is enough), counter intuitive properties appear Eg: • Sparsity • Concentration of distances • Relation to kNN: hubness which we try to model here, to better understand why things go wrong (or not as good) Curse of Dimensionality
  • 39. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 39 • M is the dimension of the space (and the data) – Measures, characteristics, … • X is therefore the sample data of a M- dimensional space What if M increases? – Influence on geometric measures (distances, k-NN) – Influence on statistical distributions « Curse of dimensionality » Richard Ernest Bellman (1961). Adaptive control processes: a guided tour. Princeton University Press. High dimensionality M iN RxxxxX  },...,,{ 21
  • 40. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 40 Imagine a data sample in [a,b]M We quantify every dimension with k bins To estimate the distribution we require n samples in each bin in average • M=1: N~k.n • M=2: N~n.k2 … • M: N~n.kM Exple: k=10, n=10, M=6 => N ~ 10’000’000 samples required High dimensionality
  • 41. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 41 Curse of dimensionality • Sparsity – N samples – M dimensions – k quantization steps n samples per bin or to maintain n constant 41 M k N n ~ M kN ~
  • 42. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 42 Curse of dimensionality 42 Mki k N xPE ~))bin((  • Consequences: – With finite sample size (limited data collection), most of the cells are empty if the feature dimension is too high – The estimation of probability density is unreliable
  • 43. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 43 Curse of dimensionality • Gaussian distribution 43 M XP )9973.0()3(  M 1 99.7% 10 97.3% 100 76.3% 500 25.8% 1000 6.7% )3( XP
  • 44. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 44 Neighbourhood structure • S a ball around a point (radius r, dimension M) • C a cube around a point [-r,+r]M 0 )2/(2)( )( ratio )2()( )2/( 2 )( 1 2/ 2/          M M M C S M C MM S MMMV MV rMV MM r MV  
  • 45. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 45 Neighbourhood structure • Most of the neighbours of the centre are «in the corners of the cube» • Empty space: each point (center) sees its neighbours far away 0 )( )(   M C S MV MV 0))((  M i SxPE
  • 46. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 46 • S a ball around a point (radius r, dimension M) • C enclosed cube: side a Neighbourhood structure ? )2/(2)( )( ratio 12/1 2/       M MM M C S MMMV MV  M r aM a r 2 2  M C M r MV        2 )( )2/( 2 )( 2/ MM r MV MM S   
  • 47. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 47 Dmax and Dmin are smallest and largest neighbour distances High-dimensional k-NN Dmax Dmin Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999). When is“nearest neighbor” meaningful? In Proceedings of the 7th International Conference on Database Theory, pages 217–235 01 )( Plimnthe0 )( limif min minmax M        D DD            M kM kM XE X V Thm [Beyer et al, 1999]
  • 48. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 48 Loss of contrast: High-dimensional k-NN Dmax Dmin  Computational imprecision prevents relevance  Noise is taking over  -NN: all or nothing  k-NN: random draw 0 )( min minmax   D DD M P
  • 49. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 49 Loss of contrast 2 /])]1,0([[ MU M Dimension Norm
  • 50. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 50 Loss of contrast 2 /)]1,0([ MN M Dimension Norm
  • 51. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 51 • Consequences – Database index based on metric distances • K-d-tree • VP-tree have to perform exhaustive search “Every enclosing rectangle encloses everything” High dimensional k-NN Illustrations: Peter N. Yianilos. Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. Fourth ACM-SIAM Symposium on Discrete Algorithms, 1993
  • 52. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 52 In M dimension, the unit hypercube has as diagonal u=[1 1 … 1]T, then High dimensional diagonals  M Mu 2 u e1 Dimension
  • 53. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 53 In M dimension, the unit hypercube has as diagonal u=[1 1 … 1]T, then High dimensional diagonals 0 1 ),cos()cos( 1 1   MT M Mu eu eu  u e1
  • 54. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 54 In M dimensions, the unit hypercube has as diagonal u=[1 1 … 1]T, then • In high dimensions, all (2M-1) diagonal vectors are orthogonal to the basis vectors • High dimensional spaces have an exponential number of dimensions • Everything along the diagonals is projected onto the origin High dimensional diagonals  2     M M
  • 55. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 55 Given a Gaussian distribution in a M-dimensional space N(mM,SM), what is the density of samples of radius r? With no loss of generality we study the centered distribution N(0,IM) Gaussian distribution ]r-dr,r+dr[
  • 56. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 56 Gaussian distribution ]r-dr,r+dr[ MM M rVM M rE M X M r rMXX rrXP NX,...,X(XX M i i M i i T iM 22 )(1 1 )( 1 ~ 1 . )),...,((ofestimation )1,0(~) 2 22 2 1 22 2 1 22 2 1           kVkE XVabaXV bXaEbaXE 2)(;)( )(2)( )()( 22    
  • 57. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 57 ]r-dr,r+dr[ MM M rVM M rE rrXP NX,...,X(XX T iM 22 )(1 1 )( )),...,((forestimation )1,0(~) 2 22 1    « Gaussian egg » Dimension 0 1 )),...,(( T rrXP  )( rXP  )( rXP 
  • 58. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 58 MM M rVM M rE rrXP NX,...,X(XX T iM 22 )(1 1 )( )),...,((forestimation )1,0(~) 2 22 1    « Gaussian egg » Dimension )),...,(( T rrXP  )( rXP  )( rXP  For a M-dimensional Normal distribution of mean 0 and s.d 1, the expected distribution marginalised over concentric spheres has a mean of 1 and a variance converging to 0 Intuition: The volume of the sphere tends to 0 goes against the high density at the centre
  • 59. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 59 Empirical evidence (10’000 samples) Dimension Bins on [0,2]
  • 60. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 60 Empirical evidence (10’000 samples) )( sXP )),...,(( T XP ss )( sXP Probability Dimension
  • 61. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 61 Empirical evidence (10’000 samples)Probability(cumulative) Dimension )),...,(( T XP ss )( sXP )( sXP
  • 62. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 62 Consequences • Loss of contrast: the relative spread of points is not seen accurately • Conversely: using high dimensional Gaussian distributions to model the data may not be as accurate
  • 63. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 63 • We want to characterise the number of times a sample appears in the k-NN of another sample: The distribution of Nk is skewed to the left. A small number of samples appear in the neighbourhood of many samples Hubs       i ikk ik ik xPxN xx xP )()( otherwise0 )(NNif1 )(
  • 64. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 64 20-NN M=100 (1000 samples) (50bins) Bin Frequency
  • 65. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 65 Hubness Dimension Bin
  • 66. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 66 When using the cosine distance as similarity measure Centering the data helps reducing the hubness Hubs: centering yx yx yxd T 1),(cos I Suzuki et al. Centering Similarity Measures to Reduce Hubs.2013 Conf. on Empirical Methods in NLP.
  • 67. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 67 Lesson Although data points may be uniformly distributed, the Lp norms being sums of coordinate distances, the computed distances are corrupted by the excess of uniformative dimensions As a result, points appear non uniformely distributed pM i p ip xx 1 1        
  • 68. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 68 Summary Two main issues: • High-dimensional data – «Curse of dimensionality» – Making distance measurements unreliable – Making statistical estimation inaccurate • Large data – «Big data»  Reduction of dimension
  • 69. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 69 Dimension reduction: principle • Given a set of data in a M-dimensional space, we seek an equivalent representation of lower dimension • Dimension reduction induces a loss. What to sacrifice? What to preserve? – Preserve local: neighbourhood, distances – Preserve global: distribution of data, variance – Sacrifice local: noise – Sacrifice global: large distances – Map linearly – Unfold data
  • 70. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 70 Some example techniques: • SFC: preserve neighbourhoods • PCA: preserve global linear structures • MDS: preserve linear neighbourhoods • IsoMAP: Unfold neighbourhoods • SNE family: unfold statistically Not studied here (but also valid): • SOM (visualisation), LLE, Random projections (hashing) Dimension reduction
  • 71. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 71 Space-filling curves • Definition: – A continuous curve which passes through every point of a closed n-cell in Euclidean n- space Rn is called a space filling curve (SFC) • The idea is to map the complete space onto a simple index: a continuous line – Directly implies an order on the dataset
  • 72. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 72 Application of SFC • Mapping multi-dimensional space to one dimensional sequence • Applications in computer science: – Database multi-attribute access – Image compression – Information visualization – …… Various types • Non-recursive – Z-Scan Curve – Snake Scan Curve • Recursive – Hilbert Curve – Peano Curve – Gray Code Curve
  • 73. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 73 Hilbert curve Ilustrations: Wikipedia
  • 74. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 74 Peano Curve Ilustrations: Wikipedia
  • 75. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 75 SFC • In our case, the idea is to use SFC to “explore” local neighborhoods, hoping that neighborhoods will appear “compact” on the curve • Hence we study such mapping for SFC
  • 76. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 76 Visualizing 4D Hyper-Sphere Surface • Z-Curve  Hilbert Curve [Illustrations from the lecture “SFC in Info Viz”, Jiwen Huo, Uni Waterloo, CA]
  • 77. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 77 • Points can be identified as vectors from the origin • Orthogonal projection • x gets projected in x* onto u (which we take of unit length to represent the subspace Fu) Projection x uo x* x-x*  Fu
  • 78. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 78 Projection   uuxxxxFxd uoF yxdx uuxx u u uy ,),( ),(minarg , * 2* *        x uo x* x-x* 
  • 79. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 79 • x* is the part of x that lives in Fu (eg subspace of interest) • x-x* is the residual (what is not represented) • x and x-x* are orthogonal (they represent complementary information) • Point x* is the closest point from Fu to x (minimal loss, maximal representation) Interpretation x uo x* x-x*  Fu
  • 80. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 80 • Given a set of data in a M dimensional space, we seek a subspace of lower dimension onto which to project our data • Criterion: preserve most inertia of the dataset • Consequence: project and minimize residuals • We construct incrementally a new subspace by successive projections – X is projected onto ui, find an orthogonal ui+1 to project the residual – At most M ui s can be found, we then select the most representative (preserving most inertia) Principal Component Analysis (PCA)
  • 81. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 81 • The data is centred around its mean to get minimal global inertia • We then look for u1 the direction capturing most inertia (minimizing the global sum of residuals) PCA ),(minarg 2 1 u i i u Fxdu  m ii xx M iN RxxxxX  },...,,{ 21 uxxuxxuuxxuuxxFxd i T i T i T iii T iiui  ),(),(),(2 )(maxarg),(minarg 1 2 1 uutrFxdu T u u i i u S   uuuu u L uu L T   SS      221
  • 82. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 82 • PCA incrementally finds “best fit” subspaces, orthogonal to each other (minimize residuals) • PCA incrementally finds directions of max variance (maximize trace of the cov matrix) • PCA transforms the data linearly (eigen decomposition) to align it with its axis of max variance (and make the covariance matrix diagonal) • The reduction of dimension is made by selecting eigenvectors corresponding to the (m<<M) largest eigenvalues • Because of the max variance criterion, PCA sees the dataset as a set of data draw from a centred distribution penalised by their deviation (distance) to the centre: a Normal distribution  PCA is a linear transformation adapted to non clustered data PCA
  • 83. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 83 PCA [Illustration Wikipedia]
  • 84. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 84 • “Discriminant”  Supervised (xi,yi), yis “labels” • Simple case: 2 classes – We seek u such that the projections of xis (xi*) onto Fu is best linearly separated Linear Discriminant Analysis (LDA) u u
  • 85. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 85 • Intuitively: max inter-class distance – u parallel to the original m1-m2 • Fisher criterion adds min intra-class spread (s2) • Fisher criterion LDA * 2 * 1 1 maxarg mm  u u 2* 2 2* 1 1 minarg ss  u u 2* 2 2* 1 * 2 * 1 1 maxarg ss mm    u u
  • 86. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 86 • Both inter- and intra-class criterion can be generalised to multi-class • Criteria consider classes as one Gaussian distribution N(mj,sj) each • Resolved by solving an eigensystem Linear solution • Can be used for supervised projection onto a reduced set of dimensions LDA
  • 87. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 87 Given dij a set of inter-distances between points of a supposed M-dimensional set X (M unknown), • We seek points X* in a m-dimensional space (m given) such that dij(X*) approximates dij • We define stress functions: which are optimised by majorization Note: weighting by dij may help privileging local structures (less penalty on small distance values) Multi Dimensional Scaling (MDS)        ji ij ji ijij Y Yd Yd X m 2 2 ))(( ))(( minarg* d
  • 88. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 88 Shepard diagram: plot dij against dij(X*) • Ideally along the diagonal (or highly correlated) • Helps seeing where the discrepancy appears MDS [Illustration from I. Borg & PJF Groenen. Modern Multidimensional Scaling. Springer 2005]
  • 89. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 89 • Recall: • This implies that if D is an interdistance matrix – D is symmetric – A quick note on “distance” matrices   RXXd : xxxd  0),( yxyxdyxd ,),(),(  yzydyxdzxd  ),(),(),( )0;0(s.d.pis  xDxxD T
  • 90. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 90 • Euclidean distances say that the shortest distance between two points is along a straight line (any diversion increases the distance value) • This also says that if y is close to x and z, then x and z should be reasonably close to each other • This may not always be true – Social nets : if y is friend with x and z, it says nothing about the social distance between x and z (may be large) – Data Manifold: if the data lies on a complex manifold, the straight line is irrelevant Non Euclidean distances yzydyxdzxd  ),(),(),(
  • 91. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 91 • A local neighbourhood graph (eg 5-NN graph) is built to create a topology and ensure continuity • Distances are replaced by geodesics (paths on the neighbourhood graph) • MDS is applied on this interdistance matrix (eg with m=2) IsoMap (non Euclidean) [Illustration from http://isomap.stanford. edu]
  • 92. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 92 • Locally Euclidean neighbourhoods are considered – Requires a good (dense, uniform) data distribution – Choice of the neighbourhood size to ensure connectivity and avoid infinite distances • Powerful to “unfold” the manifold IsoMap
  • 93. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 93 • Deterministic distance-based neighbourhoods, which may contain noise or outlying values, are replaced by a stochastic view • Distances are then taken between probability distributions • The embedding is made “in probability” • Given X in M-dimensional space, and m – pj|i is the probability of xi to pick xj as a neighbour in M-dimensional space – qj|i is the probability of xi* to pick xj* as a neighbour in m-dimensional space – Do so that q stays “close” to p (divergence) Stochastic Neighbourhood Embedding (SNE)
  • 94. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 94 • X* is found by minimizing F(X’) using gradient-based optimisation • The definition of P and Q relax the rigid constraints found using distances • The exponential decay of likelyhood favors local structures • t-SNE uses a Student t-distribution in the low dimensional space SNE        k xxd xxd ij k xxd xxd ij ki ji i ki i ji e e q e e p ),( ),( | 2 ),( 2 ),( | **2 **2 2 2 2 2 s s ij ij i j ij i iiKL q p pQPDXF | | | log)()'(  
  • 95. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 95 • MNIST dataset t-SNE example [Illustration from L. van der Maaten’s website]
  • 96. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 96 Traces of our everyday activities can be: • Captured, exchanged (production, communication) • Aggregated, Stored • Filtered, Mined (Processing) The “V”’s of Big Data: • Volume, Variety, Velocity (technical) • and hopefully... Value Raw data is almost worthless, the added value is in juicing the data into information (and knowledge) Big Data
  • 97. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 97
  • 98. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 98 However Two main issues: • High-dimensional data – «Curse of dimensionality» – Making distance measurements unreliable – Making statistical estimation inaccurate • Large data – «Big data» – Could compensate for sparsity problems – But hard to manage efficiently
  • 99. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 99 Solutions Two main issues • High-dimensional data – Reduce the dimension – Indexing for solving the kNN problem efficiently • Large data – Reduce the volume – Filter, compress, cluster,…
  • 100. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 100 Indexing structures …+ M-tree Tries Suffix array Suffix Tree Inverted files LSH… Illustration: Wikipedia
  • 101. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 101 Main ideas: • A point is described by its neighbourhood • The neighbourhood of a point encodes its position • We use only neighboring landmarks – To be fast • We don’t keep distances, just ranks – To be faster Permutation-based Indexing
  • 102. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 102 Permutation-based Indexing L(x1, R)= (𝑟1, 𝑟2, 𝑟3, 𝑟4, 𝑟5) L(x2, R)=(𝑟1, 𝑟2, 𝑟3, 𝑟4, 𝑟5) L(x3, R)= (𝑟5, 𝑟3, 𝑟2, 𝑟4, 𝑟1) n=5: D={x1, . . . , x 𝑁}, N objects, R = {𝑟1, . . . , 𝑟 𝑛} ⊂ D, n references Each 𝑜𝑖 is identified by an ordered list: L(x𝑖, R)= {𝑟𝑖1, . . . , 𝑟𝑖𝑛} such that d(x𝑖, 𝑟𝑖𝑗) ≤ d(x𝑖, 𝑟𝑖𝑗+1 ) ∀j = 1, . . . , n − 1 x y z 1x2x 3x 4x 5x r1 r2 r3 r4 r5
  • 103. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 103 Permutation-based Indexing Indexing: Building ordered lists Querying (kNN): • Build the query ordered list • Compare it with points ordered lists Using the Spearman Footrule Distance: Solving kNN: “I see what you see if I am close to you”   j ririSFD rank i jj RxLRqLxqdxqd || ),(),(),(),( x y z 1x2x 3x 4x 5x r1 r2 r3 r4 r5
  • 104. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 104 PBI in practice Given a query point q, we seek objects xi such that L(xi,R) ~ L(q,R) • We use inverted files to (pre-)select objects such that L(xi,R)|rj ~ L(q,R)|rj • We prune the lists with the assumption that only local neighborhood is important • We quantize the lists for easier indexing • … (still an active development)
  • 105. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 105 Efficiency of PBI • Still uses distances for creating lists • Issues with ordering due to the curse of dimensionality However • The choice of reference points (location, number) may be optimised • Empirical performance are good
  • 106. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 106 Optimising PBI
  • 107. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 107 PBI: Encoding Model 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 𝒓 𝟓 𝜹 𝟏𝟐 𝜹 𝟐𝟑 𝛿24 𝛿43 𝛿14 𝛿15 𝛿53 𝛿45 𝐿 𝑜, 𝑅 = (1,2) 𝐿 𝑜, 𝑅 = (1,4) 𝐿 𝑜, 𝑅 = (1,5) 𝐿 𝑜, 𝑅 = (5,1) 𝐿 𝑜, 𝑅 = (4,5) 𝐿 𝑜, 𝑅 = (4,2) 𝐿 𝑜, 𝑅 = (3,4) 𝐿 𝑜, 𝑅 = (3,2) 𝐿 𝑜, 𝑅 = (2,3) 𝐿 𝑜, 𝑅 = (2,4) 𝐿 𝑜, 𝑅 = (3,5) 𝐿 𝑜, 𝑅 = (5,3) 𝐿 𝑜, 𝑅 = (5,4) 𝐿 𝑜, 𝑅 = (2,1) 𝐿 𝑜, 𝑅 = (4,1) 107
  • 108. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 108 Optimising PBI
  • 109. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 109 Map-Reduce principle Two-step parallel processing of data: • Map the data properties (values) onto respective keys (data) – (key,value) pairs • Reduce the list of values for each of the keys – (key, list of values) – Process the list
  • 110. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 110 Map Reduce – Word Count example [Illustration: http://blog.trifork.com/2009/08/04/introduction-to-hadoop/] • Keys: stems • Values: occurrence (1) • Reducing: sum (frequency)
  • 111. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 111 MapReduce • The MapReduce programming interface brings a simple and powerful interface for data parallelization, by keeping the user away from the communications and exchange of data. 1. Mapping 2. Shuffling 3. Merging 4. Reducing
  • 112. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 112 Distributed inverted files • Data size: 36GB of XML data. • Hadoop: 40 minutes. • The best ratio between the mappers and reducers is found to be:
  • 113. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 113 • Host: – Computer hosts the GPU card. • Device: – GPU • Kernel: – Function runs thousands of threads in parallel • Grids: – Two or three-dimensional of blocks. • Blocks: – Consists of an upper limit of threads 512 or 1024. • Memory: – Local memory (Fast and Small (KB)). – Global memory (Slow and Big (GB)). GPU Architecture
  • 114. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 114 PDPS PIOFPDSS 𝑂 𝑁𝑛 𝑃 + 𝑁(𝑛 log 𝑛 + 𝑛) + 𝑡1 𝑂 𝑁(2𝑛 + 𝑛 log 𝑛) 𝑃 + 𝑡2 = 𝑠 × (𝑁𝑙× 𝑚 + 𝑛 × 𝑚 +2(𝑁𝑙 × 𝑛)) = 𝑠 × (𝑁𝑙× 𝑚 + 𝑛 × 𝑚 + 𝑁𝑙 × 𝑛 + (𝑁𝑙 × 𝑛)) = 𝑠 × (𝑁𝑙× 𝑚 + 𝑛 × 𝑚 +(𝑁𝑙 × 𝑛) Complexity: Memory: 𝑂 𝑁(2𝑛 + 𝑛 log 𝑛) 𝑃 + 𝑡2 PIOF does the sorting while it calculate the distances! Permutation Based Indexing on GPU
  • 115. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 115 • Indexing looks at organising neighborhoods to avoid exhaustive search • Indexing may be tailored to the issue in question – Inverted files for text search – Spatial indexing for neighbourhood search Summary
  • 116. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 116 • Hashing – LSH, Random projections, • Outlier detection – Including in high-dimensional spaces • Classification, regression – With sparse data Were not studied here…
  • 117. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 117 Conclusion “Distance is key” – Defines the neighbourhood of points – Defines the standard deviation around the mean – Defines the notion of similarity However – Distance may have a non-intuitive behavior – Distance may not be strictly needed • Stochastic model for neighbourhoods (SNE) • Ranking approach for neighbourhoods (PBI)
  • 118. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 118 References Big Data and Large-scale data – Mohammed, H., & Marchand-Maillet, S. (2015). Scalable Indexing for Big Data Processing. Chapman & Hall. – Marchand-Maillet, S., & Hofreiter, B. (2014). Big Data Management and Analysis for Business Informatics. Enterprise Modelling and Information Systems Architectures (EMISA), 9. – M. von Wyl, H. Mohamed, E. Bruno, S. Marchand-Maillet, “A parallel cross-modal search engine over large-scale multimedia collections with interactive relevance feedback” in ICMR 2011 - ACM International Conference on Multimedia Retrieval. – H. Mohamed, M. von Wyl, E. Bruno and S. Marchand-Maillet, “Learning-based interactive retrieval in large-scale multimedia collections” in AMR 2011 - 9th International Workshop on Adaptive Multimedia Retrieval. – von Wyl, M., Hofreiter, B., & Marchand-Maillet, S. (2012). Serendipitous Exploration of Large-scale Product Catalogs. In 14th IEEE International Conference on Commerce and Enterprise Computing (CEC 2012), Hangzhou, CN. More at http://viper.unige.ch/publications
  • 119. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 119 References Large-scale Indexing – Mohamed, H., & Marchand-Maillet, S. (2015). Quantized Ranking for Permutation- Based Indexing. Information Systems. – Mohamed, H., Osipyan, H., & Marchand-Maillet, S. (2014). Multi-Core (CPU and GPU) For Permutation-Based Indexing. In Proceedings of the 7th Internation Conference on Similarity Search and Applications (SISAP2014), Los Cabos, Mexico. – H. Mohamed and S. Marchand-Maillet “Parallel Approaches to Permutation-Based Indexing using Inverted Files” in SISAP 2012 - 5th International Conference on Similarity Search and Applications . – H. Mohamed and S. Marchand-Maillet “Distributed Media indexing based on MPI and MapReduce” in CBMI 2012 - 10th Workshop on Content-Based Multimedia Indexing. – H. Mohamed and S. Marchand-Maillet “Enhancing MapReduce using MPI and an optimized data exchange policy”, P2S2 2012 - Fifth International Workshop onParallel Programming Models and Systems Software for High-End Computing. – Mohamed, H., & Marchand-Maillet, S. (2014). Distributed media indexing based on MPI and MapReduce. Multimedia Tools and Applications, 69(2). – Mohamed, H., & Marchand-Maillet, S. (2013). Permutation-Based Pruning for Approximate K-NN Search. In DEXA, Prague, CZ. More at http://viper.unige.ch/publications
  • 120. Stephane.Marchand-Maillet@unige.ch – University of Geneva – KEYSTONE Summer School – © July 2015 - 120 References Large data analysis – Manifold learning – Sun, K., Morrison, D., Bruno, E., & Marchand-Maillet, S. (2013). Learning Representative Nodes in Social Networks. In 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Gold Coast, AU. – Sun, K., Bruno, E., & Marchand-Maillet, S. (2012). Unsupervised Skeleton Learning for Manifold Denoising and Outlier Detection. In International Conference on Pattern Recognition (ICPR'2012), Tsukuba, JP. – Sun, K., & Marchand-Maillet, S. (2014). An Information Geometry of Statistical Manifold Learning. In Proceedings of the International Conference on Machine Learning (ICML 2014), Beijing, China. – Wang, J., Sun, K., Sha, F., Marchand-Maillet, S., & Kalousis, A. (2014). Two-Stage Metric Learning. In Proceedings of the International Conference on Machine Learning (ICML 2014), Beijing, China. – Sun, K., Bruno, E., & Marchand-Maillet, S. (2012). Stochastic Unfolding. In IEEE Machine Learning for Signal Processing Workshop (MLSP'2012), Santander, Spain. More at http://viper.unige.ch/publications