Learning to Rank - From pairwise approach to listwise
1. ì
Learning
To
Rank:
From
Pairwise
Approach
to
Listwise
Approach
Zhe
Cao,
Tao
Qin,
Tie-‐Yan
Liu,
Ming-‐Feng
Tsai,
and
Hang
Li
Hasan
Hüseyin
Topcu
Learning
To
Rank
2. Outline
ì Related
Work
ì Learning
System
ì Learning
to
Rank
ì Pairwise
vs.
Listwise
Approach
ì Experiments
ì Conclusion
3. Related
Work
ì Pairwise
Approach
:
Learning
task
is
formalized
as
classificaNon
of
object
pairs
into
two
categories
(
correctly
ranked
and
incorrectly
ranked)
ì The
methods
of
classificaNon:
ì Ranking
SVM
(Herbrich
et
al.,
1999)
and
Joachims(2002)
applied
RankingSVM
to
InformaNon
Retrieval
ì RankBoost
(
Freund
et
al.
1998)
ì RankNet
(Burges
et
al.
2005):
5. Learning
System
Training
Data,
Data
Preprocessing,
…
How
objects
are
idenNfied?
How
instances
are
modeled?
SVM,
ANN,
BoosNng
Evaluate
with
test
data
Adapted
from
Paaern
Classificaton(Duda,
Hart,
Stork)
8. Learning
to
Rank
ì A
number
of
queries
are
provided
ì Each
query
is
associated
with
perfect
ranking
list
of
documents
(Ground-‐Truth)
ì A
Ranking
funcNon
is
created
using
the
training
data
such
that
the
model
can
precisely
predict
the
ranking
list.
ì Try
to
opNmize
a
Loss
funcNon
for
learning.
Note
that
the
loss
funcNon
for
ranking
is
slightly
different
in
the
sense
that
it
makes
use
of
sorNng.
10. Data
Labeling
ì Explicit
Human
Judgment
(Perfect,
Excellent,
Good,
Fair,
Bad)
ì Implicit
Relevance
Judgment
:
Derived
from
click
data
(Search
log
data)
ì Ordered
Pairs
between
documents
(A
>
B)
ì List
of
judgments(scores)
13. Pairwise
Approach
ì Collects
document
pairs
from
the
ranking
list
and
for
each
document
pairs
it
assigns
a
label.
ì
Data
labels
+1
if
score
of
A
>
B
and
-‐1
if
A
<
B
ì Formalizes
the
problem
of
learning
to
rank
as
binary
classificaNon
ì RankingSVM,
RankBoost
and
RankNet
14. Pairwise
Approach
Drawbacks
ì ObjecNve
of
learning
is
formalized
as
minimizing
errors
in
classificaNon
of
document
pairs
rather
than
minimizing
errors
in
ranking
of
documents.
ì Training
process
is
computaNonally
costly,
as
the
documents
of
pairs
is
very
large.
15. Pairwise
Approach
Drawbacks
ì Equally
treats
document
pairs
across
different
grades
(labels)
(Ex.1)
ì The
number
of
generated
document
pairs
varies
largely
from
query
to
query,
which
will
result
in
training
a
model
biased
toward
queries
with
more
document
pairs.
(Ex.2)
16. Listwise
Approach
ì Training
data
instances
are
document
list
ì The
objecNve
of
learning
is
formalized
as
minimizaNon
of
the
total
loses
with
respect
to
the
training
data.
ì Listwise
Loss
FuncNon
uses
probability
models:
Permuta(on
Probability
and
Top
One
Probability
ments d(i0
)
are given, we construct feature vectors x(i0
)
from
them and use the trained ranking function to assign scores
to the documents d(i0
)
. Finally we rank the documents d(i0
)
in descending order of the scores. We call the learning
problem described above as the listwise approach to learn-
ing to rank.
By contrast, in the pairwise approach, a new training data
set T 0
is created from T , in which each feature vector pair
x(i)
j and x(i)
k forms a new instance where j , k, and +1 is
assigned to the pair if y(i)
j is larger than y(i)
k otherwise 1.
It turns out that the training data T 0
is a data set of bi-
nary classification. A classification model like SVM can
be created. As explained in Section 1, although the pair-
of scores s is defined
Ps(⇡
where s⇡( j) is the scor
⇡.
Let us consider an ex
ing scores s = (s1, s
tions ⇡ = h1, 2, 3i an
lows:
Ps(⇡) =
(
(s1) + (
ments d(i0
)
are given, we construct feature vectors x(i0
)
from
them and use the trained ranking function to assign scores
to the documents d(i0
)
. Finally we rank the documents d(i0
)
in descending order of the scores. We call the learning
problem described above as the listwise approach to learn-
ing to rank.
By contrast, in the pairwise approach, a new training data
set T 0
is created from T , in which each feature vector pair
x(i)
j and x(i)
k forms a new instance where j , k, and +1 is
assigned to the pair if y(i)
j is larger than y(i)
k otherwise 1.
It turns out that the training data T 0
is a data set of bi-
nary classification. A classification model like SVM can
be created. As explained in Section 1, although the pair-
of scores s is defined
Ps(⇡
where s⇡( j) is the scor
⇡.
Let us consider an ex
ing scores s = (s1, s
tions ⇡ = h1, 2, 3i an
lows:
Ps(⇡) =
(
(s1) + (
ments d(i0
)
are given, we construct feature vectors x(i0
)
from
them and use the trained ranking function to assign scores
to the documents d(i0
)
. Finally we rank the documents d(i0
)
in descending order of the scores. We call the learning
problem described above as the listwise approach to learn-
ing to rank.
By contrast, in the pairwise approach, a new training data
set T 0
is created from T , in which each feature vector pair
x(i)
j and x(i)
k forms a new instance where j , k, and +1 is
assigned to the pair if y(i)
j is larger than y(i)
k otherwise 1.
It turns out that the training data T 0
is a data set of bi-
nary classification. A classification model like SVM can
be created. As explained in Section 1, although the pair-
of scores s is defined
Ps(⇡
where s⇡( j) is the scor
⇡.
Let us consider an ex
ing scores s = (s1, s
tions ⇡ = h1, 2, 3i an
lows:
Ps(⇡) =
(
(s1) + (
17. Permutation
Probability
ì Objects
:
{A,B,C}
and
PermutaNons:
ABC,
ACB,
BAC,
BCA,
CAB,
CBA
ì Suppose
Ranking
funcNon
that
assigns
scores
to
objects
sA,
sB
and
sC
ì Permuta5on
Probabilty:
Likelihood
of
a
permutaNon
ì P(ABC)
>
P(CBA)
if
sA
>
sB
>
sC
18. Top
One
Probability
ì Objects
:
{A,B,C}
and
PermutaNons:
ABC,
ACB,
BAC,
BCA,
CAB,
CBA
ì Suppose
Ranking
funcNon
that
assigns
scores
to
objects
sA,
sB
and
sC
ì Top
one
probability
of
an
object
represents
the
probability
of
its
being
ranked
on
the
top,
given
the
scores
of
all
the
objects
ì P(A)
=
P(ABC)
+
P(ACB)
ì NoNce
that
in
order
to
calculate
n
top
one
probabiliNes,
we
sNll
need
to
calculate
n!
permutaNon
probabiliNes.
ì P(A)
=
P(ABC)
+
P(ACB)
ì P(B)
=
P(BAC)
+
P(BCA)
ì P(C)
=
P(CBA)
+
P(CAB)
19. Listwise
Loss
Function
ì With
the
use
of
top
one
probability,
given
two
lists
of
scores
we
can
use
any
metric
to
represent
the
distance
between
two
score
lists.
ì For
example
when
we
use
Cross
Entropy
as
metric,
the
listwise
loss
funcNon
becomes
ì Ground
Truth:
ABCD
vs.
Ranking
Output:
ACBD
or
ABDC
20. ListNet
ì Learning
Method:
ListNet
ì OpNmize
Listwise
Loss
funcNon
based
on
top
one
probability
with
Neural
Network
and
Gradient
Descent
as
opNmizaNon
algorithm.
ì Linear
Network
Model
is
used
for
simplicity:
y
=
wTx
+
b
22. Ranking
Accuracy
ì ListNet
vs.
RankNet,
RankingSVM,
RankBoost
ì 3
Datasets:
TREC
2003,
OHSUMED
and
CSearch
ì TREC
2003:
Relevance
Judgments
(Relevant
and
Irrelevant),
20
features
extracted
ì OHSUMED:
Relevance
Judgments
(Definitely
Relevant,
PosiNvely
Relevant
and
Irrelevant),
30
features
ì CSearch:
Relevance
Judgments
from
4(‘Perfect
Match’)
to
0
(‘Bad
Match’),
600
features
ì EvaluaNon
Measures:
Normalized
Discounted
CumulaNve
Gain
(NDCG)
and
Mean
Average
Precision(MAP)
26. Conclusion
ì Discussed
ì Learning
to
Rank
ì Pairwise
approach
and
its
drawbacks
ì Listwise
Approach
outperforms
the
exisNng
Pairwise
Approaches
ì EvaluaNon
of
the
Paper
ì Linear
Neural
Network
model
is
used.
What
about
Non-‐Linear
model?
ì Listwise
Loss
FuncNon
is
the
key
issue.(Probability
models)
27. References
ì Zhe
Cao,
Tao
Qin,
Tie-‐Yan
Liu,
Ming-‐Feng
Tsai,
and
Hang
Li.
2007.
Learning
to
rank:
from
pairwise
approach
to
listwise
approach.
In
Proceedings
of
the
24th
interna(onal
conference
on
Machine
learning
(ICML
'07),
Zoubin
Ghahramani
(Ed.).
ACM,
New
York,
NY,
USA,
129-‐136.
DOI=10.1145/1273496.1273513
hap://
doi.acm.org/10.1145/1273496.1273513
ì Hang
Li:
A
Short
Introduc5on
to
Learning
to
Rank.
IEICE
TransacNons
94-‐D(10):
1854-‐1862
(2011)
ì Learning
to
Rank.
Hang
Li.
Microsow
Research
Asia.
ACL-‐IJCNLP
2009
Tutorial.
Aug.
2,
2009.
Singapore