Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Bandits with Large Action Spaces

Recap: Designing a more Eﬃcient Estimator for
Oﬀ-policy Evaluation in Bandits with Large Action
Spaces
Ajinkya More, Linas Baltrunas, Nikos Vlassis, Justin Basilico
REVEAL Workshop at RecSys 2019
Copenhagen, Denmark, Sept 20, 2019
Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 1 / 19

Introduction
Contextual bandits are a common modeling tool in modern
personalization systems (e.g. artwork personalization at Netflix,
playlist ordering on Spotify, ad ranking in web search)
Evaluating bandit algorithms via online A/B testing is expensive
Off-policy evaluation is a preferred alternative
Goal: A sensitive, efficient estimator that aligns with online metrics

Challenge
Several metrics proposed in literature have limitations in personalization
settings:
Replay, IPS, SNIPS - High variance
Few matches with many arms
Requires very large amounts of data
Direct Method, Doubly Robust - Require a reward model
This model is used to evaluate other models
Introduces modeling bias
E.g. what features to use in the reward model?

Overview
This paper:
Introduce a new off-policy evaluation metric: Recap
Trades off bias for significantly lowering variance in off-policy evaluation
Model free - No need to define reward model
Study properties via simulated data
Investigate alignment with online metric

Notation
Let A = {1, 2, ..., K} be the set of actions.
In trial t ∈ {1, 2, ..., n}, the environment chooses a context xt and in
response the agent chooses an arm at ∈ A.
The environment then reveals a reward ra,t ∈ [0, 1].
Denote the probability of the logging policy taking action a for
context x as π(x, a) and the logged action at trial t as aπ.
We want to estimate the expected reward E[ra] of a deterministic
target policy τ given data from a potentially diﬀerent logging policy.
We assume the target policy is mediated by a scorer or a ranker, that
assigns a real valued score s(x, a) to each arm and selects the arm
argmaxa∈As(x, a).

IPS, SNIPS
Inverse Propensity Scoring (IPS):
ˆV τ
IPS =
Σn
t=0raπ
I(aτ =aπ)
π(xt ,aπ)
n
(1)
Self-Normalized Inverse Propensity Scoring (SNIPS):
ˆV τ
SNIPS =
Σn
t=0raπ
I(aτ =aπ)
π(xt ,aπ)
Σn
t=0
I(aτ =aπ)
π(xt ,aπ)
(2)

Recap
ˆV τ
Recap =
Σn
t=0raπ
RR
π(xt ,aπ)
Σn
t=0
RR
π(xt ,aπ)
(3)
where
RR =
1
Σa∈AI(s(x, a) ≥ s(x, aπ))
is simply the reciprocal rank of the logged action according to the scores
assigned by the target policy.
It assigns a“partial credit” when the logged action and target policy action
are diﬀerent (which may be seen as “stochastifying” a deterministic
policy).

Recap(m)
More generally, for m ∈ N (or even m ∈ R+), we define,
ˆV τ
Recap(m) =
Σn
t=0raπ
RRm
π(xt ,aπ)
Σn
t=0
RRm
π(xt ,aπ)
(4)
ˆV τ
Recap(1) = ˆV τ
Recap
As m → ∞, ˆV τ
Recap(m) → ˆV τ
SNIPS
Thus, m allows us to smoothly trade off bias for variance.
Though we find m = 1 to work well in practice.

Simulations
Set up
1 The rewards for arm a are drawn from a Bernoulli distribution with
parameter θa = 1/a
2 The arm scores s(x, a) are drawn from a normal distribution N(µ, σ)
where σ = 0.1 and µ is drawn from a uniform distribution with
support [0, 0.2].
3 In each trial, the logging policy samples the action from a
multinomial distribution over the arms.
4 The target policy is argmaxa∈As(x, a).

Behavior in small action spaces

Behavior in large action spaces

Variance: Small number of arms

Variance: Medium number of arms

Variance: Large number of arms

Risk

Online–Oﬄine correlation

Summary
Recap: An off-policy estimator for bandits based on MRR
Trades off a small increase in bias (compared with IPS) for
significantly reduced variance compared to both IPS and SNIPS
Recap uses all the available data making it more efficient with small
logged data or large numbers of arms
Doesn’t require a reward model as compared to Direct Method and
Doubly Robust
Signs of alignment with online metrics

Ongoing Work
Analyze some theoretical properties of the estimator
Bias/Variance/Risk of the estimator
Closed form expressions
Asymptotic properties
Is the estimator consistent?
Deviation bounds
Training models to optimize Recap
Connection to Learning to Rank approaches

Thank you!
Questions?

Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Bandits with Large Action Spaces

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from Justin Basilico

More from Justin Basilico (7)

Recently uploaded

Recently uploaded (20)

Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Bandits with Large Action Spaces