3. A two-armed risky bandit task
March, J. G. (1996). Learning to Be Risk Averse. Psychological Review, 103(2), 309-319.
Denrell, J. (2007).Adaptive learning and risk taking. Psychological review, 114(1), 177.
Hertwig, R., Barron, G.,Weber, E. U., & Erev, I. (2004). Decisions from experience and the effect of rare events in risky choice. Psychological science, 15(8), 534-539.
a
b
c d
Risky but higher payoff option
Safe but lower payoff option
Trials (decision horizon)
Consider a decision-maker facing a repeated choice between a safe (i.e. certain)
alternative 𝑠 and a risky (i.e. uncertain) alternative 𝑟 for over 𝑇 trials.
A decision-maker’s goal is to maximise his/her total payoff obtained over the trials.
Due to a limited time horizon, there is a trade-off between exploration and
exploitation.
Though this task setting might seem too artificial, the task captures the basic principle
underlying exploration-exploitation dilemma and decision-making from experiences,
which is related to various real life situations ranging from choosing better
restaurants, investing profitable stocks, and finding nicer mates, to developing new
technologies and innovations.
4. t = 1
Reinforcement learning model
(i.e. the baseline asocial learning model)
q values
Choice
Prob
125 0 10 10
t = 2
Update
t = 3
Update
t = 4
Update
5. Rescorla-Wagner Rule for
Value Updating
(1-α) × α ×
Q-values at
t = 2
Q-values at
t = 1
+ 125
Payoff at
t = 1
A decision-maker updates their value of choosing each of the two alternatives at time
t, following the Rescorla-Wagner rule.
α is a learning rate (i.e. step size parameter), manipulating a step size of belief-
updating.The larger α, the more weight is given to recent experience (i.e. myopic
learning).The Q-value for the unchosen option is unchanged.
6. Q
(decision values)
The ‘Softmax’ Choice Rule
Choice
probability‘softmax’
transformation
e Qi
P
k e Qk
<latexit sha1_base64="jD0IVOPlXP+U4dUGFpM7VCuSEhQ=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjioiRV0GXRjcsW7AOaGCbTm3bo5MHMRCgh/+DGX3HjQhG3btz5N07bLGrrgQuHc+7l3nv8hDOpLOvHWFldW9/YLG2Vt3d29/bNg8O2jFNBoUVjHouuTyRwFkFLMcWhmwggoc+h449uJ37nEYRkcXSvxgm4IRlELGCUKC155rkTCEIzeMgcHxTBTY/leebINPRGeE4d5blnVqyqNQVeJnZBKqhAwzO/nX5M0xAiRTmRsmdbiXIzIhSjHPKyk0pICB2RAfQ0jUgI0s2mP+X4VCt9HMRCV6TwVJ2fyEgo5Tj0dWdI1FAuehPxP6+XquDazViUpAoiOlsUpByrGE8Cwn0mgCo+1oRQwfStmA6JDknpGMs6BHvx5WXSrlXti2qteVmp3xRxlNAxOkFnyEZXqI7uUAO1EEVP6AW9oXfj2Xg1PozPWeuKUcwcoT8wvn4BvD6esA==</latexit>
Then Q-values are translated into choice probabilities through a softmax (or multinoimal-
logistic) function. β is an inverse temperature, regulating how sensitive the choice probability is
to the value of the Q.As β decreases and approaches to 0, the choice probability approximates
to a random choice (i.e. highly explorative). Conversely, a large β makes choices almost
deterministic in favour of the option with highest Q value (i.e. highly exploitative).
7. A collective learning situation
Safe Risky
10
?
5
10
10
?
?
?
Time
Choice
Safe Risky
Round: 2/70
Make a next choice!
4 people
chose this
2 people
chose this
Let’s consider a collective learning situation under which multiple individuals play a
task simultaneously and obtain social information during the play.
A frequency-based social cue suggests how many people chose each slot in
the preceding round. The others’ payoff information is kept private.
8. Social learning model
Toyokawa et al. 2017; 2019;Aplin et al. 2017; McElreath et al. 2005; 2008; Deffner et al. 2020
Relying on
social information
σ
θ = Conformity exponent
Pi =
F✓
i
P
F✓
k
F1 F2
Reward based
reinforcement learning
1 - σ
Softmax choice based on the
reinforcement
Pi =
exp( Qi)
P
exp( Qi)<latexit sha1_base64="ZPWF1IOMhurpw07U9hAle85OVIQ=">AAACnHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuYB8RnKtgqxKQVJSZXx6RWFGjEJKWWJCoExldn1mrWVscUl+ZiEa+NF1A20DMAAwVMhiGUocwABQH5AssZYhhSGPIZkhlKGXIZUhnyGEqA7ByGRIZiIIxmMGQwYCgAisUyVAPFioCsTLB8KkMtAxdQbylQVSpQRSJQNBtIpgN50VDRPCAfZGYxWHcy0JYcIC4C6lRgUDW4arDS4LPBCYPVBi8N/uA0qxpsBsgtlUA6CaI3tSCev0si+DtBXblAuoQhA6ELr5tLGNIYLMBuzQS6vQAsAvJFMkR/WdX0z8FWQarVagaLDF4D3b/Q4KbBYaAP8sq+JC8NTA2azcAFjABD9ODGZIQZ6RkC2YEmyg5O0KjgYJBmUGLQAIa3OYMDgwdDAEMo0N65DIcZzjCcZZJjcmHyZvKFKGVihOoRZkABTGEASk2fOg==</latexit><latexit sha1_base64="ZPWF1IOMhurpw07U9hAle85OVIQ=">AAACnHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuYB8RnKtgqxKQVJSZXx6RWFGjEJKWWJCoExldn1mrWVscUl+ZiEa+NF1A20DMAAwVMhiGUocwABQH5AssZYhhSGPIZkhlKGXIZUhnyGEqA7ByGRIZiIIxmMGQwYCgAisUyVAPFioCsTLB8KkMtAxdQbylQVSpQRSJQNBtIpgN50VDRPCAfZGYxWHcy0JYcIC4C6lRgUDW4arDS4LPBCYPVBi8N/uA0qxpsBsgtlUA6CaI3tSCev0si+DtBXblAuoQhA6ELr5tLGNIYLMBuzQS6vQAsAvJFMkR/WdX0z8FWQarVagaLDF4D3b/Q4KbBYaAP8sq+JC8NTA2azcAFjABD9ODGZIQZ6RkC2YEmyg5O0KjgYJBmUGLQAIa3OYMDgwdDAEMo0N65DIcZzjCcZZJjcmHyZvKFKGVihOoRZkABTGEASk2fOg==</latexit><latexit sha1_base64="ZPWF1IOMhurpw07U9hAle85OVIQ=">AAACnHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuYB8RnKtgqxKQVJSZXx6RWFGjEJKWWJCoExldn1mrWVscUl+ZiEa+NF1A20DMAAwVMhiGUocwABQH5AssZYhhSGPIZkhlKGXIZUhnyGEqA7ByGRIZiIIxmMGQwYCgAisUyVAPFioCsTLB8KkMtAxdQbylQVSpQRSJQNBtIpgN50VDRPCAfZGYxWHcy0JYcIC4C6lRgUDW4arDS4LPBCYPVBi8N/uA0qxpsBsgtlUA6CaI3tSCev0si+DtBXblAuoQhA6ELr5tLGNIYLMBuzQS6vQAsAvJFMkR/WdX0z8FWQarVagaLDF4D3b/Q4KbBYaAP8sq+JC8NTA2azcAFjABD9ODGZIQZ6RkC2YEmyg5O0KjgYJBmUGLQAIa3OYMDgwdDAEMo0N65DIcZzjCcZZJjcmHyZvKFKGVihOoRZkABTGEASk2fOg==</latexit><latexit sha1_base64="ZPWF1IOMhurpw07U9hAle85OVIQ=">AAACnHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuYB8RnKtgqxKQVJSZXx6RWFGjEJKWWJCoExldn1mrWVscUl+ZiEa+NF1A20DMAAwVMhiGUocwABQH5AssZYhhSGPIZkhlKGXIZUhnyGEqA7ByGRIZiIIxmMGQwYCgAisUyVAPFioCsTLB8KkMtAxdQbylQVSpQRSJQNBtIpgN50VDRPCAfZGYxWHcy0JYcIC4C6lRgUDW4arDS4LPBCYPVBi8N/uA0qxpsBsgtlUA6CaI3tSCev0si+DtBXblAuoQhA6ELr5tLGNIYLMBuzQS6vQAsAvJFMkR/WdX0z8FWQarVagaLDF4D3b/Q4KbBYaAP8sq+JC8NTA2azcAFjABD9ODGZIQZ6RkC2YEmyg5O0KjgYJBmUGLQAIa3OYMDgwdDAEMo0N65DIcZzjCcZZJjcmHyZvKFKGVihOoRZkABTGEASk2fOg==</latexit>
Relying on private payoff-
based learning
# of other
individuals
Option 2Option 1
Choice_probability = (1 - σ) Asocial_choice + σ Social_influence