Neural Semi-supervised Learning under Domain Shift

Neural Semi-supervised
Learning under 
Domain Shift
Sebastian Ruder

‣ Across domains 
(Ruder & Plank, EMNLP 2017, ACL 2018; 
Howard* & Ruder*, ACL 2018)
‣ Across tasks 
(Ruder et al., arXiv 2017; 
(Augenstein* & Ruder* et al., NAACL 2018)
‣ Across languages 
(Ruder et al., JAIR 2018; 
Søgaard, Ruder & Vulic, ACL 2018; 
Ruder* & Cotterell* et al., EMNLP 2018; 
Kementchedjhieva, Ruder et al., CoNLL 2018)
2
Research overview:
Transfer learning
* equal contribution
12/6/2017 label_embedding_layer.html
Label Embedding Layer

‣ Across domains 
(Ruder & Plank, EMNLP 2017, ACL 2018; 
Howard* & Ruder*, ACL 2018)
3
Research overview:
Transfer learning
* equal contribution
Beware
of non-i.i.d.
data!
‣ We never know how well our models truly generalise if we
just test them on data of the same distribution.
‣ CIFAR-10 classiﬁers don’t even generalise to CIFAR-10 (Recht
et al., 2018).
‣ “A challenge to the community: we should evaluate on out-of-
distribution data or on a new task.” 
- Percy Liang, DeepGen workshop, NAACL-HLT 2018
Recht, B., Roelofs, R., Schmidt, L., & Berkeley, U. C. (2018). Do CIFAR-10 Classiﬁers Generalize to CIFAR-10 ?

4
Learning under Domain Shift
Labeled
source data
Different
task
Un-
labeled
target data
How to select
the most
relevant data?

5
Data setting 1: 
Multiple source domains
Target domain
Source domains
Ruder, S., & Plank, B. (2017). Learning to select data for transfer learning with Bayesian Optimization. In
Proceedings of EMNLP 2017.

Why select data for domain adaptation at all?
Why don’t we just train on all source data?
‣ Prevent negative transfer for dissimilar domains.
‣ e.g. “electrifying” is positive in , but negative in
Existing approaches
‣ use a single similarity metric in isolation;
‣ focus on a single task.
6
Background

Intuition
‣ Different tasks and domains require different notions of
similarity.
Idea
‣ Learn a data selection policy using Bayesian Optimisation.
7
Our approach

8
Our approach
x1
x2
xm
⋮
S = ϕ(x)⊤
w
Training examples
⋮
Selection policy
xn
Sorted examples
m
‣ Related: curriculum learning (Tsvetkov et al., 2016)
Tsvetkov, Y., Faruqui, M., Ling, W., & Dyer, C. (2016). Learning the Curriculum with Bayesian Optimization for
Task-Speciﬁc Word Representation Learning. In Proceedings of ACL 2016.

‣ Treat objective as a black box, which we iteratively approximate
‣ Use Bayesian Optimisation (BO) to obtain best parameter setting. 
cf. Fang & Cohn (2017) who use RL for selecting data for active
learning
‣ Sample-efﬁcient; only need about 100-200 samples to converge.
‣ BO is typically used for hyper-parameter tuning (Snoek et al.,
2012; Melis et al., 2018).
‣ Alternative: Learn latent permutation with Sinkhorn operator
(Adams and Zemel, 2011; Mena et al., 2018)
9
Learning the data selection policy
Fang, M., Li, Y., & Cohn, T. (2017). Learning how to Active Learn: A Deep Reinforcement Learning Approach. In Proceedings of EMNLP
2017.
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of
NIPS 2012.
Melis, G., Dyer, C., & Blunsom, P. (2018). On the State of the Art of Evaluation in Neural Language Models. In Proceedings of ICLR 2018.
Mena, G. E., Belanger, D., Linderman, S., & Snoek, J. (2018). Learning Latent Permutations with Gumbel-Sinkhorn Networks. In
Proceedings of ICLR 2018.

10
Optimisation framework
X
Feature 
extraction
Bayesian
Optimisation
Task model
training
Evaluation on
validation set
Scoring &  
sorting with St
̂y
Mt
Xtϕ(X)
wt
wt+1

Two important choices
‣ Surrogate model: used to approximate objective function; e.g.
Gaussian Process (GP)
‣ Acquisition function: propose new samples; trades off exploration
vs. exploitation; e.g. Expected Improvement (EI)
Procedure:
‣ Sample next weight vector by optimising the acquisition
function over the GP:
‣ Obtain noisy validation score from trained model
‣ Append sample to , update GP
11
Bayesian Optimisation
wt = arg max
w
u(w|D1:t−1)
wt
u
̂yt
D1:t = {D1:t−1, (wt, ̂yt)}

‣ Treat each source example and entire target domain as
distributions and based on term and topic probabilities
12
Features
P Q
Similarity feature Diversity feature
Jensen-Shannon
divergence
# word types
Rényi divergence Type-token ratio
Bhattacharyya distance Entropy
Cosine similarity Simpson's index
Euclidean distance Rényi entropy
Variational distance Quadratic entropy

13
Data & Tasks
Three tasks: Domains:
Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007)
POS tagging and dependency parsing on SANCL 2012 dataset (Petrov
and McDonald, 2012)
Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain
adaptation for sentiment classiﬁcation. In Proceedings of ACL 2007.
Petrov, S., & McDonald, R. (2012). Overview of the 2012 shared task on parsing the web. In Notes of the First
Workshop on Syntactic Analysis of Non-Canonical Language (SANCL).

14
Sentiment Analysis Results
Selecting 2,000 from 6,000 source domain examples
Accuracy(%)
62
68
74
80
86
Book DVD Electronics Kitchen
Random JS divergence (examples)
JS divergence (domain) Similarity (topics)
Diversity Similiarity + diversity
All source data (6,000 examples)
‣ Selecting relevant data is useful when domains are very different.

15
POS Tagging Results
Selecting 2,000 from 14-17.5k source domain examples
Accuracy(%)
92
93.25
94.5
95.75
97
Answers Emails Newsgroups Reviews Weblogs WSJ
JS divergence (examples) Similarity (terms) Diversity
Similiarity + diversity All source data
‣ Learned data selection outperforms static selection, but is less
useful when domains are very similar.

16
Dependency Parsing Results
Selecting 2,000 from 14-17.5k source domain examples
LabeledAttachmentScore(LAS)
80
82.25
84.5
86.75
89
JS divergence (examples) Similarity (terms) Diversity
Similiarity + diversity All source data

17
Cross-Model Transfer Results
Training a BiLSTM with the policy learned by a BiLSTM and a
Structured Perceptron for POS tagging
Accuracy(%)
92
93
94
95
96
BiLSTM, similarity + diversity Structured Perceptron, similarity + diversity
‣ The data selection policy can be learned with a cheap model and
transferred to more expensive models.

‣ Bayesian Optimisation is an efﬁcient way to optimise
an expensive function, e.g. order of training examples.
‣ Different domains & tasks have different notions of
similarity.
‣ Preferring certain examples is mainly useful when
domains are dissimilar.
‣ Diversity complements similarity.
‣ The learned policy transfers (to some extent) across
models, tasks, and domains.
18
Takeaways
…

19
Labeled
source data
Un-
labeled
target data
How well does
SSL work with
NNs?

20
Data setting 2: 
Single source domain
Target domain
Source domain
Ruder, S., & Plank, B. (2018). Strong Baselines for Neural Semi-supervised Learning under Domain Shift. In
Proceedings of ACL 2018.

‣ State-of-the-art domain adaptation approaches
‣ leverage task-speciﬁc features
‣ evaluate on proprietary datasets or on a single
benchmark
‣ Only compare against weak baselines
‣ Almost none evaluate against approaches from the
extensive semi-supervised learning (SSL) literature
21

‣ How do classics in SSL compare to recent advances?
‣ Can we combine the best of both worlds?
‣ How well do these approaches work on out-of-distribution
data?
22
Revisiting Semi-Supervised Learning
Classics in a Neural World

• Self-training
• (Co-training*)
• Tri-training
• Tri-training with disagreement
Bootstrapping algorithms
* used in concurrent work: Wu, J., Li, L., & Wang, W. Y. (2018). Reinforced Co-Training. In Proceedings of
NAACL-HLT 2018.

1. Train model on labeled data.
2. Use conﬁdent predictions on unlabeled data
as training examples. Repeat.
24
Self-training
- Error amplification
‣ Mixed success in NLP. Some recent success in CV
(Radosavic et al., 2018).
Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data Distillation: Towards
Omni-Supervised Learning. In Proceedings of CVPR 2018.

‣ Calibration
‣ Output probabilities in neural networks are poorly
calibrated.
‣ Throttling (Abney, 2007), i.e. selecting the top highest
conﬁdence unlabeled examples works best.
‣ Online learning
‣ Training until convergence on labeled data and then on
unlabeled data works best.
25
Self-training variants
Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data Distillation: Towards Omni-
Supervised Learning. In Proceedings of CVPR 2018.
n

26
1. Train three models on bootstrapped samples.
2. Use predictions on unlabeled data for third if two agree.
y = 1
x
y = 1
1
Tri-training
Tri-training

27
Tri-training
Tri-training
3. Final prediction: majority voting
Tri-training
y = 1y = 1 y = 0
1
x

Tri-training 
with disagreement
28
Tri-training with
disagreement
2. Use predictions on unlabeled data for third if two agree
and prediction differs.
y = 1
x
y = 1
1
y = 0
- 3 independent models

‣ Sampling unlabeled data
‣ Producing predictions for all unlabeled examples is
expensive
‣ Sample number of unlabeled examples
‣ Conﬁdence thresholding
‣ Not effective for classic approaches, but essential for
our method
29
Tri-training hyper-parameters

30
y = 1
x
y = 1
1
Multi-task tri-training
1. Train one model with 3 objective functions.
Multi-task 
Tri-training
3. Restrict ﬁnal layers to  
use different  
representations.
4. Train third objective  
function only on  
pseudo labeled to  
bridge domain shift.

31
BiLSTM
w2
char
BiLSTM
BiLSTM
w1
char
BiLSTM
BiLSTM
w3
char
BiLSTM
m1 m2 m3 m1 m2 m3 m1 m2 m3
orthogonality constraint (Bousmalis et al., 2016)
Multi-task 
Tri-training
Lorth = ∥W⊤
m1
Wm2
∥2
F
L(θ) = −
∑
i
∑
1,..,n
log Pmi
(y| ⃗h ) + γLorthLoss:
(Plank et al., 2016)

32
Data & Tasks
Two tasks: Domains:
Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007)
POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012)

Sentiment Analysis Results
Accuracy
75
76.75
78.5
80.25
82
Avg over 4 target domains
VFAE* DANN* Asym* Source only
Self-training Tri-training Tri-training-Disagr. MT-Tri
* result from Saito et al., (2017)
33
‣ Multi-task tri-training slightly outperforms tri-training, but
has higher variance.

34
POS Tagging Results
Trained on 10% labeled data (WSJ)
Accuracy
88.7
88.975
89.25
89.525
89.8
Source (+embeds) Self-training Tri-training
Tri-training-Disagr. MT-Tri
‣ Tri-training with disagreement works best with little data.

35
POS Tagging Results
* result from Schnabel & Schütze (2014)
Trained on full labeled data (WSJ)
Accuracy
89
89.75
90.5
91.25
92
TnT Stanford* Source (+embeds)
Tri-training Tri-training-Disagr. MT-Tri
‣ Tri-training works best in the full data setting.

36
POS Tagging Analysis
Accuracy on out-of-vocabulary (OOV) tokens
AccuracyonOOVtokens
50
57.5
65
72.5
80
%OOVtokens
0
2.75
5.5
8.25
11
Answers Emails Newsgroups Reviews Weblogs
OOV tokens Src Tri MT-Tri
‣ Classic tri-training works best on OOV tokens.
‣ MT-Tri does worse than source-only baseline on OOV.

37
POS accuracy per binned log frequency
Accuracydeltavs.src-onlybaseline
-0.005
0
0.005
0.009
0.014
0.018
Binned frequency
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
MT-Tri Tri
‣ Tri-training works best on low-frequency tokens (leftmost
bins).

38
Accuracy on unknown word-tag (UWT) tokens
AccuracyonUWTtokens
8
12.5
17
21.5
26
%UWTtokens
0
1
2
3
4
Answers Emails Newsgroups Reviews Weblogs
UWT rate Src Tri MT-Tri FLORS*
‣ No bootstrapping method works well on unknown word-
tag combinations.
‣ Less lexicalized FLORS approach is superior.
very difﬁcult cases
* result from Schnabel
& Schütze (2014)

‣ Classic tri-training works best: outperforms recent
state-of-the-art methods for sentiment analysis.
‣ We address the drawback of tri-training (space &
time complexity) via the proposed MT-Tri model
‣ MT-Tri works best on sentiment, but not for POS.
‣ Importance of:
‣ Comparing neural methods to classics (strong
baselines)
‣ Evaluation on multiple tasks & domains
39
Takeaways
Tri-training

40
Labeled
source data
Different
task
Un-
labeled
target data
How can we
leverage
pretrained LMs?

41
Data setting 3: 
Different target task
Target domain 
Target task
Source domain 
Source task
Howard, J.*, & Ruder, S.* (2018). Universal Language Model Fine-tuning for Text Classiﬁcation. In
Proceedings of ACL 2018.
*: equal
contribution

‣ Best practice: initialise first layer with pretrained word
embeddings
‣ Recent approaches (McCann et al., 2017; Peters et al.,
2018): Pretrained embeddings as fixed features. Peters et
al. (2018) is task-specific.
‣ Why not initialise remaining parameters?
‣ Dai and Le (2015) first proposed fine-tuning a LM.
However: No pretraining. Naive fine-tuning.
42
Transfer learning for NLP status
quo
McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in Translation: Contextualized Word
Vectors. In Proceedings of NIPS 2017.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep
contextualized word representations. In Proceedings of NAACL-HLT 2018.
Dai, A. M., & Le, Q. V. (2015). Semi-supervised Sequence Learning. In Proceedings of NIPS 2015.

43
Universal Language Model Fine-
tuning
3 step recipe:
1. Train language model (LM) on general domain data.
2. Fine-tune LM on target data.
3. Train classiﬁer on labeled data on top of LM.

‣ Model: AWD-LSTM language model
‣ 3-layer LSTM
‣ Tuned dropout hyperparameters
‣ Data: WikiText-103
‣ 103 million tokens of Wikipedia text
‣ Train for ~24 hours on a Tesla V100
‣ Recently: deeper models, trained on more data, for longer
(Radford et al., 2018)
44
Language Model Pretraining
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by
Generative Pre-Training.

‣ Discriminative fine-tuning 
Different layers capture different types of information. They
should be fine-tuned to different extents.
45
Language Model Fine-tuning
θl
t = θl
t−1 − ηl
⋅ ∇θl J(θ)
‣ Slanted triangular
learning rates 
The model should
converge quickly to
a suitable region
and then refine its
parameters.

‣ Concat pooling 
Concatenate pooled representations of hidden states to
capture long document contexts: 
 
‣ Gradual unfreezing 
Gradually unfreeze the layers starting from the last layer to
prevent catastrophic forgetting.
‣ Bidirectional language model 
Pretrain both forward and backward LMs and ﬁne-tune
them independently.
46
Classiﬁer Fine-tuning
hc = [hT, 𝚖𝚊𝚡𝚙𝚘𝚘𝚕(H), 𝚖𝚎𝚊𝚗𝚙𝚘𝚘𝚕(H)]

47
Dataset Type # classes # examples
TREC-6 6 5.5k
IMDb 2 25k
Yelp-bi 2 560k
Yelp-full 5 650k
AG News 4 120k
DBpedia 14 560k
Data & Tasks

48
Results
Previous SOTA vs. ULMFiT
Errorrate(%)
0
1.75
3.5
5.25
7
IMDb TREC-6 AG News DBpedia Yelp-bi
2.16
0.8
5.01
3.6
4.6
2.64
0.84
6.57
3.9
5.9
Previous SOTA
ULMFiT
Yelp-full
29.9830.58
‣ ULMFiT outperforms the state-of-the-art by a signiﬁcant
margin on many of the datasets.

49
Few-shot Learning
IMDb
Errorrate(%)
0
12.5
25
37.5
50
# of training examples
100 500 2000 10000 20000
From scratch
ULMFiT, supervised
ULMFiT, semi-supervised
AG-News
# of training examples
100 500 2000 10000 108000
‣ With 100 labeled examples, matches performance of
training from scratch with 10x and 20x more data.
‣ With 50-100k additional unlabeled examples, matches
performance of training with 50x and 20x more data.

‣ Proposed a general approach for fine-tuning a
pretrained language model.
‣ Proposed new techniques to reduce catastrophic
forgetting during fine-tuning.
‣ Approach achieves new SOTA on 6 text
classification tasks.
‣ Very sample-efficient.
50
Takeaways

‣ In order to understand how well our models truly
generalise, we need to measure their performance on out-
of-distribution data.
‣ It is important to evaluate our models on different domains
and tasks.
‣ Using pretrained language models is an effective way of
doing transfer / semi-supervised learning (SSL).
‣ Can be complemented by “explicit” SSL. We can take
lessons from traditional approaches.
‣ Dealing with stark domain differences is still a challenge
and requires ways to explicitly avoid negative transfer.
51
Final Takeaways

Neural Semi-supervised Learning under Domain Shift

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Neural Semi-supervised Learning under Domain Shift

Similar to Neural Semi-supervised Learning under Domain Shift (20)

More from Sebastian Ruder

More from Sebastian Ruder (17)

Recently uploaded

Recently uploaded (20)

Neural Semi-supervised Learning under Domain Shift