This document discusses neural semi-supervised learning under domain shift. It presents three research areas:
1) Learning across domains by selecting relevant source domain data for transfer learning using Bayesian optimization. Experimental results on sentiment analysis, POS tagging, and dependency parsing show this approach outperforms baselines.
2) Revisiting classic semi-supervised learning techniques like self-training, tri-training, and comparing them to recent advances. Experiments on sentiment analysis and POS tagging find tri-training works best.
3) The possibility of leveraging pre-trained language models for semi-supervised learning when the target task differs from the source task.
2. ‣ Across domains
(Ruder & Plank, EMNLP 2017, ACL 2018;
Howard* & Ruder*, ACL 2018)
‣ Across tasks
(Ruder et al., arXiv 2017;
(Augenstein* & Ruder* et al., NAACL 2018)
‣ Across languages
(Ruder et al., JAIR 2018;
Søgaard, Ruder & Vulic, ACL 2018;
Ruder* & Cotterell* et al., EMNLP 2018;
Kementchedjhieva, Ruder et al., CoNLL 2018)
2
Research overview:
Transfer learning
* equal contribution
12/6/2017 label_embedding_layer.html
Label Embedding Layer
3. ‣ Across domains
(Ruder & Plank, EMNLP 2017, ACL 2018;
Howard* & Ruder*, ACL 2018)
3
Research overview:
Transfer learning
* equal contribution
Beware
of non-i.i.d.
data!
‣ We never know how well our models truly generalise if we
just test them on data of the same distribution.
‣ CIFAR-10 classifiers don’t even generalise to CIFAR-10 (Recht
et al., 2018).
‣ “A challenge to the community: we should evaluate on out-of-
distribution data or on a new task.”
- Percy Liang, DeepGen workshop, NAACL-HLT 2018
Recht, B., Roelofs, R., Schmidt, L., & Berkeley, U. C. (2018). Do CIFAR-10 Classifiers Generalize to CIFAR-10 ?
4. 4
Learning under Domain Shift
Labeled
source data
Different
task
Un-
labeled
target data
How to select
the most
relevant data?
5. 5
Data setting 1:
Multiple source domains
Target domain
Source domains
Ruder, S., & Plank, B. (2017). Learning to select data for transfer learning with Bayesian Optimization. In
Proceedings of EMNLP 2017.
6. Why select data for domain adaptation at all?
Why don’t we just train on all source data?
‣ Prevent negative transfer for dissimilar domains.
‣ e.g. “electrifying” is positive in , but negative in
Existing approaches
‣ use a single similarity metric in isolation;
‣ focus on a single task.
6
Background
7. Intuition
‣ Different tasks and domains require different notions of
similarity.
Idea
‣ Learn a data selection policy using Bayesian Optimisation.
7
Our approach
8. 8
Our approach
x1
x2
xm
⋮
S = ϕ(x)⊤
w
Training examples
⋮
Selection policy
xn
Sorted examples
m
‣ Related: curriculum learning (Tsvetkov et al., 2016)
Tsvetkov, Y., Faruqui, M., Ling, W., & Dyer, C. (2016). Learning the Curriculum with Bayesian Optimization for
Task-Specific Word Representation Learning. In Proceedings of ACL 2016.
9. ‣ Treat objective as a black box, which we iteratively approximate
‣ Use Bayesian Optimisation (BO) to obtain best parameter setting.
cf. Fang & Cohn (2017) who use RL for selecting data for active
learning
‣ Sample-efficient; only need about 100-200 samples to converge.
‣ BO is typically used for hyper-parameter tuning (Snoek et al.,
2012; Melis et al., 2018).
‣ Alternative: Learn latent permutation with Sinkhorn operator
(Adams and Zemel, 2011; Mena et al., 2018)
9
Learning the data selection policy
Fang, M., Li, Y., & Cohn, T. (2017). Learning how to Active Learn: A Deep Reinforcement Learning Approach. In Proceedings of EMNLP
2017.
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of
NIPS 2012.
Melis, G., Dyer, C., & Blunsom, P. (2018). On the State of the Art of Evaluation in Neural Language Models. In Proceedings of ICLR 2018.
Mena, G. E., Belanger, D., Linderman, S., & Snoek, J. (2018). Learning Latent Permutations with Gumbel-Sinkhorn Networks. In
Proceedings of ICLR 2018.
11. Two important choices
‣ Surrogate model: used to approximate objective function; e.g.
Gaussian Process (GP)
‣ Acquisition function: propose new samples; trades off exploration
vs. exploitation; e.g. Expected Improvement (EI)
Procedure:
‣ Sample next weight vector by optimising the acquisition
function over the GP:
‣ Obtain noisy validation score from trained model
‣ Append sample to , update GP
11
Bayesian Optimisation
wt = arg max
w
u(w|D1:t−1)
wt
u
̂yt
D1:t = {D1:t−1, (wt, ̂yt)}
12. ‣ Treat each source example and entire target domain as
distributions and based on term and topic probabilities
12
Features
P Q
Similarity feature Diversity feature
Jensen-Shannon
divergence
# word types
Rényi divergence Type-token ratio
Bhattacharyya distance Entropy
Cosine similarity Simpson's index
Euclidean distance Rényi entropy
Variational distance Quadratic entropy
13. 13
Data & Tasks
Three tasks: Domains:
Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007)
POS tagging and dependency parsing on SANCL 2012 dataset (Petrov
and McDonald, 2012)
Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain
adaptation for sentiment classification. In Proceedings of ACL 2007.
Petrov, S., & McDonald, R. (2012). Overview of the 2012 shared task on parsing the web. In Notes of the First
Workshop on Syntactic Analysis of Non-Canonical Language (SANCL).
14. 14
Sentiment Analysis Results
Selecting 2,000 from 6,000 source domain examples
Accuracy(%)
62
68
74
80
86
Book DVD Electronics Kitchen
Random JS divergence (examples)
JS divergence (domain) Similarity (topics)
Diversity Similiarity + diversity
All source data (6,000 examples)
‣ Selecting relevant data is useful when domains are very different.
15. 15
POS Tagging Results
Selecting 2,000 from 14-17.5k source domain examples
Accuracy(%)
92
93.25
94.5
95.75
97
Answers Emails Newsgroups Reviews Weblogs WSJ
JS divergence (examples) Similarity (terms) Diversity
Similiarity + diversity All source data
‣ Learned data selection outperforms static selection, but is less
useful when domains are very similar.
17. 17
Cross-Model Transfer Results
Training a BiLSTM with the policy learned by a BiLSTM and a
Structured Perceptron for POS tagging
Accuracy(%)
92
93
94
95
96
Answers Emails Newsgroups Reviews Weblogs WSJ
BiLSTM, similarity + diversity Structured Perceptron, similarity + diversity
‣ The data selection policy can be learned with a cheap model and
transferred to more expensive models.
18. ‣ Bayesian Optimisation is an efficient way to optimise
an expensive function, e.g. order of training examples.
‣ Different domains & tasks have different notions of
similarity.
‣ Preferring certain examples is mainly useful when
domains are dissimilar.
‣ Diversity complements similarity.
‣ The learned policy transfers (to some extent) across
models, tasks, and domains.
18
Takeaways
…
19. 19
Learning under Domain Shift
Labeled
source data
Un-
labeled
target data
How well does
SSL work with
NNs?
20. 20
Data setting 2:
Single source domain
Target domain
Source domain
Ruder, S., & Plank, B. (2018). Strong Baselines for Neural Semi-supervised Learning under Domain Shift. In
Proceedings of ACL 2018.
21. ‣ State-of-the-art domain adaptation approaches
‣ leverage task-specific features
‣ evaluate on proprietary datasets or on a single
benchmark
‣ Only compare against weak baselines
‣ Almost none evaluate against approaches from the
extensive semi-supervised learning (SSL) literature
21
Learning under Domain Shift
22. ‣ How do classics in SSL compare to recent advances?
‣ Can we combine the best of both worlds?
‣ How well do these approaches work on out-of-distribution
data?
22
Revisiting Semi-Supervised Learning
Classics in a Neural World
23. • Self-training
• (Co-training*)
• Tri-training
• Tri-training with disagreement
Bootstrapping algorithms
* used in concurrent work: Wu, J., Li, L., & Wang, W. Y. (2018). Reinforced Co-Training. In Proceedings of
NAACL-HLT 2018.
24. 1. Train model on labeled data.
2. Use confident predictions on unlabeled data
as training examples. Repeat.
24
Self-training
- Error amplification
‣ Mixed success in NLP. Some recent success in CV
(Radosavic et al., 2018).
Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data Distillation: Towards
Omni-Supervised Learning. In Proceedings of CVPR 2018.
25. ‣ Calibration
‣ Output probabilities in neural networks are poorly
calibrated.
‣ Throttling (Abney, 2007), i.e. selecting the top highest
confidence unlabeled examples works best.
‣ Online learning
‣ Training until convergence on labeled data and then on
unlabeled data works best.
25
Self-training variants
Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data Distillation: Towards Omni-
Supervised Learning. In Proceedings of CVPR 2018.
n
26. 26
1. Train three models on bootstrapped samples.
2. Use predictions on unlabeled data for third if two agree.
y = 1
x
y = 1
1
Tri-training
Tri-training
27. 27
Tri-training
Tri-training
1. Train three models on bootstrapped samples.
2. Use predictions on unlabeled data for third if two agree.
3. Final prediction: majority voting
Tri-training
y = 1y = 1 y = 0
1
x
29. ‣ Sampling unlabeled data
‣ Producing predictions for all unlabeled examples is
expensive
‣ Sample number of unlabeled examples
‣ Confidence thresholding
‣ Not effective for classic approaches, but essential for
our method
29
Tri-training hyper-parameters
30. 30
y = 1
x
y = 1
1
Multi-task tri-training
1. Train one model with 3 objective functions.
2. Use predictions on unlabeled data for third if two agree.
Multi-task
Tri-training
3. Restrict final layers to
use different
representations.
4. Train third objective
function only on
pseudo labeled to
bridge domain shift.
32. 32
Data & Tasks
Two tasks: Domains:
Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007)
POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012)
33. Sentiment Analysis Results
Accuracy
75
76.75
78.5
80.25
82
Avg over 4 target domains
VFAE* DANN* Asym* Source only
Self-training Tri-training Tri-training-Disagr. MT-Tri
* result from Saito et al., (2017)
33
‣ Multi-task tri-training slightly outperforms tri-training, but
has higher variance.
34. 34
POS Tagging Results
Trained on 10% labeled data (WSJ)
Accuracy
88.7
88.975
89.25
89.525
89.8
Avg over 5 target domains
Source (+embeds) Self-training Tri-training
Tri-training-Disagr. MT-Tri
‣ Tri-training with disagreement works best with little data.
35. 35
POS Tagging Results
* result from Schnabel & Schütze (2014)
Trained on full labeled data (WSJ)
Accuracy
89
89.75
90.5
91.25
92
Avg over 5 target domains
TnT Stanford* Source (+embeds)
Tri-training Tri-training-Disagr. MT-Tri
‣ Tri-training works best in the full data setting.
36. 36
POS Tagging Analysis
Accuracy on out-of-vocabulary (OOV) tokens
AccuracyonOOVtokens
50
57.5
65
72.5
80
%OOVtokens
0
2.75
5.5
8.25
11
Answers Emails Newsgroups Reviews Weblogs
OOV tokens Src Tri MT-Tri
‣ Classic tri-training works best on OOV tokens.
‣ MT-Tri does worse than source-only baseline on OOV.
37. 37
POS accuracy per binned log frequency
Accuracydeltavs.src-onlybaseline
-0.005
0
0.005
0.009
0.014
0.018
Binned frequency
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
MT-Tri Tri
‣ Tri-training works best on low-frequency tokens (leftmost
bins).
POS Tagging Analysis
38. 38
POS Tagging Analysis
Accuracy on unknown word-tag (UWT) tokens
AccuracyonUWTtokens
8
12.5
17
21.5
26
%UWTtokens
0
1
2
3
4
Answers Emails Newsgroups Reviews Weblogs
UWT rate Src Tri MT-Tri FLORS*
‣ No bootstrapping method works well on unknown word-
tag combinations.
‣ Less lexicalized FLORS approach is superior.
very difficult cases
* result from Schnabel
& Schütze (2014)
39. ‣ Classic tri-training works best: outperforms recent
state-of-the-art methods for sentiment analysis.
‣ We address the drawback of tri-training (space &
time complexity) via the proposed MT-Tri model
‣ MT-Tri works best on sentiment, but not for POS.
‣ Importance of:
‣ Comparing neural methods to classics (strong
baselines)
‣ Evaluation on multiple tasks & domains
39
Takeaways
Tri-training
40. 40
Learning under Domain Shift
Labeled
source data
Different
task
Un-
labeled
target data
How can we
leverage
pretrained LMs?
41. 41
Data setting 3:
Different target task
Target domain
Target task
Source domain
Source task
Howard, J.*, & Ruder, S.* (2018). Universal Language Model Fine-tuning for Text Classification. In
Proceedings of ACL 2018.
*: equal
contribution
42. ‣ Best practice: initialise first layer with pretrained word
embeddings
‣ Recent approaches (McCann et al., 2017; Peters et al.,
2018): Pretrained embeddings as fixed features. Peters et
al. (2018) is task-specific.
‣ Why not initialise remaining parameters?
‣ Dai and Le (2015) first proposed fine-tuning a LM.
However: No pretraining. Naive fine-tuning.
42
Transfer learning for NLP status
quo
McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in Translation: Contextualized Word
Vectors. In Proceedings of NIPS 2017.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep
contextualized word representations. In Proceedings of NAACL-HLT 2018.
Dai, A. M., & Le, Q. V. (2015). Semi-supervised Sequence Learning. In Proceedings of NIPS 2015.
43. 43
Universal Language Model Fine-
tuning
3 step recipe:
1. Train language model (LM) on general domain data.
2. Fine-tune LM on target data.
3. Train classifier on labeled data on top of LM.
44. ‣ Model: AWD-LSTM language model
‣ 3-layer LSTM
‣ Tuned dropout hyperparameters
‣ Data: WikiText-103
‣ 103 million tokens of Wikipedia text
‣ Train for ~24 hours on a Tesla V100
‣ Recently: deeper models, trained on more data, for longer
(Radford et al., 2018)
44
Language Model Pretraining
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by
Generative Pre-Training.
45. ‣ Discriminative fine-tuning
Different layers capture different types of information. They
should be fine-tuned to different extents.
45
Language Model Fine-tuning
θl
t = θl
t−1 − ηl
⋅ ∇θl J(θ)
‣ Slanted triangular
learning rates
The model should
converge quickly to
a suitable region
and then refine its
parameters.
46. ‣ Concat pooling
Concatenate pooled representations of hidden states to
capture long document contexts:
‣ Gradual unfreezing
Gradually unfreeze the layers starting from the last layer to
prevent catastrophic forgetting.
‣ Bidirectional language model
Pretrain both forward and backward LMs and fine-tune
them independently.
46
Classifier Fine-tuning
hc = [hT, 𝚖𝚊𝚡𝚙𝚘𝚘𝚕(H), 𝚖𝚎𝚊𝚗𝚙𝚘𝚘𝚕(H)]
47. 47
Dataset Type # classes # examples
TREC-6 6 5.5k
IMDb 2 25k
Yelp-bi 2 560k
Yelp-full 5 650k
AG News 4 120k
DBpedia 14 560k
Data & Tasks
48. 48
Results
Previous SOTA vs. ULMFiT
Errorrate(%)
0
1.75
3.5
5.25
7
IMDb TREC-6 AG News DBpedia Yelp-bi
2.16
0.8
5.01
3.6
4.6
2.64
0.84
6.57
3.9
5.9
Previous SOTA
ULMFiT
Yelp-full
29.9830.58
‣ ULMFiT outperforms the state-of-the-art by a significant
margin on many of the datasets.
49. 49
Few-shot Learning
IMDb
Errorrate(%)
0
12.5
25
37.5
50
# of training examples
100 500 2000 10000 20000
From scratch
ULMFiT, supervised
ULMFiT, semi-supervised
AG-News
# of training examples
100 500 2000 10000 108000
‣ With 100 labeled examples, matches performance of
training from scratch with 10x and 20x more data.
‣ With 50-100k additional unlabeled examples, matches
performance of training with 50x and 20x more data.
50. ‣ Proposed a general approach for fine-tuning a
pretrained language model.
‣ Proposed new techniques to reduce catastrophic
forgetting during fine-tuning.
‣ Approach achieves new SOTA on 6 text
classification tasks.
‣ Very sample-efficient.
50
Takeaways
51. ‣ In order to understand how well our models truly
generalise, we need to measure their performance on out-
of-distribution data.
‣ It is important to evaluate our models on different domains
and tasks.
‣ Using pretrained language models is an effective way of
doing transfer / semi-supervised learning (SSL).
‣ Can be complemented by “explicit” SSL. We can take
lessons from traditional approaches.
‣ Dealing with stark domain differences is still a challenge
and requires ways to explicitly avoid negative transfer.
51
Final Takeaways